Link to the original text:http://tecdat.cn/?p=21467
objective
Data on house prices may reflect changes in China in recent years:
 People get more resources (salaries) and expect better houses
 populous
 One child policy: how to affect the geometric structure of the house? More bedrooms, more space
My core idea is to forecast house prices. However, I’m not going to use any ARIMA model; Instead, I will use the characteristics of the data to fit the regression year by year.
The structure is as follows:
 Data preparation: Transform numerical features into classification; Missing value
 EDA: for numerical and categorical features: average prices and performance of these features

Modeling:
 Segmenting training / testing data for a given year: for example, segmenting data in 2000; The regression model is trained according to these data
 Then, for all new years up to 2016, predict the value of each home.
 The measure used for validation will be the average price of the house (i.e. the average price and forecast value obtained from the test sample each year)
Data preparation
We have a very complete description of the features:
URL: the URL to get data (characters)
ID: ID (character)
LG: and lat coordinates, using bd09 protocol（ (number)
CID: Community ID (number)
Transaction time: transaction time (characters)
DOM: market active day（ (number)
Followers: the number of people after the transaction（ (number)
Total price: (value)
Price: the average price (value) in square terms
Area: square of the house (Figure)
 The living room
: number of characters
 a living room
: number of characters
Kitchens: number of kitchens (figures)
Number of bathrooms (characters)
The height of the house
Building type: including tower (1), bungalow (2), plate tower combination (3), plate (4) (numerical value)
Construction time
Decoration: including others (1), rough (2), simple (3), hardcover (4) (numerical value)
Building structure: including unclean (1), mixed (2), brick and wood (3), brick concrete (4), steel (5) and steelconcrete composite (6)（ Value)
Elevator ratio: the ratio of the number of residents on the same floor to the number of elevators.
Elevator has (1) or no elevator (0) (value)
Five year period: property owned by the owner for less than 5 years (Figure)
Data cleaning, feature creation
From the initial data:
 From the website, I found that it has location information, such as Chengjiao / 101084782030. Similarly, a simple regexp is used to extract provincial features.
 Another big data preparation work is to convert some digital features, such as the subway. The family code near the subway station is 1, and vice versa is 0.
 A large part of DOM is missing. I can’t use this feature in modeling or delete Na, but it also reduces the size of the data frame.
#Extract province from web site
sapply(df$url, function(x) strsplit(x,'/')[[1]][4])
Inspection missing
#Missing data graph
ggplot(data = .,aes(x = V2, y = V1)) + geom_tile(aes(fill = value )) +
 As mentioned above, a large part of the DOM is missing. I decided to keep this feature and fill in the missing values with intermediate values (the distribution is very skewed)
 Otherwise, there are only a few missing values in buildingtype and community average (pop.), and I decided to simply delete them. In fact, they only account for about 30 rows, while the data volume of the whole dataset is 300K +, so the loss is not too large.
 Now I simply delete features that I don’t intend to use in the future.
ifelse(is.na(df$DOM),median(df$DOM,na.rm=T),df$DOM)
A custom function for converting numbers to categories
For some features, you need a function to handle multiple tags. For others (living room, living room and bathroom), the conversion is very simple.
df2$livingRoom < as.numeric(df2$livingRoom)
It appears that buildingtype has the wrong coded numeric value:
buildingType
count
0.048
4
0.125
3
0.250
2
0.333
5
0.375
1
0.429
1
0.500
15
0.667
1
1.000
84541
2.000
137
3.000
59715
4.000
172405
NaN
2021
Since the number of wrong encoded values and Na is small, I will discard these rows again
df2$renovationCondition < sapply(df2$renovationCondition, ionCondition)
df2$buildingStructure < sapply(df2$buildingStructure, makeStructure)
df2$elevator < ifelse(df2$elevator==1,'has_elevator','no_elevator')
Missing value inspection
#Missing data graph
df2 %>% is.na %>% melt %>%
ggplot(data = .,aes(x = Var2, y = Var1)) + geom_tile(aes(fill = value)) +
scale_fill_manual(values = c("grey20","white")) + theme_minimal(14) +
kable(df %>% group_by(constructionTime) %>% summarise(count=n()) %>% arrange(count) %>% head(5))
constructionTime
count
2004
21145
2003
19409
NA
19283
2005
18924
2006
14854
df3 < data.frame(df2 %>% na.omit())
Final inspection after interpolation
any(is.na(df3))
## [1] FALSE
Exploratory analysis
Due to the digital and classified features, the EDA technologies I will use are:
 Numerical value: correlation matrix
 Classification: boxplot and map
We have to focus on price (unit price / unit price) and total price (million yuan)
Total price will be the target variable of regression model.
Numerical characteristics
corrplot(cor(
df3 ,
tl.col='black')
comment
 There is a strong positive correlation between totalprice and community average, that is, the housing prices in densely populated areas are higher
 There is a positive correlation between total price and the number of living room and bathroom.
 As for the area variable, we can see that it also has a strong correlation with the above variables: this makes sense, because if the house has a large area, more rooms can be built (obviously).
 Other interesting correlations: community average is negatively correlated with construction time, which means that it takes less time to build in densely populated areas
Classification features
Map
 Three level (province) map of China
 I looked at the suburbs, which are near Beijing, so I filtered the map of that particular province
ggplot() +
geom_polygon(data = shapefile_test,aes(x = long, y = lat, group = group),
BeijingLoc < data.frame('Long'=116.4075,'Lat' = 39.904)
building structure
makeEDA('buildingStructure' )
Brick and wood houses are the most expensive, almost twice as expensive as other types of houses
Building type
makeEDA('buildingType' )
Bungalows are the most expensive and local
Decoration conditions
Elevator.
 The price has little dependence on the elevator
 The distribution of houses is relatively equal to this feature.
metro
 The price has little dependence on the vicinity of the subway station.
 The distribution of houses is relatively equal to this feature.
Is it full_ Five years_
makeFeatureCatEDA('fiveYearsProperty', length(unique(df3$fiveYearsProperty)))
 The dependence on price is really small for a property owned for less than five years
 In terms of this feature, the distribution of families is relatively equal
region
regression model
strategy
 Extract year and month from tradetime
Group by year and month to get the number and average price of houses

Split dataset:
 For the year [20102017] = train and run the regression model in this group of years
 For > 2017: test samples monthly and forecast the average price
Average price overview
First we need to see what we want to predict
df3$year < year(df3$tradeTimeTs)
df3$month < month(df3$tradeTimeTs)
df3 %>% filter(year>2009) %>% group_by(monthlyTrad) %>%
summarise(count=n(), mean = mean(price)) %>%
ggplot(aes(x=monthlyTradeTS, y= mean)) +
 Average prices rose to mid2017 and then fell rapidly
 At the same time, the number of houses increases with the rise of prices, and now the number of housing transactions decreases with the rise of prices.
Prepare training / test samples
I split the data on January 1, 2017. For all samples, I need to change the classification features into pseudo variables.
df_train < data.frame(df %>% filter(year>2009 & year<2017))
df_test < data.frame(df %>% filter(year>=2017))
as.data.frame(cbind(
df_train %>% select_if(is.numeric) %>% select(Lng, Lat, year, month),
'bldgType'= dummy.code(df_train$buildingType),
'bldgStruc'= dummy.code(df_train$buildingStructure),
'renovation'= dummy.code(df_train$renovationCondition),
'hasElevator'= dummy.code(df_train$elevator),
In this step, I only train a linear model
regressors<c('lm')
Control < trainControl(method = "cv",number = 5, repeats=3)
for(r in regressors){
cnt<cnt+1
res[[cnt]]<train(totalPrice ~., data = train ,method=r,trControl = Control)
R ^ 2 is around 0.88, good. Let’s look at the details.
Training accuracy
g1<ggplot(data=PRED,aes(x=Prediction,y=True)) + geom_jitter() + geom_smooth(method='lm',size=.5) +
#Calculation index
mse < mean((PRED$TruePRED$Prediction)^2)
rmse<mse^0.5
SSE = sum((PRED$Pred  PR
## [1] "MSE: 15952.845934 RMSE : 126.304576 R2 :0.795874"
 So it looks like the residuals are OK (the distribution is normal, centered at 0), but they seem to fail for low prices.
The relationship between the prediction of training and testing samples and time
 Basically the same as above, but I’ll repeat the forecast for all months of training
 My target is average house price.
 The training is done in more than 10 years of training samples, so it will be very interesting to look at the forecast month by month.
#Training sample > training accuracy
for (i in 1:length(dates_train)){
current_df < prepareDF(current_df)
current_pred < mean(predict(res[[1]],current_df))
#Run test sample  test accuracy
for (i in 1:length(dates_test)){
current_df < prepareDF(current_df)
current_pred < mean(predict(res[[1]],current_df))
RES %>% reshape2::melt(id=c('date','split')) %>%
ggplot(aes(x=date,y=value)) + geom_line(aes(color=variable, lty=split),size=1) +
 The forecast is really good for the data after 2012, which may correspond to the months with enough data
Improvement / to do
Geographic location as a feature
 Here’s an interesting picture; It shows the total price for each location. In the center of the twodimensional distribution, the price is higher.
 The idea is to calculate the distance from the center of each house and associate it with a grade / score
BeijingLoc < data.frame('Long'=116.4075,'Lat' = 39.904)
df3 %>% ggplot(aes(x=Lng,y=Lat)) + geom_point(aes(color=price),size=.1,alpha=.5) +
theme(legend.position = 'bottom') +
Most popular insights
1.Application of R language multivariate logistic regression
2.Implementation of panel smooth transition regression (PSTR) analysis case
3.Partial least square regression (PLSR) and principal component regression (PCR) in MATLAB
4.A case study of R language Poisson Poisson regression model
5.Hosmer lemeshow goodness of fit test in R language regression
6.The realization of lasso regression, ridge ridge regression and elastic net model in R language
7.Realization of logistic regression in R language
8.Predicting stock price with linear regression in Python
9.How does R language calculate IDI and NRI in survival analysis and Cox regression