R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

Time:2021-6-5

Link to the original text:http://tecdat.cn/?p=21467 

objective

Data on house prices may reflect changes in China in recent years:

  • People get more resources (salaries) and expect better houses
  • populous
  • One child policy: how to affect the geometric structure of the house? More bedrooms, more space

My core idea is to forecast house prices. However, I’m not going to use any ARIMA model; Instead, I will use the characteristics of the data to fit the regression year by year.

The structure is as follows:

  • Data preparation: Transform numerical features into classification; Missing value
  • EDA: for numerical and categorical features: average prices and performance of these features
  • Modeling:

    • Segmenting training / testing data for a given year: for example, segmenting data in 2000; The regression model is trained according to these data
    • Then, for all new years up to 2016, predict the value of each home.
    • The measure used for validation will be the average price of the house (i.e. the average price and forecast value obtained from the test sample each year)

Data preparation

We have a very complete description of the features:

  • URL: the URL to get data (characters)
  • ID: ID (character)
  • LG: and lat coordinates, using bd09 protocol( (number)
  • CID: Community ID (number)
  • Transaction time: transaction time (characters)
  • DOM: market active day( (number)
  • Followers: the number of people after the transaction( (number)
  • Total price: (value)
  • Price: the average price (value) in square terms
  • Area: square of the house (Figure)
  • The living room: number of characters
  • a living room: number of characters
  • Kitchens: number of kitchens (figures)
  • Number of bathrooms (characters)
  • The height of the house
  • Building type: including tower (1), bungalow (2), plate tower combination (3), plate (4) (numerical value)
  • Construction time
  • Decoration: including others (1), rough (2), simple (3), hardcover (4) (numerical value)
  • Building structure: including unclean (1), mixed (2), brick and wood (3), brick concrete (4), steel (5) and steel-concrete composite (6)( Value)
  • Elevator ratio: the ratio of the number of residents on the same floor to the number of elevators.
  • Elevator has (1) or no elevator (0) (value)
  • Five year period: property owned by the owner for less than 5 years (Figure)

Data cleaning, feature creation

From the initial data:

  • From the website, I found that it has location information, such as Chengjiao / 101084782030. Similarly, a simple regexp is used to extract provincial features.
  • Another big data preparation work is to convert some digital features, such as the subway. The family code near the subway station is 1, and vice versa is 0.
  • A large part of DOM is missing. I can’t use this feature in modeling or delete Na, but it also reduces the size of the data frame.
#Extract province from web site
  sapply(df$url, function(x) strsplit(x,'/')[[1]][4])

Inspection missing

#Missing data graph
 
  ggplot(data = .,aes(x = V2, y = V1)) + geom_tile(aes(fill = value )) +

 R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

  • As mentioned above, a large part of the DOM is missing. I decided to keep this feature and fill in the missing values with intermediate values (the distribution is very skewed)
  • Otherwise, there are only a few missing values in buildingtype and community average (pop.), and I decided to simply delete them. In fact, they only account for about 30 rows, while the data volume of the whole dataset is 300K +, so the loss is not too large.
  • Now I simply delete features that I don’t intend to use in the future.
 ifelse(is.na(df$DOM),median(df$DOM,na.rm=T),df$DOM) 

A custom function for converting numbers to categories

For some features, you need a function to handle multiple tags. For others (living room, living room and bathroom), the conversion is very simple.

df2$livingRoom <- as.numeric(df2$livingRoom) 

It appears that buildingtype has the wrong coded numeric value:

buildingType

count

0.048

4

0.125

3

0.250

2

0.333

5

0.375

1

0.429

1

0.500

15

0.667

1

1.000

84541

2.000

137

3.000

59715

4.000

172405

NaN

2021

Since the number of wrong encoded values and Na is small, I will discard these rows again

 df2$renovationCondition <- sapply(df2$renovationCondition, ionCondition)

df2$buildingStructure <- sapply(df2$buildingStructure, makeStructure)
df2$elevator <- ifelse(df2$elevator==1,'has_elevator','no_elevator') 

Missing value inspection

#Missing data graph
df2 %>% is.na %>% melt %>% 
  ggplot(data = .,aes(x = Var2, y = Var1)) + geom_tile(aes(fill = value)) +
  scale_fill_manual(values = c("grey20","white")) + theme_minimal(14) +

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

kable(df %>% group_by(constructionTime) %>% summarise(count=n()) %>% arrange(-count) %>% head(5))

constructionTime

count

2004

21145

2003

19409

NA

19283

2005

18924

2006

14854

df3 <- data.frame(df2 %>% na.omit())

Final inspection after interpolation

any(is.na(df3))

## [1] FALSE

Exploratory analysis

Due to the digital and classified features, the EDA technologies I will use are:

  • Numerical value: correlation matrix
  • Classification: boxplot and map

We have to focus on price (unit price / unit price) and total price (million yuan)
Total price will be the target variable of regression model.

Numerical characteristics

corrplot(cor(
  df3  ,
  tl.col='black')

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

comment

  • There is a strong positive correlation between totalprice and community average, that is, the housing prices in densely populated areas are higher
  • There is a positive correlation between total price and the number of living room and bathroom.
  • As for the area variable, we can see that it also has a strong correlation with the above variables: this makes sense, because if the house has a large area, more rooms can be built (obviously).
  • Other interesting correlations: community average is negatively correlated with construction time, which means that it takes less time to build in densely populated areas

Classification features

Map

  • Three level (province) map of China
  • I looked at the suburbs, which are near Beijing, so I filtered the map of that particular province
ggplot() + 
  geom_polygon(data = shapefile_test,aes(x = long, y = lat, group = group), 

BeijingLoc <- data.frame('Long'=116.4075,'Lat' = 39.904)

building structure

makeEDA('buildingStructure' )

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

Brick and wood houses are the most expensive, almost twice as expensive as other types of houses

Building type

makeEDA('buildingType' )

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

  • Bungalows are the most expensive and local

Decoration conditions

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

Elevator.

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

  • The price has little dependence on the elevator
  • The distribution of houses is relatively equal to this feature.

metro

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

  • The price has little dependence on the vicinity of the subway station.
  • The distribution of houses is relatively equal to this feature.

Is it full_ Five years_

makeFeatureCatEDA('fiveYearsProperty', length(unique(df3$fiveYearsProperty)))

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

  • The dependence on price is really small for a property owned for less than five years
  • In terms of this feature, the distribution of families is relatively equal

region

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

regression model

strategy

  • Extract year and month from tradetime
  • Group by year and month to get the number and average price of houses
  • Split dataset:

    • For the year [2010-2017] = train and run the regression model in this group of years
    • For > 2017: test samples monthly and forecast the average price

Average price overview

First we need to see what we want to predict

df3$year <- year(df3$tradeTimeTs)
df3$month <- month(df3$tradeTimeTs) 

df3 %>% filter(year>2009) %>% group_by(monthlyTrad) %>% 
  summarise(count=n(), mean = mean(price)) %>% 
  ggplot(aes(x=monthlyTradeTS, y= mean)) + 

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

  • Average prices rose to mid-2017 and then fell rapidly
  • At the same time, the number of houses increases with the rise of prices, and now the number of housing transactions decreases with the rise of prices.

Prepare training / test samples

I split the data on January 1, 2017. For all samples, I need to change the classification features into pseudo variables.

df_train <- data.frame(df  %>% filter(year>2009 & year<2017))
df_test <- data.frame(df %>% filter(year>=2017))
as.data.frame(cbind(
  df_train %>% select_if(is.numeric) %>% select(-Lng, -Lat, -year, -month),
  'bldgType'= dummy.code(df_train$buildingType),
  'bldgStruc'= dummy.code(df_train$buildingStructure),
  'renovation'= dummy.code(df_train$renovationCondition),
  'hasElevator'= dummy.code(df_train$elevator), 

In this step, I only train a linear model

regressors<-c('lm')
 
 Control <- trainControl(method = "cv",number = 5, repeats=3)
for(r in regressors){
    cnt<-cnt+1
     res[[cnt]]<-train(totalPrice ~., data = train ,method=r,trControl =  Control) 

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

R ^ 2 is around 0.88, good. Let’s look at the details.

Training accuracy

g1<-ggplot(data=PRED,aes(x=Prediction,y=True)) + geom_jitter() + geom_smooth(method='lm',size=.5) +
    #Calculation index
    mse <- mean((PRED$True-PRED$Prediction)^2)
    rmse<-mse^0.5
    SSE = sum((PRED$Pred - PR

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

## [1] "MSE: 15952.845934 RMSE : 126.304576 R2 :0.795874"

  • So it looks like the residuals are OK (the distribution is normal, centered at 0), but they seem to fail for low prices.

The relationship between the prediction of training and testing samples and time

  • Basically the same as above, but I’ll repeat the forecast for all months of training
  • My target is average house price.
  • The training is done in more than 10 years of training samples, so it will be very interesting to look at the forecast month by month.
#Training sample > training accuracy
 
for (i in 1:length(dates_train)){
     current_df <- prepareDF(current_df)
     current_pred <- mean(predict(res[[1]],current_df))
 

#Run test sample --- test accuracy
 
for (i in 1:length(dates_test)){
     current_df <- prepareDF(current_df)
    current_pred <- mean(predict(res[[1]],current_df))

 RES %>% reshape2::melt(id=c('date','split')) %>% 
  ggplot(aes(x=date,y=value)) + geom_line(aes(color=variable, lty=split),size=1) + 

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

  • The forecast is really good for the data after 2012, which may correspond to the months with enough data

Improvement / to do

Geographic location as a feature

  • Here’s an interesting picture; It shows the total price for each location. In the center of the two-dimensional distribution, the price is higher.
  • The idea is to calculate the distance from the center of each house and associate it with a grade / score
BeijingLoc <- data.frame('Long'=116.4075,'Lat' = 39.904)
df3 %>% ggplot(aes(x=Lng,y=Lat)) + geom_point(aes(color=price),size=.1,alpha=.5)  + 
  theme(legend.position = 'bottom') + 

R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing


R language linear regression and time series analysis: a case study of influencing factors of housing prices in Beijing

Most popular insights

1.Application of R language multivariate logistic regression

2.Implementation of panel smooth transition regression (PSTR) analysis case

3.Partial least square regression (PLSR) and principal component regression (PCR) in MATLAB

4.A case study of R language Poisson Poisson regression model

5.Hosmer lemeshow goodness of fit test in R language regression

6.The realization of lasso regression, ridge ridge regression and elastic net model in R language

7.Realization of logistic regression in R language

8.Predicting stock price with linear regression in Python

9.How does R language calculate IDI and NRI in survival analysis and Cox regression