Regression analysis of R language interval data

Time:2021-9-11

Original linkhttp://tecdat.cn/?p=14850

Regression analysis is a very common data analysis method, which determines the relationship between variables through observation data. Traditional regression analysis takes point data as the research object, and the prediction result is also point data, while the real data often changes within a certain range. Based on the confidence degree, the confidence interval can be formed, which makes up for the deficiency that the predicted value is a single point to a certain extent, but takes the point data as the research object, There is often the problem of information loss when all data in a certain range are listed by point

Interval regression analysis is a data analysis method taking interval number as the research object. Interval number can reflect the variation range of data and is more in line with the actual situation. Interval symbolic data is a kind of interval number, which is formed through “data packaging”, so it has not only interval endpoint information, but also interval internal scattered point information

This paper will give a brief explanation on how to use r to extract the upper and lower limits in the case of interval. Let’s start by generating data,

X=rnorm(n)
Y=2+X+rnorm(n,sd = .3)

Suppose we no longer observe variable x, but just observe one class (we will create eight classes, each with one eighth of the observed value)

Q=quantile(x = X,(0:8)/8)
Q\[1\]=Q\[1\]-.00001
Xcut=cut(X,breaks = Q)

For example, for the first value, we have

as.character(Xcut\[1\])
\[1\] "(-0.626,-0.348\]"

To extract information about these boundaries, we can use the following small code, which returns the lower limit, upper limit and median of the interval

lower = c(lower1,lower2)
lower=lower\[!is.na(lower)\]
upper = c(upper1,upper2)
upper=upper\[!is.na(upper)\]
mid = (lower+upper)/2
return(c(lower=lower,mid=mid,upper=upper)
extrai(Xcut\[1\])
lower mid upper 
-0.626 -0.487 -0.348

As you can see, we can create three variables (with lower limit, upper limit and median information) in the database

B$lower=B2\[1,\]
B$mid =B2\[2,\]
B$upper=B2\[3,\]

We can compare 4 regressions: (I) we regress 8 categories, that is, our 8 factors; (II) we regress the lower limit of the interval, (III) regress the “average” value of the interval, and (IV) regress the upper limit

regF=lm(Y~X,data=B)
regL=lm(Y~lower,data=B)
regM=lm(Y~mid,data=B)
regU=lm(Y~upper,data=B)

We can compare the prediction with our four models

Regression analysis of R language interval data

Further, we can also compare the AIC of the model,

AIC(regF)
\[1\] 204.5653
AIC(regM)
\[1\] 201.1201
AIC(regL)
\[1\] 266.5246
AIC(regU)
\[1\] 255.0687

If the use of lower and upper limits is not deterministic, it should be noted here that using the average of the interval is slightly better than using 8 factors.


Regression analysis of R language interval data

reference

1.Estimation of HLM hierarchical linear model by SPSS

2.R language linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and canonical discriminant analysis (RDA)

3.Lmer mixed linear regression model based on R language

4.Bayesian simple linear regression simulation analysis of Gibbs sampling in R language

5.GAM (generalized additive model) is used for power load time series analysis in R language

6.Hierarchical linear models HLM using SAS, Stata, HLM, R, SPSS and Mplus

7.Ridge regression, Lasso regression and principal component regression in R language: linear model selection and regularization

8.Prediction of air quality ozone data with linear regression model in R language

9.R language hierarchical linear model case