[artificial intelligence project] popular project of machine learning – Boston house price:
1. Data overview analysis
1.1 data overview
- train. CSV, training set;
- test. CSV, test set;
- submission. CSV real house price file;
The training set has 404 rows of data and 14 columns. Each row of data represents the detailed information of the house and its surroundings, and the corresponding average house price of self housing has been given. It is required to predict the house price of 102 test data.
1.2 data analysis
By learning the detailed information of the house and its surroundings, including the urban crime rate, nitric oxide concentration, the average number of residential rooms, the weighted distance to the central area and the average self housing price, the training model predicts the average self housing price in a certain area through the detailed information of the house and its surroundings in a certain area.
For the regression problem, submit the average house price of self housing corresponding to each data in the test set. The evaluation index is mean square error MSE.
2. General idea of the project
2.1 data reading
Data set: Boston room training set CSV (404 pieces of data)
The data set fields are as follows:
Crim: urban per capita crime rate.
Zn: proportion of residential land exceeding 25000 sq.ft.
Indus: proportion of Urban Non retail commercial land.
Chas: Charles River Air variable (1 if the boundary is a river; otherwise 0).
NOx: nitric oxide concentration.
RM: average number of residential rooms.
Age: proportion of self use houses built before 1940.
Dis: weighted distance to five central areas of Boston.
Rad: proximity index of radial highway.
Tax: full value property tax rate per $10000.
Ptratio: Urban teacher-student ratio.
B: 1000 (bk-0.63) ^ 2, where BK refers to the proportion of blacks in cities and towns.
LSTAT: the proportion of people with lower status in the population.
MEDV: average house price of self housing, in thousands of dollars.
2.2 model pretreatment
(1) Data outlier processing
Firstly, the training set is divided into sub training set and sub test set_ data. sort_ Values sort the training set, delete the outlier samples corresponding to each feature in turn, train and test the model with sub training set and sub test set, and determine the optimal number of samples to delete under this feature.
(2) Data normalization processing
Use sklearn preprocessing. Standardscaler standardizes data sets and labels respectively.
2.3. Characteristic Engineering
Random forest feature selection algorithm is used to eliminate insensitive features.
2.4. Model selection
The regression model was integrated using gradientboosting regressor.
Gradient boosting selects the direction of gradient descent during iteration to ensure the best final result. The loss function is used to describe the “reliability” of the model. Assuming that the model is not over fitted, the larger the loss function is, the higher the error rate of the model is
If our model can make the loss function decline continuously, it shows that our model is constantly improving, and the best way is to make the loss function decline in its gradient direction.
2.5. Model evaluation
The mean square error (MSE) scoring standard is adopted, MSE: mean squared error. Mean square error refers to the expected value of the square of the difference between the estimated value of the parameter and the true value of the parameter;
MSE can evaluate the change degree of data. The smaller the value of MSE, it shows that the prediction model has better accuracy in describing experimental data. The calculation formula is as follows:
Its MSE value on the test set is:
2.6. Model tuning
Right n_ n_ Parameters of estimators are adjusted:
3. Project summary
Through many experiments, the optimal solution is about 8.18. When dealing with over fitting of small data sets, we should first consider reducing the model or increasing the data set. Since this experiment is the best method through a large number of training, the default parameters are used, and the further optimization of super parameters may be further.
This is the end of this article about Python artificial intelligence Boston house price data analysis. For more information about Python Boston house price, please search the previous articles of developeppaper or continue to browse the relevant articles below. I hope you will support developeppaper in the future!