By Kaushik Choudhury
Compile | VK
Source: toward Data Science
About 13 years ago, David cournapeau’s scikit learn started as part of the Google summer of code project. Over time, scikit learn became one of the most famous machine learning libraries in Python. It offers several classification, regression and clustering algorithms, and in my opinion, its key advantage is seamless integration with numpy, pandas and SciPy.
In this paper, I will use Excel to compare the prediction accuracy of scikit learn multiple linear regression. Scikit learn provides many parameters (super parameters called estimators) to fine tune the training of models and improve the accuracy of prediction.
In Excel, we don’t have much to adjust the regression algorithm. For fair comparison, I will train the sklearn regression model with default parameters.
The purpose of this comparison is to understand the prediction accuracy of linear regression in Excel and scikit learn. In addition, I will briefly introduce the process of performing linear regression in Excel.
Sample data file
For comparison, we will use 100000 datasets of precipitation, minimum temperature, maximum temperature and wind speed. The data were obtained by measuring several times a day for eight years.
We will use precipitation, minimum and maximum temperatures to predict wind speed. Therefore, wind speed is a dependent variable and other data are independent variables.
We will first establish and forecast the linear regression model of wind speed on Excel. Then we’ll do the same exercise using scikit learn, and finally we’ll compare the predictions.
To perform linear regression in Excel, we will open the sample data file and click the data tab in the excel ribbon. In the data tab, select data analysis options.
Tip: if you don’t see the data analysis option, click File > Options > Add ins. Select the analysis toolpak and click the go button, as shown below
Click the data analysis option to open a pop-up window showing the different analysis tools available in Excel. We’ll select regression regression and click OK.
Another pop-up window will be displayed. “The excel cell reference for wind speed (dependent variable) is filled in the input y range field. In input x range, we will provide unit references for independent variables (i.e., precipitation, minimum temperature, and maximum temperature).
We need to select the check box “label” because the first row in the sample data has a variable name.
After specifying the data, click the “OK” button, excel will establish a linear regression model. You can think of it as training in scikit learn (fit function).
Excel calculates and displays information in good format. In our example, Excel can fit a linear regression model with R-square of 0.953. Considering the 100000 records in the training dataset, excel performed linear regression in less than 7 seconds. Along with other statistics, it also shows the intercept and coefficient of different independent variables.
Based on the linear regression output of Excel, we can combine the following mathematical relations.
Wind speed = 2.438 + (precipitation * 0.026) + (minimum temperature * 0.393) + (maximum temperature * 0.395)
We will use this formula to predict the wind speed of the test data set, which is not used for training in Excel.
For example, for the first test dataset, wind speed = 2.438 + (0.51 * 0.026) + (17.78 * 0.393) + (25.56 * 0.395) = 19.55
In addition, we calculated the residual of the forecast and plotted the trend. We can see that in almost all cases, the predicted wind speed is lower than the actual value, and the faster the wind speed, the greater the prediction error.
Let’s explore linear regression in scikit learn.
Step 1-We will import the package to be used for analysis. The values of independent variables are distributed in different value ranges and are not normally distributed. Therefore, we need standardscaler to realize the standardization of independent variables.
from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score import pandas as pd import numpy as np import matplotlib.pyplot as plt
Step 2-The training data and test data are read from excel file to pandas data frame as training data and test data respectively.
Training_data=pd.read_excel(“Weather.xlsx”, sheet_name=”Sheet1") Test_data=pd.read_excel(“Weather Test.xlsx”, sheet_name=”Sheet1")
In this paper, I will not focus on the preliminary data quality check, such as blank value, outlier value and corresponding correction methods
Step 3-In the following code, we declare all column data except “wind speed” as independent variables, and only “wind speed” as dependent variable is used for training and test data. Please note that we will not use “sourcedata”_ test_ Instead, the predicted value is compared with it.
SourceData_ train_ independent= Training_ data.drop (["windspeed"], axis = 1) ා removes the windspeed variable from the training dataset SourceData_ train_ dependent=Training_ Data ["windspeed"]. Copy() ා the training data set has only independent variables SourceData_test_independent=Test_data.drop(["WindSpeed"], axis=1) SourceData_test_dependent=Test_data["WindSpeed"].copy()
Step 4-Since the ranges of independent variables are completely different, we need to adjust them to avoid the performance impact caused by some variables with large ranges and some with small ranges. In the following code, the arguments are scaled and saved to x-train and X, respectively_ test。 In y_ The training variables are not saved.
sc_X = StandardScaler() X_ train=sc_ X.fit_ transform(SourceData_ train_ independent.values ）Zoom arguments Y_ train=SourceData_ train_ Dependent ා the dependent variable does not need to be scaled X_test=sc_X.transform(SourceData_test_independent) y_test=SourceData_test_dependent
Step 5-Now we’re going to enter the independent and dependent data, x, respectively_ Train and Y_ To train the linear regression model. For the reasons mentioned at the beginning of this article, we will perform model fitting using default parameters.
reg = LinearRegression().fit(X_train, y_train) print("The Linear regression score on training data is ", round(reg.score(X_train, y_train),2))
The linear regression scores of training data were consistent with the results observed by Excel.
Step 6-Finally, we will predict the wind speed based on the test set.
It can be seen from the predicted wind speed value and residual scatter diagram that the predicted value of sklean is closer to the actual value.
Comparing sklearn and excel residuals in parallel, we can see that with the increase of wind speed, the deviation between the model and the actual value is relatively large, but sklearn is better than excel.
On the other hand, excel does predict the wind speed range similar to sklearn. If an approximate linear regression model is good enough for your business case, excel is a good choice for quick forecasting.
Excel can perform linear regression prediction at the same precision level as sklearn. Because sklearn can greatly improve the prediction accuracy of sklearn linear regression by fine tuning the parameters, and it is more suitable to deal with complex models. For fast and approximate forecasting, excel is a very good choice with acceptable accuracy.
Link to the original text: https://towardsdatascience.co…
Welcome to visit pan Chuang AI blog station:
Sklearn machine learning Chinese official document:
Welcome to pay attention to pan Chuang blog resource collection station: