Data cleaning: missing value processing, 1 deleting records2 data interpolation3 no treatment
Data inhttps://book.tipdm.org/jc/219Data and code in resource package in Chapter4 \ demo \ data \ cataling_ sale. xls
Common interpolation methods
Interpolation Lagrange interpolation
According to mathematical knowledge, for n known points on the plane (no two points), n-1 polynomial can be found on a straight line
, let the polynomial curve pass through these n points.
1) Find the N-1 degree polynomial with known n points:
Bring the coordinates of n points into the polynomial: get
Solve Lagrange interpolation polynomial:
Bring the point x corresponding to the missing function value into the polynomial to obtain the approximate value L (x) of the trend value
#Lagrange interpolation code Import pandas as PD # import data analysis library pandas import numpy as np import matplotlib.pyplot as plt from scipy. Interpolate import Lagrange # import Lagrange interpolation function inputfile = '../ data/catering_ sale. Xls' # sales volume data path outputfile = '../ tmp/sales. Xls' # output data path data = pd. read_ Excel (inputfile) # read in data Temp = data [u 'sales volume'] [(data [u 'sales volume'] < 400) | (data [u 'sales volume'] > 5000)] # find the value that does not meet the requirements data [column] [row] for i in range(temp.shape): data. LOC [temp. Index [i], u 'sales'] = NP Nan # changes the non-conforming value to null value #Custom column vector interpolation function #S is the column vector, n is the interpolated position, K is the number of data before and after taking, and the default is 5 def ployinterp_column(s, n, k=5): Y = s.iloc [list (range (n-k, n)) + list (range (n + 1, N + 1 + k))] # fetching is the data passed in Y = y [y.notnull()] # eliminate null values f = lagrange(y.index, list(y)) Return f (n) # interpolation and return the interpolation result #Determine whether interpolation is required one by one for i in data.columns: for j in range(len(data)): If (data [i]. Isnull()) [J]: # if it is empty, it is interpolated. data.loc[j,i] = ployinterp_column(data[i], j) data. to_ Write to file (Excel, output) # print("success")
This code can be run
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
I don’t know how to eliminate this warning. Anyway, I just look and look. It can run when I don’t pay attention! It seems that you can’t assign multiple values at once. You should assign values separately.
However, we can find that there is a problem with the inserted value when we take a closer look: there is an abnormal value when we output the inserted value
When processing the data, we change the values less than 400 and more than 5000 into null values, and then insert the values through Lagrange interpolation. We want to insert a negative number into the data, which is very outrageous. I checked and found nothing wrong; Then I output the data used and the fitted Lagrange function:
f=-0.008874 x + 11.53 x – 6657 x + 2.242e+06 x – 4.854e+08 x + 7.005e+10 x – 6.74e+12 x + 4.168e+14 x – 1.504e+16 x + 2.411e+17
I didn’t find any problems. After that, I thought about whether the fitting function steps were accurate enough. I increased the points, but there were no good results, but they were more outrageous. This situation is over fitting, that is, this model can fit the model you trained very well, but the test model is not good.
For example: the following set of data can be seen with X4 function fitting does not have too many points on the model, X4 function fitting is relatively more, but if tested, the prediction of the 14th power model may be very unreasonable:
Finally, I reduced the value point and found that when the upper and lower points are 4, there will be a good result, and when the upper and lower points are 3, 2 and 1 (straight line, not recommended). Therefore, there is nothing wrong with the five upper and lower points we fit, but the function it fits is that the value is outrageous at that point.