Simple data analysis
import pandas as pd
Datafile = ‘.. / data / air ﹐ data. CSV’ ﹐ aeronautical raw data, the first line is property label
Resultfile = ‘.. / TMP / explore. CSV’ ා data exploration result table
#Read the original data and specify UTF-8 encoding (you need to use a text editor to replace the data with UTF-8 encoding)
data = pd.read_csv(datafile, encoding = ‘utf-8’)
#Including the basic description of data. Percentiles parameter is a quantile table (such as 1 / 4 quantile, median, etc.) that specifies how much to calculate
Explore = data.describe (percentiles = [], include =’All ‘). T ා t is transpose, which is more convenient to view after transpose
Explore [‘null ‘] = len (data) – explore [‘Count’] (describe() function calculates non null value automatically, and needs to calculate null value manually
explore = explore[[‘null’, ‘max’, ‘min’]]
Explore. Columns = [u ‘number of null values’, u’ maximum value ‘, u’ minimum value ‘] (header rename
”’
Only part of the results are selected here.
The fields automatically calculated by the describe() function are count (non null value), unique (unique value), top (highest frequency)
Freq, mean, STD, min, 50% (median), Max
”’
Simple visualization
import pandas as pd
import matplotlib.pyplot as plt
Datafile = ‘.. / data / air ﹐ data. CSV’ ﹐ aeronautical raw data, the first line is property label
#Read the original data and specify UTF-8 encoding (you need to use a text editor to replace the data with UTF-8 encoding)
data = pd.read_csv(datafile, encoding = ‘utf-8’)
#Customer information category
#Year of membership withdrawal
from datetime import datetime
ffp = data[‘FFP_DATE’].apply(lambda x:datetime.strptime(x,’%Y/%m/%d’))
ffp_year = ffp.map(lambda x : x.year)
#Draw the histogram of the number of members in each year
Fig = plt.figure (figsize = (8, 5)) ා set canvas size
PLT. Rcparams [‘font. Sans serif ‘] =’simhei’ ා set Chinese display
plt.rcParams[‘axes.unicode_minus’] = False
plt.hist(ffp_year, bins=’auto’, color=’#0504aa’)
PLT. Xlabel (‘year ‘)
PLT. Ylabel (‘number of participants’)
PLT. Title (‘number of members in each year ‘
plt.show()
plt.close
#Number of members of different genders
Male = PD. Value_counts (data [‘gender ‘]) [‘male’]
Female = PD. Value_counts (data [‘gender ‘]) [‘female’]
#Draw the pie chart of gender proportion of members
Fig = plt.figure (figsize = (7, 4)) ා set canvas size
PLT. Pie ([male, female], labels = [‘male ‘,’ female ‘], colors = [‘lightskyblue’, ‘lightcore’],
autopct=’%1.1f%%’)
PLT. Title (‘gender ratio of members’)
plt.show()
plt.close
Data cleaning
import numpy as np
import pandas as pd
Datafile = ‘.. / data / air ﹐ data. CSV ﹐ air raw data path
Cleanedfile = ‘.. / TMP / data ﹣ cleaned. CSV ﹣ file path saved after data cleaning
#Read data
airline_data = pd.read_csv(datafile,encoding = ‘utf-8’)
Print (‘the shape of the original data is: ‘, airline [data. Shape)
#Remove the record with empty fare
airline_notnull = airline_data.loc[airline_data[‘SUM_YR_1’].notnull() &
airline_data[‘SUM_YR_2’].notnull(),:]
Print (‘the shape of the data after deleting the missing record is: ‘, airline Bu notnull. Shape)
#Only the records with non-zero ticket price, or the average discount rate of 0 and the total number of flying kilometers greater than 0 are kept.
index1 = airline_notnull[‘SUM_YR_1’] != 0
index2 = airline_notnull[‘SUM_YR_2’] != 0
index3 = (airline_notnull[‘SEG_KM_SUM’]> 0) & (airline_notnull[‘avg_discount’] != 0)
Index4 = airline ﹐ notnull [‘age ‘] > 100 ﹐ remove records older than 100
airline = airline_notnull[(index1 | index2) & index3 & ~index4]
Print (‘the shape of data after data cleaning is: ‘, airline. Shape)
Airline.to (clean file) save the cleaned data
The shape of the original data is: (62988, 44)
The shape of the data after deleting the missing record is: (62299, 44)
The shape of data after data cleaning is: (62043, 44)
Attribute selection, construction and data standardization
#Attribute selection, construction and data standardization
import pandas as pd
import numpy as np
#Read data after cleaning
Cleanedfile = ‘.. / TMP / data ﹣ cleaned. CSV ﹣ file path saved after data cleaning
airline = pd.read_csv(cleanedfile, encoding = ‘utf-8’)
#Select demand attribute
airline_selection = airline[[‘FFP_DATE’,’LOAD_TIME’,’LAST_TO_END’,
‘FLIGHT_COUNT’,’SEG_KM_SUM’,’avg_discount’]]
Print (‘top 5 behaviors of filtered properties are as follows: n ‘, airline [selection. Head())
Code 7-8
#Construction property L
L = pd.to_datetime(airline_selection[‘LOAD_TIME’]) – \
pd.to_datetime(airline_selection[‘FFP_DATE’])
L = L.astype(‘str’).str.split().str[0]
L = L.astype(‘int’)/30
#Merge properties
airline_features = pd.concat([L,airline_selection.iloc[:,2:]],axis = 1)
airline_features.columns = [‘L’,’R’,’F’,’M’,’C’]
Print (‘the first 5 behaviors of lrfmc attributes constructed are as follows: n ‘, airline features. Head())
#Data standardization
from sklearn.preprocessing import StandardScaler
data = StandardScaler().fit_transform(airline_features)
np.savez(‘../tmp/airline_scale.npz’,data)
Print (‘the five attributes of lrfmc after standardization are as follows: n ‘, data [: 5,:])
Top 5 behaviors of filtered properties:
FFP_DATE LOAD_TIME LAST_TO_END FLIGHT_COUNT SEG_KM_SUM avg_discount
0 2006/11/2 2014/3/31 1 210 580717 0.961639
1 2007/2/19 2014/3/31 7 140 293678 1.252314
2 2007/2/1 2014/3/31 11 135 283712 1.254676
3 2008/8/22 2014/3/31 97 23 281336 1.090870
4 2009/4/10 2014/3/31 5 152 309928 0.970658
The first 5 behaviors of lrfmc attributes constructed:
L R F M C
0 90.200000 1 210 580717 0.961639
1 86.566667 7 140 293678 1.252314
2 87.166667 11 135 283712 1.254676
3 68.233333 97 23 281336 1.090870
4 60.533333 5 152 309928 0.970658
After standardization, lrfmc has five attributes:
[[ 1.43579256 -0.94493902 14.03402401 26.76115699 1.29554188]
[ 1.30723219 -0.91188564 9.07321595 13.12686436 2.86817777]
[ 1.32846234 -0.88985006 8.71887252 12.65348144 2.88095186]
[ 0.65853304 -0.41608504 0.78157962 12.54062193 1.99471546]
[ 0.3860794 -0.92290343 9.92364019 13.89873597 1.34433641]]
import pandas as pd
import numpy as np
From sklearn.cluster import kmeans
#Read standardized data
airline_scale = np.load(‘../tmp/airline_scale.npz’)[‘arr_0’]
K = 5 × determine the number of cluster centers
#Build the model and set the random seed as 123
kmeans_model = KMeans(n_clusters = k,n_jobs=4,random_state=123)
Fit ﹐ kmeans = kmeans ﹐ model.fit (airline ﹐ scale) ﹐ model training
#View cluster results
Kmeans? CC = kmeans? Model. Cluster? Centers? Cluster center
Print (‘All kinds of clustering centers are: \ n ‘, kmeans ﹐ CC)
Kmeans? Labels = category labels of samples
Print (‘the category label of each sample is: \ n ‘, kmeans ﹣ labels)
R1 = PD. Series (kmeans? Model. Labels? Value? Counts()? Counts the number of samples in different categories
Print (‘the final number of each category is: \ n ‘, R1)
#Output clustering results
cluster_center = pd.DataFrame(kmeans_model.cluster_centers_,\
Columns = [‘zl ‘,’ Zr ‘,’ ZF ‘,’ ZM ‘,’ ZC ‘]). In the cluster box, function() {/ / Foreign Exchange Commission return www.fx61.com
cluster_center.index = pd.DataFrame(kmeans_model.labels_ ).\
Drop ﹐ duplicates(). Iloc [:, 0] ﹐ use sample category as data frame index
print(cluster_center)
Code 7-10
import matplotlib.pyplot as plt
#Customer group radar map
labels = [‘ZL’,’ZR’,’ZF’,’ZM’,’ZC’]
Legend = [‘customer group’ + str (I + 1) for I in cluster ﹐ center. Index] ﹐ customer group name, as the legend of radar map
lstype = [‘-‘,’–‘,(0, (3, 5, 1, 5, 1, 5)),’:’,’-.’]
kinds = list(cluster_center.iloc[:, 0])
#Since the radar map needs to ensure data closure, add l column and convert it to np.ndarray
cluster_center = pd.concat([cluster_center, cluster_center[[‘ZL’]]], axis=1)
centers = np.array(cluster_center.iloc[:, 0:])
#Divide and close the circumference of a circle
n = len(labels)
angle = np.linspace(0, 2 * np.pi, n, endpoint=False)
angle = np.concatenate((angle, [angle[0]]))
Drawing
fig = plt.figure(figsize = (8,6))
Ax = fig.add × plot (111, polar = true) × plot in polar coordinates
PLT. Rcparams [‘font. Sans serif ‘] = [‘simhei’] (used to display Chinese labels normally)
PLT. Rcparams [‘axes. Unicode_minus’] = false ා used to display negative signs normally
Drawing lines
for i in range(len(kinds)):
ax.plot(angle, centers[i], linestyle=lstype[i], linewidth=2, label=kinds[i])
#Add attribute label
ax.set_thetagrids(angle * 180 / np.pi, labels)
PLT. Title (‘customer characteristic analysis radar map ‘)
plt.legend(legen)
plt.show()
plt.close
The cluster centers are:
[[ 1.1606821 -0.37729768 -0.08690742 -0.09484273 -0.15591932]
[-0.31365557 1.68628985 -0.57402225 -0.53682279 -0.17332449]
[ 0.05219076 -0.00264741 -0.22674532 -0.23116846 2.19158505]
[-0.70022016 -0.4148591 -0.16116192 -0.1609779 -0.2550709 ]
[ 0.48337963 -0.79937347 2.48319841 2.42472188 0.30863168]]
The category labels of each sample are:
[4 4 4 … 3 1 1]
The final number of each category is:
3 24661
0 15739
1 12125
4 5336
2 4182
dtype: int64
ZL ZR ZF ZM ZC
0
4 1.160682 -0.377298 -0.086907 -0.094843 -0.155919
2 -0.313656 1.686290 -0.574022 -0.536823 -0.173324
0 0.052191 -0.002647 -0.226745 -0.231168 2.191585
3 -0.700220 -0.414859 -0.161162 -0.160978 -0.255071
1 0.483380 -0.799373 2.483198 2.424722 0.308632