Data mining practice — airline value analysis

Time:2020-1-26

Simple data analysis

import pandas as pd

Datafile = ‘.. / data / air ﹐ data. CSV’ ﹐ aeronautical raw data, the first line is property label

Resultfile = ‘.. / TMP / explore. CSV’ ා data exploration result table

#Read the original data and specify UTF-8 encoding (you need to use a text editor to replace the data with UTF-8 encoding)

data = pd.read_csv(datafile, encoding = ‘utf-8’)

#Including the basic description of data. Percentiles parameter is a quantile table (such as 1 / 4 quantile, median, etc.) that specifies how much to calculate

Explore = data.describe (percentiles = [], include =’All ‘). T ා t is transpose, which is more convenient to view after transpose

Explore [‘null ‘] = len (data) – explore [‘Count’] (describe() function calculates non null value automatically, and needs to calculate null value manually

explore = explore[[‘null’, ‘max’, ‘min’]]

Explore. Columns = [u ‘number of null values’, u’ maximum value ‘, u’ minimum value ‘] (header rename

”’

Only part of the results are selected here.

The fields automatically calculated by the describe() function are count (non null value), unique (unique value), top (highest frequency)

Freq, mean, STD, min, 50% (median), Max

”’

Simple visualization

import pandas as pd

import matplotlib.pyplot as plt

Datafile = ‘.. / data / air ﹐ data. CSV’ ﹐ aeronautical raw data, the first line is property label

#Read the original data and specify UTF-8 encoding (you need to use a text editor to replace the data with UTF-8 encoding)

data = pd.read_csv(datafile, encoding = ‘utf-8’)

#Customer information category

#Year of membership withdrawal

from datetime import datetime

ffp = data[‘FFP_DATE’].apply(lambda x:datetime.strptime(x,’%Y/%m/%d’))

ffp_year = ffp.map(lambda x : x.year)

#Draw the histogram of the number of members in each year

Fig = plt.figure (figsize = (8, 5)) ා set canvas size

PLT. Rcparams [‘font. Sans serif ‘] =’simhei’ ා set Chinese display

plt.rcParams[‘axes.unicode_minus’] = False

plt.hist(ffp_year, bins=’auto’, color=’#0504aa’)

PLT. Xlabel (‘year ‘)

PLT. Ylabel (‘number of participants’)

PLT. Title (‘number of members in each year ‘

plt.show()

plt.close

#Number of members of different genders

Male = PD. Value_counts (data [‘gender ‘]) [‘male’]

Female = PD. Value_counts (data [‘gender ‘]) [‘female’]

#Draw the pie chart of gender proportion of members

Fig = plt.figure (figsize = (7, 4)) ා set canvas size

PLT. Pie ([male, female], labels = [‘male ‘,’ female ‘], colors = [‘lightskyblue’, ‘lightcore’],

       autopct=’%1.1f%%’)

PLT. Title (‘gender ratio of members’)

plt.show()

plt.close

Data cleaning

import numpy as np

import pandas as pd

Datafile = ‘.. / data / air ﹐ data. CSV ﹐ air raw data path

Cleanedfile = ‘.. / TMP / data ﹣ cleaned. CSV ﹣ file path saved after data cleaning

#Read data

airline_data = pd.read_csv(datafile,encoding = ‘utf-8’)

Print (‘the shape of the original data is: ‘, airline [data. Shape)

#Remove the record with empty fare

airline_notnull = airline_data.loc[airline_data[‘SUM_YR_1’].notnull() &

                                   airline_data[‘SUM_YR_2’].notnull(),:]

Print (‘the shape of the data after deleting the missing record is: ‘, airline Bu notnull. Shape)

#Only the records with non-zero ticket price, or the average discount rate of 0 and the total number of flying kilometers greater than 0 are kept.

index1 = airline_notnull[‘SUM_YR_1’] != 0

index2 = airline_notnull[‘SUM_YR_2’] != 0

index3 = (airline_notnull[‘SEG_KM_SUM’]> 0) & (airline_notnull[‘avg_discount’] != 0)

Index4 = airline ﹐ notnull [‘age ‘] > 100 ﹐ remove records older than 100

airline = airline_notnull[(index1 | index2) & index3 & ~index4]

Print (‘the shape of data after data cleaning is: ‘, airline. Shape)

Airline.to (clean file) save the cleaned data

The shape of the original data is: (62988, 44)
The shape of the data after deleting the missing record is: (62299, 44)
The shape of data after data cleaning is: (62043, 44)

Attribute selection, construction and data standardization

#Attribute selection, construction and data standardization

import pandas as pd

import numpy as np

#Read data after cleaning

Cleanedfile = ‘.. / TMP / data ﹣ cleaned. CSV ﹣ file path saved after data cleaning

airline = pd.read_csv(cleanedfile, encoding = ‘utf-8’)

#Select demand attribute

airline_selection = airline[[‘FFP_DATE’,’LOAD_TIME’,’LAST_TO_END’,

                                     ‘FLIGHT_COUNT’,’SEG_KM_SUM’,’avg_discount’]]

Print (‘top 5 behaviors of filtered properties are as follows: n ‘, airline [selection. Head())

Code 7-8

#Construction property L

L = pd.to_datetime(airline_selection[‘LOAD_TIME’]) – \

pd.to_datetime(airline_selection[‘FFP_DATE’])

L = L.astype(‘str’).str.split().str[0]

L = L.astype(‘int’)/30

#Merge properties

airline_features = pd.concat([L,airline_selection.iloc[:,2:]],axis = 1)

airline_features.columns = [‘L’,’R’,’F’,’M’,’C’]

Print (‘the first 5 behaviors of lrfmc attributes constructed are as follows: n ‘, airline features. Head())

#Data standardization

from sklearn.preprocessing import StandardScaler

data = StandardScaler().fit_transform(airline_features)

np.savez(‘../tmp/airline_scale.npz’,data)

Print (‘the five attributes of lrfmc after standardization are as follows: n ‘, data [: 5,:])

Top 5 behaviors of filtered properties:

FFP_DATE LOAD_TIME LAST_TO_END FLIGHT_COUNT SEG_KM_SUM avg_discount

0 2006/11/2 2014/3/31 1 210 580717 0.961639

1 2007/2/19 2014/3/31 7 140 293678 1.252314

2 2007/2/1 2014/3/31 11 135 283712 1.254676

3 2008/8/22 2014/3/31 97 23 281336 1.090870

4 2009/4/10 2014/3/31 5 152 309928 0.970658

The first 5 behaviors of lrfmc attributes constructed:

L R F M C

0 90.200000 1 210 580717 0.961639

1 86.566667 7 140 293678 1.252314

2 87.166667 11 135 283712 1.254676

3 68.233333 97 23 281336 1.090870

4 60.533333 5 152 309928 0.970658

After standardization, lrfmc has five attributes:

[[ 1.43579256 -0.94493902 14.03402401 26.76115699 1.29554188]

[ 1.30723219 -0.91188564 9.07321595 13.12686436 2.86817777]

[ 1.32846234 -0.88985006 8.71887252 12.65348144 2.88095186]

[ 0.65853304 -0.41608504 0.78157962 12.54062193 1.99471546]

[ 0.3860794 -0.92290343 9.92364019 13.89873597 1.34433641]]

import pandas as pd

import numpy as np

From sklearn.cluster import kmeans

#Read standardized data

airline_scale = np.load(‘../tmp/airline_scale.npz’)[‘arr_0’]

K = 5 × determine the number of cluster centers

#Build the model and set the random seed as 123

kmeans_model = KMeans(n_clusters = k,n_jobs=4,random_state=123)

Fit ﹐ kmeans = kmeans ﹐ model.fit (airline ﹐ scale) ﹐ model training

#View cluster results

Kmeans? CC = kmeans? Model. Cluster? Centers? Cluster center

Print (‘All kinds of clustering centers are: \ n ‘, kmeans ﹐ CC)

Kmeans? Labels = category labels of samples

Print (‘the category label of each sample is: \ n ‘, kmeans ﹣ labels)

R1 = PD. Series (kmeans? Model. Labels? Value? Counts()? Counts the number of samples in different categories

Print (‘the final number of each category is: \ n ‘, R1)

#Output clustering results

cluster_center = pd.DataFrame(kmeans_model.cluster_centers_,\

Columns = [‘zl ‘,’ Zr ‘,’ ZF ‘,’ ZM ‘,’ ZC ‘]). In the cluster box, function() {/ / Foreign Exchange Commission return www.fx61.com

cluster_center.index = pd.DataFrame(kmeans_model.labels_ ).\

Drop ﹐ duplicates(). Iloc [:, 0] ﹐ use sample category as data frame index

print(cluster_center)

Code 7-10

import matplotlib.pyplot as plt

#Customer group radar map

labels = [‘ZL’,’ZR’,’ZF’,’ZM’,’ZC’]

Legend = [‘customer group’ + str (I + 1) for I in cluster ﹐ center. Index] ﹐ customer group name, as the legend of radar map

lstype = [‘-‘,’–‘,(0, (3, 5, 1, 5, 1, 5)),’:’,’-.’]

kinds = list(cluster_center.iloc[:, 0])

#Since the radar map needs to ensure data closure, add l column and convert it to np.ndarray

cluster_center = pd.concat([cluster_center, cluster_center[[‘ZL’]]], axis=1)

centers = np.array(cluster_center.iloc[:, 0:])

#Divide and close the circumference of a circle

n = len(labels)

angle = np.linspace(0, 2 * np.pi, n, endpoint=False)

angle = np.concatenate((angle, [angle[0]]))

Drawing

fig = plt.figure(figsize = (8,6))

Ax = fig.add × plot (111, polar = true) × plot in polar coordinates

PLT. Rcparams [‘font. Sans serif ‘] = [‘simhei’] (used to display Chinese labels normally)

PLT. Rcparams [‘axes. Unicode_minus’] = false ා used to display negative signs normally

Drawing lines

for i in range(len(kinds)):

    ax.plot(angle, centers[i], linestyle=lstype[i], linewidth=2, label=kinds[i])

#Add attribute label

ax.set_thetagrids(angle * 180 / np.pi, labels)

PLT. Title (‘customer characteristic analysis radar map ‘)

plt.legend(legen)

plt.show()

plt.close

The cluster centers are:

[[ 1.1606821 -0.37729768 -0.08690742 -0.09484273 -0.15591932]

[-0.31365557 1.68628985 -0.57402225 -0.53682279 -0.17332449]

[ 0.05219076 -0.00264741 -0.22674532 -0.23116846 2.19158505]

[-0.70022016 -0.4148591 -0.16116192 -0.1609779 -0.2550709 ]

[ 0.48337963 -0.79937347 2.48319841 2.42472188 0.30863168]]

The category labels of each sample are:

[4 4 4 … 3 1 1]

The final number of each category is:

3 24661

0 15739

1 12125

4 5336

2 4182

dtype: int64

ZL ZR ZF ZM ZC

0

4 1.160682 -0.377298 -0.086907 -0.094843 -0.155919

2 -0.313656 1.686290 -0.574022 -0.536823 -0.173324

0 0.052191 -0.002647 -0.226745 -0.231168 2.191585

3 -0.700220 -0.414859 -0.161162 -0.160978 -0.255071

1 0.483380 -0.799373 2.483198 2.424722 0.308632

Recommended Today

DK7 switch’s support for string

Before JDK7, switch can only support byte, short, char, int or their corresponding encapsulation classes and enum types. After JDK7, switch supports string type. In the switch statement, the value of the expression cannot be null, otherwise NullPointerException will be thrown at runtime. Null cannot be used in the case clause, otherwise compilation errors will […]