Project background
Explore classic data sets in depth.
1 data set review
import matplotlib. Pyplot as plot # graphics library
import numpy as np
import pandas as pd
from sklearn. metrics import silhouette_ Score # import contour coefficient index
from sklearn. Cluster import kmeans # kmeans module
from sklearn. Preprocessing import minmaxscaler, onehotencoder # data preprocessing Library
from mlxtend.preprocessing import one_hot
import seaborn as sns
data = pd. read_ Table (r'e: \ baidunetdiskdownload \ statistics \ Python data analysis and data operation \ Chapter7 \ ad_performance. TXT ')
data

data. Describe () # data dimension difference is too large. The average daily UV value is 540. The registration rate and conversion rate are about 0.0, with a difference of nearly a thousand times. Minmax () should be done later

data. isnull(). Any (axis = 0) # average residence time

pd. Dataframe (data. Corr() ["Access Depth"] sort_ values(ascending=False))

The correlation coefficient between average residence time and access depth is strong and needs to be deleted. This is because our data analysis task is clustering, and the clustering algorithm is sensitive to collinear data.
2 data visualization exploration
import seaborn as sns
sns. set_ Style ('white ', {' font. Sans serif ': ['simhei','arial ']}) # set white background and Chinese display
2.1 exploration of sub type variables
2.1.1 advertising type
sns. Countlot (y = "advertisement type", data = data, color = '#1e90ff',
Order = data ["advertisement type"] value_ counts(). index)
Data ["advertisement type"] value_ Counts() / sum (data ["ad type"]. Value_counts())

2.1.2 material type
sns. Countlot (y = "material type", data = data, color = '#1e90ff',
Order = data ["material type"] value_ counts(). index)
Data ["material type"] value_ Counts() / sum (data ["material type"]. Value_counts())

2.1.3 cooperation mode
sns. Countlot (y = "cooperation method", data = data, color = '#1e90ff',
Order = data ["cooperation mode"] value_ counts(). index)
Data ["cooperation mode"] value_ Counts() / sum (data ["cooperation method"]. Value_counts())

2.1.4 advertising size
sns. Countlot (y = "ad size", data = data, color = '#1e90ff',
Order = data ["ad size"] value_ counts(). index)
Data ["ad size"] value_ Counts() / sum (data ["ad size"]. Value_counts())

2.1.4.1 convert advertising size into advertising area
Data ["ad size"] unique()[0]. Split ("*", 2) [0], data ["ad size"] unique()[0]. Split ("*", 2) [1]# to extract 140,40

length_ Lst = []# length list
For I in range (len (data ["ad size"]. Unique ()):
length_ lst. Append (data) [ad size]. Unique() [i] split("*",2)[0])
width_ Lst = []# width list
Size (J) "len":
width_ lst. Append (data) [ad size]. Unique() [J] split("*",2)[1])
length_ Lst = list (map (int, length# u LST)) # extracts a string and needs to be converted to int before viewing the advertising area
width_lst = list(map(int, width_lst))
2.1.4.2 data table of advertising area
Frames = [PD. Dataframe (length_lst, columns = ["advertisement size"])
,pd. Dataframe (width_lst, columns = ["ad size"])
,pd. DataFrame(np.array(length_lst)*np. Array (width_lst), columns = ["advertising size"]]
df_adv=pd.concat(frames,join='inner',axis=1)
df_ adv.sort_ Values (by = "advertising size area", ascending = false)

2.1.5 advertising selling points
sns. Countlot (y = 'advertising selling point', data = data, color = '#1e90ff',
Order = data ['advertising selling point'] value_ counts(). index)
Data ["advertising selling points"] value_ Counts() / sum (data ["advertising selling point"]. Value_counts())

3 data conversion
3.1 classification variable conversion
3.1.1 method 1: get_ dummies
x_ cate = pd. get_ Dummies (data. LOC [:, "material type": "advertising selling point")
x_ Cat # use get_ Dummies, you can do single heat coding

3.1.3 method 2: onehotencoder
OneHotEncoder(sparse=False). fit_ Transform (data. LOC [:, "material type": "advertising selling point")

3.2 numerical variable conversion
data. iloc[:,1:7]. Describe () # take a look. There is a big gap between the maximum and minimum values of daily average UV and access depth. Leave an impression.

x_ seq = MinMaxScaler(). fit_ transform(data.iloc[:,1:7]). Round (2) # direct fit_ Transform, otherwise you have to first fit and then transform
x_seq

3.2.1 data table after consolidated data conversion
prepar_df = pd.concat([pd.DataFrame(x_seq,columns=[x for x in data.iloc[:,1:7].columns]),x_cate],join='inner',axis=1)
prepar_df

4. Establishment of K-means model
4.1 method 1: structural gravel diagram
4.1.1 variable by type
#inertias_1 = []
for i in range(1,45):
kmeans = KMeans(n_clusters=i, init='k-means++',max_iter=300, n_init=10,random_state=0)
kmeans.fit(x_cate)
inertia = kmeans.inertia_
#inertias_1.append(inertia)
print('For n_cluster = ',i,'The inertia is:',inertia)


4.1.2 numerical variables
for i in range(1,200):
kmeans = KMeans(n_clusters=i, init='k-means++',max_iter=300, n_init=10,random_state=0)
kmeans.fit(x_seq)
inertia = kmeans.inertia_
#inertias_1.append(inertia)
print('For n_cluster = ',i,'The inertia is:',inertia)


I haven’t finished running nearly 200, ha ha.
4.1.3 the numerical and type characteristics shall be packed together to draw the gravel map
inertias_1 = []
for i in range(1,20):
kmeans = KMeans(n_clusters=i, init='k-means++',max_iter=300, n_init=10,random_state=0)
kmeans.fit(prepar_df)
inertia = kmeans.inertia_
inertias_1.append(inertia)
print('For n_cluster = ',i,'The inertia is:',inertia)
figure = plt.figure(1, figsize=(15,6))
plt.plot(np.arange(1,20), inertias_1, alpha=0.5, marker='o')
plt.xlabel("K")
plt.ylabel("Inertia ")

It is not difficult to find from the figure that the curve gradually flattens from about k = 3 until k = 7. But the latter value is due toInertiaToo low becomes meaningless whenIneretia = 0Each sample point is treated as a category.
4.2 method 2: find the K value through the contour average coefficient
4.2.1 general methods of clustering model
When n_ When clusters = 3, the average contour coefficient value is output
model_ Kmeans = kmeans (n_clusters = 3) # establish clustering model object
labels_ tmp = model_ kmeans. fit_ Predict (prepare_df) # pass fit_ In the way of predict, the converted data is labeled with clustering label
silhouette_ Score (prepare_df, labels_tmp) # through the relationship between the converted data and the label, the average contour coefficient value will be obtained. When k is as small as possible, the maximum value is the optimal solution.
【out】:0.45746043641666684
4.2.2 establish for loop
score_ List = [] # is used to store the draw contour coefficient of each model under K
For K in range (3,8): # traverse K from 3 to 7
model_ Kmeans = kmeans (n_clusters = k) # establish clustering model object
labels = model_ kmeans. fit_ Predict (prepare_df) # training clustering model
silhouetteScore = silhouette_ Score (prepare_df, labels) # get the average contour coefficient under each K
score_ list. Only two values (#k]) can be entered for append ([append])
print(score_list)
【out】:[[3, 0.45746043641666684], [4, 0.5019703686844438], [5, 0.4798826406042495], [6, 0.4773992930791803], [7, 0.5005669314822621]]
From this,K=4It is the best clustering effect
4.2.3 re optimize the model to obtain the clustering label
Manually enter the value of k = 4 this time
model_ Kmeans = kmeans (n_clusters = 4, random_state = 117) # establish clustering model objects
labels = model_kmeans.fit_predict(prepar_df)
pd.DataFrame(labels, columns=['clusters'])

4.2.4 important logical inflection point: merge the original data with the obtained label
In the previous steps, the data obtained by data conversionprepar_dfForm, because the clustering is completed and labeled accordinglyclusters, will no longer be used in the next steps.
final_df = pd.concat((data,pd.DataFrame(labels, columns=['clusters'])),axis=1)
final_df

5 model summary processing
When clusters = 0
final_ df[final_df["clusters"]==0]. iloc[:,1:7]. Descriptive statistics when describe() #clusters = 0. The overall data results shall be obtained through the mean value in the follow-up.

final_ df[final_df["clusters"]==0]. iloc[:,7:-1]. Describe() # here, the top can view the highest statistics

5.1 by obtaining the mean value of numerical features, it can be combined with classified features
cluster_features = []
For I in range (4): # cycle through 4 categories
label_ data = final_ DF [final_df ['clusters'] = = I] # get the data of a specific class
part1_ data = label_ data. Iloc [:, 1:7] # get numerical data features
part1_ desc = part1_ data. Describe() # get the descriptive statistical information of numerical features
merge_ data1 = part1_ Desc.iloc [1,:] # get the mean value of numerical features
part2_ data = label_ data. Iloc [:, 7: - 1] # get string data feature
part2_ desc = part2_ data. Describe (include ='All ') # gets the descriptive statistics of string data features
merge_ data2 = part2_ Desc.iloc [2,:] # get the most frequent value of string data feature
merge_ line = pd. Concat ((merge_data1, merge_data2, axis = 0) # combines numeric and string typical features along the line
cluster_ features. Append (merge_line) # appends the data characteristics under each category to the list
cluster_pd = pd.DataFrame(cluster_features).T
cluster_pd

5.2 try changing to the median
cluster_features_ = []
For I in range (4): # cycle through 4 categories
label_data = final_df[final_df['clusters'] == i]
part1_data = label_data.iloc[:, 1:7]
part1_desc = part1_data.describe()
merge_ data1_ = label_ data. iloc[:,1:7]. Median () # get the median of numerical features
part2_data = label_data.iloc[:, 7:-1]
part2_desc = part2_data.describe(include='all')
merge_data2 = part2_desc.iloc[2, :]
merge_line = pd.concat((merge_data1_, merge_data2), axis=0)
cluster_features_.append(merge_line)
cluster_pd_ = pd.DataFrame(cluster_features_).T
cluster_pd_

6 explore with radar map
fig = plt. Figure (figsize = (6,6)) # create canvas
ax = fig.add_ Subplot (111, polar = true) # add a sub grid, and pay attention to the polar parameter
labels = np. Array (merge_data1. Index) # set the data label to be displayed
cor_ List = ['B', 'g', 'R', 'C','m ',' y ',' k ',' W '] # define the colors of different categories
angles = np. Linspace (0, 2 * np.pi, len (labels), endpoint = false) # calculates the angle of each interval
angles = np.concatenate((angles, [angles[0]]))

6.1 data preprocessing
num_ sets = cluster_ pd. iloc[:6, :]. T. Astype (NP. Float64) # get the [mean] data to be displayed
num_sets_max_min = MinMaxScaler().fit_transform(num_sets)
num_sets_max_min

data_ tmp = num_ sets_ max_ Min [0,:] # obtain corresponding class data
data_ con = np. Concatenate ((data_tmp, [data_tmp [0]])) # establish the same header and footer fields to facilitate closure
ax. Plot (angles, data_con, 'O -', C = cor_list [0], label = 0) # draw lines
ax. set_ Title ("cluster summary diagram", fontproperties = "simhei", FontWeight = "black", fontsize = "X-LARGE")
fig

6.2 display using mean value
fig = plt. Figure (figsize = (6,6)) # create canvas
ax = fig.add_ Subplot (111, polar = true) # add a sub grid, and pay attention to the polar parameter
labels = np. Array (merge_data1. Index) # set the data label to be displayed
cor_ List = ['B', 'g', 'R', 'C','m ',' y ',' k ',' W '] # define the colors of different categories
angles = np. Linspace (0, 2 * np.pi, len (labels), endpoint = false) # calculates the angle of each interval
angles = np.concatenate((angles, [angles[0]]))
for i in range(4):
data_ tmp = num_ sets_ max_ Min [I,:] # obtain corresponding data
data_ con = np. Concatenate ((data_tmp, [data_tmp [0]])) # establish the same header and footer fields to facilitate closure
ax. Plot (angles, data_con, 'O -', C = cor_list [i], label = I) # draw lines
ax. set_ Thetgrid (angles * 180 / np.pi, labels, fontproperties = "simhei") # sets the polar axis
ax. set_ Title ("cluster summary diagram", fontproperties = "simhei", FontWeight = "black", fontsize = "X-LARGE")
ax. set_ Rlim (- 0.2, 1.2) # sets the axis scale range
plt.legend(loc=0)
cluster_ Pd# set legend position

6.3 change to the median and check the relationship with clusters
num_ sets = cluster_ pd_. iloc[:6, :]. T. Astype (NP. Float64) # get the [mean] data to be displayed
num_sets_max_min = MinMaxScaler().fit_transform(num_sets)
fig = plt. Figure (figsize = (6,6)) # create canvas
ax = fig.add_ Subplot (111, polar = true) # add a sub grid, and pay attention to the polar parameter
labels = np. Array (merge_data1. Index) # set the data label to be displayed
cor_ List = ['B', 'g', 'R', 'C','m ',' y ',' k ',' W '] # define the colors of different categories
angles = np. Linspace (0, 2 * np.pi, len (labels), endpoint = false) # calculates the angle of each interval
angles = np.concatenate((angles, [angles[0]]))
for i in range(4):
data_ tmp = num_ sets_ max_ Min [I,:] # obtain corresponding data
data_ con = np. Concatenate ((data_tmp, [data_tmp [0]])) # establish the same header and footer fields to facilitate closure
ax. Plot (angles, data_con, 'O -', C = cor_list [i], label = I, alpha = 0.5) # draw lines
ax. set_ Thetgrid (angles * 180 / np.pi, labels, fontproperties = "simhei") # sets the polar axis
ax. set_ Title ("cluster summary diagram", fontproperties = "simhei", FontWeight = "black", fontsize = "X-LARGE")
ax. set_ Rlim (- 0.2, 1.2) # sets the axis scale range
plt.legend(loc=0)
cluster_pd_

The results of median and mean display are quite different. Considering that the purpose of this study is to explore the comprehensive delivery effect of advertising, the median is selected as the visualization standard.
7 visually explore the indicators corresponding to each category
final_df.columns[1:]
Index ([‘daily average UV’, ‘average registration rate’, ‘average search volume’, ‘access depth’, ‘order conversion rate’, ‘total delivery time’, ‘material type’, ‘advertising type’,
‘cooperation method’, ‘advertising size’, ‘advertising selling points’,’ clusters’],
dtype=’object’)
col=final_df.columns[1:]
7.1 when clusters = 0
f=plt.figure(figsize=(18,7))
f.suptitle('clusters = 0',fontweight="black",fontsize="x-large")
f.subplots_ Adjust (hspace = 1) # increase the interval between subgraphs
fig.set_size_inches(18,7)
f.add_subplot(6,2,1)
sns.histplot(final_df[final_df['clusters']==0],x=col[0], kde=True)
f.add_subplot(6,2,2)
sns.histplot(final_df[final_df['clusters']==0],x=col[1], kde=True)
f.add_subplot(6,2,3)
sns.histplot(final_df[final_df['clusters']==0],x=col[2], kde=True)
f.add_subplot(6,2,4)
sns.histplot(final_df[final_df['clusters']==0],x=col[3], kde=True)
f.add_subplot(6,2,5)
sns.histplot(final_df[final_df['clusters']==0],x=col[4], kde=True)
f.add_subplot(6,2,6)
sns.histplot(final_df[final_df['clusters']==0],x=col[5], kde=True)
f.add_subplot(6,2,7)
sns.histplot(final_df[final_df['clusters']==0],x=col[6])
f.add_subplot(6,2,8)
sns.histplot(final_df[final_df['clusters']==0],x=col[7])
f.add_subplot(6,2,9)
sns.histplot(final_df[final_df['clusters']==0],x=col[8])
f.add_subplot(6,2,10)
sns.histplot(final_df[final_df['clusters']==0],x=col[9])
f.add_subplot(6,2,11)
sns.histplot(final_df[final_df['clusters']==0],x=col[10])

7.2 when clusters = 1
f=plt.figure(figsize=(18,7))
f.suptitle('clusters = 1',fontweight="black",fontsize="x-large")
f.subplots_ Adjust (hspace = 1) # increase the interval between subgraphs
fig.set_size_inches(18,7)
f.add_subplot(6,2,1)
sns.histplot(final_df[final_df['clusters']==1],x=col[0], kde=True)
f.add_subplot(6,2,2)
sns.histplot(final_df[final_df['clusters']==1],x=col[1], kde=True)
f.add_subplot(6,2,3)
sns.histplot(final_df[final_df['clusters']==1],x=col[2], kde=True)
f.add_subplot(6,2,4)
sns.histplot(final_df[final_df['clusters']==1],x=col[3], kde=True)
f.add_subplot(6,2,5)
sns.histplot(final_df[final_df['clusters']==1],x=col[4], kde=True)
f.add_subplot(6,2,6)
sns.histplot(final_df[final_df['clusters']==1],x=col[5], kde=True)
f.add_subplot(6,2,7)
sns.histplot(final_df[final_df['clusters']==1],x=col[6])
f.add_subplot(6,2,8)
sns.histplot(final_df[final_df['clusters']==1],x=col[7])
f.add_subplot(6,2,9)
sns.histplot(final_df[final_df['clusters']==1],x=col[8])
f.add_subplot(6,2,10)
sns.histplot(final_df[final_df['clusters']==1],x=col[9])
f.add_subplot(6,2,11)
sns.histplot(final_df[final_df['clusters']==1],x=col[10])

7.3 when clusters = 2
f=plt.figure(figsize=(18,7))
f.suptitle('clusters = 2',fontweight="black",fontsize="x-large")
f.subplots_ Adjust (hspace = 1) # increase the interval between subgraphs
fig.set_size_inches(18,7)
f.add_subplot(6,2,1)
sns.histplot(final_df[final_df['clusters']==2],x=col[0], kde=True)
f.add_subplot(6,2,2)
sns.histplot(final_df[final_df['clusters']==2],x=col[1], kde=True)
f.add_subplot(6,2,3)
sns.histplot(final_df[final_df['clusters']==2],x=col[2], kde=True)
f.add_subplot(6,2,4)
sns.histplot(final_df[final_df['clusters']==2],x=col[3], kde=True)
f.add_subplot(6,2,5)
sns.histplot(final_df[final_df['clusters']==2],x=col[4], kde=True)
f.add_subplot(6,2,6)
sns.histplot(final_df[final_df['clusters']==2],x=col[5], kde=True)
f.add_subplot(6,2,7)
sns.histplot(final_df[final_df['clusters']==2],x=col[6])
f.add_subplot(6,2,8)
sns.histplot(final_df[final_df['clusters']==2],x=col[7])
f.add_subplot(6,2,9)
sns.histplot(final_df[final_df['clusters']==2],x=col[8])
f.add_subplot(6,2,10)
sns.histplot(final_df[final_df['clusters']==2],x=col[9])
f.add_subplot(6,2,11)
sns.histplot(final_df[final_df['clusters']==2],x=col[10])

7.4 when clusters = 3
f=plt.figure(figsize=(18,7))
f.suptitle('clusters = 3',fontweight="black",fontsize="x-large")
f.subplots_ Adjust (hspace = 1) # increase the interval between subgraphs
fig.set_size_inches(18,7)
f.add_subplot(6,2,1)
sns.histplot(final_df[final_df['clusters']==3],x=col[0], kde=True)
f.add_subplot(6,2,2)
sns.histplot(final_df[final_df['clusters']==3],x=col[1], kde=True)
f.add_subplot(6,2,3)
sns.histplot(final_df[final_df['clusters']==3],x=col[2], kde=True)
f.add_subplot(6,2,4)
sns.histplot(final_df[final_df['clusters']==3],x=col[3], kde=True)
f.add_subplot(6,2,5)
sns.histplot(final_df[final_df['clusters']==3],x=col[4], kde=True)
f.add_subplot(6,2,6)
sns.histplot(final_df[final_df['clusters']==3],x=col[5], kde=True)
f.add_subplot(6,2,7)
sns.histplot(final_df[final_df['clusters']==3],x=col[6])
f.add_subplot(6,2,8)
sns.histplot(final_df[final_df['clusters']==3],x=col[7])
f.add_subplot(6,2,9)
sns.histplot(final_df[final_df['clusters']==3],x=col[8])
f.add_subplot(6,2,10)
sns.histplot(final_df[final_df['clusters']==3],x=col[9])
f.add_subplot(6,2,11)
sns.histplot(final_df[final_df['clusters']==3],x=col[10])

Score different indicators through the ranking of radar chart, with a maximum score of 4 points and a minimum score of 1 point. For example, a score of 4 is the one who ranks first in the current index.
Calculation method = revenue type numerical ranking plus / cost type numerical ranking.In other words, the denominator is the total delivery time, and the numerator is all other numerical characteristics.

8 conclusion and summary
8.1 project conclusion
1.Category 3 has the highest score, and the comprehensive effect of advertising is the best. The reason why the daily average UV is not high is most likely due to (full reduction, JPG, 308)388 advertising size). Of which 308388 is the only “tree type” advertisement among all advertisements.
2.The scores of category 1 and category 0 are similar and can be compared together. Category 1 may be more expensive goods. Compared with category 0, user type indicators such as access depth and average search volume are higher, but seller type indicators such as conversion rate and registration rate are lower, and users are still waiting and watching. In contrast to category 0, the seller type index is high and the user type index is low, which may be a more distinctive mass product advertisement.
3.Category 2 has the lowest score. Access depth, daily average UV and other indicators caused by long delivery time. Moreover, the advertising size is (600 * 90) small area advertising. The reasonable explanation is that the delivery cost is low, which makes the delivery time longer. However, the final launch results are not satisfactory. In the future business activities, it is suggested to give up directly, so as to avoid the loss of time and energy.

8.2 theoretical summary and Prospect
1.Advantages of clustering: it is not difficult to find from this project that clustering can summarize messy data. The obvious advantages can be seen in showing the relevant indicators corresponding to each category through Seaborn. However, it cannot be shown by drawing histogram before modeling.
2.Clusters finally formed four categories, and the samples were unbalanced. In the follow-up, we can try to model with multi classification model to see what indicators affect this clustering. For example,Advertising size areaCan be used asNew features。 Also, you canAdvertising shapetreat asNew featuresIncluded in the classification model.
3.The visualization ability needs to be strengthened.
