K-means clustering of advertising effect

Time:2022-5-4

Project background

Explore classic data sets in depth.

1 data set review

import matplotlib. Pyplot as plot # graphics library
import numpy as np
import pandas as pd
from sklearn. metrics import silhouette_ Score # import contour coefficient index
from sklearn. Cluster import kmeans # kmeans module
from sklearn. Preprocessing import minmaxscaler, onehotencoder # data preprocessing Library
from mlxtend.preprocessing import one_hot
import seaborn as sns
data = pd. read_ Table (r'e: \ baidunetdiskdownload \ statistics \ Python data analysis and data operation \ Chapter7 \ ad_performance. TXT ')
data
K-means clustering of advertising effect

image.png
data. Describe () # data dimension difference is too large. The average daily UV value is 540. The registration rate and conversion rate are about 0.0, with a difference of nearly a thousand times. Minmax () should be done later
K-means clustering of advertising effect

image.png
data. isnull(). Any (axis = 0) # average residence time
K-means clustering of advertising effect

image.png
pd. Dataframe (data. Corr() ["Access Depth"] sort_ values(ascending=False))
K-means clustering of advertising effect

image.png

The correlation coefficient between average residence time and access depth is strong and needs to be deleted. This is because our data analysis task is clustering, and the clustering algorithm is sensitive to collinear data.

2 data visualization exploration

import seaborn as sns
sns. set_ Style ('white ', {' font. Sans serif ': ['simhei','arial ']}) # set white background and Chinese display

2.1 exploration of sub type variables

2.1.1 advertising type

sns. Countlot (y = "advertisement type", data = data, color = '#1e90ff',
             Order = data ["advertisement type"] value_ counts(). index)
Data ["advertisement type"] value_ Counts() / sum (data ["ad type"]. Value_counts())
K-means clustering of advertising effect

image.png

2.1.2 material type

sns. Countlot (y = "material type", data = data, color = '#1e90ff',
             Order = data ["material type"] value_ counts(). index)
Data ["material type"] value_ Counts() / sum (data ["material type"]. Value_counts())
K-means clustering of advertising effect

image.png

2.1.3 cooperation mode

sns. Countlot (y = "cooperation method", data = data, color = '#1e90ff',
             Order = data ["cooperation mode"] value_ counts(). index)
Data ["cooperation mode"] value_ Counts() / sum (data ["cooperation method"]. Value_counts())
K-means clustering of advertising effect

image.png

2.1.4 advertising size

sns. Countlot (y = "ad size", data = data, color = '#1e90ff',
             Order = data ["ad size"] value_ counts(). index)
Data ["ad size"] value_ Counts() / sum (data ["ad size"]. Value_counts())
K-means clustering of advertising effect

image.png

2.1.4.1 convert advertising size into advertising area

Data ["ad size"] unique()[0]. Split ("*", 2) [0], data ["ad size"] unique()[0]. Split ("*", 2) [1]# to extract 140,40
K-means clustering of advertising effect

image.png
length_ Lst = []# length list
For I in range (len (data ["ad size"]. Unique ()):
   length_ lst. Append (data) [ad size]. Unique() [i] split("*",2)[0])
width_ Lst = []# width list
Size (J) "len":
   width_ lst. Append (data) [ad size]. Unique() [J] split("*",2)[1])
length_ Lst = list (map (int, length# u LST)) # extracts a string and needs to be converted to int before viewing the advertising area
width_lst = list(map(int, width_lst))

2.1.4.2 data table of advertising area

Frames = [PD. Dataframe (length_lst, columns = ["advertisement size"])
         ,pd. Dataframe (width_lst, columns = ["ad size"])
         ,pd. DataFrame(np.array(length_lst)*np. Array (width_lst), columns = ["advertising size"]]
df_adv=pd.concat(frames,join='inner',axis=1)
df_ adv.sort_ Values (by = "advertising size area", ascending = false)
K-means clustering of advertising effect

image.png

2.1.5 advertising selling points

sns. Countlot (y = 'advertising selling point', data = data, color = '#1e90ff',
             Order = data ['advertising selling point'] value_ counts(). index)
Data ["advertising selling points"] value_ Counts() / sum (data ["advertising selling point"]. Value_counts())
K-means clustering of advertising effect

image.png

3 data conversion

3.1 classification variable conversion

3.1.1 method 1: get_ dummies

x_ cate = pd. get_ Dummies (data. LOC [:, "material type": "advertising selling point")
x_ Cat # use get_ Dummies, you can do single heat coding
K-means clustering of advertising effect

image.png

3.1.3 method 2: onehotencoder

OneHotEncoder(sparse=False). fit_ Transform (data. LOC [:, "material type": "advertising selling point")
K-means clustering of advertising effect

image.png

3.2 numerical variable conversion

data. iloc[:,1:7]. Describe () # take a look. There is a big gap between the maximum and minimum values of daily average UV and access depth. Leave an impression.
K-means clustering of advertising effect

image.png
x_ seq = MinMaxScaler(). fit_ transform(data.iloc[:,1:7]). Round (2) # direct fit_ Transform, otherwise you have to first fit and then transform
x_seq
K-means clustering of advertising effect

image.png

3.2.1 data table after consolidated data conversion

prepar_df = pd.concat([pd.DataFrame(x_seq,columns=[x for x in data.iloc[:,1:7].columns]),x_cate],join='inner',axis=1)
prepar_df
K-means clustering of advertising effect

image.png

4. Establishment of K-means model

4.1 method 1: structural gravel diagram

4.1.1 variable by type

#inertias_1 = []
for i in range(1,45):
    kmeans = KMeans(n_clusters=i, init='k-means++',max_iter=300, n_init=10,random_state=0)
    kmeans.fit(x_cate)
    inertia = kmeans.inertia_
    #inertias_1.append(inertia)
    print('For n_cluster = ',i,'The inertia is:',inertia)
K-means clustering of advertising effect

image.png

K-means clustering of advertising effect

image.png

4.1.2 numerical variables

for i in range(1,200):
    kmeans = KMeans(n_clusters=i, init='k-means++',max_iter=300, n_init=10,random_state=0)
    kmeans.fit(x_seq)
    inertia = kmeans.inertia_
    #inertias_1.append(inertia)
    print('For n_cluster = ',i,'The inertia is:',inertia)
K-means clustering of advertising effect

image.png

K-means clustering of advertising effect

image.png

I haven’t finished running nearly 200, ha ha.

4.1.3 the numerical and type characteristics shall be packed together to draw the gravel map

inertias_1 = []
for i in range(1,20):
    kmeans = KMeans(n_clusters=i, init='k-means++',max_iter=300, n_init=10,random_state=0)
    kmeans.fit(prepar_df)
    inertia = kmeans.inertia_
    inertias_1.append(inertia)
    print('For n_cluster = ',i,'The inertia is:',inertia)
figure = plt.figure(1, figsize=(15,6))
plt.plot(np.arange(1,20), inertias_1, alpha=0.5, marker='o')
plt.xlabel("K")
plt.ylabel("Inertia ")

K-means clustering of advertising effect

image.png

It is not difficult to find from the figure that the curve gradually flattens from about k = 3 until k = 7. But the latter value is due toInertiaToo low becomes meaningless whenIneretia = 0Each sample point is treated as a category.

4.2 method 2: find the K value through the contour average coefficient

4.2.1 general methods of clustering model

When n_ When clusters = 3, the average contour coefficient value is output

model_ Kmeans = kmeans (n_clusters = 3) # establish clustering model object
labels_ tmp = model_ kmeans. fit_ Predict (prepare_df) # pass fit_ In the way of predict, the converted data is labeled with clustering label
silhouette_ Score (prepare_df, labels_tmp) # through the relationship between the converted data and the label, the average contour coefficient value will be obtained. When k is as small as possible, the maximum value is the optimal solution.

【out】:0.45746043641666684

4.2.2 establish for loop

score_ List = [] # is used to store the draw contour coefficient of each model under K
For K in range (3,8): # traverse K from 3 to 7
    model_ Kmeans = kmeans (n_clusters = k) # establish clustering model object
    labels = model_ kmeans. fit_ Predict (prepare_df) # training clustering model
    silhouetteScore = silhouette_ Score (prepare_df, labels) # get the average contour coefficient under each K
    score_ list. Only two values (#k]) can be entered for append ([append])
print(score_list)

【out】:[[3, 0.45746043641666684], [4, 0.5019703686844438], [5, 0.4798826406042495], [6, 0.4773992930791803], [7, 0.5005669314822621]]
From this,K=4It is the best clustering effect

4.2.3 re optimize the model to obtain the clustering label

Manually enter the value of k = 4 this time

model_ Kmeans = kmeans (n_clusters = 4, random_state = 117) # establish clustering model objects
labels = model_kmeans.fit_predict(prepar_df)
pd.DataFrame(labels, columns=['clusters'])
K-means clustering of advertising effect

image.png

4.2.4 important logical inflection point: merge the original data with the obtained label

In the previous steps, the data obtained by data conversionprepar_dfForm, because the clustering is completed and labeled accordinglyclusters, will no longer be used in the next steps.

final_df = pd.concat((data,pd.DataFrame(labels, columns=['clusters'])),axis=1)
final_df
K-means clustering of advertising effect

image.png

5 model summary processing

When clusters = 0

final_ df[final_df["clusters"]==0]. iloc[:,1:7]. Descriptive statistics when describe() #clusters = 0. The overall data results shall be obtained through the mean value in the follow-up.
K-means clustering of advertising effect

image.png
final_ df[final_df["clusters"]==0]. iloc[:,7:-1]. Describe() # here, the top can view the highest statistics
K-means clustering of advertising effect

image.png

5.1 by obtaining the mean value of numerical features, it can be combined with classified features

cluster_features = [] 
For I in range (4): # cycle through 4 categories
    label_ data = final_ DF [final_df ['clusters'] = = I] # get the data of a specific class

    part1_ data = label_ data. Iloc [:, 1:7] # get numerical data features
    part1_ desc = part1_ data. Describe() # get the descriptive statistical information of numerical features
    merge_ data1 = part1_ Desc.iloc [1,:] # get the mean value of numerical features

    part2_ data = label_ data. Iloc [:, 7: - 1] # get string data feature
    part2_ desc = part2_ data. Describe (include ='All ') # gets the descriptive statistics of string data features
    merge_ data2 = part2_ Desc.iloc [2,:] # get the most frequent value of string data feature

    merge_ line = pd. Concat ((merge_data1, merge_data2, axis = 0) # combines numeric and string typical features along the line
    cluster_ features. Append (merge_line) # appends the data characteristics under each category to the list
cluster_pd = pd.DataFrame(cluster_features).T
cluster_pd
K-means clustering of advertising effect

image.png

5.2 try changing to the median

cluster_features_ = [] 
For I in range (4): # cycle through 4 categories
    label_data = final_df[final_df['clusters'] == i]  

    part1_data = label_data.iloc[:, 1:7]  
    part1_desc = part1_data.describe()  
    merge_ data1_ =  label_ data. iloc[:,1:7]. Median () # get the median of numerical features

    part2_data = label_data.iloc[:, 7:-1]  
    part2_desc = part2_data.describe(include='all')  
    merge_data2 = part2_desc.iloc[2, :]  

    merge_line = pd.concat((merge_data1_, merge_data2), axis=0)
    cluster_features_.append(merge_line)
cluster_pd_ = pd.DataFrame(cluster_features_).T
cluster_pd_
K-means clustering of advertising effect

image.png

6 explore with radar map

fig = plt. Figure (figsize = (6,6)) # create canvas
ax = fig.add_ Subplot (111, polar = true) # add a sub grid, and pay attention to the polar parameter
labels = np. Array (merge_data1. Index) # set the data label to be displayed
cor_ List = ['B', 'g', 'R', 'C','m ',' y ',' k ',' W '] # define the colors of different categories
angles = np. Linspace (0, 2 * np.pi, len (labels), endpoint = false) # calculates the angle of each interval
angles = np.concatenate((angles, [angles[0]]))
K-means clustering of advertising effect

image.png

6.1 data preprocessing

num_ sets = cluster_ pd. iloc[:6, :]. T. Astype (NP. Float64) # get the [mean] data to be displayed
num_sets_max_min = MinMaxScaler().fit_transform(num_sets)
num_sets_max_min
K-means clustering of advertising effect

image.png
data_ tmp = num_ sets_ max_ Min [0,:] # obtain corresponding class data
data_ con = np. Concatenate ((data_tmp, [data_tmp [0]])) # establish the same header and footer fields to facilitate closure
ax. Plot (angles, data_con, 'O -', C = cor_list [0], label = 0) # draw lines
ax. set_ Title ("cluster summary diagram", fontproperties = "simhei", FontWeight = "black", fontsize = "X-LARGE")
fig
K-means clustering of advertising effect

image.png

6.2 display using mean value

fig = plt. Figure (figsize = (6,6)) # create canvas
ax = fig.add_ Subplot (111, polar = true) # add a sub grid, and pay attention to the polar parameter
labels = np. Array (merge_data1. Index) # set the data label to be displayed
cor_ List = ['B', 'g', 'R', 'C','m ',' y ',' k ',' W '] # define the colors of different categories
angles = np. Linspace (0, 2 * np.pi, len (labels), endpoint = false) # calculates the angle of each interval
angles = np.concatenate((angles, [angles[0]]))
for i in range(4):
    data_ tmp = num_ sets_ max_ Min [I,:] # obtain corresponding data
    data_ con = np. Concatenate ((data_tmp, [data_tmp [0]])) # establish the same header and footer fields to facilitate closure
    ax. Plot (angles, data_con, 'O -', C = cor_list [i], label = I) # draw lines
ax. set_ Thetgrid (angles * 180 / np.pi, labels, fontproperties = "simhei") # sets the polar axis
ax. set_ Title ("cluster summary diagram", fontproperties = "simhei", FontWeight = "black", fontsize = "X-LARGE")
ax. set_ Rlim (- 0.2, 1.2) # sets the axis scale range
plt.legend(loc=0)
cluster_ Pd# set legend position
K-means clustering of advertising effect

image.png

6.3 change to the median and check the relationship with clusters

num_ sets = cluster_ pd_. iloc[:6, :]. T. Astype (NP. Float64) # get the [mean] data to be displayed
num_sets_max_min = MinMaxScaler().fit_transform(num_sets)
fig = plt. Figure (figsize = (6,6)) # create canvas
ax = fig.add_ Subplot (111, polar = true) # add a sub grid, and pay attention to the polar parameter
labels = np. Array (merge_data1. Index) # set the data label to be displayed
cor_ List = ['B', 'g', 'R', 'C','m ',' y ',' k ',' W '] # define the colors of different categories
angles = np. Linspace (0, 2 * np.pi, len (labels), endpoint = false) # calculates the angle of each interval
angles = np.concatenate((angles, [angles[0]]))
for i in range(4):
    data_ tmp = num_ sets_ max_ Min [I,:] # obtain corresponding data
    data_ con = np. Concatenate ((data_tmp, [data_tmp [0]])) # establish the same header and footer fields to facilitate closure
    ax. Plot (angles, data_con, 'O -', C = cor_list [i], label = I, alpha = 0.5) # draw lines
ax. set_ Thetgrid (angles * 180 / np.pi, labels, fontproperties = "simhei") # sets the polar axis
ax. set_ Title ("cluster summary diagram", fontproperties = "simhei", FontWeight = "black", fontsize = "X-LARGE")
ax. set_ Rlim (- 0.2, 1.2) # sets the axis scale range
plt.legend(loc=0)
cluster_pd_
K-means clustering of advertising effect

image.png

The results of median and mean display are quite different. Considering that the purpose of this study is to explore the comprehensive delivery effect of advertising, the median is selected as the visualization standard.

7 visually explore the indicators corresponding to each category

final_df.columns[1:]

Index ([‘daily average UV’, ‘average registration rate’, ‘average search volume’, ‘access depth’, ‘order conversion rate’, ‘total delivery time’, ‘material type’, ‘advertising type’,
‘cooperation method’, ‘advertising size’, ‘advertising selling points’,’ clusters’],
dtype=’object’)

col=final_df.columns[1:]

7.1 when clusters = 0

f=plt.figure(figsize=(18,7))
f.suptitle('clusters = 0',fontweight="black",fontsize="x-large")
f.subplots_ Adjust (hspace = 1) # increase the interval between subgraphs
fig.set_size_inches(18,7)
f.add_subplot(6,2,1)
sns.histplot(final_df[final_df['clusters']==0],x=col[0], kde=True)
f.add_subplot(6,2,2)
sns.histplot(final_df[final_df['clusters']==0],x=col[1], kde=True)
f.add_subplot(6,2,3)
sns.histplot(final_df[final_df['clusters']==0],x=col[2], kde=True)
f.add_subplot(6,2,4)
sns.histplot(final_df[final_df['clusters']==0],x=col[3], kde=True)
f.add_subplot(6,2,5)
sns.histplot(final_df[final_df['clusters']==0],x=col[4], kde=True)
f.add_subplot(6,2,6)
sns.histplot(final_df[final_df['clusters']==0],x=col[5], kde=True)
f.add_subplot(6,2,7)
sns.histplot(final_df[final_df['clusters']==0],x=col[6])
f.add_subplot(6,2,8)
sns.histplot(final_df[final_df['clusters']==0],x=col[7])
f.add_subplot(6,2,9)
sns.histplot(final_df[final_df['clusters']==0],x=col[8])
f.add_subplot(6,2,10)
sns.histplot(final_df[final_df['clusters']==0],x=col[9])
f.add_subplot(6,2,11)
sns.histplot(final_df[final_df['clusters']==0],x=col[10])
K-means clustering of advertising effect

image.png

7.2 when clusters = 1

f=plt.figure(figsize=(18,7))
f.suptitle('clusters = 1',fontweight="black",fontsize="x-large")
f.subplots_ Adjust (hspace = 1) # increase the interval between subgraphs
fig.set_size_inches(18,7)
f.add_subplot(6,2,1)
sns.histplot(final_df[final_df['clusters']==1],x=col[0], kde=True)
f.add_subplot(6,2,2)
sns.histplot(final_df[final_df['clusters']==1],x=col[1], kde=True)
f.add_subplot(6,2,3)
sns.histplot(final_df[final_df['clusters']==1],x=col[2], kde=True)
f.add_subplot(6,2,4)
sns.histplot(final_df[final_df['clusters']==1],x=col[3], kde=True)
f.add_subplot(6,2,5)
sns.histplot(final_df[final_df['clusters']==1],x=col[4], kde=True)
f.add_subplot(6,2,6)
sns.histplot(final_df[final_df['clusters']==1],x=col[5], kde=True)
f.add_subplot(6,2,7)
sns.histplot(final_df[final_df['clusters']==1],x=col[6])
f.add_subplot(6,2,8)
sns.histplot(final_df[final_df['clusters']==1],x=col[7])
f.add_subplot(6,2,9)
sns.histplot(final_df[final_df['clusters']==1],x=col[8])
f.add_subplot(6,2,10)
sns.histplot(final_df[final_df['clusters']==1],x=col[9])
f.add_subplot(6,2,11)
sns.histplot(final_df[final_df['clusters']==1],x=col[10])
K-means clustering of advertising effect

image.png

7.3 when clusters = 2

f=plt.figure(figsize=(18,7))
f.suptitle('clusters = 2',fontweight="black",fontsize="x-large")
f.subplots_ Adjust (hspace = 1) # increase the interval between subgraphs
fig.set_size_inches(18,7)
f.add_subplot(6,2,1)
sns.histplot(final_df[final_df['clusters']==2],x=col[0], kde=True)
f.add_subplot(6,2,2)
sns.histplot(final_df[final_df['clusters']==2],x=col[1], kde=True)
f.add_subplot(6,2,3)
sns.histplot(final_df[final_df['clusters']==2],x=col[2], kde=True)
f.add_subplot(6,2,4)
sns.histplot(final_df[final_df['clusters']==2],x=col[3], kde=True)
f.add_subplot(6,2,5)
sns.histplot(final_df[final_df['clusters']==2],x=col[4], kde=True)
f.add_subplot(6,2,6)
sns.histplot(final_df[final_df['clusters']==2],x=col[5], kde=True)
f.add_subplot(6,2,7)
sns.histplot(final_df[final_df['clusters']==2],x=col[6])
f.add_subplot(6,2,8)
sns.histplot(final_df[final_df['clusters']==2],x=col[7])
f.add_subplot(6,2,9)
sns.histplot(final_df[final_df['clusters']==2],x=col[8])
f.add_subplot(6,2,10)
sns.histplot(final_df[final_df['clusters']==2],x=col[9])
f.add_subplot(6,2,11)
sns.histplot(final_df[final_df['clusters']==2],x=col[10])
K-means clustering of advertising effect

image.png

7.4 when clusters = 3

f=plt.figure(figsize=(18,7))
f.suptitle('clusters = 3',fontweight="black",fontsize="x-large")
f.subplots_ Adjust (hspace = 1) # increase the interval between subgraphs
fig.set_size_inches(18,7)
f.add_subplot(6,2,1)
sns.histplot(final_df[final_df['clusters']==3],x=col[0], kde=True)
f.add_subplot(6,2,2)
sns.histplot(final_df[final_df['clusters']==3],x=col[1], kde=True)
f.add_subplot(6,2,3)
sns.histplot(final_df[final_df['clusters']==3],x=col[2], kde=True)
f.add_subplot(6,2,4)
sns.histplot(final_df[final_df['clusters']==3],x=col[3], kde=True)
f.add_subplot(6,2,5)
sns.histplot(final_df[final_df['clusters']==3],x=col[4], kde=True)
f.add_subplot(6,2,6)
sns.histplot(final_df[final_df['clusters']==3],x=col[5], kde=True)
f.add_subplot(6,2,7)
sns.histplot(final_df[final_df['clusters']==3],x=col[6])
f.add_subplot(6,2,8)
sns.histplot(final_df[final_df['clusters']==3],x=col[7])
f.add_subplot(6,2,9)
sns.histplot(final_df[final_df['clusters']==3],x=col[8])
f.add_subplot(6,2,10)
sns.histplot(final_df[final_df['clusters']==3],x=col[9])
f.add_subplot(6,2,11)
sns.histplot(final_df[final_df['clusters']==3],x=col[10])

K-means clustering of advertising effect

image.png

Score different indicators through the ranking of radar chart, with a maximum score of 4 points and a minimum score of 1 point. For example, a score of 4 is the one who ranks first in the current index.

Calculation method = revenue type numerical ranking plus / cost type numerical ranking.In other words, the denominator is the total delivery time, and the numerator is all other numerical characteristics.

K-means clustering of advertising effect

image.png

8 conclusion and summary

8.1 project conclusion

1.Category 3 has the highest score, and the comprehensive effect of advertising is the best. The reason why the daily average UV is not high is most likely due to (full reduction, JPG, 308)388 advertising size). Of which 308388 is the only “tree type” advertisement among all advertisements.
2.The scores of category 1 and category 0 are similar and can be compared together. Category 1 may be more expensive goods. Compared with category 0, user type indicators such as access depth and average search volume are higher, but seller type indicators such as conversion rate and registration rate are lower, and users are still waiting and watching. In contrast to category 0, the seller type index is high and the user type index is low, which may be a more distinctive mass product advertisement.
3.Category 2 has the lowest score. Access depth, daily average UV and other indicators caused by long delivery time. Moreover, the advertising size is (600 * 90) small area advertising. The reasonable explanation is that the delivery cost is low, which makes the delivery time longer. However, the final launch results are not satisfactory. In the future business activities, it is suggested to give up directly, so as to avoid the loss of time and energy.

K-means clustering of advertising effect

image.png

8.2 theoretical summary and Prospect

1.Advantages of clustering: it is not difficult to find from this project that clustering can summarize messy data. The obvious advantages can be seen in showing the relevant indicators corresponding to each category through Seaborn. However, it cannot be shown by drawing histogram before modeling.
2.Clusters finally formed four categories, and the samples were unbalanced. In the follow-up, we can try to model with multi classification model to see what indicators affect this clustering. For example,Advertising size areaCan be used asNew features。 Also, you canAdvertising shapetreat asNew featuresIncluded in the classification model.
3.The visualization ability needs to be strengthened.

K-means clustering of advertising effect

image.png