In the early stage, we export wechat chat records and select wechat group chat records as the analysis object. After some preprocessing work, we make a simple analysis from the speech frequency and speech type. This time, we analyze the time preference of wechat group chat, and select the group chat with the most records to analyze the most concerned people in the group. Finally, we segment the text message and draw the word cloud.
As in the previous issue, this data analysis work is carried out in the environment of jupyter, and the main libraries used are as follows:
That is, through the statistical analysis of the speech frequency of wechat group at different times.
#New libraries for subsequent analysis
First, import the library you need to use and make simple settings.
print('Min Time: ', message.index.min()) print('Max Time: ', message.index.max()) print('Total Message: ', message['real_content'].count())
Min Time: 2018-08-11 13:52:14+08:00 Max Time: 2020-02-16 11:38:21+08:00 Total Message: 155407
That is to say, the speaking time of wechat group chat records in this analysis is between August 11, 2018 and February 6, 2020, with a total of 155407 records. The following will analyze the speaking frequency from three time schedules:
- Frequency statistics were made according to the accuracy of% y-m;
- Frequency statistics were made according to the accuracy of% W;
- Frequency statistics were made according to the progress of% H.
month_count = message['real_content'].resample('M', kind='period').count() month_count.head()
createTime 2018-08 1064 2018-09 4092 2018-10 3214 2018-11 3116 2018-12 4376 Freq: M, Name: real_content, dtype: int64
fig, ax = plt.subplots(figsize=(9, 6)) month_count.plot(kind='bar', ax = ax)
In other words, we can clearly see that since July 2019, the number of wechat group chats has increased significantly, which is actually due to the fact that we have joined some very active chat groups.
weekday_count = message.groupby(lambda x: x.weekday_name)['real_content'].count() weekday_count
Friday 27563 Monday 25577 Saturday 12692 Sunday 10291 Thursday 26028 Tuesday 27318 Wednesday 25938 Name: real_content, dtype: int64
weekday_ per = weekday_ count / weekday_ count.sum () * 100
In other words, on the whole, the speaking activity of wechat group is significantly higher on weekdays than on weekends.
hour_count = message.groupby(message.index.hour)['real_content'].count() hour_count
createTime 0 1994 1 169 2 155 3 15 4 18 ... 19 8381 20 7666 21 5600 22 5791 23 3902 Name: real_content, Length: 24, dtype: int64
fig = plt.figure (figsize=(10, 8))
We found that the two peaks of speaking in wechat group appear at 11-12 and 17-18 respectively, which are around meal time. Ha ha.
Of course, the above three time-based speech counts are all based on all wechat groups. If you want to analyze the chat time habits of a specific wechat group, you just need to
message.talkerLimited to the specified wechat group.
We can segment the chat records of wechat group according to “whether the chat is continuous”. When someone participates in the chat, the number of participants and the total number of chat records increase significantly. We think that this person gets more attention in the group.
target_ group = message.groupby ('talker')['real_ content'].count().idxmax()
Wechat group with the most chat records: 12*** [email protected]
The group with the most chat records was selected for analysis, and the time difference between the two chats was taken as the index
target_messageNew column for
diff_second, which is set to 60 seconds (in fact, it’s not easy to find the time point for splitting.
diff_second = (target_message.index[1:] - target_message.index[0:-1]).values / 1e9 diff_second = diff_second.astype('d') target_message['diff_second'] = np.concatenate(([0.], diff_second)) len(diff_second[diff_second <= 60]) / target_message['real_content'].count()
The interval between two chats is less than 60 seconds, accounting for about 82.65%, and the split is reasonable.
target_message['talk_segid'] = np.where(target_message['diff_second'] <= 60, 0, 1).cumsum() target_message = target_message[['isSend', 'message_type', 'username', 'real_content', 'diff_second', 'talk_segid']] target_message.talk_segid.value_counts().value_counts().sort_index()
1 5370 2 2722 3 1447 4 883 5 539 ... 371 1 378 1 527 1 638 1 925 1 Name: talk_segid, Length: 146, dtype: int64
With the help of
cumsum()Implement segmentation marking, and select the data column to be used later. That is to say, there are 5370 chats, and the time difference between the previous chat and the next chat is more than 60 seconds, accounting for about 6.75% of the total number of chats. The longest continuous chat is 925 sentences.
def impact_ factor(df):
Calculate the changes in the number of speakers and the number of speakers before and after a person’s first speech in this chat.
- The difference of the number of participants before and after the speech was calculated;
- Calculate the difference of the total amount of chat before and after the speech;
explain:Before and after speaking here, the first time a person speaks in the chat shall prevail; the average chat interval is used for chat density.
impact = target_message.reset_index().groupby('talk_segid').apply(impact_factor) impact_people_mean = impact.groupby('username')['diff_people'].mean() impact_talks_mean = impact.groupby('username')['diff_talks'].mean() talks = target_message[['username', 'talk_segid']].drop_duplicates().groupby('username')['talk_segid'].count()
Calculate the participation times of each person in all chats
talks, the average difference of the number of people who participated in each chat
impact_people_mean, the average number of chat records affected by each chat
talks = pd.merge(pd.merge(talks, impact_people_mean, left_index=True, right_index=True), impact_talks_mean, left_index=True, right_index=True) talks.columns = ['total_talks', 'diff_people_mean', 'diff_talks_mean'] talks
313 rows × 3 columns
fig = plt.figure(figsize=(10, 8)) talks['sqrt_total_talks'] = np.sqrt(talks.total_talks) sns.scatterplot(x='sqrt_total_talks', y='diff_people_mean', hue='diff_people_mean', size='diff_people_mean', data=talks) plt.annotate('target', xy=(5, 10), xytext=(10, 15), arrowprops=dict(width=1, headwidth=5)) plt.show()
Because some members of the group are very active, they will draw a scatter diagram after the “chat session” they participate in.
In fact, there is an obvious difference between the number of speakers and the amount of chat
bug~If a person speaks a small number of times and follows the person who has received attention (or speaks earlier in the hot topic discussion), then the corresponding
diff_people_mean, diff_talks_meanIt’s worth a lot.
Does the speaker receive attention after speaking a lot
diff_people_mean, diff_talks_meanIt is reasonable to measure, as shown in the two figures
fig = plt.figure(figsize=(10, 8)) sns.scatterplot(x='sqrt_total_talks', y='diff_talks_mean', hue='diff_talks_mean', size='diff_talks_mean', data=talks) plt.annotate('target', xy=(5, 95), xytext=(10, 150), arrowprops=dict(width=1, headwidth=5)) plt.show()
Five times or more[ talks.total_ talks >= 5]['diff_ people_ mean'].idxmax())
The most concerned person who has spoken five times or more: ad * * * 47
In other words,
'ad****47'After the speech, there were more speakers and more speeches. In fact, this group was established because of him, which was in line with the expectation.
total_talks 23.000000 diff_people_mean 9.695652 diff_talks_mean 92.565217 sqrt_total_talks 4.795832 Name: ad814156147, dtype: float64
In other words,
'ad***47'A total of 23 chats were participated in. On average, after each chat, more than 9 people would come out to bubble and increase the number of speeches by 92 times.
In the end, it was selected
'ad****47'For “the most beautiful cub in the whole group”!
Finally, we use
target_messageIn the chat record simple do a word cloud. Main use
jiebaMake a participle, use
wordcloudTo draw the word cloud map.
chats = target_ message[target_ message.message_ Type = ='text '] ['real_ content']
In word segmentation, only select
Add the text converted from wechat expression to the dictionary, such as
'[cover your face], [smile] and [laugh and cry]'And so on
__init__.pyPlease refer tohttps://github.com/fxsjy/jieba/issues/423.
Building prefix dict from the default dictionary ... Dumping model to file cache /var/folders/bf/yt56193d35lbzfh4rn_xxct40000gn/T/jieba.cache Loading model cost 0.966 seconds. Prefix dict has been built successfully.
Add wechat emoticons to the dictionary and remove some stop words
collections.CounterSimplify word frequency statistics.
word_counts = Counter() for chat in chats: word_counts.update(jieba.analyse.extract_tags(chat)) word_counts = pd.Series(word_counts)
It's not 1402
With the help of wordcloud, we can draw the word cloud of 200 words with the highest frequency.
wc = WordCloud(font_path='/fontpath/simhei.ttf', background_color="white", repeat=False) wc.fit_words(word_counts.sort_values()[-200:]) fig, ax = plt.subplots(figsize=(10, 8)) ax.axis('off') ax.imshow(wc)
That’s all for the chat analysis of wechat group.
- Wei Yang’s blog, Wei Yang,Wechat chat record export to computer TXT file tutorial, 2020/3/2.
- Where hangcom writes, hangcom, Wechat chat record export – release, 2020/3/2.
- Asher117, [Python] dataframe splits one column into multiple columns and one row into multiple rows, 2020/3/2.
- The wind is shallow and calm, Chinese display of Matplotlib and Seaborn, 2020/3/2.
- alpiny, [share] what many people need: keywords with spaces and special characters~~, 2020/3/9.
This work adoptsCC agreementReprint must indicate the author and the link of this article