Two or three things in wechat group (Part Two)

Time:2021-4-15

In the early stage, we export wechat chat records and select wechat group chat records as the analysis object. After some preprocessing work, we make a simple analysis from the speech frequency and speech type. This time, we analyze the time preference of wechat group chat, and select the group chat with the most records to analyze the most concerned people in the group. Finally, we segment the text message and draw the word cloud.

As in the previous issue, this data analysis work is carried out in the environment of jupyter, and the main libraries used are as follows:

  • pandas
  • numpy
  • matplotlib
  • seaborn
  • jieba
  • wordcloud
  • collections

1 time preference of chat

That is, through the statistical analysis of the speech frequency of wechat group at different times.

#New libraries for subsequent analysis

First, import the library you need to use and make simple settings.

print('Min Time: ', message.index.min())
print('Max Time: ', message.index.max())
print('Total Message: ', message['real_content'].count())
Min Time:  2018-08-11 13:52:14+08:00
Max Time:  2020-02-16 11:38:21+08:00
Total Message:  155407

That is to say, the speaking time of wechat group chat records in this analysis is between August 11, 2018 and February 6, 2020, with a total of 155407 records. The following will analyze the speaking frequency from three time schedules:

  • Frequency statistics were made according to the accuracy of% y-m;
  • Frequency statistics were made according to the accuracy of% W;
  • Frequency statistics were made according to the progress of% H.

1.1 statistics of speeches by year

month_count = message['real_content'].resample('M', kind='period').count()
month_count.head()
createTime
2018-08    1064
2018-09    4092
2018-10    3214
2018-11    3116
2018-12    4376
Freq: M, Name: real_content, dtype: int64
fig, ax = plt.subplots(figsize=(9, 6))

month_count.plot(kind='bar', ax = ax)

Two or three things in wechat group (Part Two)

In other words, we can clearly see that since July 2019, the number of wechat group chats has increased significantly, which is actually due to the fact that we have joined some very active chat groups.

1.2 weekly speeches

weekday_count = message.groupby(lambda x: x.weekday_name)['real_content'].count()
weekday_count
Friday       27563
Monday       25577
Saturday     12692
Sunday       10291
Thursday     26028
Tuesday      27318
Wednesday    25938
Name: real_content, dtype: int64
weekday_ per = weekday_ count / weekday_ count.sum () * 100

Two or three things in wechat group (Part Two)

In other words, on the whole, the speaking activity of wechat group is significantly higher on weekdays than on weekends.

1.3 speeches in one day

hour_count = message.groupby(message.index.hour)['real_content'].count()
hour_count
createTime
0     1994
1      169
2      155
3       15
4       18
      ... 
19    8381
20    7666
21    5600
22    5791
23    3902
Name: real_content, Length: 24, dtype: int64
fig =  plt.figure (figsize=(10, 8))

Two or three things in wechat group (Part Two)

We found that the two peaks of speaking in wechat group appear at 11-12 and 17-18 respectively, which are around meal time. Ha ha.

Of course, the above three time-based speech counts are all based on all wechat groups. If you want to analyze the chat time habits of a specific wechat group, you just need tomessage.talkerLimited to the specified wechat group.

2 the most concerned people in wechat group

We can segment the chat records of wechat group according to “whether the chat is continuous”. When someone participates in the chat, the number of participants and the total number of chat records increase significantly. We think that this person gets more attention in the group.

target_ group =   message.groupby ('talker')['real_ content'].count().idxmax()
Wechat group with the most chat records: 12*** [email protected]

2.1 “segmentation” of chat record

The group with the most chat records was selected for analysis, and the time difference between the two chats was taken as the indextarget_messageNew column fordiff_second, which is set to 60 seconds (in fact, it’s not easy to find the time point for splitting.

diff_second = (target_message.index[1:] - target_message.index[0:-1]).values / 1e9
diff_second = diff_second.astype('d')

target_message['diff_second'] = np.concatenate(([0.], diff_second))

len(diff_second[diff_second <= 60]) / target_message['real_content'].count()
0.8264578046880888

The interval between two chats is less than 60 seconds, accounting for about 82.65%, and the split is reasonable.

target_message['talk_segid'] = np.where(target_message['diff_second'] <= 60, 0, 1).cumsum()

target_message = target_message[['isSend', 'message_type', 'username', 'real_content', 
                                 'diff_second', 'talk_segid']]

target_message.talk_segid.value_counts().value_counts().sort_index()
1      5370
2      2722
3      1447
4       883
5       539
       ... 
371       1
378       1
527       1
638       1
925       1
Name: talk_segid, Length: 146, dtype: int64

With the help ofcumsum()Implement segmentation marking, and select the data column to be used later. That is to say, there are 5370 chats, and the time difference between the previous chat and the next chat is more than 60 seconds, accounting for about 6.75% of the total number of chats. The longest continuous chat is 925 sentences.

2.2 calculation of speech influence

def impact_ factor(df):

Calculate the changes in the number of speakers and the number of speakers before and after a person’s first speech in this chat.

  • The difference of the number of participants before and after the speech was calculated;
  • Calculate the difference of the total amount of chat before and after the speech;

explain:Before and after speaking here, the first time a person speaks in the chat shall prevail; the average chat interval is used for chat density.

impact = target_message.reset_index().groupby('talk_segid').apply(impact_factor)

impact_people_mean = impact.groupby('username')['diff_people'].mean()
impact_talks_mean  = impact.groupby('username')['diff_talks'].mean()

talks = target_message[['username', 'talk_segid']].drop_duplicates().groupby('username')['talk_segid'].count()

Calculate the participation times of each person in all chatstalks, the average difference of the number of people who participated in each chatimpact_people_mean, the average number of chat records affected by each chatimpact_talks_mean

talks = pd.merge(pd.merge(talks, impact_people_mean, left_index=True, right_index=True),
                 impact_talks_mean, left_index=True, right_index=True)
talks.columns = ['total_talks', 'diff_people_mean', 'diff_talks_mean']
talks
username total_talks diff_people_mean diff_talks_mean
FJ***21 20 -0.950000 -5.900000
Ha***73 32 -0.093750 1.500000
Hx***55 1166 0.845626 9.343911
JA***37 76 -0.828947 0.763158
JJ***er 4 -2.500000 -1.500000
zj***mc 1 11.000000 37.000000
zl***69 186 0.032258 2.365591
zs***24 11 0.727273 7.090909
zs***41 43 0.232558 8.209302
zw***247459 14 -0.785714 -0.214286

313 rows × 3 columns

2.3 most concerned candidates

fig = plt.figure(figsize=(10, 8))
talks['sqrt_total_talks'] = np.sqrt(talks.total_talks)

sns.scatterplot(x='sqrt_total_talks', y='diff_people_mean', hue='diff_people_mean',
                size='diff_people_mean', data=talks)

plt.annotate('target', xy=(5, 10), xytext=(10, 15), 
             arrowprops=dict(width=1, headwidth=5))
plt.show()

Two or three things in wechat group (Part Two)

Because some members of the group are very active, they will draw a scatter diagram after the “chat session” they participate in.

In fact, there is an obvious difference between the number of speakers and the amount of chatbug~If a person speaks a small number of times and follows the person who has received attention (or speaks earlier in the hot topic discussion), then the correspondingdiff_people_mean, diff_talks_meanIt’s worth a lot.

Does the speaker receive attention after speaking a lotdiff_people_mean, diff_talks_meanIt is reasonable to measure, as shown in the two figurestargetPoint.

fig = plt.figure(figsize=(10, 8))

sns.scatterplot(x='sqrt_total_talks', y='diff_talks_mean', hue='diff_talks_mean',
                size='diff_talks_mean', data=talks)

plt.annotate('target', xy=(5, 95), xytext=(10, 150), 
             arrowprops=dict(width=1, headwidth=5))
plt.show()

Two or three things in wechat group (Part Two)

Five times or more[ talks.total_ talks >= 5]['diff_ people_ mean'].idxmax())
The most concerned person who has spoken five times or more: ad * * * 47

In other words,'ad****47'After the speech, there were more speakers and more speeches. In fact, this group was established because of him, which was in line with the expectation.

talks.loc['ad****47']
total_talks         23.000000
diff_people_mean     9.695652
diff_talks_mean     92.565217
sqrt_total_talks     4.795832
Name: ad814156147, dtype: float64

In other words,'ad***47'A total of 23 chats were participated in. On average, after each chat, more than 9 people would come out to bubble and increase the number of speeches by 92 times.

In the end, it was selected'ad****47'For “the most beautiful cub in the whole group”!

3. Word preference in chatting

Finally, we usetarget_messageIn the chat record simple do a word cloud. Main usejiebaMake a participle, usewordcloudTo draw the word cloud map.

3.1 participle

chats = target_ message[target_ message.message_ Type = ='text '] ['real_ content']
createTime

In word segmentation, only select'text 'Class.

Add the text converted from wechat expression to the dictionary, such as'[cover your face], [smile] and [laugh and cry]'And so on__init__.pyPlease refer tohttps://github.com/fxsjy/jieba/issues/423.

jieba.load_userdict('wechat_emoji_dict.txt')
jieba.analyse.set_stop_words('stop_words.txt')
Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/bf/yt56193d35lbzfh4rn_xxct40000gn/T/jieba.cache
Loading model cost 0.966 seconds.
Prefix dict has been built successfully.

Add wechat emoticons to the dictionary and remove some stop wordscollections.CounterSimplify word frequency statistics.

word_counts = Counter()
for chat in chats:
    word_counts.update(jieba.analyse.extract_tags(chat))

word_counts = pd.Series(word_counts)
word_counts.sort_values().tail()
It's not 1402

3.2 word cloud production

With the help of wordcloud, we can draw the word cloud of 200 words with the highest frequency.

wc = WordCloud(font_path='/fontpath/simhei.ttf', background_color="white", repeat=False)
wc.fit_words(word_counts.sort_values()[-200:])

fig, ax = plt.subplots(figsize=(10, 8))

ax.axis('off')
ax.imshow(wc)

Two or three things in wechat group (Part Two)

That’s all for the chat analysis of wechat group.

4 references

  1. Wei Yang’s blog, Wei Yang,Wechat chat record export to computer TXT file tutorial, 2020/3/2.
  2. Where hangcom writes, hangcom, Wechat chat record export – release, 2020/3/2.
  3. Asher117, [Python] dataframe splits one column into multiple columns and one row into multiple rows, 2020/3/2.
  4. The wind is shallow and calm, Chinese display of Matplotlib and Seaborn, 2020/3/2.
  5. alpiny, [share] what many people need: keywords with spaces and special characters~~, 2020/3/9.

This work adoptsCC agreementReprint must indicate the author and the link of this article