# Two or three things in wechat group (Part Two)

Time：2021-4-15

In the early stage, we export wechat chat records and select wechat group chat records as the analysis object. After some preprocessing work, we make a simple analysis from the speech frequency and speech type. This time, we analyze the time preference of wechat group chat, and select the group chat with the most records to analyze the most concerned people in the group. Finally, we segment the text message and draw the word cloud.

As in the previous issue, this data analysis work is carried out in the environment of jupyter, and the main libraries used are as follows:

• pandas
• numpy
• matplotlib
• seaborn
• jieba
• wordcloud
• collections

## 1 time preference of chat

That is, through the statistical analysis of the speech frequency of wechat group at different times.

``#New libraries for subsequent analysis``

First, import the library you need to use and make simple settings.

``````print('Min Time: ', message.index.min())
print('Max Time: ', message.index.max())
print('Total Message: ', message['real_content'].count())``````
``````Min Time:  2018-08-11 13:52:14+08:00
Max Time:  2020-02-16 11:38:21+08:00
Total Message:  155407``````

That is to say, the speaking time of wechat group chat records in this analysis is between August 11, 2018 and February 6, 2020, with a total of 155407 records. The following will analyze the speaking frequency from three time schedules:

• Frequency statistics were made according to the accuracy of% y-m;
• Frequency statistics were made according to the accuracy of% W;
• Frequency statistics were made according to the progress of% H.

### 1.1 statistics of speeches by year

``````month_count = message['real_content'].resample('M', kind='period').count()
``````createTime
2018-08    1064
2018-09    4092
2018-10    3214
2018-11    3116
2018-12    4376
Freq: M, Name: real_content, dtype: int64``````
``````fig, ax = plt.subplots(figsize=(9, 6))

month_count.plot(kind='bar', ax = ax)`````` In other words, we can clearly see that since July 2019, the number of wechat group chats has increased significantly, which is actually due to the fact that we have joined some very active chat groups.

### 1.2 weekly speeches

``````weekday_count = message.groupby(lambda x: x.weekday_name)['real_content'].count()
weekday_count``````
``````Friday       27563
Monday       25577
Saturday     12692
Sunday       10291
Thursday     26028
Tuesday      27318
Wednesday    25938
Name: real_content, dtype: int64``````
``weekday_ per = weekday_ count / weekday_ count.sum () * 100`` In other words, on the whole, the speaking activity of wechat group is significantly higher on weekdays than on weekends.

### 1.3 speeches in one day

``````hour_count = message.groupby(message.index.hour)['real_content'].count()
hour_count``````
``````createTime
0     1994
1      169
2      155
3       15
4       18
...
19    8381
20    7666
21    5600
22    5791
23    3902
Name: real_content, Length: 24, dtype: int64``````
``fig =  plt.figure (figsize=(10, 8))`` We found that the two peaks of speaking in wechat group appear at 11-12 and 17-18 respectively, which are around meal time. Ha ha.

Of course, the above three time-based speech counts are all based on all wechat groups. If you want to analyze the chat time habits of a specific wechat group, you just need to`message.talker`Limited to the specified wechat group.

## 2 the most concerned people in wechat group

We can segment the chat records of wechat group according to “whether the chat is continuous”. When someone participates in the chat, the number of participants and the total number of chat records increase significantly. We think that this person gets more attention in the group.

``target_ group =   message.groupby ('talker')['real_ content'].count().idxmax()``
``Wechat group with the most chat records: 12*** [email protected]``

### 2.1 “segmentation” of chat record

The group with the most chat records was selected for analysis, and the time difference between the two chats was taken as the index`target_message`New column for`diff_second`, which is set to 60 seconds (in fact, it’s not easy to find the time point for splitting.

``````diff_second = (target_message.index[1:] - target_message.index[0:-1]).values / 1e9
diff_second = diff_second.astype('d')

target_message['diff_second'] = np.concatenate(([0.], diff_second))

len(diff_second[diff_second <= 60]) / target_message['real_content'].count()``````
``0.8264578046880888``

The interval between two chats is less than 60 seconds, accounting for about 82.65%, and the split is reasonable.

``````target_message['talk_segid'] = np.where(target_message['diff_second'] <= 60, 0, 1).cumsum()

target_message = target_message[['isSend', 'message_type', 'username', 'real_content',
'diff_second', 'talk_segid']]

target_message.talk_segid.value_counts().value_counts().sort_index()``````
``````1      5370
2      2722
3      1447
4       883
5       539
...
371       1
378       1
527       1
638       1
925       1
Name: talk_segid, Length: 146, dtype: int64``````

With the help of`cumsum()`Implement segmentation marking, and select the data column to be used later. That is to say, there are 5370 chats, and the time difference between the previous chat and the next chat is more than 60 seconds, accounting for about 6.75% of the total number of chats. The longest continuous chat is 925 sentences.

### 2.2 calculation of speech influence

``def impact_ factor(df):``

Calculate the changes in the number of speakers and the number of speakers before and after a person’s first speech in this chat.

• The difference of the number of participants before and after the speech was calculated;
• Calculate the difference of the total amount of chat before and after the speech;

explain:Before and after speaking here, the first time a person speaks in the chat shall prevail; the average chat interval is used for chat density.

``````impact = target_message.reset_index().groupby('talk_segid').apply(impact_factor)

Calculate the participation times of each person in all chats`talks`, the average difference of the number of people who participated in each chat`impact_people_mean`, the average number of chat records affected by each chat`impact_talks_mean`

``````talks = pd.merge(pd.merge(talks, impact_people_mean, left_index=True, right_index=True),
impact_talks_mean, left_index=True, right_index=True)
talks.columns = ['total_talks', 'diff_people_mean', 'diff_talks_mean']
talks``````
FJ***21 20 -0.950000 -5.900000
Ha***73 32 -0.093750 1.500000
Hx***55 1166 0.845626 9.343911
JA***37 76 -0.828947 0.763158
JJ***er 4 -2.500000 -1.500000
zj***mc 1 11.000000 37.000000
zl***69 186 0.032258 2.365591
zs***24 11 0.727273 7.090909
zs***41 43 0.232558 8.209302
zw***247459 14 -0.785714 -0.214286

313 rows × 3 columns

### 2.3 most concerned candidates

``````fig = plt.figure(figsize=(10, 8))
talks['sqrt_total_talks'] = np.sqrt(talks.total_talks)

sns.scatterplot(x='sqrt_total_talks', y='diff_people_mean', hue='diff_people_mean',
size='diff_people_mean', data=talks)

plt.annotate('target', xy=(5, 10), xytext=(10, 15),
plt.show()`````` Because some members of the group are very active, they will draw a scatter diagram after the “chat session” they participate in.

In fact, there is an obvious difference between the number of speakers and the amount of chat`bug`~If a person speaks a small number of times and follows the person who has received attention (or speaks earlier in the hot topic discussion), then the corresponding`diff_people_mean, diff_talks_mean`It’s worth a lot.

Does the speaker receive attention after speaking a lot`diff_people_mean, diff_talks_mean`It is reasonable to measure, as shown in the two figures`target`Point.

``````fig = plt.figure(figsize=(10, 8))

sns.scatterplot(x='sqrt_total_talks', y='diff_talks_mean', hue='diff_talks_mean',
size='diff_talks_mean', data=talks)

plt.annotate('target', xy=(5, 95), xytext=(10, 150),
plt.show()`````` ``Five times or more[ talks.total_ talks >= 5]['diff_ people_ mean'].idxmax())``
``The most concerned person who has spoken five times or more: ad * * * 47``

In other words,`'ad****47'`After the speech, there were more speakers and more speeches. In fact, this group was established because of him, which was in line with the expectation.

``talks.loc['ad****47']``
``````total_talks         23.000000
diff_people_mean     9.695652
diff_talks_mean     92.565217
sqrt_total_talks     4.795832

In other words,`'ad***47'`A total of 23 chats were participated in. On average, after each chat, more than 9 people would come out to bubble and increase the number of speeches by 92 times.

In the end, it was selected`'ad****47'`For “the most beautiful cub in the whole group”!

## 3. Word preference in chatting

Finally, we use`target_message`In the chat record simple do a word cloud. Main use`jieba`Make a participle, use`wordcloud`To draw the word cloud map.

### 3.1 participle

``chats = target_ message[target_ message.message_ Type = ='text '] ['real_ content']``
``createTime``

In word segmentation, only select`'text '`Class.

Add the text converted from wechat expression to the dictionary, such as`'[cover your face], [smile] and [laugh and cry]'`And so on`__init__.py`Please refer tohttps://github.com/fxsjy/jieba/issues/423.

``````jieba.load_userdict('wechat_emoji_dict.txt')
jieba.analyse.set_stop_words('stop_words.txt')``````
``````Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/bf/yt56193d35lbzfh4rn_xxct40000gn/T/jieba.cache
Prefix dict has been built successfully.``````

Add wechat emoticons to the dictionary and remove some stop words`collections.Counter`Simplify word frequency statistics.

``````word_counts = Counter()
for chat in chats:
word_counts.update(jieba.analyse.extract_tags(chat))

word_counts = pd.Series(word_counts)``````
``word_counts.sort_values().tail()``
``It's not 1402``

### 3.2 word cloud production

With the help of wordcloud, we can draw the word cloud of 200 words with the highest frequency.

``````wc = WordCloud(font_path='/fontpath/simhei.ttf', background_color="white", repeat=False)
wc.fit_words(word_counts.sort_values()[-200:])

fig, ax = plt.subplots(figsize=(10, 8))

ax.axis('off')
ax.imshow(wc)`````` That’s all for the chat analysis of wechat group.