Two or three things in wechat group (Part one)

Time:2021-4-16

Wechat is basically the most frequently used software. Because of various reasons such as work, study, hobbies and so on, wechat has joined many groups. Today, we will use Python to analyze the chat records of wechat group.

Data analysis is carried out in the environment of jupyter. The main libraries used are as follows:

  • pandas
  • numpy
  • matplotlib
  • seaborn
  • re

1 data acquisition

First of all, we need to get wechat group chat records. In addition to Tencent’s server, wechat chat records will also be saved locally. Therefore, we can export the chat records from the mobile terminal to obtain the data we need.

IPhone version wechat chat record export can refer toHangcom’s share, Android version@Godweiyang’s share

Wechat local database is saved inEnMicroMsg.dbThe most important tables in the database are:

  • Userinfo: user’s personal information
  • Voiceinfo: voice messages sent
  • Voicetranstext: text converted to voice message
  • Chatroom: wechat group information
  • Message: chat record
  • Harddevicerankinfo: hardware device information ~ wechat motion data
  • Emojigroupinfo: expression pack group information
  • Rcontact: contact information
  • friend_ Ext: information about friends
  • Sportstepitem: personal movement steps ~ there may be multiple data in one day

The main methods used in this analysis are as followschatroom, message, rcontactThree tables, usingsqlcipher.exeAnd the password obtained before will be used to export the table ascsvFormat.

2 data preprocessing

After obtaining the relevant data, we import it into panda for some preprocessing to facilitate the subsequent data analysis.

import re

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


pd.options.display.max_rows = 10
plt.rcParams['font.sans-serif']=['SimHei']
sns.set(font='SimHei')

%matplotlib inline

Define some variables that will be used later:

  • my: personal micro signal
my = 'wxid_********22'

2.1 data reading

Note when importing data:

  • First, convert the CSV file to ‘UTF-8′ format locally (using panda as’ GB2312 ‘import error);
  • Set message toindex_col=6The time of message sending or receiving is used as index to facilitate subsequent analysis.
chatroom = pd.read_csv('chatroom.csv')
message  = pd.read_csv('message.csv', index_col=6)
rcontact = pd.read_csv('rcontact.csv')

The data to be used in each dataframe and its meaning are as follows:

  1. Chatroom: wechat group information
    • Chatromname: wechat group name ~ wechat automatic generation ~ unique
    • Memberlist: user list ~ micro signal list
    • Roomowner: group master micro signal
    • Membercount: number of group members (possible error)
  2. Message: chat record
    • Type: message type
    • Issend: send or receive messages
    • Createtime: the time of sending message ~ has been set to index
    • Talker: chat with
    • Content: chat content
  3. Rcontact: contact person is actually everyone who has been in contact (including non friends in wechat group)
    • User name: user name information ~ micro signal ~ automatic generation
    • Alias: alias ~ modified micro signal
    • Conremark: remark name
    • Nickname: nickname
    • Type: contact type
chatroom = chatroom[['chatroomname', 'memberlist', 'roomowner', 'memberCount']]
chatroom.head()
chatroomname memberlist roomowner memberCount
0 5604**@chatroom wxid_****22;fj*71;wxid_82… Gr**92
1 1202**@chatroom Gr**92;lk**09;zh**75;sh… su***e
2 7766*****@chatroom ww4;cha**;wxid_***22;wxid_… dc****25
3 1682****@chatroom wxid_;wxid_21;wxid_2… wxid_****21
4 4346****@chatroom wxid_*22;wxid_22;wxid_2… wxid_*****22
message = message[['type', 'isSend', 'talker', 'content']]
message.head()
createTime type isSend talker content
1537709186000 318767153 0.0 weixin \n \n …
1537709314000 1 0.0 weixin Welcome back to wechat. If you have any questions or suggestions in the process of using, please remember to send me feedback.
1534500421000 1 0.0 5604****@chatroom It’s raining again [cover your face]
1534500558000 1 0.0 5604****@chatroom Let’s have a meal first
1534500839000 1 0.0 5604****@chatroom XL * * 45: would you like to play
rcontact = rcontact[['username', 'alias', 'conRemark', 'nickname', 'type']]
rcontact.head()
username alias conRemark nickname type
0 filehelper NaN NaN File transfer assistant 1
1 qqmail NaN NaN QQ email reminder 33
2 floatbottle NaN NaN Drifting bottle 33
3 shakeapp NaN NaN Shake it 33
4 lbsapp NaN NaN people nearby 33

2.2 rcontact preprocessing

rcontact.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4569 entries, 0 to 4568
Data columns (total 5 columns):
username     4568 non-null object
alias        1036 non-null object
conRemark    389 non-null object
nickname     4498 non-null object
type         4569 non-null int64
dtypes: int64(1), object(4)
memory usage: 178.6+ KB

For some contact objects (such as group chat), there is no problemalias, conRemarkIt is a very normal phenomenon that the two columns of data are directly filled with “empty”usernameYou can delete one contact directly.

rcontact.dropna(subset=['username'], inplace=True)
rcontact.fillna({'alias': 'EMPTY', 'conRemark': 'EMPTY'}, inplace=True)
rcontact[rcontact.nickname.isnull()]
username alias conRemark nickname type
436 5479****@chatroom EMPTY EMPTY NaN 2
445 fake_1573021893227 EMPTY EMPTY NaN 0
1261 fake_1538661262204 EMPTY EMPTY NaN 0
1340 fake_1541554264077 EMPTY EMPTY NaN 0
1342 fake_1539142500399 EMPTY EMPTY NaN 0
4174 fake_1576654843913 EMPTY EMPTY NaN 0
4205 fake_1577183589467 EMPTY EMPTY NaN 0
4217 fake_1577447590269 EMPTY EMPTY NaN 0
4328 fake_1578041960745 EMPTY EMPTY NaN 0
4501 fake_1580712150796 EMPTY EMPTY NaN 0

70 rows × 5 columns

aboutnicknameIt is empty, and there are 70 contacts left, of which 63 are filled with'fake_'At the beginning, four group chats and three other users will temporarily reserve this part of data and fill it in as'ALIEN'

nan = rcontact.nickname.isnull()
name = rcontact.username.str

print('fake_: ', rcontact[nan & name.startswith('fake_') ]['username'].count())
print('@chatroom: ', rcontact[nan & name.endswith('@chatroom') ]['username'].count())
fake_:  63
@chatroom:  4
rcontact.fillna({'nickname': 'ALIEN'}, inplace=True)

According to my personal collation and inference, rcontact.type The meaning is as follows:

  • 0: used applets
  • 1: add user’s friends ~ official account
  • 2: Group chat
  • 3: users actively add friends ~ official account number
  • 4: Not friends with wechat group
  • 7: Friends who chat frequently
  • 8, 9, 10, 11: deleted or deleted friends
  • 33: wechat official website
  • 259: don’t let him see my circle of friends
  • 2051: top friends
  • 8193: friends who haven’t chatted
  • 65536, 65537, 65539: friends who don’t look at each other’s circle of friends
rcontact[rcontact.username.str.endswith('@chatroom')].type.value_counts()
2    92
0     2
Name: type, dtype: int64

In other words, there are two wechat groups in the local database'type'In case of marking error, correct it.

rcontact.loc[rcontact.username.str.endswith('@chatroom'), 'type'] = 2

In order to facilitate the subsequent operation, the wechat contact object is simplified and selectedcontact_typeby['friend ',' non friend ',' group chat ']There are three types.

contact_ Dict = {1: 'friend', 2: 'group chat', 3: 'friend', 4: 'non friend', 7: 'friend', 8: 'non friend', 9: 'non friend', 10: 'non friend',
Not friends 2697

2.3 chatroom treatment

From the perspective of import, there are 94 group chats. In order to get more detailed data about group chats, we need tochatroomAndrcontactMerger.

chatroom = pd.merge(chatroom, rcontact, left_on='chatroomname', right_on='username')

chatroom.groupby(['alias', 'conRemark'])['chatroomname'].count()
alias  conRemark
EMPTY  EMPTY        94
Name: chatroomname, dtype: int64

In other words,alias, conRemarkThe two columns of data are filled data. In fact, wechat group does not have the concepts of “alias” and “remarks”, so they are removed directly. In addition,chatroomnameandusernameRepeat, remove one column.

chatroom.drop(columns=['alias', 'conRemark', 'username'], inplace=True)
chatroom.head()
chatroomname memberlist roomowner memberCount nickname type contact_type
0 5712****@chatroom wxid_***… Gr****92 18 Make complaints about the first meeting of Tucao 2 Group chat
1 1202****@chatroom Gr*6… s****e 24 2018?? 2 Group chat
2 7766***@chatroom ww*_… d**** 166 Construction investment company 2 Group chat
3 1682****@chatroom wxid_… wxid_** 9 2017 2 Group chat
4 4346****@chatroom wxid_… wxid_****22 52 Youth without troubles 2 Group chat
chatroom['memberCount'].value_counts()
-1      9
 7      5
 11     5
 10     5
 18     4
       ..
 36     1
 25     1
 230    1
 34     1
 418    1
Name: memberCount, Length: 47, dtype: int64

We found that,memberCountThe negative number of group members is obviously abnormal and needs to be adjusted; the number of group members can be adjusted bymemberlistCount to get.

chatroom.memberlist.str.split(';').str.len().value_counts().sort_index()
2      2
3      3
4      4
5      3
6      1
      ..
166    1
206    1
230    1
405    1
418    1
Name: memberlist, Length: 48, dtype: int64
(chatroom.memberlist.str.split(';').str.len() == chatroom.memberCount).sum()
85

After calculation, the number of group members obtained by counting group members is the same as that obtained by counting group membersmemberCountThere are 85 terms of equality, namely divisionmemberCountAll the cases of – 1 are equalmemberlistThe value of count is perfectmemberCount

chatroom['memberCount'] = chatroom.memberlist.str.split(';').str.len()

2.4 message preprocessing

The plan is to analyze the chat records of wechat group, that is, totalkerwith'@chatroom'The chat records were analyzed.

message = message[message.talker.str.endswith('@chatroom')]

IndexescreateTimeIs the cumulative time in milliseconds, converted to a common format.

message.index = pd.to_datetime(message.index, unit='ms', utc=True).tz_convert('Asia/Shanghai')
message.head()
createTime type isSend talker content
2018-08-17 18:07:01+08:00 1 0.0 5604****@chatroom It’s raining again [cover your face]
2018-08-17 18:09:18+08:00 1 0.0 5604****@chatroom Let’s have a meal first
2018-08-17 18:13:59+08:00 1 0.0 2434****@chatroom TC * *: \ n would you like to play
2018-08-17 18:14:04+08:00 3 0.0 5604****@chatroom XL**:\n<img cdnbigimgurl=”null” hd…
2018-08-17 18:14:13+08:00 1 0.0 1285**@chatroom Overtime
message.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 155485 entries, 2018-08-17 18:07:01+08:00 to 2020-02-16 11:38:21+08:00
Data columns (total 4 columns):
type       155485 non-null int64
isSend     155477 non-null float64
talker     155485 non-null object
content    155415 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 5.9+ MB

Message tableisSendEight items are missing,content70 items missing:

  • isSendIf the value is blank, it means that the group call is over, which corresponds to the group voice call initiated, and can be deleted directly;
  • contentIn order to focus on the analysis, there are few missing items, and null values are deleted directly.
message.dropna(subset=['isSend', 'content'], inplace=True)

It should be noted that,isSendThere are three values:

  • 1: Express the message sent by oneself;
  • 0: the message is accepted;
  • 2: Other news, including group voice calls and pull people into the group initiated by me.
message.isSend.value_counts()
0.0    153785
1.0      1611
2.0        11
Name: isSend, dtype: int64

By inference, differenttypeThe types of messages are as follows:

  • 1: Plain text message
  • 3: General picture
  • 34: voice message
  • Official account name card 42:
  • 43: ordinary video
  • 47: facial expression bag
  • 48: location message
  • 49: official account or applet sharing
  • 64: group voice call ~ issend of group voice call is 2, and issend of group voice call is Nan
  • 10000: recall message
  • 1048625: collected expressions
  • 16777265: Web Sharing
  • 436207665: wechat red envelope
  • 486539313: forwarding video in official account number
  • 520093745: wechat card bag ~ gift card
  • 570425393: invite to join group chat
  • 587202609: small program message ~ game
  • 805306417: wechat Jielong
  • 822083633: reference reply message
  • -1879048186: location sharing

In order to facilitate the subsequent processing, the type is simplified

message_ Dict = {1: 'word', 3: 'picture', 34: 'voice', 42: 'official account', 43: 'video', 47: 'expression', 48: 'location',

In addition,contentThe column contains not only the chat content, but also the speaker’s micro signal':'After segmentation, the micro signal only contains[a-zA-Z0-9_\-]Character.

usemapIt is suitable for single column mapping, but it is also needed heretypeInformation helps to judge, so it is not suitable for ~ to split the text directly:Non user statements will be treated as user statements; none:I’ll be treated as myself. With the help ofapplyFunction implementation

regex =  re.compile ('([a-zA-Z0-9_ \-]+):(.*)', flags=re.S)
message['username'], message['real_content'] = message.apply(split_content, axis=1).str
message[message.username == 'None User'].type.value_counts()
10000        1745
570425393     780
64              8
Name: type, dtype: int64

In other words, only 2533 messages including “withdraw message”, “pull people into group” and “group call” are marked as “none user”.

3 preliminary analysis

A simple preliminary analysis was carried out

  1. Analysis of speech frequency of contacts, friends and non friends
  2. Speech types ~ analyze the differences between speech types according to the number of group chats

3.1 speech frequency analysis

First of all, the statistics of the frequency of speech of all contacts; and then compare it withrcontactMerger.

people = message.groupby('username')['real_content'].count()
people = pd.merge(people, rcontact, left_index=True, right_on='username', how='right')
people.head()
real_content username alias conRemark nickname type contact_type
0 NaN filehelper EMPTY EMPTY File transfer assistant 1 Friends
48 1611.0 wxid_*****22 **** EMPTY M** 1 Friends
49 NaN gh_** z*u EMPTY Z * meeting 3 Friends
51 NaN gh_* ls** EMPTY Dragon * NET 3 Friends
52 NaN wxid_** EMPTY Big* Learning* 1 Friends

take'real_content'Change the column name to'counts'If the column is empty, it means that the contact has not spoken and can be filled in as 0 directly.

people.columns = ['counts', 'username', 'alias', 'conRemark', 'nickname', 'type', 'contact_type']
people.fillna(0, inplace=True)
people.sort_values(by='counts').tail(10)
counts username alias conRemark nickname type contact_type
2888 2432.0 wxid_** li**h EMPTY Sleeping rain** 4 Not friends
2459 2473.0 wa** EMPTY EMPTY Small business* 4 Not friends
2815 2677.0 wxid_** a5** EMPTY Bai Tao** 4 Not friends
2955 2757.0 wxid_** Y** EMPTY Flowers** 4 Not friends
3125 2976.0 O* EMPTY EMPTY Happy** 4 Not friends
3115 4117.0 wxid_** EMPTY EMPTY Gold** 4 Not friends
357 5264.0 wxid_** Y** Flying** Big** 3 Friends
2869 5300.0 H* EMPTY EMPTY T* 4 Not friends
2920 5646.0 wxid_** EMPTY EMPTY rap** 4 Not friends
2790 10204.0 h** s** EMPTY Beans* 4 Not friends

That is to say, except for the 10 people who speak the most ~ none user and one is their own friend, the rest are non friends in the same wechat group.

And as the most plus group of their own speech in the 23rd, ha ha.

rank = people['counts'].rank(method='max')

rank.max() - rank[people[people.username == my].index] + 1
48    23.0
Name: counts, dtype: float64

Let’s look at the total amount, average amount and variance of different types of friends’ speeches.

people.groupby('contact_type')['counts'].agg(['sum', 'mean', 'median', 'std', 'count'])
contact_type sum mean median std count
Friends 58605.0 106.168478 0.0 354.541844 552
Group chat 0.0 0.000000 0.0 0.000000 94
Not friends 94269.0 34.953281 0.0 309.216804 2697

We can clearly see the difference in the speech between friends and non friends. The average speech of friends is significantly higher than that of non friends. What both have in common is large variance and high proportion of diving users.

people[people.counts == 0].groupby('contact_type')['username'].count() / people.contact_type.value_counts()
Friend 0.54529

When analyzing chat objects, wechat group can be removed directly.

people = people[ people.contact_ Type! ='group chat ']

Group users according to the number of speeches:[[0.0, 1.0) < [1.0, 10.0) < [10.0, 100.0) < [100.0, 1000.0) < [1000.0, 10205.0)]To view the number of people in each group and draw a histogram.

people_ Groups = ['deep-water bomb ',' lurker ',' small bubble ',' activist ',' senior talker ']
people['cat'] = pd.cut(people['counts'], people_bins, labels=people_groups,right=False)
people['cat'].value_counts()
Depth charge 1988
cat_counts = pd.crosstab(people.cat, people.contact_type)
cat_counts.plot(kind='bar')

Two or three things in wechat group (Part one)

That is to say, whether they are friends or not, the majority of them do not speak.

3.2 analysis of speech types

Combine group chat and speech, and select the data column you need

room = pd.merge(chatroom, message, left_on='chatroomname', right_on='talker')
room = room[['chatroomname', 'memberCount', 'message_type']]

Group by group size:(0, 10] < (10, 20] < (20, 50] < (50, 100] < (100, 419]To view the speech type preferences of different groups.

room_ Groups = ['small Group ',' small group ',' Medium Group ',' large group ',' Giant Group ']
room['cat'] = pd.cut(room['memberCount'], room_bins, labels=room_groups)

Different size groups and different types of speech preferences

room_total = room.groupby('cat')['chatroomname'].count()
talk_count = pd.crosstab(room.cat, room.message_type)
talk_count
message_type position official account Additive group picture withdraw written words Red envelopes Webpage expression video voice conversation
cat
Small group 19 365 57 891 145 7920 207 1 523 90 68 0
Small group 56 619 90 1751 396 28293 328 0 2650 205 235 2
Medium group 27 405 235 733 264 10278 49 0 536 88 492 0
Large group 6 346 103 1562 159 4053 465 6 183 384 600 0
Megagroup 32 1073 295 4011 798 75176 241 4 6999 500 387 6
talk_per = talk_count.div(room_total, axis=0) * 100
talk_per
message_type position official account Additive group picture withdraw written words Red envelopes Webpage expression video voice conversation
cat
Small group 0.184717 3.548513 0.554151 8.662259 1.409683 76.997861 2.012444 0.009722 5.084581 0.874976 0.661093 0.000000
Small group 0.161733 1.787726 0.259928 5.057040 1.143682 81.712635 0.947292 0.000000 7.653430 0.592058 0.678700 0.005776
Medium group 0.205997 3.089952 1.792935 5.592432 2.014191 78.416114 0.373846 0.000000 4.089418 0.671397 3.753719 0.000000
Large group 0.076268 4.398119 1.309267 19.855091 2.021101 51.519003 5.910766 0.076268 2.326173 4.881149 7.626795 0.000000
Megagroup 0.035745 1.198588 0.329528 4.480463 0.891401 83.974889 0.269208 0.004468 7.818190 0.558522 0.432296 0.006702

The thermal graph is drawn by using the categories of group sending messages of various groups.

f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(np.sqrt(np.sqrt(talk_per)), annot=True, linewidths=.5, ax=ax, cmap='Set3')

Two or three things in wechat group (Part one)

In terms of proportion, the majority of “text messages” in any group, and the difference between the proportions is too large, so the proportion is drawn after two times.

From the results, in addition to text messages, pictures and expressions are the most commonly used types of messages at home and abroad; small and medium groups share more places, and medium and large groups withdraw more messages!

This analysis mainly focuses on data preprocessing, and the specific analysis is less, and the subsequent analysis is based on the time and content of the speech.

4 references

  1. Wei Yang’s blog, Wei Yang,Wechat chat record export to computer TXT file tutorial, 2020/3/2.
  2. Where hangcom writes, hangcom, Wechat chat record export – release, 2020/3/2.
  3. Asher117, [Python] dataframe splits one column into multiple columns and one row into multiple rows, 2020/3/2.
  4. The wind is shallow and calm, Chinese display of Matplotlib and Seaborn, 2020/3/2.

This work adoptsCC agreementReprint must indicate the author and the link of this article