Simple analysis of boiling point of nuggets 15W (2)

Time:2021-7-17

1、 Data preprocessing and storage

After getting the original data, the next step is to clean and store it.

1.1 data model

Because it is a simple analysis, only thetopic of conversationusernewsThree parts. The details are as follows:

class Pins(object):
    """
    boiling point
    """
    msg_id = None            # boiling pointID
    topic_ Id = none # topic ID
    topic_ Title = none # topic name
    user_ Id = none # user ID
    user_ Name = none # user name
    msg_content = None        # boiling point内容
    msg_ctime = None        # boiling point创建时间
    msg_digg_count = 0      # boiling point点赞数
    msg_comment_count = 0   # boiling point评论数

    def __repr__(self):
        return '<Pins: %s>' % self.msg_id

1.2 database table creation

Database, use mysql. Because of the boiling pointmsg_contentContainsemojiExpression, so character set coding needs to be used when creating tablesutf8mb4

The table creation SQL statement is as follows:

CREATE SCHEMA `juejin` DEFAULT CHARACTER SET utf8mb4 ;

CREATE TABLE `juejin`.`pins` (
  `msg_ ID ` varchar (20) not null comment 'message ID',
  `topic_ ID ` varchar (20) not null comment 'subject ID',
  `topic_ Title ` varchar (16) not null comment 'subject name',
  `user_ ID ` varchar (20) not null comment 'user ID',
  `user_ Name ` varchar (32) not null comment 'user nickname',
  `msg_ Content ` text character set 'utf8mb4' not null comment 'message content',
  `msg_ CTime ` varchar (16) not null comment 'message creation timestamp',
  `msg_ digg_ Count ` int (11) not null comment 'message likes',
  `msg_ comment_ Count ` int (11) not null comment 'number of message comments',
  `msg_ Create ` datetime not null default now() comment 'message creation time (same as MSG_ CTime timestamp) ',
  PRIMARY KEY (`msg_id`));

1.3 reading and storage of original data

As mentioned above, we have saved all boiling point data tojson_dataFolder. You just need to traverse and read all the JSON files under the file, do simple processing, and then store them in the database.

The sample code is as follows:

def read_all_data():
    """
    Traverse and read all JSON data, and then put it into storage
    :return:
    """
    pins_list = []
    for dirpath, dirnames, filenames in os.walk('./json_data'):
        filenames = sorted(filenames, key=lambda _: _[5: 9])
        for filename in filenames:
            filename = os.path.join('./json_data', filename)
            print(filename)
            with open(filename, 'r') as pins_file:
                items_data = json.loads(''.join(pins_file.readlines()))['data']
                for item in items_data:
                    pins = Pins().parse_from_item(item)
                    pins_list.append(pins)
                    insert_db([pins])
    return pins_list

Finally, the database table is shown in the figure below.

Simple analysis of boiling point of nuggets 15W (2)

2、 Introduction to superset

The official description is as follows:A modern, enterprise-ready business intelligence web application.

First, let’s talk about the feelings in the process of using the company’s projects. We mainly embed the configured charts into other pages in the form of iframe. It is time-consuming and laborious to make charts alone.

  • ① The first problem we encountered was the permission problem. At that time, in order to catch up with the schedule, we directly set all the readable permissions on public, but there was a hidden danger of data security.
  • ② Superset can easily generate iframe, but the disadvantage is that the iframe code needs to be updated every time the chart is modified.
  • ③ Because it is common, many features are lost, or many functions are not easy to implement, such as data drilling.
  • ④ The chart presentation is based onD3.jsI feel that the style is not in line with the domestic preference. Fortunately, it is open source and can be expanded, such asechartsAnd so on.

Overall, the configuration and use is relatively convenient. After all, it’s free. Don’t ask too much.

Simple analysis of boiling point of nuggets 15W (2)

2.1 installation

According to official documents, we useOS dependenciesHow to install and use superset.

Step by step according to the document,virtualenvThe use of this method can be referred toOfficial documents

Just use pip to install superset,pip install apache-superset。 The latest version is0.37.0

Finally, we will load the official example as the system,superset load_examples。 Then start the development server, ` superset run – P 8088 — with threads — reload — debugger
`。

In theory, let’s turn it onhttp://127.0.0.1:8088/superset/dashboard/births/You can see the following figure:

Simple analysis of boiling point of nuggets 15W (2)

2.2 official documents

Official documents must be read,http://superset.apache.org/

3、 Chart construction based on superset

Before making a chart, we need to set a few goals, that is, what topics we want to get from the data.

Let’s make a chart with the following six topics.

  • Daily boiling point histogram
  • Curve of total boiling point with time
  • Boiling point topic proportion pie chart top 10
  • Top 25 users with the most published boiling point
  • Top 25 boiling points with the most comments
  • Top 25 boiling points with the most likes

3.0 preparation for chart making

Superset chart can be generated directly from database table. Here we choose a more general way, bySQL Lab -> SQL EditorGet the target data directly through SQL.

3.0.1 add database link

The format isSLQAlchemy URIStudents who have used Python will be familiar with this orm. If you are interested, you can learn about the official document:https://www.sqlalchemy.org/

When configured for the first time, theCould not load database driver: mysqlAbnormal. implementpip install mysqlclientInstall the MySQL driver.

Simple analysis of boiling point of nuggets 15W (2)

3.1 example of chart making

3.1.1 column chart of daily boiling points

Simple analysis of boiling point of nuggets 15W (2)

3.1.2 curve of total boiling point with time

Simple analysis of boiling point of nuggets 15W (2)

3.1.3 boiling point topic proportion pie chart top 10

The boiling point without topic is excluded in the data statistics.

Simple analysis of boiling point of nuggets 15W (2)

3.1.4 top 25 users with the most publications

Simple analysis of boiling point of nuggets 15W (2)

3.1.5 the boiling point with the most comments is top25

Simple analysis of boiling point of nuggets 15W (2)

3.1.6 the boiling point with the most likes is top25

However, the first two boiling points are suspected of praise.

Simple analysis of boiling point of nuggets 15W (2)

3.2 using the created chart to create a dashboard

Simple analysis of boiling point of nuggets 15W (2)

4、 Postscript

The follow-up consideration is to analyze the data in a multi-dimensional and deep level. If usedJieba participle+wordcloudMake word cloud for boiling point content keywords.

If possible, the background runs a special service to capture and update the boiling point data regularly, and makes a large data screen for display.


Shocked! One quarter of all topics are fishing

Simple analysis of boiling point of nuggets 15W (2)

Source code has been uploaded toGitHub, Gitee