1、 Data preprocessing and storage
After getting the original data, the next step is to clean and store it.
1.1 data model
Because it is a simple analysis, only the
topic of conversation、
newsThree parts. The details are as follows:
class Pins(object): """ boiling point """ msg_id = None # boiling pointID topic_ Id = none # topic ID topic_ Title = none # topic name user_ Id = none # user ID user_ Name = none # user name msg_content = None # boiling point内容 msg_ctime = None # boiling point创建时间 msg_digg_count = 0 # boiling point点赞数 msg_comment_count = 0 # boiling point评论数 def __repr__(self): return '<Pins: %s>' % self.msg_id
1.2 database table creation
Database, use mysql. Because of the boiling point
emojiExpression, so character set coding needs to be used when creating tables
The table creation SQL statement is as follows:
CREATE SCHEMA `juejin` DEFAULT CHARACTER SET utf8mb4 ; CREATE TABLE `juejin`.`pins` ( `msg_ ID ` varchar (20) not null comment 'message ID', `topic_ ID ` varchar (20) not null comment 'subject ID', `topic_ Title ` varchar (16) not null comment 'subject name', `user_ ID ` varchar (20) not null comment 'user ID', `user_ Name ` varchar (32) not null comment 'user nickname', `msg_ Content ` text character set 'utf8mb4' not null comment 'message content', `msg_ CTime ` varchar (16) not null comment 'message creation timestamp', `msg_ digg_ Count ` int (11) not null comment 'message likes', `msg_ comment_ Count ` int (11) not null comment 'number of message comments', `msg_ Create ` datetime not null default now() comment 'message creation time (same as MSG_ CTime timestamp) ', PRIMARY KEY (`msg_id`));
1.3 reading and storage of original data
As mentioned above, we have saved all boiling point data to
json_dataFolder. You just need to traverse and read all the JSON files under the file, do simple processing, and then store them in the database.
The sample code is as follows:
def read_all_data(): """ Traverse and read all JSON data, and then put it into storage :return: """ pins_list =  for dirpath, dirnames, filenames in os.walk('./json_data'): filenames = sorted(filenames, key=lambda _: _[5: 9]) for filename in filenames: filename = os.path.join('./json_data', filename) print(filename) with open(filename, 'r') as pins_file: items_data = json.loads(''.join(pins_file.readlines()))['data'] for item in items_data: pins = Pins().parse_from_item(item) pins_list.append(pins) insert_db([pins]) return pins_list
Finally, the database table is shown in the figure below.
2、 Introduction to superset
The official description is as follows:
A modern, enterprise-ready business intelligence web application.
First, let’s talk about the feelings in the process of using the company’s projects. We mainly embed the configured charts into other pages in the form of iframe. It is time-consuming and laborious to make charts alone.
- ① The first problem we encountered was the permission problem. At that time, in order to catch up with the schedule, we directly set all the readable permissions on public, but there was a hidden danger of data security.
- ② Superset can easily generate iframe, but the disadvantage is that the iframe code needs to be updated every time the chart is modified.
- ③ Because it is common, many features are lost, or many functions are not easy to implement, such as data drilling.
- ④ The chart presentation is based on
D3.jsI feel that the style is not in line with the domestic preference. Fortunately, it is open source and can be expanded, such as
echartsAnd so on.
Overall, the configuration and use is relatively convenient. After all, it’s free. Don’t ask too much.
According to official documents, we useOS dependenciesHow to install and use superset.
Step by step according to the document,
virtualenvThe use of this method can be referred toOfficial documents。
Just use pip to install superset,
pip install apache-superset。 The latest version is
Finally, we will load the official example as the system,
superset load_examples。 Then start the development server, ` superset run – P 8088 — with threads — reload — debugger
In theory, let’s turn it onhttp://127.0.0.1:8088/superset/dashboard/births/You can see the following figure:
2.2 official documents
3、 Chart construction based on superset
Before making a chart, we need to set a few goals, that is, what topics we want to get from the data.
Let’s make a chart with the following six topics.
- Daily boiling point histogram
- Curve of total boiling point with time
- Boiling point topic proportion pie chart top 10
- Top 25 users with the most published boiling point
- Top 25 boiling points with the most comments
- Top 25 boiling points with the most likes
3.0 preparation for chart making
Superset chart can be generated directly from database table. Here we choose a more general way, by
SQL Lab -> SQL EditorGet the target data directly through SQL.
3.0.1 add database link
The format is
SLQAlchemy URIStudents who have used Python will be familiar with this orm. If you are interested, you can learn about the official document:https://www.sqlalchemy.org/。
When configured for the first time, the
Could not load database driver: mysqlAbnormal. implement
pip install mysqlclientInstall the MySQL driver.
3.1 example of chart making
3.1.1 column chart of daily boiling points
3.1.2 curve of total boiling point with time
3.1.3 boiling point topic proportion pie chart top 10
The boiling point without topic is excluded in the data statistics.
3.1.4 top 25 users with the most publications
3.1.5 the boiling point with the most comments is top25
3.1.6 the boiling point with the most likes is top25
However, the first two boiling points are suspected of praise.
3.2 using the created chart to create a dashboard
The follow-up consideration is to analyze the data in a multi-dimensional and deep level. If used
wordcloudMake word cloud for boiling point content keywords.
If possible, the background runs a special service to capture and update the boiling point data regularly, and makes a large data screen for display.
Shocked! One quarter of all topics are fishing