How to use Clickhouse to realize temporal data management and mining?

Time:2021-3-1

Clickhouse is an efficient open-source column database management system for online analysis, developed by yandex, a Russian IT company, and announced its open source in June 2016. This article will interpret the Jingdong City spatiotemporal data engine just in detail(https://just.urban-computing.cn/)How to use Clickhouse to realize temporal data management and mining.

1. Introduction of time series data

The full name of time series data is time series data, which is a series of data points indexed in time order. The most common is the sequence obtained at continuous time intervals, so it is a series of discrete data [1].

Time series data is almost everywhere. In the current one-way time flow, people’s pulse, air humidity, stock price and so on all change with the passage of time. Time series data is a kind of data, because of its remarkable and valuable characteristics, it becomes the object of our special analysis.

The time series data can be modeled as follows

  • Metric:The data set of measurement, similar to the table in relational database, is a fixed attribute and does not change with time
  • Timestamp:Time stamp, which represents the time point of data collection
  • Tags:The dimension column is used to describe the metric, representing the ownership and attributes of the data, indicating which device / module generated it, and generally does not change over time
  • Field/Value:Indicator column, representing the measured value of data, can be single value or multi value

A specific case of multi valued model time series data is shown in Table 1

How to use Clickhouse to realize temporal data management and mining?

Table 1 time series data cases

2. Overview of time series data management

  1. The flow of time series data management

The essence of all data is to serve for value, and the process of obtaining value is data management and analysis. Technically, any data from generation to extinction will go through the process shown in Figure 1.

How to use Clickhouse to realize temporal data management and mining?

Figure 1 data life cycle

Time series data is no exception, but the processing of each part is different.

(1) Data collection. The frequency of generating time series data in the same scene is generally constant, but the frequency of collecting data in different scenes is variable. The technologies used for 1000 data per second and 10 million data per second are completely different. Therefore, the main consideration of data acquisition is frequency and concurrency.

(2) Data storage. Data storage is for query and analysis services. What format to store, what index to build, the amount of data to be stored, and the storage time are the considerations of sequential data storage. Generally, sequential data is written more and read less, and the data has timeliness, so the separation of hot and cold storage can be considered.

(3) Data query and analysis. Temporal data query also has significant characteristics, generally read according to the time range, the latest data read frequency is high, and aggregate query is done according to different time granularity, such as statistics of the data volume of the last week every day.

The analysis depends on the query, and the analysis of time series data is usually multidimensional, such as the number of web page hits, from which website, from which IP, click frequency and many other dimensions, depending on the specific scene. Time series data is also very suitable for data mining, using history to predict the future.

(4) Data deletion. The deletion here is not for a single piece of data, but for the batch data within a specific time range. Because of the timeliness of time series data, historical data is usually no longer valuable. Whether it is deleted regularly or manually, it represents the end of its short life cycle.

2. Objective of time series data management system

According to the characteristics and scenarios of temporal data, we need a temporal data management platform that can meet the following objectives:

  • High throughput write: real time write of tens of millions and hundreds of millions of data in seconds & continuous high concurrency write;
  • No update operation: most of the data represent the status of the device and need not be updated after writing;
  • Massive data storage: from TB to Pb level;
  • Efficient real-time query: statistical analysis of indicators according to different dimensions, there are obvious hot and cold data, generally only frequent query of recent data;
  • High availability;
  • Scalability;
  • Easy to use;
  • Easy to maintain;

3. Technology selection

When it comes to databases, the first thing you think of is mysql, Oracle and other traditional relational databases that have existed for many years. Of course, the relational model is still valid and practical. For a small amount of data (millions to tens of millions), MySQL can do it. If it is larger, it needs to be solved by database and table. The time sequence data is generally divided into time tables, but it puts forward high requirements for external extra design and operation and maintenance. Obviously, this can’t meet the big data scenario, so few people choose this solution.

Looking at the top ten time series databases on DB engine [2], excluding commercial ones, there are not many open source options left. Next, we introduce several popular time series databases.

How to use Clickhouse to realize temporal data management and mining?

Figure 2 dB engine temporal database ranking

(1)OpenTSDB。Opentsdb has been open source for nearly 10 years and is an early solution. Because it is based on the index developed by Hadoop and HBase, it has the storage capacity of massive data, also known as the million level write speed per second. But also because it relies on Hadoop ecology is too heavy, operation and maintenance costs are high, not simple and lightweight; another disadvantage is that it is based on HBase key value storage mode, which is not friendly and efficient for aggregate query, and the problems of HBase will also be reflected.

How to use Clickhouse to realize temporal data management and mining?

Figure 3 opentsdb user interface

(2)InfluxDB。Influxdb can be said to be a model of time series industry. It has developed into a platform, including everything that time series data should have: from data storage to interface display. However, although influxdb has opened up its core code, only the enterprise version can provide important clustering functions, and the enterprise version is not free. Many large companies either use it directly or develop their own cluster functions.

How to use Clickhouse to realize temporal data management and mining?

Figure 4 functions supported by various versions of influxdb

(3)TDengine。Tdengine is an efficient platform for storing, querying and analyzing time series big data developed by Taos team. Tao Jianhui, the founder of tdengine, is nearly 50 years old and still develops this database.

The positioning of tdengine is time series data such as Internet of things, Internet of vehicles, operation and maintenance monitoring, and its design is also specific to each device. Each collection point has a table. For example, if there are 10 million air monitoring stations, 10 million tables will be built. In order to aggregate and query multiple collection points, the concept of super table is proposed. The same type of collection point table is distinguished by labels, and the structure is the same. This design is indeed very targeted. Although it limits the scope, it greatly improves the efficiency. According to its official test report [4], its aggregate query speed is hundreds of times faster than that of influxdb, but the consumption of CPU, memory and hard disk is less.

How to use Clickhouse to realize temporal data management and mining?

Figure 5 performance comparison of different temporal databases given by Taos team

Tdengine is undoubtedly a wonderful work of time series database. In addition, it has opened its cluster function [5] not long ago, which is favored by more users. When we select the model, it has no open source cluster function, and will be included in the observation later.

(4)ClickHouse。Clickhouse (hereinafter referred to as CK) is an open source big data analysis database and a complete DBMS. CK is undoubtedly a dark horse of OLAP database. Less than four years after the open source, the number of stars on GitHub has exceeded 12K (influxdb is only 19K +), but their fork numbers are not very different.

CK is the open source of yandex, a Russian search engine company. It was originally designed to analyze the traffic of web page clicks, so it’s called click. The iteration speed is very fast. It’s published every month, and the developers are 500 +. Many of them are open source sharers, and the community is very active.

CK is a general analysis database, not designed for time series data, but as long as it is used properly, it can still play its powerful performance.

3. Introduction of CK principle

To take advantage of CK, we must first know what advantages it has, and then understand its core principles. According to our test results, for a table with 27 fields, the write speed of a single instance is close to 200MB per second, more than 4 million data / s. Because the data is randomly generated, it is not friendly to compression.

For queries, if the index can be used, they can be returned in milliseconds under different orders of magnitude (millions, tens of millions, hundreds of millions). For the extreme case: aggregate query for multiple fields without index, that is, when the whole table is scanned, the aggregate speed can reach 4 million / s.

1. Why is CK fast?

It can be summed up as choice and detail, choice decides direction, detail decides success or failure.

CK chooses the optimal algorithm, such as lz4 [6] of column compression; chooses to focus on hardware, make full use of CPU and hierarchical cache; for different scenarios, different processing, such as SIMD, is applied to text and data filtering; CK’s continuous iteration is very fast, which can not only quickly fix bugs, but also quickly incorporate new excellent algorithms.

2. CK Foundation

(1) CK is a pure column storage database. A column is one or more files on the hard disk (multiple partitions have multiple files). Column storage is not expanded here. In short, column storage is more beneficial for analysis. Because each column is stored separately, each column of data can be compressed, which not only saves the hard disk, but also reduces disk IO.

(2) CK is a multi-core parallel processing, in order to make full use of CPU resources, multithreading and multi-core is essential, and vectorization execution will also greatly improve the speed.

(3) SQL query interface is provided. The client connection mode of CK is divided into HTTP and TCP. TCP is more low-level and efficient, and HTTP is easier to use and expand. Generally speaking, HTTP is enough. There are many clients in various languages in the community.

(4) CK does not support transactions. In the big data scenario, the requirements for transactions are not so high.

(5) It is not recommended to update and delete by row. The deletion of CK will also be converted into an increase operation. The granularity is too low, which seriously affects the efficiency.

3. CK cluster

Cluster deployment is usually used in the production environment. The cluster of CK is slightly different from Hadoop. As shown in Figure 6, CK cluster includes the following key concepts.

How to use Clickhouse to realize temporal data management and mining?

Figure 6 CK cluster example

(1) CK example. You can set up multiple CK instances on one host with different ports, or one CK instance on one host.

(2) Slicing. For example, in the case of random partition, each partition in Fig. 5 has about half of the data.

(3) Copy. Redundant backup of data can also be used as query node. Multiple copies provide data query service at the same time, which can speed up the data query efficiency and improve the concurrency. Example 1 and example 3 in Figure 5 store the same data.

(4) Multi master cluster mode. Each instance of CK can be called a replica, and each entity can provide queries, regardless of the master and slave. Only when writing data, a master replica will be temporarily selected in each partition to provide data synchronization service. See the writing process below for details.

4. CK distributed engine

In order to realize the function of fragmentation, a distributed engine is needed. In the case of cluster, the tables in CK are divided into local tables and distributed tables. The following two statements can create a distributed table. Note that a distributed table is a logical table that maps to multiple local tables.

create table t_local on cluster shard2_replica2_cluster(t Datetime, id UInt64)  
ENGINE=ReplicatedMergeTree('/clickhouse/tables/{shard}/t_local','{replica}')
PARTITION BY toYYYYMM(t)
ORDER BY id
create table t on cluster shard2_replica2_cluster  (t Datetime, id UInt64) 
ENGINE=Distributed(shard2_replica2_cluster,default,t_local,id)

T here_ Local is the local table and t is the distributed table. Replicatedmergetree is an engine for replica synchronization, and its parameters can be ignored first. Distributed engine is a distributed engine. The parameters are: cluster name, database name, local table name and partition key (can be specified as rand () random number).

Distributed engine plays an important role in the process of writing and querying. See the following for the specific process.

5. CK writing process

According to the table engine used, the writing process is different. The above table creation method is more conventional. According to the above table creation statement, internal replication items need to be opened at the same time.

<shard2_replica2_cluster>
       <shard>
               <weight>1</weight>
               <internal_replication>true</internal_replication>
               <replica>
                        …
               </replica>
               <replica>
                        …
                </replica>
       </shard> 

Write two pieces of data: insert into t values (now(), 1), (now(), 2), as shown in Figure 7. The write process is divided into two steps: distributed write and replica synchronization.

How to use Clickhouse to realize temporal data management and mining?

Figure 7 CK writing process

(1) Distributed write

1) The client will select a replica in the cluster to establish a connection. Here is instance 1. All written data are written in instance 1 first. According to the fragmentation rule, the data belonging to 01 fragmentation is written to instance 1 locally, and the data belonging to 02 fragmentation is written to a temporary directory first. Then, a connection is established to instance 2 (the primary copy of shard02) and the data is sent to instance 2.

2) Instance 2 receives the data and writes it to the local partition.

3) Instance 1 returns a successful write to the client (write a copy of each partition to return, which can be configured).

(2) Replica synchronization

ZK is used in the synchronization process. The first parameter of replicated merge tree in the above table creation statement is the path on ZK. When creating a table, there will be a copy election process. Generally, the first one will become the primary copy, and the node information of the copy will be registered to ZK. ZK is only used to maintain the metadata of the copy and task, as well as distributed communication, and does not transmit data. Once the replica is successfully registered, it will start to listen to the log under / log. When the replica goes online, the following process will be followed when inserting:

1) After instance 1 writes the local partition data, it will send the operation log to / log of ZK with the partition name and source host (the host of instance 1).

2) Other copies of partition 01, for instance 3, listen to log changes, pull logs, create tasks, put them into the execution queue / queue on ZK (asynchronous here), and then execute tasks according to the queue.

3) The process of executing the task is as follows: select a replica (the replica with the most complete data and the least queue tasks), establish a connection to the replica (instance 1), and pull the data.

Note that it is unwise to use the replica engine without turning on internal replication, because the data will be written repeatedly. Although data verification can ensure that the data is not duplicated, it increases the fearless overhead.

6. CK query process

The query is a distributed table, but to locate the actual local table, that is, the replica selection, there are several selection algorithms, and the default is random selection. There will only be one copy in response to the client query request, but the execution process may involve multiple copies. For example: select count (*) from t. Because the data is distributed in two pieces, only one copy can not get all the results.

How to use Clickhouse to realize temporal data management and mining?

Figure 8 CK multi instance query process

7. Important index engine in CK

The core engine of CK is mergetree, on which many additional engines have been generated. Here are some common engines.

(1)ReplacingMergeTree。 In order to solve the problem that the primary key of mergetree can be repeated, we can use replacing mergetree, but it is not repeated to a certain extent: it is not repeated only in a partition. Reference for usage:https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/replacingmergetree/

(2)SummingMergeTree。 For the determined group by + sum query, if it is time-consuming, you can create summingmergetree to aggregate or customize the aggregate fields according to the fields of order by, and sum the remaining fields.

(3)AggregatingMergeTree。 Aggregation is obviously the focus of query analysis. In general, using aggregation mergetree will combine with materialized view and automatically synchronize to materialized view when inserting data. In this way, you can directly query the aggregation results in materialized view.

(4)ReplicatedXXXMergeTree。 Add a replicated prefix before all engines to upgrade the engine to support the replica function.

(5) Materialized view. Materialized view is to store the results of view SQL query in a table. The special point in CK is that only the data of insert can enter the trigger view query and the view table. Synchronization of past data will not be triggered in distributed situation. In order to use materialized view in distributed situation, the table on which materialized view depends can be designated as distributed table.

4. The combination of CK and timing

After understanding the basic principle of CK, let’s take a look at its processing ability in time series data.

(1) Time: time is essential, according to the time partition can greatly reduce the scope of data scanning;

(2) Filtering: the filtering of conditions is generally based on some columns, which has obvious advantages for column storage;

(3) Downsampling: it is a very important function for timing, which can be realized by aggregation. The time conversion function of each granularity and powerful aggregation ability of CK can meet the requirements;

(4) Analysis mining: can develop extended functions to support.

In addition, as a big data system, CK also meets the following basic requirements:

(1) High throughput write;

(2) Mass data storage: hot and cold backup, TTL;

(3) Efficient real-time query;

(4) High availability;

(5) Extensibility: user defined development can be realized;

(6) Easy to use: JDBC and HTTP interfaces are provided;

(7) Easy to maintain: data migration is convenient and recovery is easy. Later, ZK may be removed and distributed function is built in.

Therefore, CK can easily achieve a high-performance, high available time series data management and analysis system. Here is a detailed introduction of the key points.

1. Temporal index and partition

There are many aggregate queries in temporal query scenarios. For a specific scenario, if it is used very frequently and the amount of data is very large, we can use materialized view for pre aggregation, and then query materialized view. However, for a general analysis platform, when the query conditions can be changed at will, the overhead of using materialized view is too large, so we do not use materialized view at present, but use the original table. The materialized view scheme will be further verified in the future.

The following is the syntax format for just to create a time table: the first bracket is the tag field, the second bracket is the value field (must be numeric), and the brace is a special configuration for the underlying storage. Here are the indexes and parameters of CK. In addition to the user specified field, there is an implicit time field reserved for timing.

create table my_ts_table as ts (
    tag1 string,
    Tag2 string [: primarykey] [: comment ='description ']
)
(
    value1 double,
    value2 double
)

At the bottom of just, there are two tables corresponding to CK (one local table and one distributed table). By default, they are partitioned and sorted according to time, as shown in the following example:

create table my_ts_table as ts (
    tag1 string,
    Tag2 string [: primarykey] [: comment ='description ']
)
(
    value1 double,
    value2 double
)

The actual corresponding CK table creation statement is:

CREATE TABLE just.username_dbname_airquality_local
(
    `id` Int32,
    `oid`Int32,
    `name`String,
    `city`String,
    `time`DateTime,
    `PM10`Float64,
    `PM25`Float64
)
ENGINE =ReplicatedMergeTree('/clickhouse/tables/{shard}/24518511-2939-489b-94a8-0567384d927d','{replica}')
ORDER BY (time)
SETTINGS index_granularity = 8192
PARTITION BY toYYYYMM(time)
​
CREATE TABLE just.wangpeng417_test_airquality
(
    `id` Int32,
    `oid`Int32,
    `name`String,
    `city`String,
    `time`DateTime,
    `PM10`Float64,
    `PM25`Float64
)
ENGINE = Distributed('just_default', 'just', ' username_dbname_airquality_local',rand()) 

This ensures that the index can be used when using the time range query. If there are other query conditions according to tag, you can also customize the index and sort field [LL1] (CK stipulates that the index field must be the prefix of the sort field).

In different scenarios, it is necessary to select index partition and index granularity according to data volume and data characteristics. According to the experimental test, if CK can scan 1GB of data per second in our environment, and then multiply it by 1-10 times of compression ratio, then the data volume of a partition should be greater than 10 million to 100 million level, which can ensure better speed. CK itself is multi-threaded query, which can ensure the isolation of each partition query at the same time. However, according to the query scenario, for example, the maximum query time is one month, but in most cases, the query time is one week, so it may be better to partition accurately to the week, which is a comprehensive trade-off process.

2. Deployment and high availability

In just, high scalability and high availability are our pursuit. To achieve high scalability, we slice the data horizontally; to achieve high availability, we store at least two copies of each slice.

As for cluster deployment, the minimum case is two machines, which leads to two cases: 1) cross replica; 2) one primary and one standby, as shown in Figure 9

How to use Clickhouse to realize temporal data management and mining?

Figure 9 the case of two copies

The experimental results of the impact of these two schemes on query and write are shown in Figure 10

How to use Clickhouse to realize temporal data management and mining?

Figure 10 comparison of write and query results of two copies

The experimental results show that: the write speed (abscissa is the number of write examples, ordinate is the speed MB / s) is almost the same when it reaches the limit, while the query performance (abscissa is the SQL number, see Appendix 1 for SQL statements, ordinate is the time-consuming, unit is seconds) is not very different for simple queries, but for complex queries that occupy a lot of resources, one primary and one standby is more efficient. Because the strong performance of CK is based on making full use of CPU. In our test, the CPU reaches more than 90% in the case of bare metal, which is very frequent. If there is a separate machine to deploy CK, there is no doubt that it can make full use of machine resources. However, in our environment, if we share machines with other big data platforms, we need to avoid that CK takes up too many resources and affects other services, so we choose docker deployment. The deployment of docker container also has an open source implementation based on k8s: Clickhouse operator. For small environment, you can choose to configure manually, and realize automatic deployment through simple script.

Based on the above test conclusions, in order to ensure the high availability of services, CK cluster and data redundancy are essential. Our scheme is to ensure that at least two copies, the number of partitions is as many as possible, and make full use of machines, and each machine has and only has one CK instance. So we have the following formula for the number of partitions and the number of copies:

How to use Clickhouse to realize temporal data management and mining?

Among them_ f_ (_ n_ )There should be representatives_ n_ When there are two machines, the distribution of deployment, n > = 2_ f_ (2) = (1,2) indicates that two machines adopt the strategy of one partition and two copies_ f_ (3) = (1,3) represents the deployment strategy of one slice and three copies for three machines_ f_ (4) = (2,2) means that 4 machines use 2 slices, each slice has 2 copies, and so on.

3. Dynamic expansion

With the increase of the amount of data, when the node needs to be expanded, it can be dynamically expanded without downtime, mainly using the weight relationship between the slices.

There are two cases of capacity expansion

(1) Add replica: you only need to modify the configuration file and add a replica instance, and the data will be synchronized automatically. Because of the multi master feature of CK, the replica can also be used as a query node, so it can share the query pressure;

(2) Add fragmentation: adding fragmentation is troublesome. You need to calculate the weight according to the current amount of data and the amount of added data, and then modify the weight back when the amount of data reaches equilibrium

If we had only one slice at the beginning, we would have 100 pieces of data.

<test_extend>
       <shard>
              <weight>1</weight>
              <internal_replication>true</internal_replication>
              <replica>
                     <host>10.220.48.106</host>
                     <port>9000</port>
              </replica>
              <replica>
                     <host>10.220.48.105</host>
                     <port>9000</port>
              </replica>
       </shard>
</test_extend> 

Now to add another slice, the weight calculation process is as follows (to simplify, ignore the data inserted during this period)

If we plan to insert n pieces of data, the cluster data can be balanced, then each shard has (n + 100) / 2 pieces, now shard01 has 100 pieces, set the weight as W1, W2, then the formula is satisfied: n * (W2 / (W1 + W2)) = (n + 100) / 2, where n > 100, so if W1 = 1, n = 200, then W2 = 3.

Therefore, the configuration is modified as follows:

<test_extend>
       <shard>
              <weight>1</weight>
              <internal_replication>true</internal_replication>
              <replica>
                     <host>10.220.48.106</host>
                     <port>9000</port>
              </replica>
              <replica>
                     <host>10.220.48.105</host>
                     <port>9000</port>
              </replica>
       </shard>
       <shard>
              <weight>3</weight>
              <internal_replication>true</internal_replication>
              <replica>
                     <host>10.220.48.103</host>
                     <port>9000</port>
              </replica>
       </shard>
</test_extend> 

Wait until the data is synchronized evenly, and then change back to 1:1.

4. System introduction and deficiency

The bottom layer of just timing analysis uses CK as the storage query engine, and develops a reusable visual analysis interface. Welcome to visithttps://just.urban-computing.cn/Experience.

How to use Clickhouse to realize temporal data management and mining?

Figure 11 schematic diagram of just timing analysis module

Users can use a unified query interface to create a time table, then import data and switch to the time series analysis module for visual query.

How to use Clickhouse to realize temporal data management and mining?

Figure 12. The intention of just establishment sequence

At present, the main query functions are: query by time and filter by tag. In the case of a large amount of data, downsampling can be carried out according to a larger time granularity to view the trend of the whole data. At the same time, linear, Lagrangian and other missing value filling functions are provided.

The analysis and mining part is mainly to find specific value and percentage filtering, as well as some simple function transformation.

At present, the function of timing module is relatively simple, and the SQL query support for timing data is not complete. In the future, the following functions will be integrated:

(1) Access to real-time data;

(2) For complex queries, panel functions can be pre aggregated by aggregation engine;

(3) More perfect analysis and mining function;

(4) Fault tolerance and verification of data;

(5) SQL query support consistent with just.

Reference link:

[1]https://en.wikipedia.org/wiki…

[2]https://db-engines.com/en/ran…

[3]https://www.influxdata.com/bl…

[4]https://www.taosdata.com/down…

[5]https://www.taosdata.com/blog…

[6]lz4.LZ4[EB/OL].https://lz4.github.io/lz4/,20…

[7]https://clickhouse.tech/docs/…

Recommended reading:

Welcome to clickJingdong Zhilian cloud, learn about the developer community

More wonderful technical practice and exclusive dry goods analysis

Welcome to the official account of Jingdong developer cloud.

How to use Clickhouse to realize temporal data management and mining?