HTAP | highway from Mysql to Clickhouse

Time:2022-2-2
HTAP | highway from Mysql to Clickhouse

Author: tceason Qingyun technology database R & D Engineer

Since 2000, MySQL [1] has been the most popular OLTP (online transaction processing) database in the world, and Clickhouse [2] is an OLAP (online analytical processing) database that has attracted great attention in recent years. So will there be any sparks between the two?

This article will lead you to break the barriers of heterogeneous databases and synchronize MySQL data to Clickhouse.

background

1. Development of MySQL replication

HTAP | highway from Mysql to Clickhouse

Figure 1-1 development of MySQL replication

Figure 1-1 details the development of MySQL replication.

MySQL version 3.23 in 2001 already supports isomorphic databasesAsynchronous replication; Because it is asynchronous replication, it can not be used in large quantities in actual production.

2013 MySQL version 5.7.2 supportEnhanced semi synchronous replicationAbility, which can barely be regarded as an enterprise level available data synchronization scheme.

MySQL 5.7.17 in 2016 supportsMGR, and constantly developed into a data synchronization scheme available at the financial level.

For homogeneous MySQL data synchronization, the next thing to do is to continuously optimize the experience, improve the timeliness of synchronization, and solve various problems under Network exceptions.

Based on this, major manufacturers also began to make their own high availability synchronization components. For example, xenon, developed and open-source by the qingcloud database R & D team, has real strong consistency and high availability.

2、MySQL + Xenon

HTAP | highway from Mysql to Clickhouse

Xenon in Figure 1-2 is a high availability component implemented by raft like algorithm, which is used to manage MySQL election and exploration activities and correct data accuracy. MySQL data synchronization still uses semi sync replication or Mgr, so as to achieve strong data consistency, non centralized automatic master selection and master-slave second switching, as well as cross region disaster tolerance based on cloud.

Clickhouse synchronize MySQL data

To speed up OLAP queries, qingcloud MySQL plus [3] (MySQL + xenon) borrows Clickhouse to synchronize MySQL data.

1. Clickhouse overview

HTAP | highway from Mysql to Clickhouse

Clickhouse is a columnar database management system (DBMS) for on-line analysis (OLAP). Clickhouse was conceived in 2008 and was originally developed for yandex METRICA (the world’s second largest web analysis platform). It has been continuously used by the system as the core component of the system for many years, and announced open source in 2016.

HTAP | highway from Mysql to Clickhouse

From the latest DB engines, we can see that its ranking curve is rising all the way, and major factories have deployed a large number of important businesses, which is a very obvious trend. Therefore, it seems that we can conclude that the popularity of Clickhouse is not just a temporary phenomenon, it will survive for a long time. Moreover, Clickhouse’s flexible external table engine can easily synchronize data with MySQL. Let’s learn about it next.

2、MySQL Table Engine

MySQL table engine features.

  • Mapping to MySQL table
  • Fetch table struct from MySQL
  • Fetch data from MySQL when executing query

Clickhouse initially supports table level synchronization of MySQL data, and realizes the mapping with MySQL tables through the external table engine MySQL table engine. frominformation_schemaGet the structure of the corresponding table in the table and convert it into the data structure supported by Clickhouse. At this time, the table structure is successfully established on the Clickhouse side. But at this time, there is no real data synchronization. Only when a request is made to the table in Clickhouse will the data of the MySQL table to be synchronized be actively pulled.

MySQL table engine is very simple to use, but it is very meaningful. Because this is the first time to open the data channel between Clickhouse and mysql. However, the disadvantages are obvious:

i. It is only a mapping of MySQL table relationships;

ii. Transmitting MySQL data to Clickhouse during query may cause unknown network pressure and reading pressure to MySQL, which may affect the normal use of MySQL in production.

Based on the disadvantage that MySQL table engine can only map MySQL table relationships, the qingcloud Clickhouse team implemented MySQL database engine.

3、MySQL Database Engine

Mysql database engine features.

  • Mapping to MySQL Database
  • Fetch table list from MySQL
  • Fetch table struct from MySQL
  • Fetch data from MySQL when executing query

Mysql database engine is a database level mapping, which is frominformation_schemaPull the structure of all MySQL tables contained in the database to be synchronized, which solves the problem of establishing multiple tables. However, it still has the same disadvantages as MySQL table engine: transmitting MySQL data to Clickhouse during query may cause unknown network pressure and reading pressure to MySQL, which may affect the normal use of MySQL in production.

4. Borrow third-party software synchronization

HTAP | highway from Mysql to Clickhouse

In addition to the MySQL table engine and MySQL database engine mentioned above, third-party software, such as canal or Kafka, can be used to synchronize data by parsing MySQL binlog, and then writing a program to control writing to Clickhouse. This has a great advantage, that is, the synchronization process is autonomous and controllable. But there are additional problems:

i. It increases the complexity of data synchronization.

ii. Third party software is added, which increases the difficulty of operation and maintenance exponentially.

Based on this, we can think about another question: can Clickhouse actively synchronize and subscribe to MySQL data?

Materialize MySQL

In order to solve the problem that MySQL database engine still exists and support Clickhouse to actively synchronize and subscribe to MySQL data, the qingcloud Clickhouse team independently developed the materialized MySQL [4] engine.

1. Brief introduction to materializemysql

Materializemysql engine is a library engine independently developed by qingcloud Clickhouse team. At present, it is incorporated into Clickhouse version 20.8 as an experimental feature. It is a mapping of MySQL library level relationship and subscribes to MySQL data by consuming binlog and storing it in mergetree.

The specific usage is a simple create database SQL example:

CREATE DATABASE test ENGINE = MaterializeMySQL(
  '172.17.0.3:3306', 'demo', 'root', '123'
)

#172.17.0.3:3306 - MySQL address and port
#Demo - name of MySQL Library
#Root - MySQL synchronization account
#123 - password of MySQL synchronization account

2. Design idea of materialized MySQL

  • Check MySQL Vars
  • Select history data
  • Consume new data

The design idea of materializemysql is as follows:

  1. First, check whether the source MySQL parameters meet the specifications;
  2. Then the data is divided into historical data and incremental data according to gtid;
  3. Synchronize historical data to gtid point;
  4. Continuous consumption increment data.

3. Materializemysql function flow

HTAP | highway from Mysql to Clickhouse

As shown in Figure 3-1, the main process of materializemysql function is as follows:

CheckMySQLVars -> prepareSynchronized -> Synchronized

(1)CheckMySQLVars

The inspection parameters are relatively simple, which is to query whether these parameters meet the expectations.

SHOW VARIABLES WHERE (Variable_name = 'log_bin'
AND upper(Value) = 'ON')
OR (Variable_name = 'binlog_format'
AND upper(Value) = 'ROW')
OR (Variable_name = 'binlog_row_image'
AND upper(Value) = 'FULL')
OR (Variable_name = 'default_authentication_plugin'
AND upper(Value) = 'MYSQL_NATIVE_PASSWORD')
OR (Variable_name = 'log_bin_use_v1_row_events'
AND upper(Value) = 'OFF');

(2)prepareSynchronized

This step is used to pull historical data.

  • To initialize gtid information first;
  • In order to ensure idempotency, clean up the tables under Clickhouse materializemysql engine library every time you resynchronize;
  • Pull the historical data again and rewrite the MySQL table structure on the Clickhouse side;
  • Establish a binlog transmission channel with MySQL.
std::optional<MaterializeMetadata> MaterializeMySQLSyncThread::prepareSynchronized()
{
    connection = pool.get();
    MaterializeMetadata metadata(
connection, DatabaseCatalog::instance().getDatabase(database_name)->getMetadataPath() + "/.metadata", mysql_database_name, opened_transaction);
    if (!metadata.need_dumping_tables.empty())
    {
        Position position;
        position.update(metadata.binlog_position, metadata.binlog_file, metadata.executed_gtid_set);
        metadata.transaction(position, [&]()
        {
            cleanOutdatedTables(database_name, global_context);
            dumpDataForTables(connection, metadata, query_prefix, database_name, mysql_database_name, global_context, [this] { return isCancelled(); });
         });
    }
    connection->query("COMMIT").execute();
}

In mysql, there is a table t under the demo library. The primary key is ID and the common column is col_ 1。

CREATE TABLE demo.t (
  id int(11) NOT NULL,
  col_1 varchar(20) DEFAULT NULL,
  PRIMARY KEY (id)
) ENGINE = InnoDB;

In Clickhouse, ID is still the primary key column, but there are many hidden columns_ Sign and_ version。

i. _ Sign: the values are only 1 and – 1. Where, 1 means that this row of data exists and – 1 means that this row of data is deleted.

ii. _ Version: only the high version value will be read, and the rows with the same primary key will be merged in the background, and finally the row with the highest version will be retained.

CREATE TABLE test.t
(
    `id` Int32,
    `col_1` Nullable(String),
    `_sign` Int8 MATERIALIZED 1,
    `_version` UInt64 MATERIALIZED 1
)
ENGINE = ReplacingMergeTree(_version)
PARTITION BY intDiv(id, 4294967)
ORDER BY tuple(id)

(3)Synchronized

In preparessynchronized, we get the historical data and historical data location information, and obtain the binlog transmission channel with MySQL. The next step is to synchronize incremental data from this site. Read the contents of each binlog through the readonebinlogevent function, and then use onevent to convert it into the statement format of Clickhouse. Finally, for data security, call the flushbuffersdata function to drop the data to the disk.

client.connect();
client.startBinlogDumpGTID(randomNumber(), mysql_database_name, metadata.executed_gtid_set, metadata.binlog_checksum);
Buffers buffers(database_name);
while (!isCancelled())
{
    BinlogEventPtr binlog_event = client.readOneBinlogEvent(std::max(UInt64(1), max_flush_time - watch.elapsedMilliseconds()));
    if (binlog_event)
        onEvent(buffers, binlog_event, *metadata);
    if (!buffers.data.empty())
        flushBuffersData(buffers, *metadata);
}

HTAP application scenario

When we get through the replication channel between Clickhouse and MYSQL, and the analysis ability of Clickhouse is so amazing, can we use MySQL + Clickhouse to realize HTAP?

HTAP | highway from Mysql to Clickhouse

In the architecture in Figure 4-1, xenon, a highly available component, is still used to manage MySQL replication. Meanwhile, xenon has increased the supervision of Clickhouse and synchronized MySQL data through materialized mysql.

In the previous architecture diagram, the read-only instance of MySQL was used for business analysis, user portrait and other business analysis. Now Clickhouse can be directly added to MySQL replication as an analysis instance to replace some read-only instances for analysis and calculation. At the same time, Clickhouse itself supports a large number of functions to support analysis ability, and also supports standard SQL. I believe users can enjoy a good experience.

At present, Clickhouse can support the synchronization of MySQL 5.7 and 8.0 data, but does not support the synchronization of MySQL 5.6 data. However, as an experimental feature, the timeline of materialized MySQL is equivalent to MySQL that just supported replication in 2001. Welcome to contribute and maintain materializemysql.

[1]. MySQL : https://www.mysql.com/

[2]. ClickHouse : https://clickhouse.tech/docs/en/

[3]. MySQL Plus:https://www.qingcloud.com/products/mysql-plus/

[4]. MaterializeMySQL:https://clickhouse.tech/docs/en/engines/database-engines/materialize-mysql/

About radondb

Radondb open source community is a cloud oriented native and containerized database open source community. It provides database technology lovers with a technology sharing platform around mainstream open source databases (mysql, PostgreSQL, redis, mongodb, Clickhouse, etc.), and provides enterprise level radondb open source products and services.

At present, radondb open source database series products have beenEverbright Bank, Pudong Development Silicon Valley Bank, Hami bank, Taikang Insurance, Taiping Insurance, AXA insurance, sunshine insurance, Centennial life insurance, Anji logistics, Anchang logistics, blue moon, Tiancai Shanglong, rockjiahua, Shengzhe technology, Wuxi huipao sports, Beijing Telecom, Jiangsu transportation holding, Sichuan Airlines, Kunming Airlines and state-controlled biologyAnd thousands of enterprises and community users.

Radondb can be delivered based on cloud platform and kubernetes container platform. It not only provides database product solutions covering multiple scenarios, but also provides professional cluster management and automatic operation and maintenance capabilities. Its main functional features include:High availability master-slave switching, strong data consistency, read-write separation, one click installation and deployment, multi-dimensional index monitoring & alarm, elastic expansion & shrinkage, horizontal free expansion, automatic backup & recovery, multi activity in the same city and remote disaster recoveryWait. Radondb only needs enterprise and community users to focus on business layer logic development, without paying attention to complex issues such as cluster high availability selection, management and operation and maintenance, so as to help enterprise and community users greatly improve the efficiency of business development and value innovation!

GitHub:

https://github.com/radondb

This article will introduce how the materialized MySQL engine synchronizes MySQL data to Clickhouse.

Recommended Today

The implementation in python outputs the bases in the fasta file with the specified number per line

  001. Test data [[email protected] test2]# ls a.fa test.py [[email protected] test2]# cat a.fa ## Test data >OR4F5_ENSG00000186092_ENST00000641515_61_1038_2618 CCCAGATCTCTTCAGTTTTTATGCCTCATTCTGTGAAAATTGCTGTAGTCTCTTCCAGTTATGAAGAAGGTAACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTT >OR4F29_ENSG00000284733_ENST00000426406_20_955_995 AGCCCAGTTGGCTGGACCAATGGATGGAGAGAATCACTCAGTGGTATCTGAGTTTTTGTTTCTGGGACTCACTCATTCATGGGAGATCCAGCTCCTCCTCCTAGTGTTTTCCTCTGTGCTCTATGTGGCAAGCATTACTGGAAACATCCTCATTGTGTTTTCTGTGACCACTGACCCTCACTTAC [[email protected] test2]# cat test.py ## Conversion script #!/usr/bin/python dict1 = {} in_file = open(“a.fa”, “r”) out_file = open(“result.fa”, “w”) for i in in_file: i = i.strip() if i.startswith(“>”): key = i dict1[key] = “” else: […]