Flink + Hudi framework Lake warehouse integrated solution

Time:2021-12-24

This article is reproduced from the official account of Qi, and introduces the prototype construction of Flink + Hudi Lake Warehouse Integration Scheme in detail.

  1. Hudi
  2. The new architecture is integrated with the lake warehouse
  3. Best practices
  4. Flink on Hudi
  5. Flink CDC 2.0 on Hudi

Flink + Hudi framework Lake warehouse integrated solution

1、 Hudi

1. Introduction

Apache Hudi (pronounced “Hoodie”) provides the following stream primitives on the DFS dataset

  • Insert updates (how do I change the dataset?)
  • Incremental pull (how to get changed data?)

Hudi maintains a timeline of all operations performed on the dataset to provide an instant view of the dataset. Hudi organizes the dataset into a directory structure under a basic path very similar to the hive table. The dataset is divided into multiple partitions, and the folder contains the files of the partition. Each partition is uniquely identified by the partition path relative to the basic path.

Partition records are assigned to multiple files. Each file has a unique file ID and a commit that generates the file. If there is an update, multiple files share the same file ID, but the commit when writing is different.

Storage type– storage method of processing data

  • Copy on write
  • Pure determinant
  • Create a new version of the file
  • Read time merge
  • Near real time

view– read method of processing data

Read optimization view-The input format selects only compressed columnar files

  • Parquet file query performance
  • The delay time of 500GB is about 30 minutes
  • Import an existing hive table

Near real time view

  • Mix and format data
  • Delay of about 1-5 minutes
  • Provide near real time table

Incremental view

  • Changes to data sets
  • Enable incremental pull

The Hudi storage tier consists of three different parts

metadata– it maintains the metadata of all operations performed on the dataset in the form of a timeline, which allows the immediate view of the dataset to be stored in the metadata directory of the basic path. The types of operations on the timeline include

  • Commit, a commit represents the process of writing a batch of record atoms into the dataset. A monotonically increasing timestamp that indicates the beginning of a write operation.
  • Clean, clean up older versions of files in the dataset that are no longer used in queries.
  • Compression, the action of converting a line file to a column file.
  • Indexes, quickly map the incoming record key to the file (if the record key already exists). The index implementation is pluggable, bloom filter – since it does not rely on any external system, it is the default configuration, and the index and data are always consistent. Apache HBase – more efficient for a small number of keys. Several seconds may be saved during index marking.
  • data, Hudi stores data in two different storage formats. The format actually used is pluggable, but it is required to have the following characteristics – read optimized column storage format (roformat), and the default value is Apache parquet; Write optimized row based storage format (woformat). The default value is Apache Avro.

Flink + Hudi framework Lake warehouse integrated solution

2. Why is Hudi important for large-scale and near real-time applications?

Hudi addresses the following limitations:

  • Scalability limitations of HDFS;
  • Need to render data faster in Hadoop;
  • There is no direct support for updating and deleting existing data;
  • Fast ETL and modeling;
  • To retrieve all updated records, whether they are new records added to the latest date partition or updates to old data, Hudi allows the user to use the last checkpoint timestamp. This procedure does not perform a query that scans the entire source table.

3. Hudi’s advantages

  • Scalability limitations in HDFS;
  • Fast presentation of data in Hadoop;
  • Support the update and deletion of existing data;
  • Fast ETL and modeling.

(the above content is mainly quoted from Apache Hudi)

2、 The new architecture is integrated with the lake warehouse

Through the integration of Lake warehouse and flow batch, we can achieve: Data homology, the same computing engine, the same storage and the same computing caliber in the quasi real-time scenario. The timeliness of data can reach the minute level, which can well meet the needs of business quasi real-time data warehouse. The following is the architecture diagram:

Flink + Hudi framework Lake warehouse integrated solution

MySQL data enters Kafka through Flink CDC. The reason why the data enters Kafka first rather than directly into Hudi is to reuse the data from MySQL for multiple real-time tasks, so as to avoid the impact on the performance of MySQL database caused by multiple tasks connecting MySQL tables and binlog through Flink CDC.

In addition to the ODS layer of the offline data warehouse, the data entered into Kafka through CDC will be transferred from ODS – > DWD – > DWS – > OLAP database according to the link of the real-time data warehouse, and finally used for data services such as reports. The result data of each layer of the real-time data warehouse will be sent to the offline data warehouse in quasi real time. In this way, the program is developed once, the index caliber is unified, and the data is unified.

From the architecture diagram, we can see that there is a step of data correction (rerunning historical data). The reason for this step is that there may be rerunning historical data due to caliber adjustment or error in the calculation result of the real-time task of the previous day.

The data stored in Kafka has expiration time and will not store historical data for too long. The historical data that runs for a long time cannot obtain historical source data from Kafka. Moreover, if a large amount of historical data is pushed to Kafka again and the historical data is corrected through the link of real-time calculation, the real-time operation of the day may be affected. Therefore, the rerun history data will be processed through data correction.

Generally speaking, this architecture is a hybrid architecture of lambda and kappa. Each data link of the stream batch integrated data warehouse has a data quality verification process. The next day, the data of the previous day is reconciled. If the data calculated in real time the previous day is normal, there is no need to correct the data. The kappa architecture is sufficient.

(this section is quoted from the practice of 37 mobile games based on Flink CDC + Hudi Lake warehouse integration scheme)

3、 Best practices

1. Version matching

The problem of version selection may become the first stumbling block for everyone. The following is the version adaptation recommended by Hudi Chinese community:

Flink Hudi
1.12.2 0.9.0
1.13.1 0.10.0

It is recommended to use Hudi master + Flink 1.13 to better adapt to CDC connector.

2. Download Hudi

https://mvnrepository.com/art…

At present, the latest version of Maven central warehouse is 0.9 0, if you need to download 0.10 0 version, you can join the community group, download in the shared file, or download the source code and compile it yourself.

3. Implementation

If willHudi-Flink-bundle_2.11-0.10.0.jarPut itFlink/libUnder, you only need to execute as follows, otherwise various exceptions that cannot find the class will appear

bin/SQL-client.sh embedded

4、 Flink on Hudi

Newly build Maven project and modify POM as follows:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>Flink_Hudi_test</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <Flink.version>1.13.1</Flink.version>
        <Hudi.version>0.10.0</Hudi.version>
        <hadoop.version>2.10.1</hadoop.version>
    </properties>

    <dependencies>


        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>


        <dependency>
            <groupId>org.apache.Flink</groupId>
            <artifactId>Flink-core</artifactId>
            <version>${Flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.Flink</groupId>
            <artifactId>Flink-streaming-java_2.11</artifactId>
            <version>${Flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.Flink</groupId>
            <artifactId>Flink-connector-jdbc_2.11</artifactId>
            <version>${Flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.Flink</groupId>
            <artifactId>Flink-java</artifactId>
            <version>${Flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.Flink</groupId>
            <artifactId>Flink-clients_2.11</artifactId>
            <version>${Flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.Flink</groupId>
            <artifactId>Flink-table-api-java-bridge_2.11</artifactId>
            <version>${Flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.Flink</groupId>
            <artifactId>Flink-table-common</artifactId>
            <version>${Flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.Flink</groupId>
            <artifactId>Flink-table-planner_2.11</artifactId>
            <version>${Flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.Flink</groupId>
            <artifactId>Flink-table-planner-blink_2.11</artifactId>
            <version>${Flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.Flink</groupId>
            <artifactId>Flink-table-planner-blink_2.11</artifactId>
            <version>${Flink.version}</version>
            <type>test-jar</type>
        </dependency>

        <dependency>
            <groupId>com.ververica</groupId>
            <artifactId>Flink-connector-mySQL-CDC</artifactId>
            <version>2.0.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.Hudi</groupId>
            <artifactId>Hudi-Flink-bundle_2.11</artifactId>
            <version>${Hudi.version}</version>
            <scope>system</scope>
            <systemPath>${project.basedir}/libs/Hudi-Flink-bundle_2.11-0.10.0-SNAPSHOT.jar</systemPath>
        </dependency>

        <dependency>
            <groupId>mySQL</groupId>
            <artifactId>mySQL-connector-java</artifactId>
            <version>5.1.49</version>
        </dependency>


    </dependencies>
</project>

We build queries byinsert into t2 select replace(uuid(),'-',''),id,name,description,now() from mySQL_binlogInsert the created MySQL table into Hudi.

package name.lijiaqi;

import org.apache.Flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.Flink.table.api.EnvironmentSettings;
import org.apache.Flink.table.api.SQLDialect;
import org.apache.Flink.table.api.TableResult;
import org.apache.Flink.table.api.bridge.java.StreamTableEnvironment;

public class MySQLToHudiExample {
    public static void main(String[] args) throws Exception {
        EnvironmentSettings fsSettings = EnvironmentSettings.newInstance()
                .useBlinkPlanner()
                .inStreamingMode()
                .build();
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, fsSettings);

        tableEnv.getConfig().setSQLDialect(SQLDialect.DEFAULT);

        //Data source table
        String sourceDDL =
                "CREATE TABLE mySQL_binlog (\n" +
                        " id INT NOT NULL,\n" +
                        " name STRING,\n" +
                        " description STRING\n" +
                        ") WITH (\n" +
                        " 'connector' = 'jdbc',\n" +
                        " 'url' = 'jdbc:mySQL://127.0.0.1:3306/test', \n"+
                        " 'driver' = 'com.mySQL.jdbc.Driver', \n"+
                        " 'username' = 'root',\n" +
                        " 'password' = 'dafei1288', \n" +
                        " 'table-name' = 'test_CDC'\n" +
                        ")";

        //Output target table
        String sinkDDL =
                "CREATE TABLE t2(\n" +
                        "\tuuid VARCHAR(20),\n"+
                        "\tid INT NOT NULL,\n" +
                        "\tname VARCHAR(40),\n" +
                        "\tdescription VARCHAR(40),\n" +
                        "\tts TIMESTAMP(3)\n"+
//                        "\t`partition` VARCHAR(20)\n" +
                        ")\n" +
//                        "PARTITIONED BY (`partition`)\n" +
                        "WITH (\n" +
                        "\t'connector' = 'Hudi',\n" +
                        "\t'path' = 'hdfs://172.19.28.4:9000/Hudi_t4/',\n" +
                        "\t'table.type' = 'MERGE_ON_READ'\n" +
                        ")" ;
        //Simple aggregation processing
        String transformSQL =
                "insert into t2 select replace(uuid(),'-',''),id,name,description,now()  from mySQL_binlog";

        tableEnv.executeSQL(sourceDDL);
        tableEnv.executeSQL(sinkDDL);
        TableResult result = tableEnv.executeSQL(transformSQL);
        result.print();

        env.execute("mySQL-to-Hudi");
    }
}

Query Hudi

package name.lijiaqi;

import org.apache.Flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.Flink.table.api.EnvironmentSettings;
import org.apache.Flink.table.api.SQLDialect;
import org.apache.Flink.table.api.TableResult;
import org.apache.Flink.table.api.bridge.java.StreamTableEnvironment;

public class ReadHudi {
    public static void main(String[] args) throws Exception {
        EnvironmentSettings fsSettings = EnvironmentSettings.newInstance()
                .useBlinkPlanner()
                .inStreamingMode()
                .build();
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, fsSettings);

        tableEnv.getConfig().setSQLDialect(SQLDialect.DEFAULT);

        String sourceDDL =
                "CREATE TABLE t2(\n" +
                        "\tuuid VARCHAR(20),\n"+
                        "\tid INT NOT NULL,\n" +
                        "\tname VARCHAR(40),\n" +
                        "\tdescription VARCHAR(40),\n" +
                        "\tts TIMESTAMP(3)\n"+
//                        "\t`partition` VARCHAR(20)\n" +
                        ")\n" +
//                        "PARTITIONED BY (`partition`)\n" +
                        "WITH (\n" +
                        "\t'connector' = 'Hudi',\n" +
                        "\t'path' = 'hdfs://172.19.28.4:9000/Hudi_t4/',\n" +
                        "\t'table.type' = 'MERGE_ON_READ'\n" +
                        ")" ;
        tableEnv.executeSQL(sourceDDL);
        TableResult result2 = tableEnv.executeSQL("select * from t2");
        result2.print();

        env.execute("read_Hudi");
    }
}

Show results

Flink + Hudi framework Lake warehouse integrated solution

5、 Flink CDC 2.0 on Hudi

In the previous chapter, we built the experiment in the form of code. In this chapter, we directly used the Flink package downloaded from the official website to build the experimental environment.

1. Add dependency

Add the following dependencies to $Flink_ Under home / lib:

  • Hudi-Flink-bundle_ 2.11-0.10. 0-SNAPSHOT. Jar (modify the Hudi Flink version of the master branch to 1.13.2 and build it)
  • hadoop-mapreduce-client-core-2.7. 3. Jar (solve Hudi classnotfoundexception)
  • Flink-SQL-connector-mySQL-CDC-2.0.0.jar
  • Flink-format-changelog-json-2.0.0.jar
  • Flink-SQL-connector-Kafka_2.11-1.13.2.jar

Note that when looking for jars,CDC 2.0Updatedgroup id, don’t try againcom.alibaba.ververicaIt’s changed tocom.ververica

Flink + Hudi framework Lake warehouse integrated solution

2. Flink SQL CDC on Hudi

Create MySQL CDC table

CREATE  TABLE mySQL_users (
 id BIGINT PRIMARY KEY NOT ENFORCED ,
 name STRING,
 birthday TIMESTAMP(3),
 ts TIMESTAMP(3)
) WITH (
 'connector' = 'mySQL-CDC',
 'hostname' = 'localhost',
 'port' = '3306',
 'username' = 'root',
 'password' = 'dafei1288',
 'server-time-zone' = 'Asia/Shanghai',
 'database-name' = 'test',
 'table-name' = 'users'   
);

Create Hudi table

CREATE TABLE Hudi_users5(
 id BIGINT PRIMARY KEY NOT ENFORCED,
    name STRING,
    birthday TIMESTAMP(3),
    ts TIMESTAMP(3),
    `partition` VARCHAR(20)
) PARTITIONED BY (`partition`) WITH (
    'connector' = 'Hudi',
    'table.type' = 'MERGE_ON_READ',
    'path' = 'hdfs://localhost:9009/Hudi/Hudi_users5'
);

Modify the configuration, output the query mode as a table, and set the checkpoint

set execution.result-mode=tableau;

set execution.checkpointing.interval=10sec;

Import input

INSERT INTO Hudi_users5(id,name,birthday,ts, partition) SELECT id,name,birthday,ts,DATE_FORMAT(birthday, ‘yyyyMMdd’) FROM mySQL_users;

Query data

select * from Hudi_users5;

results of enforcement

Flink + Hudi framework Lake warehouse integrated solution

3. Card execution plan

Flink + Hudi framework Lake warehouse integrated solution

This problem has been studied for a long time. On the surface, it is normal. There are no errors in the log. It can also be seen that the CDC works. There are data writes, but it is stuckhoodie_stream_writeThere was no movement on the, and no data was distributed. Thank the community leadersDanny ChanIt may be a checkpoint problem, so I set it

set execution.checkpointing.interval=10sec;

So finally normal:

Flink + Hudi framework Lake warehouse integrated solution

So far, the prototype of Flink + Hudi Lake warehouse integration scheme has been completed.

Reference link

https://blog.csdn.net/weixin_…

https://blog.csdn.net/qq_3709…

https://mp.weixin.qq.com/s/xo…


For more Flink related technical issues, you can scan the code to join the community nail exchange group;

For the first time, get the latest technical articles and community trends. Please pay attention to the official account number.

Flink + Hudi framework Lake warehouse integrated solution