Flink Doris connector design

Time:2021-12-30

summary

The program first thanks the author of spark Doris connector

From the perspective of Doris, the introduction of its data into Flink can use a series of rich ecological products of Flink, broaden the imagination of products, and make the joint query between Doris and other data sources possible

Starting from our business architecture and business requirements, we chose Flink as a part of our architecture, ETL and real-time computing framework for data. At present, the community supports spark Doris connector. Therefore, we designed and developed Flink Doris connector with reference to spark Doris connector.

Technology selection

At the beginning of model selection, like spark Doris connector, we began to consider the JDBC method, but this method has advantages, but its disadvantages are more obvious, just like the article of spark Doris connector. Later, we read and tested the spark code and decided to implement it on the shoulders of giants (Note: directly copy the code and modify it).

The following content is from the spark Doris connector blog and is copied directly

So we developed a new datasource for Doris, spark Doris connector. Under this scheme, Doris can expose Doris data and distribute it to spark. Spark driver accesses Doris Fe to obtain the schema and underlying data distribution of Doris table. Then, according to this data distribution, reasonably allocate data query tasks to executors. Finally, spark executors access different be for query. It greatly improves the efficiency of query

usage method

Compile doris-flink-1.0 in the extension / Flink Doris connector / directory of Doris’s code base 0-SNAPSHOT. Jar. Add this jar package to the classpath of Flink to use the Flink on Doris function

SQL mode

CREATE TABLE flink_doris_source (
    name STRING,
    age INT,
    price DECIMAL(5,2),
    sale DOUBLE
    ) 
    WITH (
      'connector' = 'doris',
      'fenodes' = '$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT',
      'table.identifier' = '$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME',
      'username' = '$YOUR_DORIS_USERNAME',
      'password' = '$YOUR_DORIS_PASSWORD'
);

CREATE TABLE flink_doris_sink (
    name STRING,
    age INT,
    price DECIMAL(5,2),
    sale DOUBLE
    ) 
    WITH (
      'connector' = 'doris',
      'fenodes' = '$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT',
      'table.identifier' = '$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME',
      'username' = '$YOUR_DORIS_USERNAME',
      'password' = '$YOUR_DORIS_PASSWORD'
);

INSERT INTO flink_doris_sink select name,age,price,sale from flink_doris_source
Datastream mode
DorisOptions.Builder options = DorisOptions.builder()
                .setFenodes("$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
                .setUsername("$YOUR_DORIS_USERNAME")
                .setPassword("$YOUR_DORIS_PASSWORD")
                .setTableIdentifier("$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME");
env.addSource(new DorisSourceFunction<>(options.build(),new SimpleListDeserializationSchema())).print();

Applicable scenario

Flink Doris connector design

1. Use Flink to jointly analyze the data in Doris and other data sources

Many business departments put their data on different storage systems, such as some online analysis and report data in Doris, some structured retrieval data in elasticsearch, and some data that need things in mysql, etc. Businesses often need to analyze across multiple storage sources. After getting through Flink and Doris through Flink Doris connector, businesses can directly use Flink to jointly query and calculate the data in Doris with multiple external data sources.

2. Real time data access

Before Flink Doris connector: for business irregular data, it is often necessary to do standard processing for messages, write new topics such as null value filtering, and then start routine load to write to Doris.

Flink Doris connector design

After Flink Doris connector: Flink reads Kafka and directly writes it to Doris.
Flink Doris connector design

Technical realization

Architecture diagram

Flink Doris connector design

Doris provides more external capabilities

Doris FE
It has opened the interface for obtaining metadata information of internal tables, single table query planning and some statistical information.

All rest API interfaces need httpbasic authentication. The user name and password are the user name and password to log in to the database. Attention should be paid to the correct allocation of permissions.

//Get table schema meta information
GET api/{database}/{table}/_schema

//Get the query planning template for a single table
POST api/{database}/{table}/_query_plan
{
"sql": "select k1, k2 from {database}.{table}"
}

//Get table size
GET api/{database}/{table}/_count
Doris BE
Through thrift protocol, it directly provides data filtering, scanning and cutting capabilities.

service TDorisExternalService {
    //Initialize query executor
TScanOpenResult open_scanner(1: TScanOpenParams params);

//Stream batch data acquisition, Apache arrow data format
    TScanBatchResult get_next(1: TScanNextBatchParams params);

//End scan
    TScanCloseResult close_scanner(1: TScanCloseParams params);
}
For definitions of thrift related structures, please refer to:

https://github.com/apache/incubator-doris/blob/master/gensrc/thrift/DorisExternalService.thrift

Implement datastream
Inherit org apache. flink. streaming. api. functions. source. Richsourcefunction, a custom Doris sourcefunction, obtains the execution plan of related tables and the corresponding partition during initialization.

Override the run method to read data from the partition in a loop.

public void run(SourceContext sourceContext){
       //Read each partition circularly
        for(PartitionDefinition partitions : dorisPartitions){
            scalaValueReader = new ScalaValueReader(partitions, settings);
            while (scalaValueReader.hasNext()){
                Object next = scalaValueReader.next();
                sourceContext.collect(next);
            }
        }
}

Implement Flink SQL on Doris
Referring to Flink custom source & sink and Flink JDBC connector, the following effects are achieved. Flink SQL can be used to directly operate Doris tables, including reading and writing.

Implementation details

1. Implement dynamictablesourcefactory. Dynamictablesinkfactory registers Doris connector

2. Customize dynamictablesource and dynamictablesink to generate logical plans

3. Dorisrowdatainputformat and dorisdynamicoutputformat start to execute after obtaining the logical plan.

The most important ones in the implementation are dorisrowdatainputformat and dorisdynamicoutputformat customized based on richinputformat and richoutputformat.

In dorisrowdatainputformat, the obtained dorispartitions are divided into multiple slices in createinputsplits for parallel computing.

public DorisTableInputSplit[] createInputSplits(int minNumSplits) {
        List<DorisTableInputSplit> dorisSplits = new ArrayList<>();
        int splitNum = 0;
        for (PartitionDefinition partition : dorisPartitions) {
            dorisSplits.add(new DorisTableInputSplit(splitNum++,partition));
        }
        return dorisSplits.toArray(new DorisTableInputSplit[0]);
}
 

public RowData nextRecord(RowData reuse)  {
        if (!hasNext) {
            //After reading the data, null is returned
            return null;
        }
        List next = (List)scalaValueReader.next();
        GenericRowData genericRowData = new GenericRowData(next.size());
        for(int i =0;i<next.size();i++){
            genericRowData.setField(i, next.get(i));
        }
        //Determine whether there is data
        hasNext = scalaValueReader.hasNext();
        return genericRowData;
}
In dorisrowdataoutputformat, data is written to Doris through streamload. Streamload program reference org apache. doris. plugin. audit. DorisStreamLoader

public  void writeRecord(RowData row) throws IOException {
       //Streamload default separator \ t
        StringJoiner value = new StringJoiner("\t");
        GenericRowData rowData = (GenericRowData) row;
        for(int i = 0; i < row.getArity(); ++i) {
            value.add(rowData.getField(i).toString());
        }
        //Streamload write data
        DorisStreamLoad.LoadResponse loadResponse = dorisStreamLoad.loadBatch(value.toString());
        System.out.println(loadResponse);
}