Flink + hologres 100 million users real time UV accurate de duplication best practice

Time:2021-6-18

Introduction:Flink + hologres 100 million users real time UV accurate de duplication best practice

UV and PV computing are usually divided into two scenarios because of different business requirements

  • Offline computing scenario: Based on T + 1, calculate historical data
  • Real time computing scenario: real time computing of daily new data, de duplication of user tags

For offline computing scenarios, hologres provides ultra-high cardinality UV computing based on roaringbitmap. It only needs to perform the most fine-grained pre aggregation calculation and generate the most fine-grained pre aggregation result table to achieve sub second level query. For details, please refer to previous Articles > >How does hologres support ultra high cardinality UV computing (based on roaringbitmap Implementation)

For real-time computing scenarios, you can use Flink + hologres mode, and based on roaringbitmap to de duplicate user tags in real time. In this way, the user’s UV and PV data can be obtained in fine-grained real-time. At the same time, it is convenient to adjust the minimum statistical window (such as the UV in the last 5 minutes) according to the demand, so as to achieve the effect of similar real-time monitoring and better display in Bi such as large screen. Compared with the day, week and month, it is more suitable for more fine-grained statistics on the activity date, and through simple aggregation, the statistical results of larger time units can also be obtained.

Subject thought

  1. Flink transforms streaming data into tables and dimension tables for join operation, and then into streaming data. This can be done by using the_insertIfNotExists_ Characteristic combinationAuto increment fieldTo achieve efficient uid mapping.
  2. Flink processes the associated result data according to the time window, aggregates it according to the query dimension using roaringbitmap, and stores the query dimension and the aggregated uid in the aggregation result table, in which the aggregated uid results are put into the roaringbitmap type field of hologres.
  3. When querying, it is similar to offline mode. You can directly query the aggregation result table according to the query conditions, and do or operation on the key roaringbitmap field and count the cardinality to get the number of application users.
  4. The processing flow is shown in the figure below

Flink + hologres 100 million users real time UV accurate de duplication best practice

Project best practices

1. Create relevant basic tables

1) Create table UID\_ Mapping is the uid mapping table, which is used to map uid to 32-bit int type.

  • The roaringbitmap type requires that the user ID must be a 32-bit int type, and the denser the user ID, the better. Most of the user IDs in common business systems or buried points are string type or long type, so uid is needed\_ The mapping type builds a mapping table. The mapping table uses the serial type of hologres (self increasing 32-bit int) to realize the automatic management and stable mapping of user mapping.
  • Because it is real-time data, set the table as row storage table to improve the QPS of Flink dimension table real-time join.
BEGIN;
CREATE TABLE public.uid_mapping (
uid text NOT NULL,
uid_int32 serial,
PRIMARY KEY (uid)
);
--Set uid to clustering_ Key and distribution_ Key is easy to find its corresponding int32 value
CALL set_table_property('public.uid_mapping', 'clustering_key', 'uid');
CALL set_table_property('public.uid_mapping', 'distribution_key', 'uid');
CALL set_table_property('public.uid_mapping', 'orientation', 'row');
COMMIT;

2) Create table DWS\_ App is the basic aggregation table, which is used to store the aggregated results on the basic dimension.

  • Before using the roaringbitmap, you need to create the roaringbitmap extension and the hologres instance should be version 0.10
CREATE EXTENSION IF NOT EXISTS roaringbitmap;
  • In order to improve the performance, it is recommended to set the shard number reasonably according to the data amount of the basic aggregation table, but it is recommended that the shard number of the basic aggregation table should not exceed the number of cores of the computing resources. It is recommended to use the following method to set the shard number through the table group
--Create a new table group with shard number of 16,
--Because the amount of test data is millions, the back-end computing resource is 100core, and the shard number is set to 16
BEGIN;
CREATE TABLE tg16 (a int);                             -- Table group sentry table
call set_table_property('tg16', 'shard_count', '16'); 
COMMIT;
  • Compared with the offline result table, this result table adds a timestamp field, which is used to realize the statistics based on the period of the Flink window. The results are as follows
BEGIN;
create table dws_app(
  country text,
  prov text,
  city text, 
  YMD text not null, - date field
  Timetz timestamptz, - statistical time stamp, which can realize the statistics based on the period of Flink window
  uid32_ Bitmap roaringbitmap, - use roaringbitmap to record UV
  Primary key (country, Prov, city, YMD, timetz) -- Query dimension and time as primary keys to prevent repeated data insertion
);
CALL set_table_property('public.dws_app', 'orientation', 'column');
--The date field is set to clustering_ Key and event_ time_ Column, easy to filter
CALL set_table_property('public.dws_app', 'clustering_key', 'ymd');
CALL set_table_property('public.dws_app', 'event_time_column', 'ymd');
--It is equivalent to placing a table in a table group with shard number 16
call set_table_property('public.dws_app', 'colocate_with', 'tg16');
--Set the group by field to distribution_ key
CALL set_table_property('public.dws_app', 'distribution_key', 'country,prov,city');
COMMIT;

2. Flink reads data in real time and updates DWS\_ App basic aggregation table

Complete sample source code seealibabacloud-hologres-connectors examples

1) Flink reads data stream and converts it into source table

//Here, we use the CSV file as the data source, or Kafka and so on
DataStreamSource odsStream = env.createInput(csvInput, typeInfo);
//To join with dimension table, you need to add a procime field. See https://help.aliyun.com/document_ detail/62506.html
Table odsTable =
    tableEnv.fromDataStream(
    odsStream,
    $("uid"),
    $("country"),
    $("prov"),
    $("city"),
    $("ymd"),
    $("proctime").proctime());
//Register to the catalog environment
tableEnv.createTemporaryView("odsTable", odsTable);

2) Compare the source table with the hologres dimension table (uid)\_ Mapping)

The dimension table uses the insertifnotexists parameter, that is, it inserts itself when no data can be found, uid\_ The int32 field can be created automatically by using the serial type of hologres.

//Create a hologres dimension table, where nsertifnotexists means to insert if the query fails
String createUidMappingTable =
    String.format(
    "create table uid_mapping_dim("
    + "  uid string,"
    + "  uid_int32 INT"
    + ") with ("
    + "  'connector'='hologres',"
    +"'dbname '=% s'" // hologres db name
    +"'tablename '=% s'" // hologres table name
    +"'user name '=% s'" // access ID of current account
    +"'password '=% s'" // access key of current account
    + "  'endpoint' = '%s'," //Hologres endpoint
    + "  'insertifnotexists'='true'"
    + ")",
    database, dimTableName, username, password, endpoint);
tableEnv.executeSql(createUidMappingTable);
//Join source table and dimension table
String odsJoinDim =
    "SELECT ods.country, ods.prov, ods.city, ods.ymd, dim.uid_int32"
    + "  FROM odsTable AS ods JOIN uid_mapping_dim FOR SYSTEM_TIME AS OF ods.proctime AS dim"
    + "  ON ods.uid = dim.uid";
Table joinRes = tableEnv.sqlQuery(odsJoinDim);

3) The association results are transformed into datastream, processed by Flink time window, and aggregated with roaring bitmap

DataStream<Tuple6<String, String, String, String, Timestamp, byte[]>> processedSource =
    source
    //Select the dimensions (country, Prov, city, YMD) to be counted
    .keyBy(0, 1, 2, 3)
    //Rolling time window; Here, due to the use of read CSV analog input stream, using processingtime, the actual use can use eventtime
    .window(TumblingProcessingTimeWindows.of(Time.minutes(5)))
    //Trigger, which can get the aggregation result before the end of the window
    .trigger(ContinuousProcessingTimeTrigger.of(Time.minutes(1)))
    .aggregate(
    //Aggregation function, which aggregates according to the dimensions filtered by key by
    new AggregateFunction<
        Tuple5<String, String, String, String, Integer>,
        RoaringBitmap,
        RoaringBitmap>() {
            @Override
            public RoaringBitmap createAccumulator() {
                return new RoaringBitmap();
            }
            @Override
            public RoaringBitmap add(
                Tuple5<String, String, String, String, Integer> in,
                RoaringBitmap acc) {
                //Add the 32-bit uid to the roaringbitmap for de duplication
                acc.add(in.f4);
                return acc;
            }
            @Override
            public RoaringBitmap getResult(RoaringBitmap acc) {
                return acc;
            }
            @Override
            public RoaringBitmap merge(
                RoaringBitmap acc1, RoaringBitmap acc2) {
                return RoaringBitmap.or(acc1, acc2);
            }
     },
    //Window function, output aggregation results
    new WindowFunction<
        RoaringBitmap,
        Tuple6<String, String, String, String, Timestamp, byte[]>,
        Tuple,
        TimeWindow>() {
            @Override
            public void apply(
                Tuple keys,
                TimeWindow timeWindow,
                Iterable<RoaringBitmap> iterable,
                Collector<
                Tuple6<String, String, String, String, Timestamp, byte[]>> out)
                throws Exception {
                RoaringBitmap result = iterable.iterator().next();
                //Optimizing roaringbitmap
                result.runOptimize();
                //Convert roaringbitmap to byte array and store it in holo
                byte[] byteArray = new byte[result.serializedSizeInBytes()];
                result.serialize(ByteBuffer.wrap(byteArray));
                //The tuple6.f4 (timestamp) field indicates that the window length is used as the cycle for statistics, and the unit is seconds
                out.collect(
                    new Tuple6<>(
                        keys.getField(0),
                        keys.getField(1),
                        keys.getField(2),
                        keys.getField(3),
                        new Timestamp(
                            timeWindow.getEnd() / 1000 * 1000),
                        byteArray));
        }
    });

4) Write result table

It should be noted that the roaringbitmap type in hologres corresponds to the byte array type in Flink

//Conversion of calculation results to tables
Table resTable =
    tableEnv.fromDataStream(
        processedSource,
        $("country"),
        $("prov"),
        $("city"),
        $("ymd"),
        $("timest"),
        $("uid32_bitmap"));
//Create a hologres result table, in which the roaringbitmap type of hologres is stored in the byte array
String createHologresTable =
    String.format(
        "create table sink("
        + "  country string,"
        + "  prov string,"
        + "  city string,"
        + "  ymd string,"
        + "  timetz timestamp,"
        + "  uid32_bitmap BYTES"
        + ") with ("
        + "  'connector'='hologres',"
        + "  'dbname' = '%s',"
        + "  'tablename' = '%s',"
        + "  'username' = '%s',"
        + "  'password' = '%s',"
        + "  'endpoint' = '%s',"
        + "  'connectionSize' = '%s',"
        + "  'mutatetype' = 'insertOrReplace'"
        + ")",
    database, dwsTableName, username, password, endpoint, connectionSize);
tableEnv.executeSql(createHologresTable);
//Write the calculation results to DWS table
tableEnv.executeSql("insert into sink select * from " + resTable);

3. Data query

When querying, from the basic aggregation table (DWS)\_ According to the query dimension, aggregate calculation is performed in the group by (APP), and the bitmap cardinality is queried to get the number of users under the condition of group by

  • Query the UV of each city in a certain day
--Run RB below_ AGG operation query, executable parameters first turn off the three-stage aggregation switch (default off), better performance
set hg_experimental_enable_force_three_stage_agg=off  
SELECT  country
        ,prov
        ,city
        ,RB_CARDINALITY(RB_OR_AGG(uid32_bitmap)) AS uv
FROM    dws_app
WHERE   ymd = '20210329'
GROUP BY country
         ,prov
         ,city
;
  • Query the UV of each province in a certain period of time
--Run RB below_ AGG operation query, executable parameters first turn off the three-stage aggregation switch (default off), better performance
set hg_experimental_enable_force_three_stage_agg=off 
SELECT  country
        ,prov
        ,RB_CARDINALITY(RB_OR_AGG(uid32_bitmap)) AS uv
FROM    dws_app
WHERE   time > '2021-04-19 18:00:00+08' and time < '2021-04-19 19:00:00+08'
GROUP BY country
         ,prov
;

Copyright notice:The content of this article is spontaneously contributed by alicloud real name registered users, and the copyright belongs to the original author. The alicloud developer community does not own its copyright, nor does it bear the corresponding legal responsibility. For specific rules, please refer to the user service agreement of alicloud developer community and the guidelines for intellectual property protection of alicloud developer community. If you find any suspected plagiarism content in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the suspected infringement content.