Storage and use of user tag data in Clickhouse

Time:2022-5-23

Basic ideas

  • Offline part: importing offline data of HDFS into Clickhouse
  • Real time part: write directly to Clickhouse and useAggregatingMergeTreeThe table engine merges the latest data by sort key
  • Queries: UsingfinalKeyword to query only the latest merged data
  • Timed executionoptimize, press sort key to merge the whole table data
  • Data distribution: split the conditions submitted by users, and insert data from different tables into the distribution details(AggregatingMergeTree), and finally press user_ ID issued after de duplication
  • Group number: split the conditions submitted by users, and separate users from different tables_ Find the bitmap of ID, combine multiple groups of bitmaps, and return the number of groups

Table structure

Number of wide table fields 800+

CREATE TABLE cust_label_all_hdfs (    
  user_ ID string comment 'customer account number',
  ……  
) ENGINE = HDFS('hdfs://xx.xx.xx.xx:8020/hive/cust_label.db/cust_label_all/stat_dt=2021-09-09/*','Parquet')

Field 19 of sub table a, data volume 3000W, field 13 of sub table B, data volume 5000W

CREATE TABLE cust_label_table_a (
  user_ ID string comment 'customer account number',
  ……
) ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/tables/{layer}-{shard}/cust_label/cust_label_d_member','{replica}')
ORDER BY user_id

Data distribution

Split the conditions submitted by the user, and insert the data from different tables into the distribution details(AggregatingMergeTree), and finally press user_ ID issued after de duplication

Storage and use of user tag data in Clickhouse

Tag library ch.jpg

Population statistics

Split the conditions submitted by users, and separate users from different tables_ IDBitmapFind out, merge multiple groups of bitmaps and return the number of groups

select
  bitmapOrCardinality(id_bitmap_a, id_bitmap_b) as user_num
from
  (
    select
      1 as join_id,
      groupBitmapState(user_id) as id_bitmap_a
    from
      Table a
    where
      condition
  ) T1
  inner join (
    select
      1 as join_id,
      groupBitmapState(user_id) as id_bitmap_b
    from
      Table b
    where
      condition
  ) T2 on T1.join_id = T2.join_id

Data import

Sub table import wide table (@ deprecated)

Source table usageHDFS table engineDirectly read the sub table in HDFS, and establish a large and wide table for the target tableAggregatingmergetree table engineThe data imported from the sub table can be merged by sorting key
useinsert intoStatement execution data import
useoptimizeStatements are merged and executed before mergingSET optimize_throw_if_noop = 1You can see the exception information when merging

Here, because the offline data table in hive is very time-consuming to merge the large and wide tables, there is no need to waste time from the offline large and wide tables when using Clickhouse for storage

Sub table a import results

Elapsed: 1839.742 sec. Processed 29.87 million rows, 8.33 GB (16.23 thousand rows/s., 4.53 MB/s.)

Sub table B import results

Elapsed: 3731.763 sec. Processed 52.14 million rows, 13.51 GB (13.97 thousand rows/s., 3.62 MB/s.)

Execute merge results

OPTIMIZE TABLE cust_label.cust_label_d_all FINAL
Query id: 08a04d4d-93cc-4386-8cf9-5012072ff984
Ok.
0 rows in set. Elapsed: 0.002 sec.

It only takes 0.002 seconds to execute, which indicates that there is no real consolidation of wide table data and no exception. This problem makes our scheme of importing wide table consolidated data using this sub table unavailable

Sub table direct import

Source table usageHDFS table engineDirectly read the sub table in HDFS, and the target table establishes the sub table with the same field as HDFS table
Table usage of real-time TagsAggregatingmergetree table engineThe newly added data can be merged by sorting key

Sub table a import results

Elapsed: 172.970 sec. Processed 29.87 million rows, 8.33 GB (172.67 thousand rows/s., 48.18 MB/s.)

Sub table B import results

Elapsed: 306.156 sec. Processed 52.14 million rows, 13.51 GB (170.29 thousand rows/s., 44.12 MB/s.)

Execute merge results

Sub table a elapsed: 51.150 sec
Sub table B elapsed: 83.499 sec