Introduction:This paper will introduce hologres’s implementation of ultra-high cardinality UV calculation based on roaringbitmap
Roaringbitmap is a compressed bitmap index. Its data compression and de duplication features are very suitable for UV computing under big data. The main principles are as follows:
- For 32bit number, roaringbitmap will construct 2 ^ 16 buckets, corresponding to the upper 16 bits of 32 bits; The lower 16 bits of 32 bits are mapped to a bit of the corresponding bucket. The capacity of a single barrel is determined by the existing maximum value in the barrel
- Bitmap represents 32 bits with 1 bit, which can greatly compress the data size.
- Bitmap bit operation provides a means for de duplication.
Main idea (T + 1): put the uid results aggregated from all the data of the previous day according to the largest query dimension into the roaringbitmap, and store the roaringbitmap and query dimension in the aggregation result table (millions per day). Then, when querying, we use the powerful column storage calculation of hologres to query the aggregation result table directly according to the query dimension, do or operation on the key roaringbitmap field, remove the duplicate and count the cardinality, then we can get the number of applications UV, count the number of applications, and then calculate the PV, reaching the sub second level query.
Only one fine-grained pre aggregation calculation is required, and only one fine-grained pre aggregation result table is generated. Thanks to the real-time computing power of hologres, the number and space required for precomputation in this scheme achieve low overhead.
Details of UV and PV schemes calculated by hologres
Figure 1 Calculation PV UV process of hologres based on roaringbitmap
1. Create related basic tables
1) Before using roaringbitmap, you need to create roaringbitmap extension. The syntax is as follows. At the same time, this function requires holograms Version 0.10.
CREATE EXTENSION IF NOT EXISTS roaringbitmap;
2) Create table ODS\_ App is a detail source table, which stores a large amount of detailed data (divided by days) of users every day. Its DDL is as follows:
BEGIN; CREATE TABLE IF NOT EXISTS public.ods_app ( uid text, country text, prov text, city text, channel text, operator text, brand text, ip text, click_time text, year text, month text, day text, ymd text NOT NULL ); CALL set_table_property('public.ods_app', 'bitmap_columns', 'country,prov,city,channel,operator,brand,ip,click_time, year, month, day, ymd'); --distribution_ Key set according to the requirements and according to the real-time query requirements of the table, the dimension from which fragmentation can achieve better results CALL set_table_property('public.ods_app', 'distribution_key', 'uid'); --It is used for where filtering conditions, including the complete time field of month, year and day. It is recommended to set it to clustering_ Key and event_ time_ column CALL set_table_property('public.ods_app', 'clustering_key', 'ymd'); CALL set_table_property('public.ods_app', 'event_time_column', 'ymd'); CALL set_table_property('public.ods_app', 'orientation', 'column'); COMMIT;
3) Create table UID\_ Mapping is a uid mapping table, which is used to map UIDs to 32-bit int types.
The roaringbitmap type requires that the user ID must be a 32-bit int type, and the denser the user ID, the better (the user ID should be continuous). However, many user IDs in common business systems or buried points are string types, so uid is used\_ The mapping type builds a mapping table. The mapping table uses the serial type (self incremented 32-bit int) of hologres to realize the automatic management and stable mapping of user mapping.
Note: in this example, this table is written in batches every day. It can be a row table or a column table. There is no big difference. If real-time data is needed (for example, in combination with Flink), row storage table is needed to improve the QPS of real-time join of Flink dimension table.
BEGIN; CREATE TABLE public.uid_mapping ( uid text NOT NULL, uid_int32 serial, PRIMARY KEY (uid) ); --Set uid to clustering_ Key and distribution_ Key is convenient to quickly find its corresponding int32 value CALL set_table_property('public.uid_mapping', 'clustering_key', 'uid'); CALL set_table_property('public.uid_mapping', 'distribution_key', 'uid'); CALL set_table_property('public.uid_mapping', 'orientation', 'row'); COMMIT;
3) Create table DWS\_ App basic aggregation table is used to store the aggregated results on the basic dimension
The basic dimension is the smallest dimension of PV and UV after query. Here, take country, Prov and city as examples to build an aggregation table
begin; create table dws_app( country text, prov text, city text, YMD text not null, -- date field uid32_ Bitmap roaringbitmap, -- UV calculation PV integer, -- PV calculation Primary key (country, Prov, city, YMD) -- Query dimension and time as the primary key to prevent repeated data insertion ); CALL set_table_property('public.dws_app', 'orientation', 'column'); --clustering_ Key and event_ time_ Column is set as the date field to facilitate filtering CALL set_table_property('public.dws_app', 'clustering_key', 'ymd'); CALL set_table_property('public.dws_app', 'event_time_column', 'ymd'); --distribution_ Set key to group by field CALL set_table_property('public.dws_app', 'distribution_key', 'country,prov,city'); end;
2. Update DWS table and ID\_ Mapping table
Find new customers from the uid of the previous day every day (UIDs not in the uid mapping table UID \ _mapping) and insert them into the uid mapping table
WITH --Where YMD = '20210329' represents the data of the previous day user_ids AS ( SELECT uid FROM ods_app WHERE ymd = '20210329' GROUP BY uid ) ,new_ids AS ( SELECT user_ids.uid FROM user_ids LEFT JOIN uid_mapping ON (user_ids.uid = uid_mapping.uid) WHERE uid_mapping.uid IS NULL ) INSERT INTO uid_mapping SELECT new_ids.uid FROM new_ids ;
After updating the uid mapping table, aggregate the data and insert it into the aggregation result table. The main steps are as follows:
- Firstly, the aggregation conditions and corresponding UIDs of the previous day are obtained through the source table inner join uid mapping table\_ int32；
- Then, after performing aggregation operations according to the aggregation conditions, insert the roaringbitmap aggregation result table as the aggregation result of the previous day;
- Aggregation is only required once a day to store a copy of data. The worst number of data pieces is equal to the amount of UV. It is illustrated by a case that the detailed list has an increment of hundreds of millions every day, and only millions of data need to be stored in the aggregation result table every day.
WITH aggregation_src AS( SELECT country, prov, city, uid_int32 FROM ods_app INNER JOIN uid_mapping ON ods_app.uid = uid_mapping.uid WHERE ods_app.ymd = '20210329' ) INSERT INTO dws_app SELECT country ,prov ,city ,'20210329' ,RB_BUILD_AGG(uid_int32) ,COUNT(1) FROM aggregation_src GROUP BY country ,prov ,city ;
3. UV and PV query
When querying, select DWS from summary table\_ In the app, aggregate calculation is performed according to the query dimension, and the bitmap cardinality is queried to obtain the number of users under the condition of group by
--Run the following RB_ For AGG operation query, you can turn off the three-stage aggregation switch first for better performance (off by default) set hg_experimental_enable_force_three_stage_agg=off --You can query the UV and PV of any combination of basic dimensions and any time period SELECT country ,prov ,city ,RB_CARDINALITY(RB_OR_AGG(uid32_bitmap)) AS uv ,sum(1) AS pv FROM dws_app WHERE ymd = '20210329' GROUP BY country ,prov ,city; --Check for a month SELECT country ,prov ,RB_CARDINALITY(RB_OR_AGG(uid32_bitmap)) AS uv ,sum(1) AS pv FROM dws_app WHERE ymd >= '20210301' and ymd <= '20210331' GROUP BY country ,prov; The query is equivalent to SELECT country ,prov ,city ,COUNT(DISTINCT uid) AS uv ,COUNT(1) AS pv FROM ods_app WHERE ymd = '20210329' GROUP BY country ,prov ,city; SELECT country ,prov ,COUNT(DISTINCT uid) AS uv ,COUNT(1) AS pv FROM ods_app WHERE ymd >= '20210301' and ymd <= '20210331' GROUP BY country ,prov;
4. Visual display
UV, PV and are calculated. In most cases, Bi tools need to be used to visually display them in a more intuitive way, because RB needs to be used\_ Cardinality and Rb\_ OR\_ AGG aggregation calculation requires the ability to use Bi’s custom aggregation function. Common Bi with this ability include Apache superset and tableau. The best practices of these two Bi tools will be described below.
4.1 using Apache superset
For how Apache superset interfaces with hologres, please refer toProduct manual。 DWS can be used directly in superset\_ The app table is used as a dataset
And in the dataset, create a separate metrics named UV with the following expression:
Then you can start exploring the data
Of course, you can also create dashborad:
4.2 using tableau
For the way tableau connects to hologres, please refer toProduct manual。 You can use tableau’s pass through function to directly implement user-defined functions. For details, please refer toTableau’s Manual。 After tableau interfaces with hologres, you can create a calculation field with the following expression
RAWSQLAGG_INT("RB_CARDINALITY(RB_OR_AGG(%1))", [Uid32 Bitmap])
Then you can start exploring the data
Of course, you can also create dashborad
Copyright notice:The content of this article is spontaneously contributed by Alibaba cloud real name registered users, and the copyright belongs to the original author. Alibaba cloud developer community does not own its copyright or bear corresponding legal liabilities. Please refer to Alibaba cloud developer community user service agreement and Alibaba cloud developer community intellectual property protection guidelines for specific rules. If you find any content suspected of plagiarism in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the content suspected of infringement.