background
Clickhouse does not implement join very well, which is not suitable for large table Association. It advocates the concept of large and wide tables. Because by default, Clickhouse uses the hash join algorithm. Clickhouse takes <right_ Table> and create a hash table for it in RAM. After a certain memory consumption threshold, Clickhouse falls back to the merge join algorithm. Therefore, when Clickhouse is used in the case of a large amount of data, join should be used as little as possible, and field redundancy should be carried out through redundant large wide tables. However, in some scenarios, such as the scenario of dimension completion, we can use the join method when the data in the dimension table is not large. Of course, there will be better methods. Let me talk about them slowly.
Business scenario
Existence fact table: ODS_ device_ Log (equipment submission log), dimension table: product table (dim_product), account tariff configuration table (dim_billing_policy).
A simple requirement is to count the metering data of devices in the billing cycle: the number of reported messages.
ods_ device_ The log table structure is as follows:
`product_ Key` string comment “product key, which is used to obtain the account information”,
`device_ Key` string comment “device key”,
`data_ Time ` datetime comment “data reporting time”,
`log_ Info` string comment “report data content”
The structure of the product table is as follows:
`org_ ID ` Int64 comment “the account ID to which the product belongs”,
`product_ Key` string comment “product key”,
`product_ Name` string comment “product name”
The account tariff configuration table is as follows:
`org_ ID ` Int64 comment “the account ID to which the product belongs”,
`
` policy_ Name` string comment “tariff name”,
`billing_ start_ Time` datetime comment “billing cycle start time”,
`billing_ end_ Time` datetime comment “billing cycle end time”
Through join
The SQL is still relatively simple. The countif function can be used to easily count the measurement data. The SQL is as follows:
select
countIf(a.data_time>c.billing_start_time)
from
ods_device_log a
left join dim_product b on a.product_key = b.product_key
left join dim_billing_policy c on b.org_id =c.org_id
where
a.data_time >= toDateTime(‘2021-12-01 00:00:00’)
and a.data_time <toDateTime(‘2021-12-02 00:00:00’)
GROUP by
a.product_key,
a.device_key;
A better implementation
For dimension tables with small volume, we can replace the join method with an external dictionary. Because join each query will revisit the associated table. You can build a dimension table into an external dictionary table. The dictionary table periodically synchronizes dimension information. Then obtain dimension data through dictionary function. The reference SQL is as follows:
select
countIf(a.data_time > dictGetDateTime(‘dim_billing_policy’,’billing_start_time’,
tuple(dictGetInt64(‘dim_product’,’org_id’,tuple(a.product_key)),’00-00-01′))
) num
from
ods_device_log a
where
a.data_time >= toDateTime(‘2021-12-01 00:00:00’)
and a.data_time <toDateTime(‘2021-12-02 00:00:00’)
GROUP by
a.product_key,
a.device_key;
Description: dictionary table dim_ billing_ Policy primary key is ORG_ id、policy_ code。 Dictionary table dim_ The product primary key is product_ key。 First, through the product_ Key to obtain the account ID, and then obtain the start time of the billing cycle through the account ID and billing policy code.