ClickHouse DDL

Time:2022-6-1

1. database

The database acts as a namespace, which can effectively avoid the problem of naming conflicts, and also provide support for subsequent data isolation. Any data table must belong to a database. The complete syntax for creating a database is as follows:

CREATE DATABASE IF NOT EXISTS db_name [ENGINE = engine];

The database currently supports five engines, as shown below.

  • Ordinary: the default engine. In most cases, we will use the default engine without any deliberate declaration. Any type of table engine can be used under this database.
  • Dictionary: Dictionary engine. This type of database will automatically create their data tables for all data dictionaries.
  • Memory: memory engine, used to store temporary data. The data tables in such databases will only stay in memory and will not involve any disk operations. The data will be cleared after the service is restarted.
  • Lazy: log engine. Only log series table engines can be used in this type of database.
  • MySQL: MySQL engine. This type of database will automatically pull data from remote MySQL and create data tables of MySQL table engine for them.

In most cases, you only need to use the default database engine. For example, execute the following statement to create our first database:

CREATE DATABASE DB_TEST;

The default database is essentially a file directory on the physical disk, so Clickhouse will create a DB in the installation path after the statement is executed_ File directory of test database:

# ls -l /var/lib/clickhouse/data/
total 12
drwxr-x--- 2 clickhouse clickhouse 4096 Aug 16 16:44 DB_TEST
drwxr-x--- 2 clickhouse clickhouse 4096 Aug 16 16:45 default
drwxr-x--- 2 clickhouse clickhouse 4096 Aug 16 13:38 system

View database:

show databases;
┌─name────┐
│ DB_TEST │
│ default │
│ system  │
└─────────┘

Switch the database and view the tables under the database:

use DB_TEST;
show tables;

Delete database:

DROP DATABASE [IF EXISTS] db_name;

2. data sheet

General method to create a table:

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name (
    name1 [type] [DEFAULT|MATERIALIZED|ALIAS expr],
    name2 [type] [DEFAULT|MATERIALIZED|ALIAS expr],
    ...
) ENGINE = engine;
  • If you do not specify a database name, the table is created under the default database by default.
  • The engine parameter, which is used to specify the engine of the data table. The table engine determines the characteristics of the data table and how the data will be stored and loaded. For example, the memory table engine used in the example is the simplest table engine of Clickhouse. The data will only be saved in memory and will be lost when the service is restarted.

Copy the table structure of other tables:

CREATE TABLE [IF NOT EXISTS] [db_name1.]table_name AS [db_name2.]table_name2 [ENGINE = engine];
  • The engine for creating a new table can be different from the engine for the replicated table

Create a table in the form of a select clause: not only will the corresponding table structure be established according to the select clause, but also the data queried by the select clause will be written in

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name ENGINE = engine AS SELECT ...

View table structure:desc table;

Delete table:DROP TABLE [IF EXISTS] [db_name.]table_name;

3. default value expression

Table fields support three methods for defining default value expressions, namely, default, materialized, and alias. No matter which form is used, once the table field has been defined with a default value, it will no longer be forced to define a data type, because Clickhouse will infer the type based on the default value. If both data types and default value expressions are defined for table fields, the clearly defined data types will prevail. For example, the following example:

CREATE TABLE dfv_v1 ( 
    id String,
    c1 DEFAULT 1000,
    c2 String DEFAULT c1
) ENGINE = TinyLog;
INSERT INTO dfv_v1(id) VALUES ('A000');
SELECT c1,c2,toTypeName(c1),toTypeName(c2) from dfv_v1;
┌───c1─┬─c2───┬─toTypeName(c1)─┬─toTypeName(c2)─┐
│ 1000 │ 1000 │ UInt16         │ String         │
└──────┴──────┴────────────────┴────────────────┘

It can be verified from the query results that the priority of the default value meets our expectations, and the C1 field is inferred as uint16 according to the default value; The C2 field defines both the data type and the default value, so its final data type comes from the clearly defined string.

There are also differences among the three definition methods of default value expression, which can be compared from the following three aspects.
(1) Data write: during data writing, only fields of default type can appear in the insert statement. Neither materialized nor alias can be explicitly assigned. They can only rely on calculation. For example, if you try to write data to a field of type materialized, you will get the following error.

DB::Exception: Cannot insert column URL, because it is MATERIALIZED column..

(2) Data query: during data query, only fields of default type can be returned through select *. However, fields of types materialized and alias do not appear in the returned result set of the select * query.
(3) Data storage: during data storage, only fields of default and materialized types support persistence. If the table engine used supports physical storage (such as tinylog table engine), these column fields will have physical storage. The alias field does not support persistence. Its value is always generated by calculation, and the data will not fall to the disk.

You can use the alter statement to modify the default values, for example:

ALTER TABLE [db_name.]table MODIFY COLUMN col_name DEFAULT value;

The modification action does not affect the previously existing data in the data table. However, there are many restrictions on the modification of default values. For example, in the consolidated tree table engine, its primary key fields cannot be modified; Some table engines do not support modification at all (for example, tinylog).

4. temporary table

Clickhouse also has the concept of a temporary table. The method to create a temporary table is to add the temporary keyword on the basis of a normal table. Its complete syntax is as follows:

CREATE TEMPORARY TABLE [IF NOT EXISTS] table_name (
    name1 [type] [DEFAULT|MATERIALIZED|ALIAS expr],
    name2 [type] [DEFAULT|MATERIALIZED|ALIAS expr],
    ...
);

Compared with ordinary tables, temporary tables have the following two special features:

  • Its life cycle is session bound, so it only supports the memory table engine. If the session ends, the data table will be destroyed;
  • The temporary table does not belong to any database, so there are neither database parameters nor table engine parameters in its table creation statement.

The priority of temporary tables is higher than that of normal tables. When two data tables have the same name, the data of the temporary table will be read first.
In the daily use of Clickhouse, temporary tables are usually not deliberately used. It is more used inside Clickhouse and is the carrier of data transmission between clusters.

5. partition table

Data partition and data shard are two completely different concepts. Data partitioning is a vertical segmentation of local data. Data slicing is a kind of horizontal segmentation of data. Data partitioning is of great significance for an OLAP database: with the help of data partitioning, unnecessary data directories can be skipped in the subsequent query process, so as to improve the query performance. By making rational use of the partition feature, you can also update the data in a disguised form, because the data partition supports delete, replace and reset operations. Assuming that the data table is partitioned by month, the data can be replaced and updated by month granularity.
Although partitioning is good, not all table engines can use this feature. At present, only the table engines of the mergetree family can support data partitioning. Next, a simple example is given to demonstrate how to use the partitioned table. First, the partition key is specified by partition by, such as the following data table partition_ V1 uses the date field as the partition key and formats it in the form of month and year:

CREATE TABLE partition_v1 ( 
    ID String,
    URL String,
    EventTime Date
) ENGINE =  MergeTree()
PARTITION BY toYYYYMM(EventTime) 
ORDER BY ID;

INSERT INTO partition_v1 VALUES 
('A000','www.nauu.com','2019-05-01'),
('A001','www.brunce.com','2019-06-02');

SELECT table,partition,path from system.parts WHERE table = 'partition_v1';
┌─table────────┬─partition┬─path────────────────────────────────────────────────────────┐
│ partition_v1 │ 201905   │ /var/lib/clickhouse/data/DB_TEST/partition_v1/201905_1_1_0/ │
│ partition_v1 │ 201906   │ /var/lib/clickhouse/data/DB_TEST/partition_v1/201906_2_2_0/ │
└──────────────┴──────────┴─────────────────────────────────────────────────────────────┘

As you can see, partition_ After V1 is divided by month and year, it currently has two data partitions, and each partition corresponds to an independent file directory to save their own data. It is very important to properly design partition keys, which are usually designed according to the query scenarios of data tables. For example, in the previous example, the data table is partitioned by month and year. If subsequent queries are filtered by partition key, for example:

SELECT * FROM  partition_v1 WHERE EventTime ='2019-05-01';

In the subsequent query process, you can use the partition index to skip the partition directory in June and only load the data in May, thus improving the query performance.

Of course, using unreasonable partitioning keys will also backfire. Partitioning keys should not use data fields with too fine granularity. For example, partitioning by hour will lead to a sharp increase in the number of partitions, resulting in performance degradation.

6. view

Clickhouse has two kinds of Views: normal view and materialized view. Materialized view has independent storage, while normal view is just a simple query agent. The complete syntax for creating a normal view is as follows:

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name AS SELECT ...

A normal view does not store any data. It is just a simple select query mapping, which simplifies the query and clarifies the semantics. It will not enhance the query performance. Suppose there is a normal view view_ tb_ V1, which is based on data table tb_ Created in V1, the following two select queries are completely equivalent:

--Common table
SELECT * FROM tb_v1;
-- tb_ View of V1
SELECT * FROM view_tb_v1;

Materialized views support the table engine. The data saving form is determined by its table engine. The complete syntax for creating materialized views is as follows:

CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [TO[db.]name] [ENGINE = engine] [POPULATE] AS SELECT ...

After the materialized view is created, if new data is written to the source table, the materialized view will be updated synchronously.

The popup modifier determines the initialization strategy for Materialized Views:

  • If the popup modifier is used, the existing data in the source table will be imported together with the view creation process, as if select into had been executed;
  • On the contrary, if the popup modifier is not used, the materialized view will have no data after it is created, and it will only synchronize the data written to the source table after that.
  • When usedTO [db].[table], will save the data to the specified table, so you cannot use the popup syntax. If you do not use theTO [db].[table]Syntax, you must set the engine of the table (here, the newly created view) that stores data

Materialized views currently do not support synchronous deletion. If data is deleted in the source table, the materialized view data will still be retained. A materialized view is a special data table in nature. Use show tables to see the table name of the materialized view. The table name is. inner. [name of materialized view], the syntax for deleting a view is:DROP TABLE view_name;

7. data table modification

Currently, only mergetree, merge and distributed table engines support alter syntax.

  • Append field
ALTER TABLE tb_name ADD COLUMN [IF NOT EXISTS] name [type] [default_expr] [AFTER name_after];
#Add column at the end of the table
ALTER TABLE testcol_v1 ADD COLUMN OS String DEFAULT 'mac';
#Add column after specified column
ALTER TABLE testcol_v1 ADD COLUMN IP String AFTER ID;

For the existing old data in the data table, the newly added fields will be completed with the default values.

  • Modify data type
ALTER TABLE tb_name MODIFY COLUMN [IF EXISTS] name [type] [default_expr];

Modifying the data type of a field essentially calls the corresponding totype transformation method. If the current type is not compatible with the expected type, the modification operation will fail. For example, it is feasible to change the IP field of string type to IPv4 type, and an error will occur when trying to convert string type to uint type.

  • Modify remarks

The syntax for adding comments is as follows:

ALTER TABLE tb_name COMMENT COLUMN [IF EXISTS] name 'some comment';
  • Delete field
ALTER TABLE tb_name DROP COLUMN [IF EXISTS] name;

After a column field is deleted, its data will also be deleted.

  • Move data table
RENAME TABLE [db_name11.]tb_name11 TO [db_name12.]tb_name12,[db_name21.]tb_name21 TO [db_name22.]tb_name22, ...

Rename can modify the name of the data table. If the original database and the target database are set to different names, the data table can be moved between the two databases. If the name of the original database is the same as that of the target database, it is the operation of modifying the table name.

It should be noted that the data table can only be moved within the range of a single node. In other words, the target database and the original database of the data table movement must be in the same service node, not a remote node in the cluster.

  • Empty data table
TRUNCATE TABLE [IF EXISTS] [db_name.]tb_name;

8. basic partition operation

Currently, only the mergetree series table engine supports data partitioning.

  • Query partition information

Clickhouse has built-in many system tables to query its own status information. The parts system table is specially used to query the partition information of the data table. For example, execute the following statement to get the data table partition_ Partition status of V2:

SELECT partition_id,name,table,database FROM system.parts WHERE table = 'partition_v2';
┌─partition_id───┬─name───────────┬─table────────┬─database┐
│ 201905         │ 201905_1_1_0_6 │ partition_v2 │ default │
│ 201910         │ 201910_3_3_0_6 │ partition_v2 │ default │
│ 201911         │ 201911_4_4_0_6 │ partition_v2 │ default │
│ 201912         │ 201912_5_5_0_6 │ partition_v2 │ default │
└────────────────┴────────────────┴──────────────┴─────────┘
#Current partition_ The V2 table has four partitions in total, of which partition_ ID or name is equivalent to the partition's primary key. You can determine a specific partition based on their values.
  • Delete specified partition
ALTER TABLE tb_name DROP PARTITION partition_expr;

#If you need to update the partition now_ For the data of the whole July in the V2 data table, you can delete the partition in July first
ALTER TABLE partition_v2 DROP PARTITION 201907;
#Then write the new data of the whole July again to achieve the purpose of updating
INSERT INTO partition_v2 VALUES ('A004-update','www.bruce.com', '2019-07-02');...
  • Copy partition data

Clickhouse supports copying partition data from table a to table B. This feature can be used in scenarios such as fast data writing, data synchronization and backup between multiple tables. Its complete syntax is as follows:

ALTER TABLE B REPLACE PARTITION partition_expr FROM A;

Not all data tables can replicate with each other. They also need to meet two prerequisites:
(1) Two tables need to have the same partition key;

(2) Their table structures are identical.

#Assume data table partition_ V2 vs previous partition_ The V1 partition key and table structure are identical
#In this case, the partition_ Write a batch of new data in August in V1
INSERT INTO partition_v1 VALUES ('A006-v1','www.v1.com','2019-08-05'),('A007-v1','www.v1.com','2019-08-20');
#Then execute the following statement
ALTER TABLE partition_v2 REPLACE PARTITION 201908 FROM partition_v1;
#That is, the partition_ Data in the entire 201908 partition of V1 is copied to the partition_ v2
  • Reset partition data

If the data in a column of the data table is incorrect, you need to reset it to the initial value. At this time, you can use the following statement:

ALTER TABLE tb_name CLEAR COLUMN column_name IN PARTITION partition_expr;

For the meaning of default value, the following principles shall be followed: if the expression of default value is declared, the expression shall prevail; Otherwise, the default value of the corresponding data type shall prevail.

  • Unload and mount partitions

The table partition can be unloaded through the detach statement. After the partition is unloaded, its physical data is not deleted, but transferred to the detached subdirectory of the current data table directory. Loading partitions is the reverse operation. It can reload a partition under the detached subdirectory. The associated operations of unloading and loading are often used in the migration and backup scenarios of partitioned data.

The syntax for uninstalling a partition is as follows:

ALTER TABLE tb_name DETACH PARTITION partition_expr;

Once the partition is moved to the detached subdirectory, it means that it has been separated from Clickhouse management. Clickhouse will not actively clean up these files. These partition files will always exist unless we actively delete them or reload them using the attach statement.

The complete syntax for mounting a partition is as follows:

ALTER TABLE tb_name ATTACH PARTITION partition_expr;

9. distributed DDL

Clickhouse supports the cluster mode. A cluster has 1 to more nodes. The DDL statements create, alter, drop, renmae, and truncate all support distributed execution. This means that if DDL statements are executed on any node in the cluster, each node in the cluster will execute the same statements in the same order. This feature is of great significance. It is like a batch command, eliminating the need to execute DDL on a single node in turn.

It is very simple to convert an ordinary DDL statement into distributed execution. You only need to add on cluster cluster_ Name declaration is sufficient. For example, after executing the following statement, the Ch_ All nodes in the cluster broadcast this DDL statement:

CREATE TABLE partition_v3 ON CLUSTER ch_cluster( 
    ID String,
    URL String,
    EventTime Date
) ENGINE =  MergeTree()
PARTITION BY toYYYYMM(EventTime)
ORDER BY ID;

10. data writing

The insert statement supports three syntax paradigms, which are different and can be used flexibly according to the writing requirements.

The first is the general syntax using the values format:

INSERT INTO [db.]table [(c1, c2, c3…)] VALUES (v11, v12, v13…), (v21, v22, v23…), ...

Where C1, C2 and C3 are column field declarations, which can be omitted. Values is followed by the data to be written, which is composed of tuples. It corresponds to the column field declaration one by one through the subscript bit. Data can be written in batches. Multiple rows of data are separated by commas

When writing data using the syntax in values format, it supports adding expressions or functions, such as:

INSERT INTO partition_v2 VALUES ('A0014',toString(1+2),now());

The second is to use the syntax of the specified format:

INSERT INTO [db.]table [(c1, c2, c3…)] FORMAT format_name data_set

Take the common CSV format as an example:

INSERT INTO partition_v2 FORMAT CSV \
'A0017','www.nauu.com','2019-10-01' \
'A0018','www.nauu.com','2019-10-01'

The third is the syntax in the form of a select clause:

INSERT INTO [db.]table [(c1, c2, c3…)] SELECT ...

When writing data through the select clause, it also supports adding expressions or functions, although the forms of values and select clauses support declaring expressions or functions,However, expressions and functions will bring additional performance overhead, resulting in write performance degradation。 Therefore, if you want to pursue extreme write performance, you should avoid using them as much as possible.

All data operations in Clickhouse are block oriented, so the insert statement will eventually convert the data into block data blocks. Because of this, the insert statement is atomic in the writing process of a single data block. By default, each data block can write up to 1048576 rows of data (controlled by the max_insert_block_size parameter). That is, if an insert statement writes less data than max_ insert_ block_ Size line, then the writing of this batch of data is atomic, that is, either all succeed or all fail. It should be noted that this atomic writing feature is only available when the Clickhouse server processes data, such as when using JDBC or HTTP interfaces. Because Max_ insert_ block_ The size parameter does not take effect when written using the CLI command line or the insert select clause.

11. data deletion and modification

Clickhouse provides the ability of delete and update. This kind of operation is called mutation query, which can be regarded as a variant of alter statement. Although mutation can eventually be modified and deleted, it can not be fully understood in the general sense of update and delete. We must clearly recognize its differences:

  • First, the mutation statement is a “heavy” operation, which is more suitable for modifying and deleting batch data;
  • Secondly, it does not support transactions. Once a statement is committed for execution, it will immediately affect the existing data and cannot be rolled back;
  • Finally, the execution of the mutation statement is an asynchronous background process, which will return immediately after the statement is submitted. Therefore, this does not mean that the specific logic has been executed. Its specific execution progress needs to pass the system Changes system table query.

The complete syntax of the delete statement is as follows:

ALTER TABLE [db_name.]table_name DELETE WHERE filter_expr

Delete process:

  • Each delete operation will generate a mutation_ id
SELECT database, table, mutation_id, block_numbers.number as num, is_done FROM system.mutations;

┌─database─┬─table────────┬─mutation_id────┬─num──┬─is_done─┐
│ default  │ partition_v2 │ mutation_6.txt │ [6]  │ 1       │
└──────────┴──────────────┴────────────────┴──────┴─────────┘

In the root directory of the data table, the_ ID is the name to generate the corresponding log file, which completely records the execution statement and time of the delete operation. The data deletion process is to rewrite all directories into new directories based on each partition directory of the data table. The naming rule of the new directory is to add system mutations. block_ numbers. Number, so the partition directory before the delete operation is as follows:

201905_1_1_0

Delete operation:

201905_1_1_0
#New directory generated
201905_1_1_0_6
#Log file
mutation_6.txt

Data from201905_1_1_0Rewrite a copy of the directory to201905_1_1_0_6Directory. In this process, the201905_1_1_0_6Remove the data to be deleted from the directory.

The old data directory will not be deleted immediately, but will be marked as inactive. These inactive directories will not be physically deleted until the next merge action of the mergetree engine is triggered.

According to the modification, except for specifying specific column fields, the whole logic is the same as data deletion. Its complete syntax is as follows:

ALTER TABLE [db_name.]table_name UPDATE column1 = expr1 [, ...] WHERE filter_expr;

Update supports defining multiple modified fields in a statement at the same time. Partition keys and primary keys cannot be used as modified fields. For example, executing the following statement can modify the partition simultaneously according to the where condition_ URL and OS fields in V2:

ALTER TABLE partition_v2 UPDATE URL = 'www.wayne.com',OS = 'mac' WHERE ID IN (SELECT ID FROM partition_v2 WHERE EventTime = '2019-06-01');

Recommended Today

MVIKotlin study notes (3): View, Binding and Lifecycle

View in realizingViewsThere are no special guidelines to follow, althoughMVIKotlinWhat is provided may be useful. existMVIKotlinThere are two related toViewInterface: ViewRenderer Use and render “Models。 ViewEventsProductionEvents。 one leftMviViewinterface, which simply contains bothViewRendererandViewEventsinterface. Usually no direct implementation is requiredMviViewinterface, which can be inherited throughBaseMviViewclass to implement. If usingJetpack Compose, then most likely you don't need to […]