Clickhouse introduction
Clickhouse is a columnar DBMS for online analytical processing (OLAP).
Clickhouse application scenario
- Most requests are for read access
- The data needs to be updated in large batches (greater than 1000 rows), not in a single line; Or there is no update operation at all
- The data is only added to the database, and there is no need to modify it
- When reading data, a large number of rows are extracted from the database, but only a small number of columns are used
- The table is’ wide ‘, that is, it contains a large number of columns
- The query frequency is relatively low (usually hundreds or fewer queries per second per server)
- For simple queries, a delay of about 50 milliseconds is allowed
- The value of the column is a small numeric value and a short string (for example, only 60 bytes per URL)
- High throughput is required when processing a single query (> billions of rows per second per server)
- No transaction required
- Low data consistency requirements
- Only one large table will be queried in each query. Except for a big watch, the rest are small watches
- The query result is significantly smaller than the data source. That is, data can be filtered or aggregated. The returned result does not exceed the memory size of a single server
Clickhouse restrictions
- Real delete and update transactions are not supported (expected to be supported by subsequent versions)
- Secondary indexes are not supported
- With limited SQL support, the join implementation is different
- Window function is not supported
- Metadata management requires manual intervention and maintenance
Environment and objectives
The author uses CentOS 7 to build the environment. First, install a virtual machine, download and install Clickhouse + zookeeper, then clone the virtual machine, and finally complete the configuration and construction of Clickhouse cluster environment.
Installing virtual machines
Virtual machine network settings – bridging

Set up the network and connect to the external network
- Enter the directory of network related configuration files to be modified
cd /etc/sysconfig/network-scripts/
-
ls
Directory foundifcfg-eth**
file

-
vi ifcfg-ens33
set upONBOOT=yes
,:wq
Save and exit

Service network restart ` # (restart network service)
Clickhouse configuration
install
Refer to the official Clickhouse documentation,RPM
Installation package
To add an official repository, you need to confirm the selectiony
enter:
sudo yum install yum-utils
sudo rpm --import https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG
sudo yum-config-manager --add-repo https://repo.clickhouse.tech/rpm/stable/x86_64
To run the installation command, you need to confirm the selectiony
enter:
sudo yum install clickhouse-server clickhouse-client
start-up
withoutservice
, you can run the following command to start the service in the background:
sudo /etc/init.d/clickhouse-server start
If prompted:Init script is already running
Start with the following command
systemctl start clickhouse-server.service
The log file will be output in/var/log/clickhouse-server/
folder.
If the server does not start, check/etc/clickhouse-server/config.xml
Configuration in.
You can also manually start the server from the console:
clickhouse-server --config-file=/etc/clickhouse-server/config.xml
In this case, the log will be printed to the console, which is convenient in the development process.
If the configuration file is in the current directory, it does not need to be specified——config-file
Parameters. By default, its path is./config.xml
。
Clickhouse supports access restriction settings. They are locatedusers.xml
Documents (andconfig.xml
Sibling directory).
By default, thedefault
Users can access from anywhere without a password. Viewableuser/default/networks
。
After starting the service, you can connect to it using the command line client:
clickhouse-client
By default, thedefault
The user does not carry a password to connect tolocalhost:9000
。 You can also use--host
Parameter to connect to the specified server.
The terminal must use UTF-8 coding.
LAN access
Enable external access configuration
cd /etc/clickhouse-server/
vim config. XML # release < listen_ host>::</listen_ Host > comments
systemctl restart clickhouse-server. Service # restart the service
View all open ports of firewall
firewall-cmd --zone=public --list-ports
Firewall port operation
Firewall CMD -- Query port = 8123 / TCP # query whether the specified port is open. Yes means open; No means not on.
Firewall CMD -- zone = public -- add port = 8123 / TCP -- permanent # open port 8123
Firewall CMD -- zone = public -- remove port = 8123 / TCP -- permanent # close port 8123
Firewall CMD -- reload # reload configuration
Netstat - ANP # query open ports
Turn off firewall
If there are too many ports to be opened and it is troublesome, you can close the firewall and evaluate the security by yourself
systemctl stop firewalld.service
View firewall status
firewall-cmd --state
View listening ports
netstat -lnpt
Installing zookeeper
usewget
Mode installation
-
yum
Mode installationwget
yum -y install wget
- Create / opt folder:
mkdir -p /apps
- Enter the / opt Directory:
cd /apps
- Download zookeeper-3.4.9 tar. gz:
wget https://archive.apache.org/dist/zookeeper/zookeeper-3.4.9/zookeeper-3.4.9.tar.gz
- Unzip zookeeper-3.4.9 tar. gz:
tar -zxvf zookeeper-3.4.9.tar.gz
- Enter / usr / local / services / zookeeper / zookeeper-3.4.9/conf Directory:
cd zookeeper-3.4.9/conf/
- Copy Zoo_ sample. The cfg file is and named zoo cfg:
cp zoo_sample.cfg zoo.cfg
- Edit zoo cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir= /apps/zookeeper-3.4.9/data/zookeeper
dataLogDir= /apps/zookeeper-3.4.9/log/zookeeper
clientPort=2182
autopurge.purgeInterval=0
globalOutstandingLimit=200
server.1=10.10.7.143:2888:3888
server.2=10.10.7.193:2888:3888
server.3=10.10.7.152:2888:3888
- Create related directories
mkdir -p /apps/zookeeper-3.4.9/data/zookeeper
mkdir -p /apps/zookeeper-3.4.9/log/zookeeper
- Set myid
VIM / apps / zookeeper-3.4.9/data/zookeeper/myid # three nodes are configured as 1 / 2 / 3 respectively
- Enter the zookeeper bin directory and start the zookeeper service. Each node needs to be started
./zkServer.sh start
./zkServer. SH status # view node status
Clickhouse high availability cluster solution
Briefly, CKTable engine, there are more than ten in total, and three are recommended here
- MergeTree, is the most advanced engine in CK, with ultra-high performance. Single machine write can reach 50W peak, and query performance is very fast
- ReplicatedMergeTree, based on mergetree, ZK is introduced and replicated
- Distributed, the distributed engine itself does not store data. It can be considered as a view. If it writes, it will throw the request to the node in the cluster (controlled by the algorithm). If it queries, it will help you forward the query and aggregate it back
Viewing and changing host names
Hostnamectl set hostname XXX # change the hostname to XXX
Hostname # view hostname
Cluster configuration
After successful installation, generate the configuration file config. In the / etc / Clickhouse server / directory XML and user XML, custom configuration directory config D and users d,recommendIn config D and users D, and then in config XML and user Introducing config.xml into XML D and users Configuration file under D.
- edit
config.xml
vim /etc/clickhouse-server/config.xml
<include_ from>/etc/clickhouse-server/config. d/metrika. xml</include_ from> <!-- Add this parameter -- >
- Configure cluster nodes
Briefly introduce the configuration keywords, which can be used for referenceClickhouse official website
Each server needs to be specifiedhost
,port
, and optionaluser
,password
,secure
,compression
Parameters:
- Host – remote server address. Can be domain name, IPv4 or IPv6. If the domain name is specified, the service initiates a DNS request at startup, and the request results will be recorded during the operation of the server. If the DNS request fails, the service will not start. If you modify the DNS record, you need to restart the service.
- Port – TCP port for messaging (tcp_port configuration is usually set to 9000). Don’t talk to http_ Port confusion.
- User – the user name used to connect to the remote server. Default value: default. The user must have access to the remote server. Access rights are configured in users XML file. For more information, see the « access rights » section.
- Password – the password used to connect to the remote server. Default: empty string.
- Secure – whether to connect using SSL, set to
true
Port = 9440 should also be set when. The server should also listen for < TCP_ port_ secure>9440</tcp_ port_ Secure > and have the correct certificate. - Compression – whether to use data compression. Default value: true.
Data slice configuration:
-
internal_replication
–true
Write operation selects only one normal copy to write data. Use this scenario if the child table of a distributed table is a replicated table (* replicamergetree). In other words, this is actually to hand over the data replication work to the table itself that actually needs to write data, rather than the distributed table;false
(default) a write operation writes data to all copies. In essence, this means copying data from the distributed table itself. This method is not as good as using replicated tables, because the consistency of the replica is not checked, and the replica data may be somewhat different over time. -
Sharding refers to servers that contain different parts of data (to read all data, you must access all shards).
-
A replica is a server that stores replicated data (to read all data, you can access the data on any replica).
/etc/clickhouse-server/config.d
directories creatingmetrika.xml
/etc/clickhouse-server/config.d/metrika.xml
<yandex>
<!-- CK cluster node -- >
<remote_servers>
<test_ck_cluster>
<!-- Slice 1 -- >
<shard>
<weight>1</weight>
<internal_replication>false</internal_replication>
<replica>
<host>10.10.7.143</host>
<port>9000</port>
<user>default</user>
<password></password>
<compression>true</compression>
</replica>
<replica>
<host>10.10.7.193</host>
<port>9000</port>
<user>default</user>
<password></password>
<compression>true</compression>
</replica>
</shard>
<!-- Slice 2 -- >
<shard>
<weight>1</weight>
<internal_replication>false</internal_replication>
<replica>
<host>10.10.7.152</host>
<port>9000</port>
<user>default</user>
<password></password>
<compression>true</compression>
</replica>
<replica>
<host>10.10.7.101</host>
<port>9000</port>
<user>default</user>
<password></password>
<compression>true</compression>
</replica>
</shard>
</test_ck_cluster>
</remote_servers>
<!-- Zookeeper related configurations -- >
<zookeeper>
<node index="1">
<host>10.10.7.143</host>
<port>2181</port>
</node>
<node index="2">
<host>10.10.7.193</host>
<port>2181</port>
</node>
<node index="3">
<host>10.10.7.152</host>
<port>2181</port>
</node>
</zookeeper>
<macros>
<replica>10.10.7.152</replica> <!-- Current node hostname -- >
</macros>
<networks>
<ip>::/0</ip>
</networks>
<!-- Compression related configuration -- >
<clickhouse_compression>
<case>
<min_part_size>10000000000</min_part_size>
<min_part_size_ratio>0.01</min_part_size_ratio>
<method>lz4</method> <!-- Compression algorithm lz4 compression is faster than zstd and occupies more disk -- >
</case>
</clickhouse_compression>
</yandex>
The differences among the three nodes in the above configuration are:
<macros>
<replica>10.10.7.XXX</replica>
</macros>
Configuration completerestartService.
Cluster configuration verification
select * from system.clusters;

Visualization tools
Official websiteaddress, I use dbaver, you can choose.
Reference connection:
https://clickhouse.tech/docs/en/getting-started/tutorial/
https://clickhouse.tech/docs/en/engines/table-engines/special/distributed/
Questions:
-
internal_replication
bytrue
Time*ReplicaMergeTree
Copy data under each partition cannot be copied?A: port 9009 is used for replication. The firewall needs to be turned on 9009
Port replication
Follow up learning points:
- At present, Rand () is used in the distributed table fragmentation scheme. Learn about other fragmentation schemes later.link