Clickhouse is a columnar DBMS for online analytical processing (OLAP).
Clickhouse application scenario
- Most requests are for read access
- The data needs to be updated in large batches (greater than 1000 rows), not in a single line; Or there is no update operation at all
- The data is only added to the database, and there is no need to modify it
- When reading data, a large number of rows are extracted from the database, but only a small number of columns are used
- The table is’ wide ‘, that is, it contains a large number of columns
- The query frequency is relatively low (usually hundreds or fewer queries per second per server)
- For simple queries, a delay of about 50 milliseconds is allowed
- The value of the column is a small numeric value and a short string (for example, only 60 bytes per URL)
- High throughput is required when processing a single query (> billions of rows per second per server)
- No transaction required
- Low data consistency requirements
- Only one large table will be queried in each query. Except for a big watch, the rest are small watches
- The query result is significantly smaller than the data source. That is, data can be filtered or aggregated. The returned result does not exceed the memory size of a single server
- Real delete and update transactions are not supported (expected to be supported by subsequent versions)
- Secondary indexes are not supported
- With limited SQL support, the join implementation is different
- Window function is not supported
- Metadata management requires manual intervention and maintenance
Environment and objectives
The author uses CentOS 7 to build the environment. First, install a virtual machine, download and install Clickhouse + zookeeper, then clone the virtual machine, and finally complete the configuration and construction of Clickhouse cluster environment.
Installing virtual machines
Virtual machine network settings – bridging
Set up the network and connect to the external network
- Enter the directory of network related configuration files to be modified
vi ifcfg-ens33set up
:wqSave and exit
Service network restart ` # (restart network service)
Refer to the official Clickhouse documentation,
To add an official repository, you need to confirm the selection
sudo yum install yum-utils sudo rpm --import https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG sudo yum-config-manager --add-repo https://repo.clickhouse.tech/rpm/stable/x86_64
To run the installation command, you need to confirm the selection
sudo yum install clickhouse-server clickhouse-client
service, you can run the following command to start the service in the background:
sudo /etc/init.d/clickhouse-server start
Init script is already runningStart with the following command
systemctl start clickhouse-server.service
The log file will be output in
If the server does not start, check
You can also manually start the server from the console:
In this case, the log will be printed to the console, which is convenient in the development process.
If the configuration file is in the current directory, it does not need to be specified
——config-fileParameters. By default, its path is
Clickhouse supports access restriction settings. They are located
By default, the
defaultUsers can access from anywhere without a password. Viewable
After starting the service, you can connect to it using the command line client:
By default, the
defaultThe user does not carry a password to connect to
localhost:9000。 You can also use
--hostParameter to connect to the specified server.
The terminal must use UTF-8 coding.
Enable external access configuration
cd /etc/clickhouse-server/ vim config. XML # release < listen_ host>::</listen_ Host > comments systemctl restart clickhouse-server. Service # restart the service
View all open ports of firewall
firewall-cmd --zone=public --list-ports
Firewall port operation
Firewall CMD -- Query port = 8123 / TCP # query whether the specified port is open. Yes means open; No means not on. Firewall CMD -- zone = public -- add port = 8123 / TCP -- permanent # open port 8123 Firewall CMD -- zone = public -- remove port = 8123 / TCP -- permanent # close port 8123 Firewall CMD -- reload # reload configuration Netstat - ANP # query open ports
Turn off firewall
If there are too many ports to be opened and it is troublesome, you can close the firewall and evaluate the security by yourself
systemctl stop firewalld.service
View firewall status
View listening ports
yum -y install wget
- Create / opt folder:
mkdir -p /apps
- Enter the / opt Directory:
- Download zookeeper-3.4.9 tar. gz：
- Unzip zookeeper-3.4.9 tar. gz：
tar -zxvf zookeeper-3.4.9.tar.gz
- Enter / usr / local / services / zookeeper / zookeeper-3.4.9/conf Directory:
- Copy Zoo_ sample. The cfg file is and named zoo cfg：
cp zoo_sample.cfg zoo.cfg
- Edit zoo cfg
tickTime=2000 initLimit=10 syncLimit=5 dataDir= /apps/zookeeper-3.4.9/data/zookeeper dataLogDir= /apps/zookeeper-3.4.9/log/zookeeper clientPort=2182 autopurge.purgeInterval=0 globalOutstandingLimit=200 server.1=10.10.7.143:2888:3888 server.2=10.10.7.193:2888:3888 server.3=10.10.7.152:2888:3888
- Create related directories
mkdir -p /apps/zookeeper-3.4.9/data/zookeeper mkdir -p /apps/zookeeper-3.4.9/log/zookeeper
- Set myid
VIM / apps / zookeeper-3.4.9/data/zookeeper/myid # three nodes are configured as 1 / 2 / 3 respectively
- Enter the zookeeper bin directory and start the zookeeper service. Each node needs to be started
./zkServer.sh start ./zkServer. SH status # view node status
Clickhouse high availability cluster solution
Briefly, CKTable engine, there are more than ten in total, and three are recommended here
- MergeTree, is the most advanced engine in CK, with ultra-high performance. Single machine write can reach 50W peak, and query performance is very fast
- ReplicatedMergeTree, based on mergetree, ZK is introduced and replicated
- Distributed, the distributed engine itself does not store data. It can be considered as a view. If it writes, it will throw the request to the node in the cluster (controlled by the algorithm). If it queries, it will help you forward the query and aggregate it back
Viewing and changing host names
Hostnamectl set hostname XXX # change the hostname to XXX Hostname # view hostname
After successful installation, generate the configuration file config. In the / etc / Clickhouse server / directory XML and user XML, custom configuration directory config D and users d，recommendIn config D and users D, and then in config XML and user Introducing config.xml into XML D and users Configuration file under D.
<include_ from>/etc/clickhouse-server/config. d/metrika. xml</include_ from> <!-- Add this parameter -- >
- Configure cluster nodes
Briefly introduce the configuration keywords, which can be used for referenceClickhouse official website
Each server needs to be specified
port, and optional
- Host – remote server address. Can be domain name, IPv4 or IPv6. If the domain name is specified, the service initiates a DNS request at startup, and the request results will be recorded during the operation of the server. If the DNS request fails, the service will not start. If you modify the DNS record, you need to restart the service.
- Port – TCP port for messaging (tcp_port configuration is usually set to 9000). Don’t talk to http_ Port confusion.
- User – the user name used to connect to the remote server. Default value: default. The user must have access to the remote server. Access rights are configured in users XML file. For more information, see the « access rights » section.
- Password – the password used to connect to the remote server. Default: empty string.
- Secure – whether to connect using SSL, set to
truePort = 9440 should also be set when. The server should also listen for < TCP_ port_ secure>9440</tcp_ port_ Secure > and have the correct certificate.
- Compression – whether to use data compression. Default value: true.
Data slice configuration:
trueWrite operation selects only one normal copy to write data. Use this scenario if the child table of a distributed table is a replicated table (* replicamergetree). In other words, this is actually to hand over the data replication work to the table itself that actually needs to write data, rather than the distributed table;
false(default) a write operation writes data to all copies. In essence, this means copying data from the distributed table itself. This method is not as good as using replicated tables, because the consistency of the replica is not checked, and the replica data may be somewhat different over time.
Sharding refers to servers that contain different parts of data (to read all data, you must access all shards).
A replica is a server that stores replicated data (to read all data, you can access the data on any replica).
<yandex> <!-- CK cluster node -- > <remote_servers> <test_ck_cluster> <!-- Slice 1 -- > <shard> <weight>1</weight> <internal_replication>false</internal_replication> <replica> <host>10.10.7.143</host> <port>9000</port> <user>default</user> <password></password> <compression>true</compression> </replica> <replica> <host>10.10.7.193</host> <port>9000</port> <user>default</user> <password></password> <compression>true</compression> </replica> </shard> <!-- Slice 2 -- > <shard> <weight>1</weight> <internal_replication>false</internal_replication> <replica> <host>10.10.7.152</host> <port>9000</port> <user>default</user> <password></password> <compression>true</compression> </replica> <replica> <host>10.10.7.101</host> <port>9000</port> <user>default</user> <password></password> <compression>true</compression> </replica> </shard> </test_ck_cluster> </remote_servers> <!-- Zookeeper related configurations -- > <zookeeper> <node index="1"> <host>10.10.7.143</host> <port>2181</port> </node> <node index="2"> <host>10.10.7.193</host> <port>2181</port> </node> <node index="3"> <host>10.10.7.152</host> <port>2181</port> </node> </zookeeper> <macros> <replica>10.10.7.152</replica> <!-- Current node hostname -- > </macros> <networks> <ip>::/0</ip> </networks> <!-- Compression related configuration -- > <clickhouse_compression> <case> <min_part_size>10000000000</min_part_size> <min_part_size_ratio>0.01</min_part_size_ratio> <method>lz4</method> <!-- Compression algorithm lz4 compression is faster than zstd and occupies more disk -- > </case> </clickhouse_compression> </yandex>
The differences among the three nodes in the above configuration are:
<macros> <replica>10.10.7.XXX</replica> </macros>
Cluster configuration verification
select * from system.clusters;
Official websiteaddress, I use dbaver, you can choose.
*ReplicaMergeTreeCopy data under each partition cannot be copied?
A: port 9009 is used for replication. The firewall needs to be turned on 9009Port replication
Follow up learning points:
- At present, Rand () is used in the distributed table fragmentation scheme. Learn about other fragmentation schemes later.link