Build Clickhouse high availability cluster from 0

Time:2022-1-11

Clickhouse introduction

Clickhouse is a columnar DBMS for online analytical processing (OLAP).

Clickhouse application scenario

  1. Most requests are for read access
  2. The data needs to be updated in large batches (greater than 1000 rows), not in a single line; Or there is no update operation at all
  3. The data is only added to the database, and there is no need to modify it
  4. When reading data, a large number of rows are extracted from the database, but only a small number of columns are used
  5. The table is’ wide ‘, that is, it contains a large number of columns
  6. The query frequency is relatively low (usually hundreds or fewer queries per second per server)
  7. For simple queries, a delay of about 50 milliseconds is allowed
  8. The value of the column is a small numeric value and a short string (for example, only 60 bytes per URL)
  9. High throughput is required when processing a single query (> billions of rows per second per server)
  10. No transaction required
  11. Low data consistency requirements
  12. Only one large table will be queried in each query. Except for a big watch, the rest are small watches
  13. The query result is significantly smaller than the data source. That is, data can be filtered or aggregated. The returned result does not exceed the memory size of a single server

Clickhouse restrictions

  1. Real delete and update transactions are not supported (expected to be supported by subsequent versions)
  2. Secondary indexes are not supported
  3. With limited SQL support, the join implementation is different
  4. Window function is not supported
  5. Metadata management requires manual intervention and maintenance

Environment and objectives

The author uses CentOS 7 to build the environment. First, install a virtual machine, download and install Clickhouse + zookeeper, then clone the virtual machine, and finally complete the configuration and construction of Clickhouse cluster environment.

Installing virtual machines

Virtual machine network settings – bridging

Build Clickhouse high availability cluster from 0

Virtual machine network settings

Set up the network and connect to the external network

  1. Enter the directory of network related configuration files to be modified
cd /etc/sysconfig/network-scripts/
  1. lsDirectory foundifcfg-eth**file
Build Clickhouse high availability cluster from 0

Network settings directory
  1. vi ifcfg-ens33set upONBOOT=yes:wqSave and exit
Build Clickhouse high availability cluster from 0

ONBOOT=yes
Service network restart ` # (restart network service)

Clickhouse configuration

install

Refer to the official Clickhouse documentation,RPMInstallation package

To add an official repository, you need to confirm the selectionyenter:

sudo yum install yum-utils
sudo rpm --import https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG
sudo yum-config-manager --add-repo https://repo.clickhouse.tech/rpm/stable/x86_64

To run the installation command, you need to confirm the selectionyenter:

sudo yum install clickhouse-server clickhouse-client

start-up

withoutservice, you can run the following command to start the service in the background:

sudo /etc/init.d/clickhouse-server start

If prompted:Init script is already runningStart with the following command

systemctl start clickhouse-server.service

The log file will be output in/var/log/clickhouse-server/folder.

If the server does not start, check/etc/clickhouse-server/config.xmlConfiguration in.

You can also manually start the server from the console:

clickhouse-server --config-file=/etc/clickhouse-server/config.xml

In this case, the log will be printed to the console, which is convenient in the development process.

If the configuration file is in the current directory, it does not need to be specified——config-fileParameters. By default, its path is./config.xml

Clickhouse supports access restriction settings. They are locatedusers.xmlDocuments (andconfig.xmlSibling directory).
By default, thedefaultUsers can access from anywhere without a password. Viewableuser/default/networks

After starting the service, you can connect to it using the command line client:

clickhouse-client

By default, thedefaultThe user does not carry a password to connect tolocalhost:9000。 You can also use--hostParameter to connect to the specified server.

The terminal must use UTF-8 coding.

LAN access

Enable external access configuration

cd /etc/clickhouse-server/
 vim config. XML # release < listen_ host>::</listen_ Host > comments
 systemctl restart clickhouse-server. Service # restart the service

View all open ports of firewall

firewall-cmd --zone=public --list-ports

Firewall port operation

Firewall CMD -- Query port = 8123 / TCP # query whether the specified port is open. Yes means open; No means not on.

Firewall CMD -- zone = public -- add port = 8123 / TCP -- permanent # open port 8123

Firewall CMD -- zone = public -- remove port = 8123 / TCP -- permanent # close port 8123

Firewall CMD -- reload # reload configuration

Netstat - ANP # query open ports

Turn off firewall

If there are too many ports to be opened and it is troublesome, you can close the firewall and evaluate the security by yourself

systemctl stop firewalld.service

View firewall status

firewall-cmd --state

View listening ports

netstat -lnpt

Installing zookeeper

usewgetMode installation

  1. yumMode installationwget
yum -y install wget
  1. Create / opt folder:
mkdir -p /apps
  1. Enter the / opt Directory:
cd /apps
  1. Download zookeeper-3.4.9 tar. gz:
wget https://archive.apache.org/dist/zookeeper/zookeeper-3.4.9/zookeeper-3.4.9.tar.gz
  1. Unzip zookeeper-3.4.9 tar. gz:
tar -zxvf zookeeper-3.4.9.tar.gz
  1. Enter / usr / local / services / zookeeper / zookeeper-3.4.9/conf Directory:
cd zookeeper-3.4.9/conf/
  1. Copy Zoo_ sample. The cfg file is and named zoo cfg:
cp zoo_sample.cfg zoo.cfg
  1. Edit zoo cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir= /apps/zookeeper-3.4.9/data/zookeeper
dataLogDir= /apps/zookeeper-3.4.9/log/zookeeper
clientPort=2182
autopurge.purgeInterval=0
globalOutstandingLimit=200
server.1=10.10.7.143:2888:3888
server.2=10.10.7.193:2888:3888
server.3=10.10.7.152:2888:3888
  1. Create related directories
mkdir -p /apps/zookeeper-3.4.9/data/zookeeper
mkdir -p /apps/zookeeper-3.4.9/log/zookeeper
  1. Set myid
VIM / apps / zookeeper-3.4.9/data/zookeeper/myid # three nodes are configured as 1 / 2 / 3 respectively
  1. Enter the zookeeper bin directory and start the zookeeper service. Each node needs to be started
./zkServer.sh start
./zkServer. SH status # view node status

Clickhouse high availability cluster solution

Briefly, CKTable engine, there are more than ten in total, and three are recommended here

  • MergeTree, is the most advanced engine in CK, with ultra-high performance. Single machine write can reach 50W peak, and query performance is very fast
  • ReplicatedMergeTree, based on mergetree, ZK is introduced and replicated
  • Distributed, the distributed engine itself does not store data. It can be considered as a view. If it writes, it will throw the request to the node in the cluster (controlled by the algorithm). If it queries, it will help you forward the query and aggregate it back

Viewing and changing host names

Hostnamectl set hostname XXX # change the hostname to XXX

Hostname # view hostname

Cluster configuration

After successful installation, generate the configuration file config. In the / etc / Clickhouse server / directory XML and user XML, custom configuration directory config D and users d,recommendIn config D and users D, and then in config XML and user Introducing config.xml into XML D and users Configuration file under D.

  1. editconfig.xml
vim /etc/clickhouse-server/config.xml
<include_ from>/etc/clickhouse-server/config. d/metrika. xml</include_ from> <!-- Add this parameter -- >
  1. Configure cluster nodes

Briefly introduce the configuration keywords, which can be used for referenceClickhouse official website

Each server needs to be specifiedhostport, and optionaluserpasswordsecurecompressionParameters:

  • Host – remote server address. Can be domain name, IPv4 or IPv6. If the domain name is specified, the service initiates a DNS request at startup, and the request results will be recorded during the operation of the server. If the DNS request fails, the service will not start. If you modify the DNS record, you need to restart the service.
  • Port – TCP port for messaging (tcp_port configuration is usually set to 9000). Don’t talk to http_ Port confusion.
  • User – the user name used to connect to the remote server. Default value: default. The user must have access to the remote server. Access rights are configured in users XML file. For more information, see the « access rights » section.
  • Password – the password used to connect to the remote server. Default: empty string.
  • Secure – whether to connect using SSL, set totruePort = 9440 should also be set when. The server should also listen for < TCP_ port_ secure>9440</tcp_ port_ Secure > and have the correct certificate.
  • Compression – whether to use data compression. Default value: true.

Data slice configuration:

  • internal_replicationtrueWrite operation selects only one normal copy to write data. Use this scenario if the child table of a distributed table is a replicated table (* replicamergetree). In other words, this is actually to hand over the data replication work to the table itself that actually needs to write data, rather than the distributed table;false(default) a write operation writes data to all copies. In essence, this means copying data from the distributed table itself. This method is not as good as using replicated tables, because the consistency of the replica is not checked, and the replica data may be somewhat different over time.

  • Sharding refers to servers that contain different parts of data (to read all data, you must access all shards).

  • A replica is a server that stores replicated data (to read all data, you can access the data on any replica).

/etc/clickhouse-server/config.ddirectories creatingmetrika.xml

/etc/clickhouse-server/config.d/metrika.xml
<yandex>
<!-- CK cluster node -- >
<remote_servers>
    <test_ck_cluster>
        <!-- Slice 1 -- >
        <shard>
            <weight>1</weight>
            <internal_replication>false</internal_replication>
            <replica>
                <host>10.10.7.143</host>
                <port>9000</port>
                <user>default</user>
                <password></password>
               <compression>true</compression>
            </replica>
            <replica>
                <host>10.10.7.193</host>
                <port>9000</port>
                <user>default</user>
                <password></password>
                <compression>true</compression>
            </replica>
        </shard>
        <!-- Slice 2 -- >
        <shard>
            <weight>1</weight>
            <internal_replication>false</internal_replication>
            <replica>
                <host>10.10.7.152</host>
                <port>9000</port>
                <user>default</user>
                <password></password>
                <compression>true</compression>
            </replica>
            <replica>
                <host>10.10.7.101</host>
                <port>9000</port>
                <user>default</user>
                <password></password>
                <compression>true</compression>
            </replica>
        </shard>
    </test_ck_cluster>
</remote_servers>

<!-- Zookeeper related configurations -- >
<zookeeper>
    <node index="1">
        <host>10.10.7.143</host>
        <port>2181</port>
    </node>
    <node index="2">
        <host>10.10.7.193</host>
        <port>2181</port>
    </node>
    <node index="3">
        <host>10.10.7.152</host>
        <port>2181</port>
    </node>
</zookeeper>

<macros>
    <replica>10.10.7.152</replica> <!-- Current node hostname -- >
</macros>

<networks>
    <ip>::/0</ip>
</networks>

<!-- Compression related configuration -- >
<clickhouse_compression>
    <case>
        <min_part_size>10000000000</min_part_size>
        <min_part_size_ratio>0.01</min_part_size_ratio>
        <method>lz4</method> <!-- Compression algorithm lz4 compression is faster than zstd and occupies more disk -- >
    </case>
</clickhouse_compression>
</yandex>

The differences among the three nodes in the above configuration are:

<macros>
  <replica>10.10.7.XXX</replica>
</macros>

Configuration completerestartService.

Cluster configuration verification

select * from system.clusters;
Build Clickhouse high availability cluster from 0

Cluster information

Visualization tools

Official websiteaddress, I use dbaver, you can choose.

Reference connection:

https://clickhouse.tech/docs/en/getting-started/tutorial/

https://clickhouse.tech/docs/en/engines/table-engines/special/distributed/

Questions:

  1. internal_replicationbytrueTime*ReplicaMergeTreeCopy data under each partition cannot be copied?

    A: port 9009 is used for replication. The firewall needs to be turned on 9009

    Build Clickhouse high availability cluster from 0

    Port replication

Follow up learning points:

  1. At present, Rand () is used in the distributed table fragmentation scheme. Learn about other fragmentation schemes later.link

Recommended Today

You have to know about JVM garbage collection

catalogue 1、 Four citation methods1.1 strong reference1.2 soft reference1.3 weak reference1.4 phantom reference 2、 How to judge whether the object is garbage2.1 reference counting method2.2 root accessibility analysis 3、 Garbage collection algorithm3.1 mark sweep3.2 mark compact3.3 mark copy 4、 Garbage collector4.1 classification and characteristics4.1.1 serial4.1.2 throughput priority4.1.3 priority of response time4.2 serial garbage collector details […]