Orzclick: write a Clickhouse client for National Day

Time:2021-9-17

cause

I think Clickhouse has a C + + client (Clickhouse CPP), and I used PHP CPP to write extensions, so I wrote it on the national dayOrzClick, a Clickhouse client for PHP.

What’s more embarrassing is that I found seasclick in the middle of writing. It is also a binding of Clickhouse CPP, and it was written in C. I feel that I have lost half with PHP CPP, so my small goal is to surpass seasclick in performanceOrzclick: write a Clickhouse client for National Day

performance testing

Select results:

Orzclick: write a Clickhouse client for National Day

  • Using PDO to access the MySQL interface of Clickhouse can query a small amount of data with better performanceOrzclick: write a Clickhouse client for National Day
  • When there is a small amount of data, the performance of orzclick and seasclick is similar. When there is a large amount of data, orzclick > seasclick > MySQL interfaceOrzclick: write a Clickhouse client for National Day

Insert result:

Orzclick: write a Clickhouse client for National Day

  • Orzclick indexed refers to seasclick, which is the closest API (see code:1 2), it’s a small goalOrzclick: write a Clickhouse client for National Day
  • Both seasclick and orzclick have APIs to improve the insertion performance. The startwrite write endwrite performance of seasclick is very good (seasclick block in the figure). The insertcolumnar of orzclick can only exceed it when the data volume is greater than 5000 (orzclick columnar in the figure)Orzclick: write a Clickhouse client for National Day

Which Clickhouse CPP?

Search GitHub for Clickhouse CPP and you will find two similar libraries:

seeLICENSEAnd developers, you can know that the official of Clickhouse is fork. After a brief comparison of the code, the bottom layers of the two are still the same, but there is a small difference in functional features.

Orzclick uses the fork of Clickhouse / Clickhouse CPP, while seasclick is the fork of artpaul / Clickhouse CPP, so we are still homologous, and the performance difference is reflected in the use mode and patch.

Optimization of seasclick

The data insertion interface of Clickhouse CPP is very simple. There is only one entry method:

void Insert(const std::string& table_name, const Block& block);

Seasclick splits it into:

void InsertQuery(const std::string& query, SelectCallback cb);
void InsertData(const Block& block);
void InsertDataEnd();

This split is very helpful for performance improvement, expansion and Implementation:

  • InsertQueryYou can get the field type information and simplify the use of PHP interface. Unlike orzclick, you need to specify the field type
  • InsertQuery+Many timesInsertData + InsertDataEndContinuous insertion can be realized, and the performance is greatly improved (see seasclick block on the figure)

Optimization of orzclick

Data access mode

Clickhouse is a column storage database, and its interface also uses the same design. A select will return multiple blocks. There are multiple columns in the block. The data in one column is stored continuously, and the columns are independent of each other.

The application layer uses data mainly by behavior, so here we need to reorganize the data and convert column data into row data. Seasclick is processed by row, while orzclick is processed by column, which is one of the main differences between the two.

Seasclick traversal mode

The implementation of seasclick is similar to this:

For (auto I = 0; I < block. Getrowcount(); I + +) {/ / outer layer traverses by line

The implementation of orzclick is similar to this:

For (auto, I = 0; I < block. Getcolumncount(); I + +) {/ / outer layer traverses by column

By comparison, we can see that the inner loop of seasclick will have a large number of switch branch jumps, while orzclick

The type is determined in the outer layer, and the inner layer loop is very compact without redundant branches.

Through perf stat analysis, the number of branches and branch misses of seasclick are more than twice that of orzclick:

# perf stat php select-orzclick.php 1000 1000

 Performance counter stats for 'php select-orzclick.php 1000 1000':

        496.85 msec task-clock:u              #    0.340 CPUs utilized
             0      context-switches:u        #    0.000 K/sec
             0      cpu-migrations:u          #    0.000 K/sec
         1,977      page-faults:u             #    0.004 M/sec
 1,761,248,425      cycles:u                  #    3.545 GHz
 2,601,973,475      instructions:u            #    1.48  insn per cycle
   487,402,260      branches:u                #  980.986 M/sec
     2,879,008      branch-misses:u           #    0.59% of all branches

# perf stat php select-seasclick.php 1000 1000

 Performance counter stats for 'php select-seasclick.php 1000 1000':

        896.48 msec task-clock:u              #    0.482 CPUs utilized
             0      context-switches:u        #    0.000 K/sec
             0      cpu-migrations:u          #    0.000 K/sec
         1,962      page-faults:u             #    0.002 M/sec
 3,316,728,038      cycles:u                  #    3.700 GHz
 6,019,365,862      instructions:u            #    1.81  insn per cycle
 1,316,036,409      branches:u                # 1468.000 M/sec           (2.7x)
    10,073,424      branch-misses:u           #    0.77% of all branches (3.4x)

Therefore, in the select test, orzclick is only slightly better than seasclick when the amount of data is small, but the performance gap is widened when the amount of data is large.

Of course, there is also an adverse situation that degenerates to orzclick, that is, Clickhouse returns multiple blocks, but each block has only one row. At present, this is only found in the memory engine.

TCP_NODELAY

During the test, it is found that a small amount of data is slower, which is the difference of one byte:

$ time php insert-orzclick.php 8170 100

real    0m3.894s
user    0m0.030s
sys 0m0.061s

$ time php insert-orzclick.php 8171 100

real    0m0.422s
user    0m0.050s
sys 0m0.022s

Looking at the Clickhouse log, it took about 40ms to process a small amount of data (big guys probably guessed it when they saw 40ms).

Comparing the two flame diagrams, although the total execution time is different, the proportion of various functions is close, and the big head is_zend_hash_find_known_hash:

Orzclick: write a Clickhouse client for National Day

Orzclick: write a Clickhouse client for National Day

Is the problem really PHP? I removed the call of Clickhouse CPP and found that the execution time of the two cases was basically the same, which ruled out the possibility of PHP. The problem should be Clickhouse CPP.

Then trace with strace. It is found that when there is little data, there is only one send system call, and when there is more data, it will be divided into two:

# 8170
sendto(3, "
# 8170
sendto(3, "\2\0\1\0\2\377\377\377\377\0\1\352?\2u8"..., 8192, MSG_NOSIGNAL, NULL, 0) = 8192
# 8171
sendto(3, "\2\0\1\0\2\377\377\377\377\0\1\353?\2u8"..., 22, MSG_NOSIGNAL, NULL, 0) = 22
sendto(3, "\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20"..., 8171, MSG_NOSIGNAL, NULL, 0) = 8171
# 8170
sendto(3, "\2\0\1\0\2\377\377\377\377\0\1\352?\2u8"..., 8192, MSG_NOSIGNAL, NULL, 0) = 8192
# 8171
sendto(3, "\2\0\1\0\2\377\377\377\377\0\1\353?\2u8"..., 22, MSG_NOSIGNAL, NULL, 0) = 22
sendto(3, "\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20"..., 8171, MSG_NOSIGNAL, NULL, 0) = 8171
7777
# 8170
sendto(3, "\2\0\1\0\2\377\377\377\377\0\1\352?\2u8"..., 8192, MSG_NOSIGNAL, NULL, 0) = 8192
# 8171
sendto(3, "\2\0\1\0\2\377\377\377\377\0\1\353?\2u8"..., 22, MSG_NOSIGNAL, NULL, 0) = 22
sendto(3, "\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20"..., 8171, MSG_NOSIGNAL, NULL, 0) = 8171
2?u8"..., 8192, MSG_NOSIGNAL, NULL, 0) = 8192 # 8171 sendto(3, "
# 8170
sendto(3, "\2\0\1\0\2\377\377\377\377\0\1\352?\2u8"..., 8192, MSG_NOSIGNAL, NULL, 0) = 8192
# 8171
sendto(3, "\2\0\1\0\2\377\377\377\377\0\1\353?\2u8"..., 22, MSG_NOSIGNAL, NULL, 0) = 22
sendto(3, "\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20"..., 8171, MSG_NOSIGNAL, NULL, 0) = 8171
# 8170
sendto(3, "\2\0\1\0\2\377\377\377\377\0\1\352?\2u8"..., 8192, MSG_NOSIGNAL, NULL, 0) = 8192
# 8171
sendto(3, "\2\0\1\0\2\377\377\377\377\0\1\353?\2u8"..., 22, MSG_NOSIGNAL, NULL, 0) = 22
sendto(3, "\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20"..., 8171, MSG_NOSIGNAL, NULL, 0) = 8171
7777
# 8170
sendto(3, "\2\0\1\0\2\377\377\377\377\0\1\352?\2u8"..., 8192, MSG_NOSIGNAL, NULL, 0) = 8192
# 8171
sendto(3, "\2\0\1\0\2\377\377\377\377\0\1\353?\2u8"..., 22, MSG_NOSIGNAL, NULL, 0) = 22
sendto(3, "\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20"..., 8171, MSG_NOSIGNAL, NULL, 0) = 8171
3?u8"..., 22, MSG_NOSIGNAL, NULL, 0) = 22 sendto(3, "\t\n\v\f\r"..., 8171, MSG_NOSIGNAL, NULL, 0) = 8171

The critical points 8170 and 8171 are found to be very close to the buffer size 8192 of Clickhouse CPP. So I try to adjust the size of the Clickhouse CPP buffer. It does affect the number of send, but the critical point changes a little, which can’t solve the problem.

So far, we can basically determine the impact of the kernel and protocol stack, so we think about the configurations that may affect the send and receive delays, and then we think of themTCP_NODELAYSo I put forward a PR and added it to Clickhouse CPPTCP_NODELAYOption (PR merged on January 10, 2020), and the test performance is finally stable.

Later, I tried to use the off CPU flame diagram. I can only see that there is a wait during recv, but I can’t directly see the reason. This problem is inexperienced and difficult to deal with (although searching)TCP 40msThere will be results).

Orzclick: write a Clickhouse client for National Day

Php-cpp loss

Php-cpp encapsulates Zend API. The development and extension can basically ignore the lower layers of Zend engine (zval, hashtable, etc.), which is very convenient at the cost of more additional operation and performance loss.

The optimization method is very violent. Modify php-cpp directly to expose the encapsulated zval, and then operate directly with Zend API. The process is to first write in php-cpp, then use the flame diagram to find hot spots, and then replace it with Zend API.

For example, innestedForeachMethod, you need to get the value of the array. If you use php-cppValue::get()Last copy:

Value::Value(struct _zval_struct *val, bool ref)
{
    // do we have to force a reference?
    if (!ref)
    {
        // we don't, simply duplicate the value
        ZVAL_DUP(_val, val);
    }

When inserting in batches, there will be unnecessary array copying. So this is changed tozend_hash_findGet*zval, and then directly traverse:

zval *item;
auto column = zend_hash_find(Z_ARRVAL_P(data._val), key);
auto ht = Z_ARRVAL_P(column);

ZEND_HASH_FOREACH_VAL(ht, item) {
    callback(item);
}
ZEND_HASH_FOREACH_END();

Conclusion

During the National Day holiday, I learned a little from this project:

  • ClickHouse
  • PHP extension development
  • C++
  • CMake
  • performance optimization

There are also some things that have not been done well:

  • I originally wanted to use PHPT for unit testing, but I didn’t write it. At present, there are several examples I use when developing in the tests directory
  • Ci, ready to try GitHub action

Finally, from the name orzclick, you should know that it is written for the purpose of playing and learning. Seasclick is recommended for production environment.

This work adoptsCC agreement, reprint must indicate the author and the link to this article