Technology sharing | multi thread parallel playback of slave MTS (1)

Time:2021-1-21

Author: Gao Peng (eight monsters)

This section contains the distribution call process. Please refer to the link below:
https://www.jianshu.com/p/870…

1、 Overview

Different from the playback of single SQL thread, MTS contains multiple working threads, and the original SQL thread is transformed into coordination thread. The SQL coordination thread also undertakes the work of checkpoint. We know that there are two ways of parallel playback, including logical_ Clock and database have different rules to determine which things can be played back in parallel. In fact, the source code corresponds to two different classes:

  • Mts_submode_logical_clock
  • Mts_submode_database

I’m just going to talk about logic based_ Instead of discussing the old database based method, the following are the parameters I set:

  • slave_parallel_type:LOGICAL_CLOCK
  • slave_parallel_workers :4

Pay attention to slave_ parallel_ Workers sets the number of worker threads, and does not include coordinating threads. Therefore, if you do not want to use MTS, you should set this parameter to 0, and then “stop slave; start slave” will take effect. Because the worker thread has been initialized when it is started.

Because we know that in 5.7, the anonymous gtid event is included even if gtid is not turned on. It carries last commit and seq number. Therefore, MTS can be used even if gtid is turned off. However, it is not recommended to find the reason in section 26.

Previously, we discussed the transaction submission process of MySQL layer and the parallel replication method based on writeset. We mentioned three methods to generate last commit and seq number

  • ORDER_COMMIT
  • WRITESET
  • WRITESET_SESSION

They control the rules that generate last commit and seq number. The slave library only needs to change the parameter slave_ parallel_ Type is set to logical_ The basis of whether the clock can be paralleled is the last commit and the SEQ number.

In the following description, we use a normal “delete” statement to delete an event of a row of data. Then the sequence of events is as follows:

Technology sharing | multi thread parallel playback of slave MTS (1)

At the same time, let’s clarify the three places where MTS information can be persisted in mysql, because MTS needs to store more information, which is different from the master-slave of traditional single SQL thread. Note that we only discuss master_ info_ Repository and relay_ log_ info_ If the repository is table, it is as follows:

  • slave_ master_ Info table: updated by IO thread, exceeding sync_ master_ Info setting update, unit number of events.
  • relay_ log_ info_ Repository table: it is updated when the checkpoint is executed by the SQL coordination thread.
  • slave_ worker_ Info table: it is updated every time a transaction is submitted by a worker thread.

Refer to section 25 for a more detailed explanation, and explain why only master is considered_ info_ Repository and relay_ log_ info_ The reason why repository is table.

2、 Distribution mechanism of coordination thread

In event distribution, the coordination thread mainly completes the following two tasks:

  • Determine whether the transaction can be played back in parallel.
  • Determine which worker thread will play back the transaction.

Different from the process of single SQL thread execution, it is mainly reflected in the function apply_ event_ and_ update_ Under POS, for a single thread, the application of event will be completed, while for MTS, only the distribution of event will be completed. The specific application will be completed by the worker thread.
Here’s the simplified process. Please refer to the notes for specific function calls. The following is a flow chart (Figure 19-1, the original HD image is included in the original image at the end of the article)

Technology sharing | multi thread parallel playback of slave MTS (1)

3、 Step analysis

Each step is analyzed as follows:
(1) If it’s gtid_ LOG_ Event represents the beginning of a thing and adds the thing to the gaQ queue (gaQ will be described in detail in the next section). Reference function log_ event::get_ slave_ worker。
(2) Add gtid_ LOG_ Add event to curr_ group_ In the queue. Reference function log_ event::get_ slave_ worker。
(3) Get gtid_ LOG_ Last commit and seq number values in event. Referenced function MTS_ submode_ logical_ clock::schedule_ next_ event。
(4) Get current_ LWM value. This value represents the SEQ number of the previous committed transaction of the earliest transaction in the gaQ queue that has not yet been submitted. It may be that the subsequent transaction has been submitted. It may sound awkward but it is very important. If all the transactions have been submitted, the SEQ number of the latest submitted transaction will be taken Number, the following figure expresses this meaning, this figure is in the source code. This value can be obtained by referring to the function Mts_ submode_ logical_ clock::get_ lwm_ timestamp。

the last time index containg lwm
               +------+
               | LWM  |
               |  |   |
               V  V   V
GAQ:x  xoooooxxxxxXXXXX...X
             ^   ^
             || LWM + 1 (LWM represents the location of the checkpoint)
             |
             + new current_ LWM (here is current_ lwm)

      <---- logical (commit) time ----
      
here `x' stands for committed, `X' for committed and discarded from
the running range of the queue, `o' for not committed.

We can not look at the LWM part first, and we will discuss the LWM of checkpoint later. Seq number increases from right to left. There are actually three kinds of values in gaQ

  • 10: Have done checkpoints, things out of the team in gaQ.
  • x: Something that has been submitted for completion.
  • o: There is no submission of the completed thing.

We can see the current we need to get_ LWM is not the value of the SEQ number of the latest submitted transaction, but the SEQ number of the previous submitted transaction of the earliest uncommitted transaction. This is very important, because after understanding, we will know how events affect the parallel playback of Mts. at the same time, the five “O” in the middle are actually the so-called “gap”, which will be described in detail in the next section.

(5) Add gtid_ LOG_ Last commit and current commit in event_ LWM. You can refer to the function Mts_ submode_ logical_ clock::schedule_ next_ event。 Here are the general rules:

  • If last commit is less than or equal to current_ LWM indicates that parallel playback can be performed and the playback can be continued.
  • If last commit is greater than current_ LWM means that parallel playback cannot be performed. At this time, the coordination thread needs to wait until the condition less than or equal to is established. After setting up, the coordination thread will be awakened by the worker thread. The waiting period status is set to “waiting for dependent transaction to commit”.

The source code is also relatively simple as follows:

longlong lwm_estimate= estimate_lwm_timestamp(); 
//This value will only be set when waiting below_ waited_ timestamp ,
//Min is set_ waited_ Timestamp will update LWM_ estimate
    if (!clock_leq(last_committed, lwm_estimate) && 
//  @return   true  when a "<=" b,false otherwise  last_committed<=lwm_estimate
        rli->gaq->assigned_group_index != rli->gaq->entry) 
    {
      if (wait_for_last_committed_trx(rli, last_committed, lwm_estimate)) 
//Waiting for the completion of the last group commit

(6) If it’s query_ Event is added to curr_ group_ In the queue.
(7) If it’s map_ Event allocates worker threads. Reference function MTS_ submode_ logical_ clock::get_ least_ occupied_ The allocation of working threads is as follows:

  • If there are idle worker threads, the allocation is completed and the process continues.
  • If there is no idle worker thread, wait for the idle worker thread. In this case, the status will be set to “waiting for slave workers to process their queues”.

The following is the standard of distribution, which is actually very simple:

for (Slave_worker **it= rli->workers.begin(); it != rli->workers.end(); ++it)
  {
    Slave_worker *w_i= *it;
    if (w_i->jobs.len == 0)
//If the task queue is 0, the worker thread is idle and can be allocated
      return w_i;
  }
  return 0;

(8) Add gtid_ LOG_ Event and query_ Event is assigned to the worker thread. Please refer to append_ item_ to_ Jobs function.

The previous worker thread has been allocated. Here you can start to allocate the event to this worker thread. When allocating, you need to check whether the task queue of the worker thread is full. If it is full, you need to wait, and the status is set to “waiting for slave worker queue”. Because the allocated unit is event, a transaction may contain many events. If the application speed of the worker thread can not catch up with the queue speed of the coordination thread, it may lead to the backlog of the task queue, so it is possible for the task queue to be full. The size of the task queue is 16384, as follows:

mts_slave_worker_queue_len_max= 16384;  

Here is part of the code for joining the team:

while (worker->running_status == Slave_worker::RUNNING && !thd->killed &&
         (ret= en_queue(&worker->jobs, job_item)) == -1)
//If it's full
  {
    thd->ENTER_COND(&worker->jobs_cond, &worker->jobs_lock,
                    &stage_slave_waiting_worker_queue, &old_stage);
//Mark wait state
    worker->jobs.overfill= TRUE;
    worker->jobs.waited_overfill++;
    rli->mts_ wq_ overfill_ CNT + +; // marks the number of times the queue is full
    mysql_cond_wait(&worker->jobs_cond, &worker->jobs_lock);
//Waiting to wake up
    mysql_mutex_unlock(&worker->jobs_lock);
    thd->EXIT_COND(&old_stage);
    mysql_mutex_lock(&worker->jobs_lock);
  }

(9)MAP_ Event is assigned to the worker thread, as above.

(10)DELETE_ Event is assigned to the worker thread, as above.

(11)XID_ Event is assigned to the worker thread, but it needs additional processing. It mainly deals with some checkpoint related information

ptr_group->checkpoint_log_name= my_strdup(key_memory_log_event, 
rli->get_group_master_log_name(), MYF(MY_WME));
ptr_group->checkpoint_log_pos= rli->get_group_master_log_pos();
ptr_group->checkpoint_relay_log_name=my_strdup(key_memory_log_event, 
rli->get_group_relay_log_name(), MYF(MY_WME));
ptr_group->checkpoint_relay_log_pos= rli->get_group_relay_log_pos();
ptr_group->ts= common_header->when.tv_sec + (time_t) exec_time; 
//Seconds_behind_master related .checkpoint
//This value will be passed to MTS again_ checkpoint_ routine()      
ptr_group->checkpoint_seqno= rli->checkpoint_seqno;
//Getting seqno subtracts the offset after chkpt

If the checkpoint is on this transaction, this information will appear in the table slave_ worker_ Info and will appear in show slave status. In other words, a lot of information in show slave status comes from MTS checkpoints. The next section describes checkpoints in detail.

(12) If the allocation process of the above event is greater than 2 minutes (120 seconds), a log may appear as follows:

Technology sharing | multi thread parallel playback of slave MTS (1)

This screenshot is also a friend’s question. In fact, this log can be regarded as a warning. In fact, the corresponding source code is:

sql_print_information("Multi-threaded slave statistics%s: "
                "seconds elapsed = %lu; "
                "events assigned = %llu; "
                "worker queues filled over overrun level = %lu; "
                "waited due a Worker queue full = %lu; "
                "waited due the total size = %lu; "
                "waited at clock conflicts = %llu "
                "waited (count) when Workers occupied = %lu "
                "waited when Workers occupied = %llu",
                rli->get_for_channel_str(),
                static_cast<unsigned long>
                (my_now - rli->mts_last_online_stat),
//Total time consumed in seconds
                rli->mts_events_assigned,
//Total number of events allocated
                rli->mts_wq_overrun_cnt,
//The number of times the worker thread allocates the queue is greater than 90%. The current hard code is 14746
                rli->mts_wq_overfill_cnt,    
//The number of waiting times due to the full work allocation queue is currently hard coded 16384
                rli->wq_size_waits_cnt, 
//The number of large events generally does not exist
                rli->mts_total_wait_overlap,
//Because the previous group of parallel big things have not been submitted, the waiting time unit of worker thread cannot be allocated nanoseconds
                rli->mts_wq_no_underrun_cnt, 
//The number of times a work thread waits because there is no idle thread
                rli->mts_total_wait_worker_avail);
//The unit of time a work thread waits because it has no idle time

Because I often see friends ask me to explain their meaning in detail. From the previous analysis, we can see three waiting points in total

  • “Waiting for dependent transaction to commit”

Because the coordination thread determines that the last commit of this transaction is greater than current_ Therefore, LWM cannot play back in parallel, and the coordination thread is waiting, which will be aggravated by large transactions.

  • “Waiting for slave workers to process their queues”

Since there are no idle worker threads, the coordination thread will wait. This shows that the parallelism is ideal in theory, but it may be the parameter slave_ parallel_ Workers is not set enough. Of course, setting the number of working threads should be considered in combination with the server configuration and load, because in section 29, we will see that threads are the smallest unit of CPU scheduling.

  • “Waiting for Slave Worker queue”

The coordination thread will wait because the worker’s task queue is full. The reason for this situation is that a transaction contains too many events, and the speed of event application by the worker thread can not catch up with the speed of event allocation by the coordinating thread, resulting in a backlog of more than 16384 events.

In addition, there is actually another kind of waiting as follows:
“Waiting for slave workers to free pending events”: it is caused by the so-called ‘big event’. What is’ big event ‘? In the source code, it is described as: event size is greater than slave_ pending_ jobs_ size_ max but less than slave_ max_ allowed_ packet。 Personally, I don’t think it is likely to happen, so I haven’t given much consideration to it. You can use the function append_ item_ to_ Jobs.

We will explain the output in the log in detail as follows:

Technology sharing | multi thread parallel playback of slave MTS (1)

We can see that this log is very complete, basically covering all the possibilities we discussed earlier. Then let’s look at the log in the case, waited at clock conflicts = 91895169800, about 91 seconds. 120 seconds or 91 seconds of waiting caused by the inability to play back in parallel. Obviously, we should consider whether there are big things.

4、 Determinants of parallel playback

The following is a binary log fragment generated by our main library using writeset method. We mainly observe lastcomit and seq number, and get familiar with this process through analysis.

Technology sharing | multi thread parallel playback of slave MTS (1)

According to the parallel judgment rule just mentioned, namely:

  • If last commit is less than or equal to current_ LWM indicates that parallel playback can be performed and the playback can be continued.
  • If last commit is greater than current_ LWM means that parallel playback cannot be performed and it needs to wait.

The specific analysis is as follows:
(last commit: 22 SEQ number: 23) this transaction will be executed after the completion of (last commit: 21 SEQ number: 22) transaction, because (last commit: 22 < = SEQ number: 22), the subsequent transactions will be executed until (last commit: 21 SEQ number: 22)_ Commit: 22 SEQ number: 30) can actually be executed in parallel. Let’s assume that they are all executed. We continue to observe the following three transactions:

  • last_committed:29 sequence_number:31
  • last_committed:30 sequence_number:32
  • last_committed:27 sequence_number:33

We notice that this is an obvious feature of parallel replication based on writeset. The last commit may be smaller than the last transaction, which is calculated according to the historical map information of writeset. Therefore, according to the above rules, the three of them can be executed in parallel. Because obviously:

  • last_committed:29 <= current_lwm:30
  • last_committed:30 <= current_lwm:30
  • last_committed:27 <= current_lwm:30

However, if (last commit: 22 SEQ number: 30) there is a large transaction that has not been executed before this transaction, the current_ The value of LWM will not be 30. For example, (last commit: 22 SEQ number: 27) if this transaction is a large transaction, then current_ LWM will be marked as 26. The above three transactions will be blocked, and the allocation (last commit: 29 SEQ number: 31) will be blocked. The reasons are as follows:

  • last_committed:29 > current_lwm:26
  • last_committed:30 > current_lwm:26
  • last_committed:27 > current_lwm:26

Let’s consider the transaction under the parallel replication based on writeset (last commit: 27 SEQ number: 33), because under our parallel rule, the smaller the last commit, the higher the possibility of concurrency. Therefore, the parallel replication based on writeset does improve the parallelism of playback from the slave library, but as mentioned in Section 16, the master library has some overhead.

The end of section 19.

Recommended Today

Socket communication between Java and python

Code service code, call the corresponding service according to the code Status code, return the status code according to the execution result Using JSON string communication between sockets Java socket client and python socket server Java socket client package com.brewin.hammer.system.util.socket; import java.io.InputStream; import java.io.OutputStream; import java.io.PrintWriter; import java.net.Socket; import java.util.HashMap; import java.util.Map; import com.alibaba.fastjson.JSON; /** […]