10000 word detailed explanation of redis cluster mission protocol

Time:2021-7-25

Redis cluster mission protocol

Hello, I’m Li Xiaobing. Today, I’ll talk about the mission protocol and cluster operation of IDS cluster. The mind map of the article is as follows.

10000 word detailed explanation of redis cluster mission protocol

Introduction to cluster mode and mission

In the field of data storage, when the amount of data or request traffic is large to a certain extent, distributed storage will be introduced。 For example, redis has excellent stand-alone performance, but it has to introduce clusters for the following reasons.

  • A single machine cannot guarantee high availability, and multiple instances need to be introduced to provide high availability
  • A single machine can provide QPS up to about 8W, and even higher QPS requires the introduction of multiple instances
  • The amount of data that a single machine can support is limited, and multiple instances need to be introduced to process more data;
  • The network traffic handled by a single machine has exceeded the upper limit of the server’s network card, and multiple instances need to be introduced to divert.

With clusters, clusters often need to maintain certain metadata, such as instance IP address, cache fragmented slots information, etc., so a set of distributed mechanism is needed to maintain the consistency of metadata. Such mechanisms generally have two modes: decentralized and centralized

The distributed mechanism stores metadata on some or all nodes, and different nodes communicate continuously to maintain the change and consistency of metadata. Redis cluster, consume, etc. are all in this mode.
10000 word detailed explanation of redis cluster mission protocol

Centralized means that the cluster metadata is centrally stored on external nodes or middleware, such as zookeeper. The old versions of Kafka and storm use this mode.
10000 word detailed explanation of redis cluster mission protocol

The two modes have their own advantages and disadvantages, as shown in the table below:

pattern advantage shortcoming
Centralized The data update is timely and timely, and the updating and reading of metadata are very timely. Once the metadata changes, it will be updated to the centralized external node immediately, and other nodes can perceive it immediately when reading; Large data update pressure. The update pressure is concentrated in external nodes, which affects the whole system as a single point
Decentralized The pressure of data update is scattered, and the metadata update is scattered. Instead of centralizing a node, the update requests are scattered, and different nodes process them with a certain delay, reducing the concurrency pressure The delay of data update may lead to a certain lag in the perception of the cluster

The decentralized metadata mode has a variety of optional algorithms for metadata synchronization, such as Paxos, raft and mission. Paxos and raft require all nodes or most nodes (more than half) to operate normally, so that the whole cluster can operate stably, while mission does not require more than half of the nodes to operate.

Gossip protocol, as its name implies, is like gossip. It uses a random and infectious way to spread the information to the whole network, and makes the data of all nodes in the system consistent within a certain period of time. For you, mastering this protocol can not only well understand the most commonly used algorithm to realize the final consistency, but also easily realize the final consistency of data in the follow-up work.

Gossip protocol, also known as epic protocol, is a protocol for information exchange between nodes or processes based on epidemic transmission. It is widely used in P2P networks and distributed systems, and its methodology is particularly simple:

In a cluster in a bounded network, if each node randomly exchanges specific information with other nodes, after a long enough time, the cognition of each node in the cluster will eventually converge to the same.

The “specific information” here generally refers to the cluster status, the status of each node and other metadata. The mission protocol fully conforms to the base principle and can be used in any field requiring final consistency, such as distributed storage and registry. In addition, it can easily implement elastic cluster, allow nodes to go online and offline at any time, and provide fast failure detection and dynamic load balancing.

In addition, the biggest advantage of the mission protocol is that even if the number of cluster nodes increases, the load of each node will not increase much, which is almost constant. This allows the scale of nodes managed by redis cluster or consul cluster to be horizontally expanded to thousands.

Redis cluster’s mission communication mechanism

Redis Cluster introduced the cluster function in version 3.0. In order to let each instance in the cluster know the status information of all other instances, redis cluster stipulates that each instance communicates and transmits information according to the mission protocol.

10000 word detailed explanation of redis cluster mission protocol
The figure above shows the redis cluster diagram of the master-slave architecture, in which the solid line represents the master-slave replication relationship between nodes, and the dotted line represents the mission communication between nodes.

Each node in redis clusterMaintain a current status of the entire cluster from your own perspective, mainly including:

  1. Current cluster status
  2. The slots information of each node in the cluster and its migrate status
  3. Master slave status of each node in the cluster
  4. Survival status and suspected fail status of each node in the cluster

In other words, the above information is the content theme of gossip and gossip among nodes in the cluster, and it is more comprehensive. There are both their own and others. In this way, everyone transmits to each other, and the final information is comprehensive and consistent.

Redis cluster nodes send a variety of messages to each other. The more important ones are as follows:

  • Meet: through the “cluster meet IP port” command, the node of the existing cluster will send an invitation to the new node to join the existing cluster, and then the new node will start communicating with other nodes;
  • Ping: the node sends a ping message to other nodes in the cluster according to the configured time interval. The message contains its own status, the cluster metadata maintained by itself, and the metadata of some other nodes;
  • Pong: the node is used to respond to Ping and meet messages. Its structure is similar to that of Ping messages. It also contains its own status and other information. It can also be used for information broadcasting and updating;
  • Fail: after a node fails to Ping, it will broadcast the hang up message of the node to all nodes of the cluster. Other nodes are marked offline after receiving the message.

In redis’s source code, the cluster. H file defines all message types. The code is redis version 4.0.

//Note that Ping, pong, and meet are actually the same message.
//Pong is a reply to Ping, and its actual format is also a ping message,
//Meet is a special Ping message, which is used to force the receiver of the message to add the sender of the message to the cluster (if the node is not already in the node list)
#define CLUSTERMSG_ TYPE_ Ping 0 / * Ping message*/
#define CLUSTERMSG_ TYPE_ Pong 1 / * pong is used to reply to Ping*/
#define CLUSTERMSG_ TYPE_ Meet 2 / * meet requests to add a node to the cluster*/
#define CLUSTERMSG_ TYPE_ Fail 3 / * fail marks a node as fail*/
#define CLUSTERMSG_ TYPE_ Publish 4 / * broadcast messages through publish and subscribe*/
#define CLUSTERMSG_ TYPE_ FAILOVER_ AUTH_ Request 5 / * requests a failover operation. The receiver of the message is required to vote to support the sender of the message*/
#define CLUSTERMSG_ TYPE_ FAILOVER_ AUTH_ ACK 6 / * the recipient of the message agrees to vote for the sender of the message*/
#define CLUSTERMSG_ TYPE_ Update 7 / * slots has changed. The message sender requires the message receiver to update accordingly*/
#define CLUSTERMSG_ TYPE_ Mfstart 8 / * pause each client for manual failover*/
#define CLUSTERMSG_ TYPE_ Count 9 / * total number of messages*/

Through these messages, each instance in the cluster can obtain the status information of all other instances. In this way, even if events such as new node joining, node failure and slot change occur, the cluster state can be synchronized on each instance through the transmission of Ping and Pong messages. Next, let’s take a look at several common scenarios in turn.

Timed Ping / Pong messages

All nodes in redis cluster regularly send Ping messages to other nodes to exchange the status information of each node and check the status of each node, including online status, suspected offline status pfail and offline status fail.

The working principle of timing Ping / Pong of redis cluster can be summarized into two points:

  • First, each instance will randomly select some instances from the cluster according to a certain frequency, and send the Ping message to the selected instances to detect whether these instances are online and exchange status information with each other. The Ping message encapsulates the status information of the instance sending the message, the status information of some other instances, and the slot mapping table.
  • Second, after receiving the Ping message, an instance will send a Pong message to the instance that sent the Ping message. The content of the pong message is the same as that of the Ping message.

The following figure shows the Ping and Pong message transmission between two instances. Instance 1 is the sending node and instance 2 is the receiving node

10000 word detailed explanation of redis cluster mission protocol

New node Online

When redis cluster joins a new node, the client needs to execute the cluster meet command, as shown in the following figure.

10000 word detailed explanation of redis cluster mission protocol

When a node executes the cluster meet command, it will first create a clusternode data for the new node and add it to the nodes Dictionary of the clusterstate it maintains. For the relationship between clusterstate and clusternode, we will have a detailed schematic diagram and source code in the last section.

Then, the node will send a meet message to the new node according to the IP address and port number in the cluster meet command. After the new node receives the meet message sent by node 1, the new node will also create a clusternode structure for node 1 and add the structure to the nodes Dictionary of its own maintained clusterstate.

Next, the new node returns a Pong message to node 1. As soon as the node receives the pong message returned by node B, it knows that the new node has successfully received the meet message it sent.

Finally, a ping message is sent to the new node. After receiving the Ping message, the new node can know that node a has successfully received the P ong message returned by itself, so as to complete the handshake operation of the new node access.

After the meet operation is successful, the node will send the information of the new node to other nodes in the cluster through the timing Ping mechanism mentioned earlier, so that other nodes can also shake hands with the new node. Finally, after a period of time, the new node will be recognized by all nodes in the cluster.

Node suspected offline and true offline

The node in redis cluster will regularly check whether the receiving node that has sent the Ping message has returned the pong message within the specified time (cluster node timeout). If not, it will be marked as a suspected offline state, that is, pfail state, as shown in the following figure.

10000 word detailed explanation of redis cluster mission protocol

Then, the node will pass the information that node 2 is in the suspected offline state to other nodes, such as node 3, through the Ping message. After receiving the Ping message from node 1 and knowing that node 2 has entered the pfail state, node 3 will find the clusternode structure corresponding to node 2 in the nodes Dictionary of the clusterstate maintained by itself, and add the offline report of primary node 1 to the fail of the clusternode structure_ Reports linked list.

10000 word detailed explanation of redis cluster mission protocol

With the passage of time, if node 10 (for example) thinks that node 2 is suspected to be offline because of the pong timeout, and finds the failure of the clusternode of node 2 maintained by itself_ There are in reportsIf more than half of the primary nodes are not obsolete, node 2 will be marked as pfail status report log, then node 10 will mark node 2 as offline fail, and node 10 willimmediatelyBroadcast the fail message that the primary node 2 has been offline to other nodes in the cluster. All nodes receiving the fail message will immediately mark the status of node 2 as offline. As shown in the figure below.

10000 word detailed explanation of redis cluster mission protocol
It should be noted that the report of suspected offline records is timeliness. If the time exceeds cluster node timeout * 2, the report will be ignored and node 2 will return to normal state.

Redis cluster communication source code implementation

To sum up, we have learned about the principles and operation processes of redis cluster in timing Ping / pong, new node online, node suspected offline and real offline. Let’s really take a look at the source code implementation and specific operations of redis in these links.

Data structure involved

First, let’s explain the data structures involved, that is, the clusternode and other structures mentioned above.

Each node maintains a clusterstate structure, which represents the overall status of the current cluster, is defined as follows.

typedef struct clusterState {
   clusterNode *myself;  /*  Clusternode information of the current node*/
   ....
   dict *nodes;          /*  Name to clusternode dictionary*/
   ....
   clusterNode *slots[CLUSTER_ SLOTS]; /*  Corresponding relationship between slot and node*/
   ....
} clusterState;

It has three key fields, as shown below:

  • The myself field is a clusternode structure used to record its own status;
  • The nodes dictionary records a mapping from name to clusternode structure to record the status of other nodes;
  • Slot array, which records the node clusternode structure corresponding to the slot.

10000 word detailed explanation of redis cluster mission protocol

Clusternode structureSaves the current state of a nodeFor exampleNode creation time, node name, current configuration era of the node, IP address and port number of the node, etc。 In addition, the link attribute of the clusternode structure is a clusternink structure that holds the relevant information required to connect nodesFor exampleSocket descriptor, input buffer and output buffer. Clusternode also has a fail_ Report list, used to record suspected offline reports. The specific definitions are as follows.

typedef struct clusterNode {
    mstime_ t ctime; /*  When the node was created*/
    char name[CLUSTER_ NAMELEN]; /*  Node name*/
    int flags;      /*  Node ID, which marks the role or status of the node, such as master node, slave node, online and offline*/
    uint64_ t configEpoch; /*  Cluster unified epoch known to the current node*/
    unsigned char slots[CLUSTER_SLOTS/8]; /* slots handled by this node */
    int numslots;   /* Number of slots handled by this node */
    int numslaves;  /* Number of slave nodes, if this is a master */
    struct clusterNode **slaves; /* pointers to slave nodes */
    struct clusterNode *slaveof; /* pointer to the master node. Note that it
                                    may be NULL even if the node is a slave
                                    if we don't have the master node in our
                                    tables. */
    mstime_ t ping_ sent;      /*  The last time the current node sent a ping message to the node*/
    mstime_ t pong_ received;  /*  The last time the current node received the pong message from this node*/
    mstime_ t fail_ time;      /*  The time when the fail flag bit is set*/
    mstime_t voted_time;     /* Last time we voted for a slave of this master */
    mstime_t repl_offset_time;  /* Unix time we received offset for this node */
    mstime_t orphaned_time;     /* Starting time of orphaned master condition */
    long long repl_ offset;      /*  The repl of the current node is cheap*/
    char ip[NET_ IP_ STR_ LEN];  /*  IP address of the node*/
    int port;                   /*  Port*/
    int cport;                  /*  Communication port, generally port + 1000*/
    clusterLink *link;          /*  TCP connection to this node*/
    list *fail_ reports;         /*  Offline record list*/
} clusterNode;

Clusternodefailreport is the structure that records the offline report of a node, node is the information of the reporting node, and time represents the reporting time.

typedef struct clusterNodeFailReport {
    struct clusterNode *node;  /*  Report the node that the current node has been offline*/
    mstime_ t time;             /*  Report time*/
} clusterNodeFailReport;

Message structure

After understanding the data structure maintained by the IDS node, let’s look at the message structure that the node communicates with. The outermost structure of the communication message is clustermsg, which includes a lot of message record information, including rcmb flag bit, total message length, message protocol version and message type; It also includes the record information of the node sending the message, such as node name, slot information in charge of the node, node IP and port, etc; Finally, it contains a clustermsgdata to carry specific types of messages.

typedef struct {
    char sig[4];        /*  Flag bit, "rcmb" (redis cluster message bus)*/
    uint32_ t totlen;    /*  Total message length*/
    uint16_ t ver;       /*  Message protocol version*/
    uint16_ t port;      /*  Port*/
    uint16_ t type;      /*  Message type*/
    uint16_t count;     /*  */
    uint64_ t currentEpoch;  /*  It represents the unified epoch of the whole cluster currently recorded by this node, which is used to make decisions on elections and voting. Different from configepoch, configepoch represents the unique flag of the master node and currentepoch is the unique flag of the cluster*/
    uint64_ t configEpoch;   /*  Each master node has a unique configepoch flag. If it conflicts with other master nodes, it will be forced to increase automatically to make this node unique in the cluster*/
    uint64_ t offset;    /*  Master slave copies offset related information. The meanings of master node and slave node are different*/
    char sender[CLUSTER_ NAMELEN]; /*  Name of sending node*/
    unsigned char myslots[CLUSTER_ SLOTS/8]; /*  This node is responsible for the slots information, 16384 / 8 char arrays, with a total of 16384bit*/
    char slaveof[CLUSTER_ NAMELEN]; /*  Master information. If this node is a slave node, the protocol contains master information*/
    char myip[NET_IP_STR_LEN];    /* IP */
    char notused1[34];  /*  Reserved fields*/
    uint16_ t cport;      /*  Communication port of cluster*/
    uint16_ t flags;      /*  The current status of this node, such as cluster_ NODE_ HANDSHAKE、CLUSTER_ NODE_ MEET */
    unsigned char state; /* Cluster state from the POV of the sender */
    unsigned char mflags[3]; /*  There are only two types of this message: clustermsg_ FLAG0_ PAUSED、CLUSTERMSG_ FLAG0_ FORCEACK */
    union clusterMsgData data;
} clusterMsg;

Clustermsgdata is a union structure, which can be a message body such as Ping, meet, pong or fail. When the message is of Ping, meet and Pong types, the Ping field is assigned, but when the message is of fail type, the fail field is assigned.

//Note that this is the union keyword
union clusterMsgData {
    /*When Ping, meet or Pong messages, the Ping field is assigned*/
    struct {
        /* Array of N clusterMsgDataGossip structures */
        clusterMsgDataGossip gossip[1];
    } ping;
    /*When a fail message is, fail is assigned a value*/
    struct {
        clusterMsgDataFail about;
    } fail;
    //... omit the fields of the publish and update messages
};

Clustermsgdatamission is the structure of Ping, pong and meet messages. It will include other node information maintained by the sending node, that is, the information contained in the nodes field in the clusterstate above. The specific code is as follows. You will also find that the fields of the two are similar.

typedef struct {
    /*The name of the node is random by default. After the meet message is sent and replied, the cluster will set a formal name for the node*/
    char nodename[CLUSTER_NAMELEN]; 
    uint32_ t ping_ sent; /*  The timestamp of the last Ping message sent by the sending node to the receiving node will be assigned as 0 after receiving the corresponding Pong reply*/
    uint32_ t pong_ received; /*  The timestamp of the last time the sending node receives the pong message sent by the receiving node*/
    char ip[NET_IP_STR_LEN];  /* IP address last time it was seen */
    uint16_t port;       /* IP*/       
    uint16_ t cport;      /*  Port*/  
    uint16_ t flags;      /*  Identification*/ 
    uint32_ t notused1;   /*  Align character*/
} clusterMsgDataGossip;

typedef struct {
    char nodename[CLUSTER_ NAMELEN]; /*  Name of the offline node*/
} clusterMsgDataFail;

After reading the data structure maintained by the node and the message structure sent, let’s take a look at the specific behavior source code of redis.

Send Ping message randomly and periodically

Redis’s clustercron function will be called regularly. Every 10 times it is executed, it will be ready to send a ping message to a random node.

It will first randomly select five nodes, then select the node that has not communicated with it for the longest time, and call the clustersendping function to send the type as clustermsg_ TYPE_ Ping message

//Cluster. C file 
//Clustercron() sends a gossip message to a random node every 10 times (at least one second apart)
if (!(iteration % 10)) {
    int j;

    /*Select one of 5 nodes randomly*/
    for (j = 0; j < 5; j++) {
        de = dictGetRandomKey(server.cluster->nodes);
        clusterNode *this = dictGetVal(de);

        /*Do not Ping disconnected nodes or nodes that have been pinged recently*/
        if (this->link == NULL || this->ping_sent != 0) continue;
        if (this->flags & (CLUSTER_NODE_MYSELF|CLUSTER_NODE_HANDSHAKE))
            continue;
        /*Contrast Pong_ In the received field, select the node that has not received its Pong message for a longer time (indicating that it has not received its Pong message for a long time)*/
        if (min_pong_node == NULL || min_pong > this->pong_received) {
            min_pong_node = this;
            min_pong = this->pong_received;
        }
    }
    /*Send a ping command to the node that has not received a Pong reply for the longest time*/
    if (min_pong_node) {
        serverLog(LL_DEBUG,"Pinging node %.40s", min_pong_node->name);
        clusterSendPing(min_pong_node->link, CLUSTERMSG_TYPE_PING);
    }
}

We will learn more about the specific behavior of clustersendping function later, because this function is often used in other links

Nodes join the cluster

After the node executes the cluster meet command, it will maintain a clusternode structure for the new node. The link of the structure, that is, the TCP connection field, is null, indicating that the new node has not established a connection.

The clusterCron function also handles these new nodes that do not establish connections, calls createClusterLink to create connections, and then calls the clusterSendPing function to send MEET messages.

/*The cluster. C clustercron function section creates a connection for nodes that have not created a connection*/
if (node->link == NULL) {
    int fd;
    mstime_t old_ping_sent;
    clusterLink *link;
    /*Establish a connection with this node*/
    fd = anetTcpNonBlockBindConnect(server.neterr, node->ip,
        node->cport, NET_FIRST_BIND_ADDR);
    /*... exception handling when FD is - 1*/
    /*Establish link*/
    link = createClusterLink(node);
    link->fd = fd;
    node->link = link;
    aeCreateFileEvent(server.el,link->fd,AE_READABLE,
            clusterReadHandler,link);
    /*Send a ping command to the newly connected node to prevent the node from being recognized as offline*/
    /*If the node is marked as meet, send the meet command, otherwise send the ping command*/
    old_ping_sent = node->ping_sent;
    clusterSendPing(link, node->flags & CLUSTER_NODE_MEET ?
            CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING);
    /* .... */
    /*If the current node (sender) fails to receive the reply of meet information, it will no longer send commands to the target node*/
    /*If a reply is received, the node will no longer be in the handshake state and continue to send a normal ping command to the target node*/
    node->flags &= ~CLUSTER_NODE_MEET;
}

Prevent node false timeout and status expiration

To prevent node false timeout and mark suspected offline, the mark is also in the clustercron function, as shown below. It will check the current list of all nodes. If it is found that the communication time between a node and its last Pong exceeds half of the predetermined threshold, in order to prevent the node from false timeout, it will actively release the link connection with it, and then actively send a ping message to it.

/*In the cluster. C clustercron function part, traverse the nodes to check the failed nodes*/
while((de = dictNext(di)) != NULL) {
    clusterNode *node = dictGetVal(de);
    now = mstime(); /* Use an updated time at every iteration. */
    mstime_t delay;

    /*If the waiting time for Pong to arrive exceeds half of the node timeout*/
    /*Because although the node is still normal, there may be a problem with the connection*/
    if (node->link && /* is connected */
        now - node->link->ctime >
        server.cluster_ node_ Timeout & & / * not reconnected yet*/
        node->ping_ Send & & / * has sent a ping message*/
        node->pong_ received < node->ping_ Send & & / * still waiting for Pong message*/
        /*Waiting for Pong message exceeds timeout / 2*/
        now - node->ping_sent > server.cluster_node_timeout/2)
    {
        /*Release the connection and the clustercron() will automatically reconnect next time*/
        freeClusterLink(node->link);
    }

    /*If there is no Ping node at present*/
    /*And no Pong reply has been received from the node for half of the node timeout*/
    /*Then send a ping to the node to ensure that the information of the node is not too old, and it may not be in random all the time*/
    if (node->link &&
        node->ping_sent == 0 &&
        (now - node->pong_received) > server.cluster_node_timeout/2)
    {
        clusterSendPing(node->link, CLUSTERMSG_TYPE_PING);
        continue;
    }
    /*... handle failover and tag loss*/
}

Handle failover and mark suspected logoff

If the node still does not receive the pong message from the target node after preventing the node from false timeout processing, and the time has exceeded the cluster_ node_ Timeout, then mark the node as suspected offline.

/*If this is a master node and there is a slave server requesting manual failover, send Ping to the slave server*/
if (server.cluster->mf_end &&
    nodeIsMaster(myself) &&
    server.cluster->mf_slave == node &&
    node->link)
{
    clusterSendPing(node->link, CLUSTERMSG_TYPE_PING);
    continue;
}

/*The subsequent code is executed only when the node sends the ping command*/
if (node->ping_sent == 0) continue;

/*Calculate the length of time waiting for Pong reply*/ 
delay = now - node->ping_sent;
/*The waiting time for Pong reply exceeds the limit, and the target node is marked as pfail (suspected offline)*/
if (delay > server.cluster_node_timeout) {
    /*Timeout, marked as suspected offline*/
    if (!(node->flags & (REDIS_NODE_PFAIL|REDIS_NODE_FAIL))) {
        redisLog(REDIS_DEBUG,"*** NODE %.40s possibly failing",
            node->name);
        //Open suspected offline marker
        node->flags |= REDIS_NODE_PFAIL;
        update_state = 1;
    }
}

Actually send mission message

The following is the source code of the clustersendping () method that has been called many times before. There are detailed comments in the code, which you can read by yourself. The main operation is to convert the clusterstate maintained by the node itself into the corresponding message structure,.

/*Send a meet, Ping or Pong message to the specified node*/
void clusterSendPing(clusterLink *link, int type) {
    unsigned char *buf;
    clusterMsg *hdr;
    int gossipcount = 0; /* Number of gossip sections added so far. */
    int wanted; /* Number of gossip sections we want to append if possible. */
    int totlen; /* Total packet length. */
    //Freshnodes is a counter used to send gossip information
    //Each time a message is sent, the program subtracts the value of freshnodes by one
    //When the value of freshnodes is less than or equal to 0, the program stops sending mission information
    //The number of freshnodes is the number of nodes in the current nodes table minus 2 
    //Here, 2 refers to two nodes, one is the myself node (that is, the node that sends information)
    //The other is the node that receives the mission information
    int freshnodes = dictSize(server.cluster->nodes)-2;

    
    /*Calculate the number of nodes to carry, at least 3, up to 1 / 10 of the total number of nodes in the cluster*/
    wanted = floor(dictSize(server.cluster->nodes)/10);
    if (wanted < 3) wanted = 3;
    if (wanted > freshnodes) wanted = freshnodes;

    /*... omit the calculation of totlen, etc*/

    /*If the message sent is Ping, the timestamp of the last ping command sent is updated*/
    if (link->node && type == CLUSTERMSG_TYPE_PING)
        link->node->ping_sent = mstime();
    /*Record the information of the current node (such as name, address, port number and processing slot) into the message*/
    clusterBuildMessageHdr(hdr,type);

    /* Populate the gossip fields */
    int maxiterations = wanted*3;
    /*Each node has the opportunity to send the gossip information once
       Send the gossip information (gossip count) of 2 selected nodes to the target node each time*/
    while(freshnodes > 0 && gossipcount < wanted && maxiterations--) {
        /*Randomly select a node (selected node) from the nodes dictionary*/
        dictEntry *de = dictGetRandomKey(server.cluster->nodes);
        clusterNode *this = dictGetVal(de);

        /*The following nodes cannot be selected:
         *Myself: the node itself.
         *Pfail status node
         *Node in handshake state.
         *Node with noaddr ID
         *A node disconnected because it does not process any slots 
         */
        if (this == myself) continue;
        if (this->flags & CLUSTER_NODE_PFAIL) continue;
        if (this->flags & (CLUSTER_NODE_HANDSHAKE|CLUSTER_NODE_NOADDR) ||
            (this->link == NULL && this->numslots == 0))
        {
            freshnodes--; /* Tecnically not correct, but saves CPU. */
            continue;
        }

        //Check whether the selected node is already in the HDR - > data.ping.mission array
        //If yes, this node has been selected before
        //Don't select it again (otherwise it will repeat)
        if (clusterNodeIsInGossipSection(hdr,gossipcount,this)) continue;

        /*The selected node is valid and the counter is decremented by one*/
        clusterSetGossipEntry(hdr,gossipcount,this);
        freshnodes--;
        gossipcount++;
    }

    /*.... If there is a pfail node, add it at last*/


    /*Calculate message length*/
    totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);
    totlen += (sizeof(clusterMsgDataGossip)*gossipcount);
    /*Record the number of selected nodes (how many nodes are included in the mission information) in the count attribute*/
    hdr->count = htons(gossipcount);
    /*Record the length of the message in the message*/
    hdr->totlen = htonl(totlen);
    /*Send network request*/
    clusterSendMessage(link,buf,totlen);
    zfree(buf);
}


void clusterSetGossipEntry(clusterMsg *hdr, int i, clusterNode *n) {
    clusterMsgDataGossip *gossip;
    /*Point to mission information structure*/
    gossip = &(hdr->data.ping.gossip[i]);
    /*Record the name of the selected node in the mission information*/   
    memcpy(gossip->nodename,n->name,CLUSTER_NAMELEN);
    /*Record the sending timestamp of the ping command of the selected node to the mission information*/
    gossip->ping_sent = htonl(n->ping_sent/1000);
    /*Record the timestamp replied by the pong command of the selected node into the mission information*/
    gossip->pong_received = htonl(n->pong_received/1000);
    /*Record the IP of the selected node to the mission information*/
    memcpy(gossip->ip,n->ip,sizeof(n->ip));
    /*Record the port number of the selected node to the mission information*/
    gossip->port = htons(n->port);
    gossip->cport = htons(n->cport);
    /*Record the identification value of the selected node to the mission information*/
    gossip->flags = htons(n->flags);
    gossip->notused1 = 0;
}

The following is the clusterbuildmessagehdr function, which is mainly responsible for filling in the basic information in the message structure and the status information of the current node.

/*Build the header of the message*/
void clusterBuildMessageHdr(clusterMsg *hdr, int type) {
    int totlen = 0;
    uint64_t offset;
    clusterNode *master;

    /*If the current node is save, then master is its master node. If the current node is master, then master is the current node*/
    master = (nodeIsSlave(myself) && myself->slaveof) ?
              myself->slaveof : myself;

    memset(hdr,0,sizeof(*hdr));
    /*Initialize the protocol version, ID, and type*/
    hdr->ver = htons(CLUSTER_PROTO_VER);
    hdr->sig[0] = 'R';
    hdr->sig[1] = 'C';
    hdr->sig[2] = 'm';
    hdr->sig[3] = 'b';
    hdr->type = htons(type);
    /*The message header sets the current node ID*/
    memcpy(hdr->sender,myself->name,CLUSTER_NAMELEN);

    /*The message header sets the IP address of the current node*/
    memset(hdr->myip,0,NET_IP_STR_LEN);
    if (server.cluster_announce_ip) {
        strncpy(hdr->myip,server.cluster_announce_ip,NET_IP_STR_LEN);
        hdr->myip[NET_IP_STR_LEN-1] = '
/*Build the header of the message*/
void clusterBuildMessageHdr(clusterMsg *hdr, int type) {
int totlen = 0;
uint64_t offset;
clusterNode *master;
/*If the current node is save, then master is its master node. If the current node is master, then master is the current node*/
master = (nodeIsSlave(myself) && myself->slaveof) ?
myself->slaveof : myself;
memset(hdr,0,sizeof(*hdr));
/*Initialize the protocol version, ID, and type*/
hdr->ver = htons(CLUSTER_PROTO_VER);
hdr->sig[0] = 'R';
hdr->sig[1] = 'C';
hdr->sig[2] = 'm';
hdr->sig[3] = 'b';
hdr->type = htons(type);
/*The message header sets the current node ID*/
memcpy(hdr->sender,myself->name,CLUSTER_NAMELEN);
/*The message header sets the IP address of the current node*/
memset(hdr->myip,0,NET_IP_STR_LEN);
if (server.cluster_announce_ip) {
strncpy(hdr->myip,server.cluster_announce_ip,NET_IP_STR_LEN);
hdr->myip[NET_IP_STR_LEN-1] = '\0';
}
/*Basic port and node communication port in the cluster*/
int announced_port = server.cluster_announce_port ?
server.cluster_announce_port : server.port;
int announced_cport = server.cluster_announce_bus_port ?
server.cluster_announce_bus_port :
(server.port + CLUSTER_PORT_INCR);
/*Set the slot information of the current node*/
memcpy(hdr->myslots,master->slots,sizeof(hdr->myslots));
memset(hdr->slaveof,0,CLUSTER_NAMELEN);
if (myself->slaveof != NULL)
memcpy(hdr->slaveof,myself->slaveof->name, CLUSTER_NAMELEN);
hdr->port = htons(announced_port);
hdr->cport = htons(announced_cport);
hdr->flags = htons(myself->flags);
hdr->state = server.cluster->state;
/*Set currentepoch and configepochs*/
hdr->currentEpoch = htonu64(server.cluster->currentEpoch);
hdr->configEpoch = htonu64(master->configEpoch);
/*Set copy offset*/
if (nodeIsSlave(myself))
offset = replicationGetSlaveOffset();
else
offset = server.master_repl_offset;
hdr->offset = htonu64(offset);
/* Set the message flags. */
if (nodeIsMaster(myself) && server.cluster->mf_end)
hdr->mflags[0] |= CLUSTERMSG_FLAG0_PAUSED;
/*Calculate and set the total length of the message*/
if (type == CLUSTERMSG_TYPE_FAIL) {
totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);
totlen += sizeof(clusterMsgDataFail);
} else if (type == CLUSTERMSG_TYPE_UPDATE) {
totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);
totlen += sizeof(clusterMsgDataUpdate);
}
hdr->totlen = htonl(totlen);
}
'; } /*Basic port and node communication port in the cluster*/ int announced_port = server.cluster_announce_port ? server.cluster_announce_port : server.port; int announced_cport = server.cluster_announce_bus_port ? server.cluster_announce_bus_port : (server.port + CLUSTER_PORT_INCR); /*Set the slot information of the current node*/ memcpy(hdr->myslots,master->slots,sizeof(hdr->myslots)); memset(hdr->slaveof,0,CLUSTER_NAMELEN); if (myself->slaveof != NULL) memcpy(hdr->slaveof,myself->slaveof->name, CLUSTER_NAMELEN); hdr->port = htons(announced_port); hdr->cport = htons(announced_cport); hdr->flags = htons(myself->flags); hdr->state = server.cluster->state; /*Set currentepoch and configepochs*/ hdr->currentEpoch = htonu64(server.cluster->currentEpoch); hdr->configEpoch = htonu64(master->configEpoch); /*Set copy offset*/ if (nodeIsSlave(myself)) offset = replicationGetSlaveOffset(); else offset = server.master_repl_offset; hdr->offset = htonu64(offset); /* Set the message flags. */ if (nodeIsMaster(myself) && server.cluster->mf_end) hdr->mflags[0] |= CLUSTERMSG_FLAG0_PAUSED; /*Calculate and set the total length of the message*/ if (type == CLUSTERMSG_TYPE_FAIL) { totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData); totlen += sizeof(clusterMsgDataFail); } else if (type == CLUSTERMSG_TYPE_UPDATE) { totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData); totlen += sizeof(clusterMsgDataUpdate); } hdr->totlen = htonl(totlen); }

Postscript

Originally, I just wanted to write about the mission protocol of redis cluster. Unexpectedly, the more articles are written, the more content will be. Finally, the source code analysis is a bit of a tiger’s head and snake’s tail. Let’s make do and have a look. I hope you will continue to pay attention to my follow-up problems.

Personal blog, welcome to play

10000 word detailed explanation of redis cluster mission protocol

Recommended Today

VBS obtains the operating system and its version number

VBS obtains the operating system and its version number ? 1 2 3 4 5 6 7 8 9 10 11 12 ‘************************************** ‘*by r05e ‘* operating system and its version number ‘************************************** strComputer = “.” Set objWMIService = GetObject(“winmgmts:” _  & “{impersonationLevel=impersonate}!\\” & strComputer & “\root\cimv2”) Set colOperatingSystems = objWMIService.ExecQuery _  (“Select * from […]