CLOSE_WAIT Problem Caused by IP Restart of Container under docker swarm mode



Brief description of problems

As shown below. After server docker restart, the log written by client side is lost and no error is reported.
Because the sequence diagram is not supported, the sequence diagram code is embedded in the code.

client->server: log_data
client->server: log_data
server->server: docker restart
server->client: fin
client->server: log_data loss without error

tcp state diagram

CLOSE_WAIT Problem Caused by IP Restart of Container under docker swarm mode

Problem Location Process

Why is it stuck in CLOSE_WAIT?

Looking at the TCP state transition diagram, you can see that client has received fin, has not recv, has been stuck in CLOSE_WAIT. and the actual code is consistent.
So why is it that after the server docker restart triggers CLOSE_WAIT, the client still does not report errors when sending messages?

  1. The TCP protocol allows clients to continue sending messages after receiving fin.
  2. After the server changes its IP after docker restart, the client still sends messages to the original ip, and no host notifies the client rst, resulting in a backlog of messages in the system buffer.

The backlog information is as follows:

[email protected]:/# netstat -nap | grep 27017 | grep 10.0.0
tcp        1  402         CLOSE_WAIT  4308/server
[email protected]:/# netstat -nap | grep 27017 | grep 10.0.0
tcp        1  70125         CLOSE_WAIT  4308/server

At this point, at the level of elixir socket interface, no matter the status of the socket or the sending, it is ok.

iex([email protected])25> socket |> :inet.port
{:ok, 57395}
iex([email protected])26> socket |> :gen_tcp.send("aaa")

If active close, it will enter the LAST_ACK state

iex([email protected])27> socket |> :gen_tcp.close()    
[email protected]:/# netstat -nap | grep 27017 | grep 10.0.0
tcp        1  70126         LAST_ACK    -   

Recovery of CLOSE_WAIT

CLOSE_WAIT can’t be detected if the code is still sent or not received. Apparently, application layer heartbeat is a solution. So, when can errors be detected without heartbeat and without sending or receiving?

  1. Send buffer full
  2. Todo goes deep into TCP keepalive and does not use the maximum TCP link idle time in the case of keepalive.

Recommended Today

DK7 switch’s support for string

Before JDK7, switch can only support byte, short, char, int or their corresponding encapsulation classes and enum types. After JDK7, switch supports string type. In the switch statement, the value of the expression cannot be null, otherwise NullPointerException will be thrown at runtime. Null cannot be used in the case clause, otherwise compilation errors will […]