Linux processing time_ Wait and fin_ WAIT_ 2 status

Time:2021-7-20
  1. Take the kernel version 3.10 as an example, the kernel version 4.1 + has some changes when dealing with fin-wait-2, which will be mentioned later
  2. The code should be reasonably simplified

TL;DR

  • Time of Linux TCP_ The timeout of wait status is 60 seconds by default and cannot be modified
  • Fin of Linux TCP_ WAIT_ 2 and time_ Wait share one set of implementation
  • This can be done through TCP_ fin_ Timeout modify fin_ WAIT_ Timeout for 2
  • The effect of 3.10 kernel and 4.1 + kernel on TCP_ fin_ The implementation mechanism of timeout has changed
  • Both reuse and recycle need to turn on timestamp, which is not friendly to NAT
  • 4.3 + kernel is recommended. Please refer to the last section for parameter configuration


Figure 1. TCP state machine

Source code analysis

entrance

Preliminary viewtcp_input.c#tcp_finFor this entry, the active disconnect party receives the message from the closed partyFINAfter the instruction, it enters the time wait state for further processing.

link:linux/net/ipv4/tcp_input.c

/*
 *  /net/ipv4/tcp_input.c
 *  ...
 *	If we are in FINWAIT-2, a received FIN moves us to TIME-WAIT.
 */
static void tcp_fin(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);

	inet_csk_schedule_ack(sk);

	sk->sk_shutdown |= RCV_SHUTDOWN;
	sock_set_flag(sk, SOCK_DONE);

	switch (sk->sk_state) {
	case TCP_SYN_RECV:
	case TCP_ESTABLISHED:
		...
	case TCP_CLOSE_WAIT:
	case TCP_CLOSING:
		...
	case TCP_LAST_ACK:
		...
	case TCP_FIN_WAIT1:
		...
	case TCP_FIN_WAIT2:
		/*Receive the fin of the closed connection -- send ack to time_ Wait status*/
		tcp_send_ack(sk);
		tcp_time_wait(sk, TCP_TIME_WAIT, 0);
		break;
	default:
		/* Only TCP_LISTEN and TCP_CLOSE are left, in these
		 * cases we should never reach this piece of code.
		 */
		pr_err("%s: Impossible, sk->sk_state=%d\n",
		       __func__, sk->sk_state);
		break;
	}
	...
}

Processing time wait

staytcp_minisocks.cIn this paper, we will deal with state recovery and control the size of time wait bucket

a)net.ipv4.tcp_tw_recycleNeed andnet.ipv4.tcp_timestampsOpen at the same time to recycle quickly
b) The connection status is time_ During wait, the cleaning time is 60s by default and cannot be modified

link:linux/net/ipv4/tcp_minisocks.c

// tcp_ death_ Row structure
/*
In TCP_ death_ There are two recycling mechanisms in row. One is to put TW into the socket port with a long timeout_ In the queue of timer,
A socket with short timeout, which is put into twcal_ In the queue of timer;
tw_ The timeout precision of timer is TCP_ TIMEWAIT_ LEN / INET_ TWDR_ TWKILL_ SLOTS=7.5s
And twcal_ The timer unit is not a fixed value, but is defined according to the constant Hz, which is 250Hz in 3.10 kernel,
The timeout precision is: 1 < Rx_ opt.ts_ recent_ stamp)
		recycle_ok = tcp_remember_stamp(sk);

  //If the current number of waiting for recycling time-wait is less than the configured bucket size, the current sock is thrown into the processing queue
	if (tcp_death_row.tw_count < tcp_death_row.sysctl_max_tw_buckets) // 2
		tw = inet_twsk_alloc(sk, state);

	if (tw != NULL) {
		struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
		const int rto = (icsk->icsk_rto << 2) - (icsk->icsk_rto >> 1); // 3.5*RTO
		struct inet_sock *inet = inet_sk(sk);

		tw->tw_transparent	= inet->transparent;
		tw->tw_mark		= sk->sk_mark;
		tw->tw_rcv_wscale	= tp->rx_opt.rcv_wscale;
		tcptw->tw_rcv_nxt	= tp->rcv_nxt;
		tcptw->tw_snd_nxt	= tp->snd_nxt;
		tcptw->tw_rcv_wnd	= tcp_receive_window(tp);
		tcptw->tw_ts_recent	= tp->rx_opt.ts_recent;
		tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
		tcptw->tw_ts_offset	= tp->tsoffset;
		tcptw->tw_last_oow_ack_time = 0;

... ifdef endif...

		/* Get the TIME_WAIT timeout firing. */
    //From TCP_ The input parameter Timeo of fin method is 0, which will be re assigned to 3.5rto
		if (timeo < rto)
			timeo = rto;

    //If recycling is needed, set the timeout to the current RTO; Otherwise, the timeout time is set to 60s. When the status is time-wait,
    //The time o is also set to 60s, which is the time for later processing
    //BTW, RTO values are generally less than the configured values, unless both sides have network jitters and hardware exceptions, and need to retransmit for many times
		if (recycle_ok) {
			tw->tw_timeout = rto;
		} else {
			tw->tw_timeout = TCP_TIMEWAIT_LEN;
			if (state == TCP_TIME_WAIT)
				timeo = TCP_TIMEWAIT_LEN;
		}

		/* Linkage updates. */
		__inet_twsk_hashdance(tw, sk, &tcp_hashinfo);

    //Two timers
    // 1. TCP_ TIMEWAIT_ The timer of 60s defined by Len
    //2. Timer with timeout of 3.5 * RTO
		inet_twsk_schedule(tw, &tcp_death_row, timeo,
				   TCP_TIMEWAIT_LEN);
		inet_twsk_put(tw);
	} else {
		/* Sorry, if we're out of memory, just CLOSE this
		 * socket up.  We've got bigger problems than
		 * non-graceful socket closings.
		 */
		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPTIMEWAITOVERFLOW);
	}

	tcp_update_metrics(sk);
	tcp_done(sk);
}

Time wait polling processing

The slot is calculated according to the size of the Timeo value. After judgment, it enters a different timer and waits for cleaning

/*
* /net/ipv4/inet_timewait_sock.c
*/
void inet_twsk_schedule(struct inet_timewait_sock *tw,
		       struct inet_timewait_death_row *twdr,
		       const int timeo, const int timewait_len)
{
	struct hlist_head *list;
	unsigned int slot;

	/* timeout := RTO * 3.5
	 *
	 * 3.5 = 1+2+0.5 to wait for two retransmits.
	 *
	 * RATIONALE: if FIN arrived and we entered TIME-WAIT state,
	 * our ACK acking that FIN can be lost. If N subsequent retransmitted
	 * FINs (or previous seqments) are lost (probability of such event
	 * is p^(N+1), where p is probability to lose single packet and
	 * time to detect the loss is about RTO*(2^N - 1) with exponential
	 * backoff). Normal timewait length is calculated so, that we
	 * waited at least for one retransmitted FIN (maximal RTO is 120sec).
	 * [ BTW Linux. following BSD, violates this requirement waiting
	 *   only for 60sec, we should wait at least for 240 secs.
	 *   Well, 240 consumes too much of resources 8)
	 * ]
	 * This interval is not reduced to catch old duplicate and
	 * responces to our wandering segments living for two MSLs.
	 * However, if we use PAWS to detect
	 * old duplicates, we can reduce the interval to bounds required
	 * by RTO, rather than MSL. So, if peer understands PAWS, we
	 * kill tw bucket after 3.5*RTO (it is important that this number
	 * is greater than TS tick!) and detect old duplicates with help
	 * of PAWS.
	 */
  //The slot number is calculated by the timeout value
	slot = (timeo + (1 << INET_TWDR_RECYCLE_TICK) - 1) >> INET_TWDR_RECYCLE_TICK;

	spin_lock(&twdr->death_lock);

	/* Unlink it, if it was scheduled */
	if (inet_twsk_del_dead_node(tw))
		twdr->tw_count--;
	else
		atomic_inc(&tw->tw_refcnt);
  
  //If the calculated slot is larger than the default value (1 < < 5), enter slow timer to process, and others enter fast timer
	if (slot >= INET_TWDR_RECYCLE_SLOTS) {
		/* Schedule to slow timer */
		if (timeo >= timewait_len) {
			slot = INET_TWDR_TWKILL_SLOTS - 1;
		} else {
			slot = DIV_ROUND_UP(timeo, twdr->period);
			if (slot >= INET_TWDR_TWKILL_SLOTS)
				slot = INET_TWDR_TWKILL_SLOTS - 1;
		}
		tw->tw_ttd = inet_tw_time_stamp() + timeo;
		slot = (twdr->slot + slot) & (INET_TWDR_TWKILL_SLOTS - 1);
		list = &twdr->cells[slot];
	} else {
		tw->tw_ttd = inet_tw_time_stamp() + (slot << INET_TWDR_RECYCLE_TICK);

		if (twdr->twcal_hand < 0) {
			twdr->twcal_hand = 0;
			twdr->twcal_jiffie = jiffies;
			twdr->twcal_timer.expires = twdr->twcal_jiffie +
					      (slot << INET_TWDR_RECYCLE_TICK);
			add_timer(&twdr->twcal_timer);
		} else {
			if (time_after(twdr->twcal_timer.expires,
				       jiffies + (slot << INET_TWDR_RECYCLE_TICK)))
				mod_timer(&twdr->twcal_timer,
					  jiffies + (slot << INET_TWDR_RECYCLE_TICK));
			slot = (twdr->twcal_hand + slot) & (INET_TWDR_RECYCLE_SLOTS - 1);
		}
		list = &twdr->twcal_row[slot];
	}

	hlist_add_head(&tw->tw_death_node, list);

	if (twdr->tw_count++ == 0)
		mod_timer(&twdr->tw_timer, jiffies + twdr->period);
	spin_unlock(&twdr->death_lock);
}
EXPORT_SYMBOL_GPL(inet_twsk_schedule);

What has the 4.1 + kernel changed

4.1 for time_ Wait processing logic has been changed. See PR for specific changes.

TCP / DCCP: getting rid of a single time wait timer

About 15 years ago, when memory was expensive and the machine had only one CPU, it was a good choice to use timer as a time wait socket,
But this can't be extended, the code is ugly and the delay is huge (you can often see CPUs in death)_ Lock pinlock (spin up to 30ms)
We can now add 64 bytes to each time wait socket and extend the time wait load to all CPUs
To get better performance
The tests are as follows:
In the following test, on the server side (lpaa24) / proc / sys / net / IPv4 / TCP_ tw_ Recycle is set to 1

Before revision:
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171
When the test runs, a delay of 25 to 33 ms can be observed

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

After modification:
90% increase in throughput:

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992

The real-time network utilization has increased by more than 90%, and the delay remains at a very low level

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

commit:789f558cfb3680aeb52de137418637f6b04b7d22

link:v4.1/net/ipv4/inet_timewait_sock.c

void inet_twsk_schedule(struct inet_timewait_sock *tw, const int timeo)
{
	tw->tw_kill = timeo <= 4*HZ;
	if (!mod_timer_pinned(&tw->tw_timer, jiffies + timeo)) {
		atomic_inc(&tw->tw_refcnt);
		atomic_inc(&tw->tw_dr->tw_count);
	}
}
EXPORT_SYMBOL_GPL(inet_twsk_schedule);

Then the above PR is revised in 4.3, see PR for details. Do the following simple translation:

When creating a timewait socket, we need to configure the timer before allowing other CPUs to find it.
The signal that allows CPUs to find socket will tw_ Refcnt is set to a non-zero value

We need to call INET first_ twsk_ Schedule () can be used in__ inet_ twsk_ Set TW in hashdance()_ Refcnt value

It also means that we need to start from INET_ twsk_ Delete TW in schedule()_ Changes to refcnt are then processed by the caller.

Please note that since we use mod_ timer_ Pinned (), so TW can be set when running in BH context_ Before refcnt,
The timer does not expire.

To make the content more readable, I introduced INET_ twsk_ reschedule() helper。

When you reset the timer, you can use mod_ timer_ Pending() to ensure that the canceled timer does not need to be reset.

Note: if the packet of the stream can hit multiple CPUs, this error may be triggered. Unless the flow control is broken in some way,
Otherwise, this usually does not happen. This error was found about five months after the modification.

reqsk_ queue_ hash_ Syn in req()_ Recv socket requires a similar fix, but will be mentioned in a separate fix
For the fix to track correctly.

commit:ed2e923945892a8372ab70d2f61d364b0b6d9054

link:v4.3/net/ipv4/inet_timewait_sock.c#L222

void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
{
	if (!rearm) {
			BUG_ON(mod_timer_pinned(&tw->tw_timer, jiffies + timeo));
 			atomic_inc(&tw->tw_dr->tw_count);
		} else {
			mod_timer_pending(&tw->tw_timer, jiffies + timeo);
 		}
}

In short, it is to improve CPU utilization and throughput. 4.3 + kernel is recommended

Introduction of several common parameters

net.ipv4.tcp_tw_reuse

Reuse time_ Conditions for wait connection:

  • TCP is set_ Timestamps = 1, i.e. on state.
  • TCP is set_ tw_ Reuse = 1, i.e. on state.
  • The timestamp of the new connection is larger than that of the previous connection.
  • At time_ Wait for 1 second.get_seconds() - tcptw->tw_ts_recent_stamp > 1

Reused connection type: only outbound (outgoing) connection, not inbound connection.

What does safety mean

  • TIME_ Wait can avoid repeated packets being received by subsequent connection errors. Due to the existence of timestamp mechanism, repeated packets will be discarded directly.
  • TIME_ Wait can ensure that the passive party will not stay in last after the last ACK packet sent by the active party is lost (such as packet loss caused by network delay)_ ACK status, resulting in the passive closing party cannot close the connection correctly. In order to ensure this mechanism, the actively closed party retransmits the fin packets all the time.
int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
{
	const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw);
	struct tcp_sock *tp = tcp_sk(sk);

	/* With PAWS, it is safe from the viewpoint
	   of data integrity. Even without PAWS it is safe provided sequence
	   spaces do not overlap i.e. at data rates <= 80Mbit/sec.

	   Actually, the idea is close to VJ's one, only timestamp cache is
	   held not per host, but per port pair and TW bucket is used as state
	   holder.

	   If TW bucket has been already destroyed we fall back to VJ's scheme
	   and use initial timestamp retrieved from peer table.
	 */
    //Need to turn on timestamp extension
	if (tcptw->tw_ts_recent_stamp &&
	    (twp == NULL || (sysctl_tcp_tw_reuse &&
			     get_seconds() - tcptw->tw_ts_recent_stamp > 1))) {
		tp->write_seq = tcptw->tw_snd_nxt + 65535 + 2;
		if (tp->write_seq == 0)
			tp->write_seq = 1;
		tp->rx_opt.ts_recent	   = tcptw->tw_ts_recent;
		tp->rx_opt.ts_recent_stamp = tcptw->tw_ts_recent_stamp;
		sock_hold(sktw);
		return 1;
	}

	return 0;
}
EXPORT_SYMBOL_GPL(tcp_twsk_unique);

net.ipv4.tcp_tw_recycle

See the above treatment for detailstime-waitAnalysis of the section

TW is not recommended_ Recycle configuration. In fact, in Linux kernel version 4.12, net.ipv4.tcp has been removed_ tw_ Recycle parameter, please refer tocommit

tcp_max_tw_buckets

Set time_ The maximum number of waits. Objective to prevent some simple DoS attacks, usually do not artificially reduce it. If you shrink it, the system will save the extra time_ If wait is deleted, the log will show “TCP: time wait bucket table overflow”.

How to set the correct value

Please refer to [3] for timer accuracy correlation analysis

4.1 kernel

  1. tcp_ fin_ timeout <= 3, FIN_ WAIT_ 2 State timeout is TCP_ fin_ Timeout value.
  2. 3
  3. tcp_ fin_ timeout > 60, FIN_ WAIT_ 2 state will go through the keep alive state first, and the duration is TMO = TCP_ fin_ Timeout-60 value, and then experience the timewait state, the duration is (TCP)_ fin_ Timeout – 60) + timer precision, which is based on (TCP_ fin_ The calculated value of “timeout – 60” will eventually fall into the above two precision ranges (1 / 8 second or 7 second).

4.3 + kernel

  1. tcp_ fin_ timeout <=60, FIN_ WAIT_ 2 State timeout is TCP_ fin_ Timeout value.
  2. tcp_ fin_ timeout > 60, FIN_ WAIT_ 2 state will go through the keep alive state first, and the duration is TMO = TCP_ fin_ The timeout-60 value, and then experience the timewait state, the duration is TMO = TCP_ fin_ The timeout-60 value.

reference resources

[1] Analysis of Linux TCP finwait2 / timewait state,https://blog.csdn.net/dog250/article/details/81582604

[2] Time of TCP_ Rapid recovery and reuse of wait,https://blog.csdn.net/dog250/article/details/13760985

[3] By optimizing fin_ WAIT_ Introduction of state timeout on TCP_ fin_ Research on timeout parameter,https://www.talkwithtrend.com/Article/251641

[4] TCP TIME_ Wait,https://www.zhuxiaodong.net/2018/tcp-time-wait-instruction/