Troubleshooting and solution of HTTP timeout in go

Time:2020-12-23

background

Recently, some colleagues have reported that there is a call timeout between services, and the probability and frequency of occurrence are relatively high in the business peak. From the call relationship in the log, there are 2 call chains that often have timeouts.

Problem 1: service a timed out sending a request to service B using HTTP1.1.

Problem 2: the a service timed out sending a request to the C service using a lightweight HTTP SDK (internal http2.0).

When error information is reported by golang:

Post http://host/v1/xxxx: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

The log trace ID is notified to check, and it is found that some requests have already timed out before the server arrives.

Some of them have already arrived at the service side, but they have also timed out.

Here is the first investigation of problem 2, the following is the process.

Investigation

Speculation

The HTTP request timeout set by the caller is 1s.

The request has already timed out

  1. The service provider responded slowly. Through the log check, there are some of them.
  2. It takes 990ms for the client to call, and only 10ms to the server. This will definitely time out.

The reason why the request did not arrive at the server timeout may be:

  1. The golang CPU cannot be dispatched. This possibility is ruled out by CPU monitoring
  2. Reasons for golang network library. Key investigation

Investigation method:

Write a test program locally, and 1000 call the C service of the test environment


n := 1000
var waitGroutp = sync.WaitGroup{}
waitGroutp.Add(n)
for i := 0; i < n; i++ {
 go func(x int) {
  httpSDK.Request()
 }
}
waitGroutp.Wait()

report errors:

Too many open files / / this error is the reason why the local ulimit is too small. Net / http: request cancelled can be ignored( Client.Timeout exceeded while awaiting headers)

If the concurrency number is adjusted to 500 to continue testing, the same error is reported.

connection timed out

If the problem can be reproduced locally, it is generally easier to find out.

Start with the source code of golang. The following is the code for creating httpclient, which is globally reused.

func createHttpClient(host string, tlsArg *TLSConfig) (*http.Client, error) {
 httpClient := &http.Client{
 Timeout: time.Second,
 }
 tlsConfig := &tls.Config{InsecureSkipVerify: true}
 transport := &http.Transport{
 TLSClientConfig: tlsConfig,
 MaxIdleConnsPerHost: 20,
 }
 http2.ConfigureTransport(transport)
 return httpClient, nil
}
//Using httpclient
httpClient.Do(req)

Jump to net / HTTP/ client.go Do method of


func (c *Client) do(req *Request) (retres *Response, reterr error) {
 if resp, didTimeout, err = c.send(req, deadline); err != nil {
 }
}

Continue with the send method. The actual request is sent through the roundtrip function.


func send(ireq *Request, rt RoundTripper, deadline time.Time) (resp *Response, didTimeout func() bool, err error) {
 rt.RoundTrip(req) 
}

The RT parameter received by the send function is an inteface, so the http.Transport Go to the roundtrip function.

amonglog.Println("getConn time", time.Now().Sub(start), x)It is a log added by the author. It takes time to create a connection in order to verify.

var n int
// roundTrip implements a RoundTripper over HTTP.
func (t *Transport) roundTrip(req *Request) (*Response, error) {
 //Check whether http2 is registered. If so, use roundtrip of http2 directly
 if t.useRegisteredProtocol(req) {
 altProto, _ := t.altProto.Load().(map[string]RoundTripper)
 if altRT := altProto[scheme]; altRT != nil {
  resp, err := altRT.RoundTrip(req)
  if err != ErrSkipAltProtocol {
  return resp, err
  }
 }
 }
 for {
 //n++
 // start := time.Now()
 pconn, err := t.getConn(treq, cm)
  // log.Println("getConn time", time.Now().Sub(start), x)
 if err != nil {
  t.setReqCanceler(req, nil)
  req.closeBody()
  return nil, err
 }
 }
}

It’s true that a large number of logs run out of time.

doubt

Here are two questions:

  1. Why does http2 not reuse connections, but creates a large number of connections?
  2. Why is connection creation slower and slower?

Continue with the source code of getconn. The first step of getconn is to obtain the idle connection, because http2 is used here, so you can ignore it.

Append the time-consuming log to confirm that dialconn is time-consuming.


func (t *Transport) getConn(treq *transportRequest, cm connectMethod) (*persistConn, error) {
 if pc, idleSince := t.getIdleConn(cm); pc != nil {
 }
 //n++
 go func(x int) {
 // start := time.Now()
 // defer func(x int) {
 // log.Println("getConn dialConn time", time.Now().Sub(start), x)
 // }(n)
 pc, err := t.dialConn(ctx, cm)
 dialc <- dialRes{pc, err}
 }(n)
}

Continue with the dialconn function. There are two time-consuming parts in it:

Connection established, three handshakes.

TLS handshake time-consuming, see the following http2 chapter dialconn source code.

The log is appended to the position of t.dial and addtls in dialconn function.

As can be seen, the connection of three handshakes is relatively stable, and the later connection takes nearly 1 s in the time of TLS handshake.

2019/10/23 14:51:41 DialTime 39.511194ms https.Handshake 1.059698795s 2019/10/23 14:51:41 DialTime 23.270069ms https.Handshake 1.064738698s 2019/10/23 14:51:41 DialTime 24.854861ms https.Handshake 1.0405369s 2019/10/23 14:51:41 DialTime 31.345886ms https.Handshake 1.076014428s 2019/10/23 14:51:41 DialTime 26.767644ms https.Handshake 1.084155891s 2019/10/23 14:51:41 DialTime 22.176858ms https.Handshake 1.064704515s 2019/10/23 14:51:41 DialTime 26.871087ms https.Handshake 1.084666172s 2019/10/23 14:51:41 DialTime 33.718771ms https.Handshake 1.084348815s 2019/10/23 14:51:41 DialTime 20.648895ms https.Handshake 1.094335678s 2019/10/23 14:51:41 DialTime 24.388066ms https.Handshake 1.084797011s 2019/10/23 14:51:41 DialTime 34.142535ms https.Handshake 1.092597021s 2019/10/23 14:51:41 DialTime 24.737611ms https.Handshake 1.187676462s 2019/10/23 14:51:41 DialTime 24.753335ms https.Handshake 1.161623397s 2019/10/23 14:51:41 DialTime 26.290747ms https.Handshake 1.173780655s 2019/10/23 14:51:41 DialTime 28.865961ms https.Handshake 1.178235202s

Conclusion: the answer to the second question is that the TLS handshake takes time

http2

Why does http2 not reuse connections, but creates a large number of connections?

Created before http.Client In the http2.configuretransport (transport) method, configuretransport is called internally

func configureTransport(t1 *http.Transport) (*Transport, error) {
 //Declare a connection pool
 //Nodialclientconnpool is very important here. It indicates that the connection does not need dial, but is upgraded from http1 connection 
 connPool := new(clientConnPool)
 t2 := &Transport{
 ConnPool: noDialClientConnPool{connPool},
 t1: t1,
 }
 connPool.t = t2
//Register the roundtripp method of http2 with the altproto variable of transport on http1.
//When the roundtrip method of http1 is requested, check whether altproto has registered http2, and if so, use
//The use registered protocol in the previous code is the detection method
 if err := registerHTTPSProtocol(t1, noDialH2RoundTripper{t2}); err != nil  {
 return nil, err
 }
 //After HTTP1.1 is upgraded to http2, the callback function will add the connection to the connection pool of http2 through the addconnifneeded function
 upgradeFn := func(authority string, c *tls.Conn) http.RoundTripper {
 addr := authorityAddr("https", authority)
 if used, err := connPool.addConnIfNeeded(addr, t2, c); err != nil {
  go c.Close()
  return erringRoundTripper{err}
 } else if !used {
  go c.Close()
 }
 return t2
 }
 if m := t1.TLSNextProto; len(m) == 0 {
 t1.TLSNextProto = map[string]func(string, *tls.Conn) http.RoundTripper{
  "h2": upgradeFn,
 }
 } else {
 m["h2"] = upgradeFn
 }
 return t2, nil
}

Tlsnextproto is in http.Transport ->Used in dialconn. Call the upgradefn function, return roundtripper of http2, and assign it to alt.

ALT will be in http.Transport Roundtripper internal check call in.

func (t *Transport) dialConn(ctx context.Context, cm connectMethod) (*persistConn, error) {
 pconn := &persistConn{
 t:  t,
 }
 if cm.scheme() == "https" && t.DialTLS != nil {
 //There is no custom dialtls method, you will not go to this step
 } else {
 conn, err := t.dial(ctx, "tcp", cm.addr())
 if err != nil {
  return nil, wrapErr(err)
 }
 pconn.conn = conn
 if cm.scheme() == "https" {
  //TLS handshake in addtls is also the most time-consuming place to establish a new connection.
  if err = pconn.addTLS(firstTLSHost, trace); err != nil {
  return nil, wrapErr(err)
  }
 }
 }
 if s := pconn.tlsState; s != nil && s.NegotiatedProtocolIsMutual && s.NegotiatedProtocol != "" {
 if next, ok := t.TLSNextProto[s.NegotiatedProtocol]; ok {
  //Next calls the registered upgrade function
  return &persistConn{t: t, cacheKey: pconn.cacheKey, alt: next(cm.targetAddr, pconn.conn.(*tls.Conn))}, nil
 }
 }
 return pconn, nil
}

Conclusion

When there is no connection, if there is a large wave of requests at this time, more than n HTTP1.1 connections will be created for upgrading and handshaking. However, TLS handshake becomes very slow as the number of connections increases.

Resolve timeout

The above conclusion does not fully explain the problem of multiplexing connections. Because when the service is running normally, there are always requests, and the connection will not be disconnected. Therefore, http2 connections should be reused under normal circumstances except for the first connection or network disconnection.

According to the following test, when the connection with http2 is reproduced, n new connections will still be created:

sdk.Request () // request once, establish the connection, and test whether the connection has been reused.
time.Sleep(time.Second)
n := 1000
var waitGroutp = sync.WaitGroup{}
waitGroutp.Add(n)
for i := 0; i < n; i++ {
 go func(x int) {
  sdk.Request()
 }
}
waitGroutp.Wait()

Therefore, it is still suspected that the upgrade of HTTP1.1 has resulted in the use of http2.transport


httpClient.Transport = &http2.Transport{
  TLSClientConfig: tlsConfig,
}

After modification, the test found that there was no error.

To verify the difference between upgrade mode and direct http2 mode. Here, return to the addconnifneeded function in upgrade mode, which will call the run function of addconncall:


func (c *addConnCall) run(t *Transport, key string, tc *tls.Conn) {
 cc, err := t.NewClientConn(tc)
}

The transport of http2 is passed in the run parameter.

The whole explanation is that after HTTP1.1 creates a connection, the transport layer will be connected through addconnifneeded – > Run – > Transport.NewClientConn Form an http2 connection. Because http2 and HTTP1.1 are application layer protocols in essence, the transport layer connections are the same.

Then add a log to the newclientconn connection.


func (t *Transport) newClientConn(c net.Conn, singleUse bool) (*ClientConn, error) {
 // log.Println("http2.newClientConn")
}

Conclusion

In the upgrade mode, a lot of http2.newclientconn will be printed. According to the previous investigation, this makes sense. In the pure http2 mode, new connections will be created, though rarely.

Number of concurrent connections

Under what circumstances will new connections be created in http2 mode?

Here’s how http2 calls newclientconn. Back to clientconn pool, dialonmiss is true in http2 mode, and dial > dialclientconn > newclientconn will be called in getstartdiallocked.


func (p *clientConnPool) getClientConn(req *http.Request, addr string, dialOnMiss bool) (*ClientConn, error) {
 p.mu.Lock()
 for _, cc := range p.conns[addr] {
 if st := cc.idleState(); st.canTakeNewRequest {
  if p.shouldTraceGetConn(st) {
  traceGetConn(req, addr)
  }
  p.mu.Unlock()
  return cc, nil
 }
 }
 if !dialOnMiss {
 p.mu.Unlock()
 return nil, ErrNoCachedConn
 }
 traceGetConn(req, addr)
 call := p.getStartDialLocked(addr)
 p.mu.Unlock()
 }


If there is a connection, if cantakenewrequest is false, a new connection will also be created. Take a look at this variable


func (cc *ClientConn) idleStateLocked() (st clientConnIdleState) {
 if cc.singleUse && cc.nextStreamID > 1 {
 return
 }
 var maxConcurrentOkay bool
 if cc.t.StrictMaxConcurrentStreams {
 maxConcurrentOkay = true
 } else {
 maxConcurrentOkay = int64(len(cc.streams)+1) < int64(cc.maxConcurrentStreams)
 }
 st.canTakeNewRequest = cc.goAway == nil && !cc.closed && !cc.closing && maxConcurrentOkay &&
 int64(cc.nextStreamID)+2*int64(cc.pendingRequests) < math.MaxInt32
 // if st.canTakeNewRequest == false {
 // log.Println("clientConnPool", cc.maxConcurrentStreams, cc.goAway == nil, !cc.closed, !cc.closing, maxConcurrentOkay, int64(cc.nextStreamID)+2*int64(cc.pendingRequests) < math.MaxInt32)
 // }
 st.freshConn = cc.nextStreamID == 1 && st.canTakeNewRequest
 return
}


In order to check the problem, here added a detailed log. After testing, it is found that maxconcurrenctstreams are exceeded, and cantakenewrequest is false.

In the initialization configuration of newclientconn in http2, maxconcurrenctstreams is 1000 by default

maxConcurrentStreams: 1000, // “infinite”, per spec. 1000 seems good enough.

But in practice, it is found that 500 concurrent will create new connections. Continue to trace where this variable is set:


func (rl *clientConnReadLoop) processSettings(f *SettingsFrame) error {
 case SettingMaxConcurrentStreams:
  cc.maxConcurrentStreams = s.Val
  //log.Println("maxConcurrentStreams", s.Val)
}

Run the test and find that it is the configuration from the service, and the value is 250.

Conclusion: the server limits the number of concurrent connections in a single connection, and new connections will be created after exceeding the limit.

Server side restrictions

In the server-side framework, find the listenandservetls function, followed by – > servetls > serve > setup http2_ Serve-

>onceSetNextProtoDefaults_Serve->onceSetNextProtoDefaults->http2ConfigureServer。

The statement of new (http2server) is found. Because the web framework supports both HTTP1.1 and http2, it does not specify any http2 related configuration and uses the default.


// Server is an HTTP/2 server.
type http2Server struct {
 // MaxConcurrentStreams optionally specifies the number of
 // concurrent streams that each client may have open at a
 // time. This is unrelated to the number of http.Handler goroutines
 // which may be active globally, which is MaxHandlers.
 // If zero, MaxConcurrentStreams defaults to at least 100, per
 // the HTTP/2 spec's recommendations.
 MaxConcurrentStreams uint32
} 


From the comments on this field, the http2 standard recommends at least 100, and the default variable http2defaultmaxstreams is used in golang with a value of 250.

truth

The above steps are more to record the troubleshooting process and the key points in the source code, so as to provide a reference for similar problems in the future.

In a nutshell:

  1. The caller and server communicate using HTTP1.1 to http2
  2. The server http2server limits the concurrency of a single connection to 250
  3. When the concurrency exceeds 250, say 1000, the caller creates 750 connections concurrently. The TLS handshake time for these connections will be longer and longer. However, the call timeout is only 1 s, which leads to a large number of timeout.
  4. Some of these connections timed out before they arrived at the server. Some of them arrived, but the service provider could not handle them. The caller canceled the connection, which was also a timeout.
  5. In the case of high concurrency, if the network is disconnected, it will also cause this situation.

retry

The lightweight HTTP SDK used by a service has a retrial mechanism. When a temporary error is detected, it will try again twice.

Temporary() bool / / is the error temporary? This timeout error is a temporary error, which enlarges this situation.

terms of settlement

Http2 is not in upgrade mode.


httpClient.Transport = &http2.Transport{
  TLSClientConfig: tlsConfig,
}

Why doesn’t http2 create a lot of connections?

This is because http2 will lock when creating a new connection. After the subsequent request is unlocked, if it is found that the number of concurrent connections does not exceed, the connection can be reused directly. So there is no such situation. The lock is in the clientConnPool.getStartDialLocked Source code.

Question 1

Problem 1: service a timed out sending a request to service B using HTTP1.1.

The reason for question 1 and question 2 is the same, that is, in the case of high concurrency, a large number of connections will be created, and the connection creation will become more and more slow, thus timeout.

There is no good way to solve this problem. Http2 is recommended.

If you can’t use http2, increase the maxidleconnsperhost parameter to alleviate this situation. By default, HTTP1.1 only keeps two idle connections for each host. If 1000 concurrent connections are created, 998 new connections will be created.

The amount of adjustment can be adjusted according to the system situation, such as 50100.

summary

The above is the troubleshooting and solution of HTTP timeout in go introduced by Xiaobian. I hope it will be helpful to you. If you have any questions, please leave me a message, and the editor will reply to you in time. Thank you very much for your support to the developeppaer website! If you think this article is helpful to you, welcome to reprint, please indicate the source, thank you!

Recommended Today

Use of Android WebView (super detailed usage)

1.1 overview of WebView Android WebView is a special view on the Android platform. It can be used to display web pages. This WebView class can be used to display only one online web page in the app. Of course, it can also be used to develop browsers. The internal implementation of WebView uses WebKit […]