Home > Article > Backend Development > i/o timeout, I hope you don’t step on the trap of this net/http package
Let’s look at a piece of daily code. What
1package main 2 3import ( 4 "bytes" 5 "encoding/json" 6 "fmt" 7 "io/ioutil" 8 "net" 9 "net/http" 10 "time" 11) 12 13var tr *http.Transport 14 15func init() { 16 tr = &http.Transport{ 17 MaxIdleConns: 100, 18 Dial: func(netw, addr string) (net.Conn, error) { 19 conn, err := net.DialTimeout(netw, addr, time.Second*2) //设置建立连接超时 20 if err != nil { 21 return nil, err 22 } 23 err = conn.SetDeadline(time.Now().Add(time.Second * 3)) //设置发送接受数据超时 24 if err != nil { 25 return nil, err 26 } 27 return conn, nil 28 }, 29 } 30} 31 32func main() { 33 for { 34 _, err := Get("http://www.baidu.com/") 35 if err != nil { 36 fmt.Println(err) 37 break 38 } 39 } 40} 41 42 43func Get(url string) ([]byte, error) { 44 m := make(map[string]interface{}) 45 data, err := json.Marshal(m) 46 if err != nil { 47 return nil, err 48 } 49 body := bytes.NewReader(data) 50 req, _ := http.NewRequest("Get", url, body) 51 req.Header.Add("content-type", "application/json") 52 53 client := &http.Client{ 54 Transport: tr, 55 } 56 res, err := client.Do(req) 57 if res != nil { 58 defer res.Body.Close() 59 } 60 if err != nil { 61 return nil, err 62 } 63 resBody, err := ioutil.ReadAll(res.Body) 64 if err != nil { 65 return nil, err 66 } 67 return resBody, nil 68}
does is relatively simple, which is to loop to request http://www.baidu.com/
and then wait for the response.
It seems like there is no problem.
When the code runs, it can indeed send and receive messages normally.
But if this code runs for a period of time, an i/o timeout
error will appear.
This is actually a problem that has been investigated recently. I found that this pitfall may be easier to step on, so I have simplified the code here.
The phenomenon that occurs in actual production is that when the golang
service initiates a http
call, although http.Transport
If the 3s
timeout is set, an i/o timeout
error will occasionally appear.
But when I checked the downstream service, I found that the downstream service had actually returned in 100ms
.
It’s very strange, the server clearly displays The processing time is only 100ms
, and the client timeout is set to 3s
, so why does a timeout error i/o timeout
appear?
There are two possibilities here.
Because the logs printed by the server are actually only logs printed by the server application layer. But after the client application layer sends the data, it also passes through the client's transport layer, network layer, data link layer and physical layer , and then the server's physical layer, data link layer, network Layer, transport layer to application layer on the server side. It takes 100ms at the server application layer and then returns to the original path. The remaining 3s-100ms
may be wasted on various layers in the entire process. For example, when the network is not good, the transport layer TCP vigorously loses packets and retransmits them.
There is no problem with the network. The entire sending and receiving process from the client to the server probably takes about 100ms
. A timeout occurs due to client processing logic issues.
Generally when you encounter problems, in most cases it is not a problem with the underlying network. If you boldly suspect that it is your own problem, you are right. , if you don’t want to give up, grab a bag and take a look.
After analysis, from the beginning of the three handshakes (the place where the red box is drawn).
到最后出现超时报错 i/o timeout
(画了蓝框的地方)。
从time
那一列从7
到10
,确实间隔3s
。而且看右下角的蓝框,是51169
端口发到80
端口的一次Reset
连接。
80
端口是服务端的端口。换句话说就是客户端3s
超时主动断开链接的。
但是再仔细看下第一行三次握手到最后客户端超时主动断开连接的中间,其实有非常多次HTTP请求。
回去看代码设置超时的方式。
1 tr = &http.Transport{ 2 MaxIdleConns: 100, 3 Dial: func(netw, addr string) (net.Conn, error) { 4 conn, err := net.DialTimeout(netw, addr, time.Second*2) //设置建立连接超时 5 if err != nil { 6 return nil, err 7 } 8 err = conn.SetDeadline(time.Now().Add(time.Second * 3)) //设置发送接受数据超时 9 if err != nil { 10 return nil, err 11 } 12 return conn, nil 13 }, 14 }
也就是说,这里的3s
超时,其实是在建立连接之后开始算的,而不是单次调用开始算的超时。
看注释里写的是
SetDeadline sets the read and write deadlines associated with the connection.
Everyone knows that HTTP
is an application layer protocol, and the transport layer uses the TCP
protocol.
HTTP protocol has used short connection
by default since 1.0
, and a TCP connection will be established every time a request is made. Send and receive data. Then disconnect.
Every TCP connection is a three-way handshake. Each disconnect requires four waves.
In fact, there is no need to establish a new connection every time. As long as the established connection is not disconnected, it will be reused every time the data is sent.
Since then, the HTTP protocol has used long connections
by default since 1.1
. For specific information, please see this previous article.
Then golang standard library
is also compatible with this implementation.
By establishing a connection pool, establish a TCP long connection for each domain name
, such as http://baidu.com
and http:// golang.com
are two different domain names.
When you visit http://baidu.com
for the first time, a connection will be established. After it is used up, it will be placed in the idle connection pool. Next time you want to visit http ://baidu.com
will retrieve this connection from the connection pool reuse
.
Insert a digression: This also explains the last statement in this article Question, why should we emphasize that it is the same domain name: a domain name will establish a connection, and one connection corresponds to a reading goroutine and a writing goroutine. Because it is the same domain name,
3
goroutines will be leaked in the end. If they are different domain names,1 2*N
goroutines will be leaked,N
is the number of domain names.
Assuming that the first request takes 100ms
, after each request http://baidu.com
Put them all into the connection pool, and continue to reuse them next time, repeating 29
times, taking 2900ms
.
At the time of the 30
request, the connection had already taken 3000ms
from the time it was established to the time the service returned, just reaching the set 3s Timeout threshold, then the client will report a timeouti/o timeout
.
Although the server actually took 100ms
at this time, it couldn't bear the previous 29 times
and the total time was already very long.
也就是说只要通过 http.Transport
设置了 err = conn.SetDeadline(time.Now().Add(time.Second * 3))
,并且你用了长连接,哪怕服务端处理再快,客户端设置的超时再长,总有一刻,你的程序会报超时错误。
原本预期是给每次调用设置一个超时,而不是给整个连接设置超时。
另外,上面出现问题的原因是给长连接设置了超时,且长连接会复用。
基于这两点,改一下代码。
1package main 2 3import ( 4 "bytes" 5 "encoding/json" 6 "fmt" 7 "io/ioutil" 8 "net/http" 9 "time" 10) 11 12var tr *http.Transport 13 14func init() { 15 tr = &http.Transport{ 16 MaxIdleConns: 100, 17 // 下面的代码被干掉了 18 //Dial: func(netw, addr string) (net.Conn, error) { 19 // conn, err := net.DialTimeout(netw, addr, time.Second*2) //设置建立连接超时 20 // if err != nil { 21 // return nil, err 22 // } 23 // err = conn.SetDeadline(time.Now().Add(time.Second * 3)) //设置发送接受数据超时 24 // if err != nil { 25 // return nil, err 26 // } 27 // return conn, nil 28 //}, 29 } 30} 31 32 33func Get(url string) ([]byte, error) { 34 m := make(map[string]interface{}) 35 data, err := json.Marshal(m) 36 if err != nil { 37 return nil, err 38 } 39 body := bytes.NewReader(data) 40 req, _ := http.NewRequest("Get", url, body) 41 req.Header.Add("content-type", "application/json") 42 43 client := &http.Client{ 44 Transport: tr, 45 Timeout: 3*time.Second, // 超时加在这里,是每次调用的超时 46 } 47 res, err := client.Do(req) 48 if res != nil { 49 defer res.Body.Close() 50 } 51 if err != nil { 52 return nil, err 53 } 54 resBody, err := ioutil.ReadAll(res.Body) 55 if err != nil { 56 return nil, err 57 } 58 return resBody, nil 59} 60 61func main() { 62 for { 63 _, err := Get("http://www.baidu.com/") 64 if err != nil { 65 fmt.Println(err) 66 break 67 } 68 } 69}
看注释会发现,改动的点有两个
http.Transport
里的建立连接时的一些超时设置干掉了。
在发起http请求的时候会场景http.Client
,此时加入超时设置,这里的超时就可以理解为单次请求的超时了。同样可以看下注释
Timeout specifies a time limit for requests made by this Client.
到这里,代码就改好了,实际生产中问题也就解决了。
实例代码里,如果拿去跑的话,其实还会下面的错
1Get http://www.baidu.com/: EOF
这个是因为调用得太猛了,http://www.baidu.com
那边主动断开的连接,可以理解为一个限流措施,目的是为了保护服务器,毕竟每个人都像这么搞,服务器是会炸的。。。
解决方案很简单,每次HTTP调用中间加个sleep
间隔时间就好。
到这里,其实问题已经解决了,下面会在源码层面分析出现问题的原因。对读源码不感兴趣的朋友们可以不用接着往下看,直接拉到文章底部右下角,做点正能量的事情(点两下)支持一下。(疯狂暗示,拜托拜托,这对我真的很重要!)
用的go版本是1.12.7。
从发起一个网络请求开始跟。
1res, err := client.Do(req) 2func (c *Client) Do(req *Request) (*Response, error) { 3 return c.do(req) 4} 5 6func (c *Client) do(req *Request) { 7 // ... 8 if resp, didTimeout, err = c.send(req, deadline); err != nil { 9 // ... 10 } 11 // ... 12} 13func send(ireq *Request, rt RoundTripper, deadline time.Time) { 14 // ... 15 resp, err = rt.RoundTrip(req) 16 // ... 17} 18 19// 从这里进入 RoundTrip 逻辑 20/src/net/http/roundtrip.go: 16 21func (t *Transport) RoundTrip(req *Request) (*Response, error) { 22 return t.roundTrip(req) 23} 24 25func (t *Transport) roundTrip(req *Request) (*Response, error) { 26 // 尝试去获取一个空闲连接,用于发起 http 连接 27 pconn, err := t.getConn(treq, cm) 28 // ... 29} 30 31// 重点关注这个函数,返回是一个长连接 32func (t *Transport) getConn(treq *transportRequest, cm connectMethod) (*persistConn, error) { 33 // 省略了大量逻辑,只关注下面两点 34 // 有空闲连接就返回 35 pc := <-t.getIdleConnCh(cm) 36 37 // 没有创建连接 38 pc, err := t.dialConn(ctx, cm) 39 40}
这里上面很多代码,其实只是为了展示这部分代码是怎么跟踪下来的,方便大家去看源码的时候去跟一下。
最后一个上面的代码里有个 getConn
方法。在发起网络请求的时候,会先取一个网络连接,取连接有两个来源。
如果有空闲连接,就拿空闲连接
1/src/net/http/tansport.go:810 2func (t *Transport) getIdleConnCh(cm connectMethod) chan *persistConn { 3 // 返回放空闲连接的chan 4 ch, ok := t.idleConnCh[key] 5 // ... 6 return ch 7}
没有空闲连接,就创建长连接。
1/src/net/http/tansport.go:1357 2func (t *Transport) dialConn() { 3 //... 4 conn, err := t.dial(ctx, "tcp", cm.addr()) 5 // ... 6 go pconn.readLoop() 7 go pconn.writeLoop() 8 // ... 9}
当第一次发起一个http请求时,这时候肯定没有空闲连接,会建立一个新连接。同时会创建一个读goroutine和一个写goroutine。
注意上面代码里的t.dial(ctx, "tcp", cm.addr())
,如果像文章开头那样设置了 http.Transport
的
1Dial: func(netw, addr string) (net.Conn, error) { 2 conn, err := net.DialTimeout(netw, addr, time.Second*2) //设置建立连接超时 3 if err != nil { 4 return nil, err 5 } 6 err = conn.SetDeadline(time.Now().Add(time.Second * 3)) //设置发送接受数据超时 7 if err != nil { 8 return nil, err 9 } 10 return conn, nil 11},
那么这里就会在下面的dial
里被执行到
1func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) { 2 // ... 3 c, err := t.Dial(network, addr) 4 // ... 5}
这里面调用的设置超时,会执行到
1/src/net/net.go 2func (c *conn) SetDeadline(t time.Time) error { 3 //... 4 c.fd.SetDeadline(t) 5 //... 6} 7 8//... 9 10func setDeadlineImpl(fd *FD, t time.Time, mode int) error { 11 // ... 12 runtime_pollSetDeadline(fd.pd.runtimeCtx, d, mode) 13 return nil 14} 15 16 17//go:linkname poll_runtime_pollSetDeadline internal/poll.runtime_pollSetDeadline 18func poll_runtime_pollSetDeadline(pd *pollDesc, d int64, mode int) { 19 // ... 20 // 设置一个定时器事件 21 rtf = netpollDeadline 22 // 并将事件注册到定时器里 23 modtimer(&pd.rt, pd.rd, 0, rtf, pd, pd.rseq) 24}
上面的源码,简单来说就是,当第一次调用请求的,会建立个连接,这时候还会注册一个定时器事件,假设时间设了3s
,那么这个事件会在3s
后发生,然后执行注册事件的逻辑。而这个注册事件就是netpollDeadline
。 注意这个netpollDeadline
,待会会提到。
设置了超时事件,且超时事件是3s后之后,发生。再次期间正常收发数据。一切如常。
直到3s
过后,这时候看读goroutine
,会等待网络数据返回。
1/src/net/http/tansport.go:1642 2func (pc *persistConn) readLoop() { 3 //... 4 for alive { 5 _, err := pc.br.Peek(1) // 阻塞读取服务端返回的数据 6 //... 7}
然后就是一直跟代码。
1src/bufio/bufio.go: 129 2func (b *Reader) Peek(n int) ([]byte, error) { 3 // ... 4 b.fill() 5 // ... 6} 7 8func (b *Reader) fill() { 9 // ... 10 n, err := b.rd.Read(b.buf[b.w:]) 11 // ... 12} 13 14/src/net/http/transport.go: 1517 15func (pc *persistConn) Read(p []byte) (n int, err error) { 16 // ... 17 n, err = pc.conn.Read(p) 18 // ... 19} 20 21// /src/net/net.go: 173 22func (c *conn) Read(b []byte) (int, error) { 23 // ... 24 n, err := c.fd.Read(b) 25 // ... 26} 27 28func (fd *netFD) Read(p []byte) (n int, err error) { 29 n, err = fd.pfd.Read(p) 30 // ... 31} 32 33/src/internal/poll/fd_unix.go: 34func (fd *FD) Read(p []byte) (int, error) { 35 //... 36 if err = fd.pd.waitRead(fd.isFile); err == nil { 37 continue 38 } 39 // ... 40} 41 42func (pd *pollDesc) waitRead(isFile bool) error { 43 return pd.wait('r', isFile) 44} 45 46func (pd *pollDesc) wait(mode int, isFile bool) error { 47 // ... 48 res := runtime_pollWait(pd.runtimeCtx, mode) 49 return convertErr(res, isFile) 50}
直到跟到 runtime_pollWait,这个可以简单认为是等待服务端数据返回。
1//go:linkname poll_runtime_pollWait internal/poll.runtime_pollWait 2func poll_runtime_pollWait(pd *pollDesc, mode int) int { 3 4 // 1.如果网络正常返回数据就跳出 5 for !netpollblock(pd, int32(mode), false) { 6 // 2.如果有出错情况也跳出 7 err = netpollcheckerr(pd, int32(mode)) 8 if err != 0 { 9 return err 10 } 11 } 12 return 0 13}
整条链路跟下来,就是会一直等待数据,等待的结果只有两个
有可以读的数据
出现报错
这里面的报错,又有那么两种
连接关闭
超时
1func netpollcheckerr(pd *pollDesc, mode int32) int { 2 if pd.closing { 3 return 1 // errClosing 4 } 5 if (mode == 'r' && pd.rd < 0) || (mode == 'w' && pd.wd < 0) { 6 return 2 // errTimeout 7 } 8 return 0 9}
其中提到的超时
,就是指这里面返回的数字2
,会通过下面的函数,转化为 ErrTimeout
, 而 ErrTimeout.Error()
其实就是 i/o timeout
。
1func convertErr(res int, isFile bool) error { 2 switch res { 3 case 0: 4 return nil 5 case 1: 6 return errClosing(isFile) 7 case 2: 8 return ErrTimeout // ErrTimeout.Error() 就是 "i/o timeout" 9 } 10 println("unreachable: ", res) 11 panic("unreachable") 12}
那么问题来了。上面返回的超时错误,也就是返回2的时候的条件是怎么满足的?
1 if (mode == 'r' && pd.rd < 0) || (mode == 'w' && pd.wd < 0) { 2 return 2 // errTimeout 3 }
还记得刚刚提到的 netpollDeadline
吗?
这里面放了定时器3s
到点时执行的逻辑。
1func timerproc(tb *timersBucket) { 2 // 计时器到设定时间点了,触发之前注册函数 3 f(arg, seq) // 之前注册的是 netpollDeadline 4} 5 6func netpollDeadline(arg interface{}, seq uintptr) { 7 netpolldeadlineimpl(arg.(*pollDesc), seq, true, true) 8} 9 10/src/runtime/netpoll.go: 428 11func netpolldeadlineimpl(pd *pollDesc, seq uintptr, read, write bool) { 12 //... 13 if read { 14 pd.rd = -1 15 rg = netpollunblock(pd, 'r', false) 16 } 17 //... 18}
这里会设置pd.rd=-1
,是指 poller descriptor.read deadline
,含义网络轮询器文件描述符的读超时时间, 我们知道在linux里万物皆文件,这里的文件其实是指这次网络通讯中使用到的socket。
这时候再回去看发生超时的条件就是if (mode == 'r' && pd.rd < 0)
。
至此。我们的代码里就收到了 io timeout
的报错。
Do not set a timeout in http.Transport
, that is the timeout of the connection, not the timeout of the request. Otherwise, an inexplicable io timeout
error may occur.
The request timeout is set in the creation client
.
The above is the detailed content of i/o timeout, I hope you don’t step on the trap of this net/http package. For more information, please follow other related articles on the PHP Chinese website!