nginx 配置 proxy_next_upstream 会出现未预期 502 错误问题排查-526互联

当使用nginx代理多个网关实例时，
当被请求服务的get 接口异常时，如 error timeout invalid_header http_500 http_502 http_503 http_504，
nginx 会响应 502状态码，

在我之前的认知里，nginx 只会转发后端服务的响应，一般不会对状态码进行修改

nginx 配置如下：

worker_processes  1;
daemon off;
master_process off; 
error_log  logs/error.log  debug; 
events {
    worker_connections  1024;
}
http {
    include       mime.types;
    default_type  application/octet-stream;
     log_format apm '[$time_local]\tclient=$remote_addr\t'
               'upstream_addr=$upstream_addr\t'
               'upstream_status=$upstream_status\t'
               'document_root="$document_root"\t'
               'fastcgi_script_name="$fastcgi_script_name"\t'
               'request_filename="$request_filename"\t'
               'request_time=$request_time\t'
               'upstream_response_time=$upstream_response_time\t'
               'upstream_connect_time=$upstream_connect_time\t'
               'upstream_header_time=$upstream_header_time\t';
    access_log  logs/access.log  apm;
    sendfile        on; 
    keepalive_timeout  65;
    upstream gateway {
        server 192.168.2.102:12012;
        server 192.168.2.102:12011;
    }
    server {
        listen       80;
        server_name  localhost; 
        location / {
            root   html;
            index  index.html index.htm;
        }
        location /api/ {
            proxy_pass http://gateway/;
            proxy_next_upstream error http_503 http_502;
        } 
        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;
        } 
    }
}

示例测试代码：

    @GetMapping("/excep503")
    public ResponseEntity<String>  excep503(HttpServletRequest request, Integer times) throws InterruptedException {
        Thread.sleep(200);
        return ResponseEntity.status(HttpStatus.SERVICE_UNAVAILABLE).body("服务不可用");
    }

测试方法：

多次 get 请求一个异常接口

现象：

有时报错 502 ，有时报错 503

返回 503时

access_log 中的 upstream_addr 会有两条： 192.168.2.102:12012, 192.168.2.102:12011
error_log 会出现分别请求两台网关的日志：
首先请求 connect to 192.168.2.102:12011 ；
102:12011 返回 503 Service Unavailable
报错

upstream server temporarily disabled while reading response header from upstream

然后重新指向 connect to 192.168.2.102:12012
102:12012 同样返回 503 Service Unavailable

返回 502时

access_log 中的 upstream_addr 只会有一条：upstream_addr=192.168.2.102:12011

error_log 只会出现一次请求网关的日志：
请求 connect to 192.168.2.102:12011 ；
102:12011 返回 503 Service Unavailable
报错

upstream server temporarily disabled while reading response header from upstream,
no live upstreams while connecting to upstream,

返回502的原因

根据查阅相关资料

传入的ft_type为 40000000 匹配到 default ，所以最终状态码为 NGX_HTTP_BAD_GATEWAY ，即 502

nginx-1.24.0\src\http\ngx_http_upstream.c(ngx_http_upstream_next) 4370行；

switch (ft_type) {

    case NGX_HTTP_UPSTREAM_FT_TIMEOUT:
    case NGX_HTTP_UPSTREAM_FT_HTTP_504:
        status = NGX_HTTP_GATEWAY_TIME_OUT;
        break;

    case NGX_HTTP_UPSTREAM_FT_HTTP_500:
        status = NGX_HTTP_INTERNAL_SERVER_ERROR;
        break;

    case NGX_HTTP_UPSTREAM_FT_HTTP_503:
        status = NGX_HTTP_SERVICE_UNAVAILABLE;
        break;

    /*
     * NGX_HTTP_UPSTREAM_FT_BUSY_LOCK and NGX_HTTP_UPSTREAM_FT_MAX_WAITING
     * never reach here
     */

    default:
        status = NGX_HTTP_BAD_GATEWAY;
    }

502 与 503 的逻辑分岔路：

nginx-1.24.0\src\http\ngx_http_upstream_round_robin.c（ngx_http_upstream_get_round_robin_peer）449 行

peers = rrp->peers;
    ngx_http_upstream_rr_peers_wlock(peers);

    if (peers->single) {
        peer = peers->peer;

        if (peer->down) {
            goto failed;
        }

        if (peer->max_conns && peer->conns >= peer->max_conns) {
            goto failed;
        }

        rrp->current = peer;

    } else {

        peer = ngx_http_upstream_get_peer(rrp);

        if (peer == NULL) {
            goto failed;
        }

        ngx_log_debug2(NGX_LOG_DEBUG_HTTP, pc->log, 0,
                       "get rr peer, current: %p %i",
                       peer, peer->current_weight);
    }

其中的 single 标志位是一个用于标识后端服务器组是否只有一个成员的标志，即 upstream_addr 为单个

所以现在的问题是:

为什么有时upstream_addr是两个，有时是一个

debug nginx 源码

nginx启动时给每个后端节点赋值了一个默认的超时时间 10s

发生异常时将节点标记为不可用：

nginx-1.24.0/src/http/ngx_http_upstream_round_robin.c(ngx_http_upstream_get_peer) 522 行

    for (peer = rrp->peers->peer, i = 0;
         peer;
         peer = peer->next, i++)
    {
        n = i / (8 * sizeof(uintptr_t));
        m = (uintptr_t) 1 << i % (8 * sizeof(uintptr_t));

        if (rrp->tried[n] & m) {
            continue;
        }

        if (peer->down) {
            continue;
        }

        if (peer->max_fails
            && peer->fails >= peer->max_fails
            && now - peer->checked <= peer->fail_timeout)
        {
            continue;
        }

        if (peer->max_conns && peer->conns >= peer->max_conns) {
            continue;
        }

        peer->current_weight += peer->effective_weight;
        total += peer->effective_weight;

        if (peer->effective_weight < peer->weight) {
            peer->effective_weight++;
        }

        if (best == NULL || peer->current_weight > best->current_weight) {
            best = peer;
            p = i;
        }
    }