HAProxy Marks One Backend Server as DOWN for Response Time Over 2000ms

I have configured my haproxy.cfg as follows:

global
    chroot      /var/run/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     6000
    user        haproxy-user
    group       haproxy-group
    daemon
    stats socket /var/run/haproxy-admin.sock

defaults
    log         global
    option      httplog
    option      dontlognull
    retries     2
    timeout connect 4000ms
    timeout client  4000ms
    timeout server  4000ms

backend app_servers
    balance roundrobin
    option tcplog
    option tcp-check
    option httpchk GET /check-status
    server app1 10.10.10.10:8080 check
    server app2 20.20.20.20:8080 check

Logs that I received:

Server app2 is DOWN, Layer7 timeout, check duration: 2000ms.
Server app2 is UP, Layer7 check passed, code: 200, duration: 150ms.

When the host is marked DOWN, the response is a 504 error:

20.20.20.20 504 POST /service/request
10.10.10.10 200 POST /service/request

My question is, despite setting a timeout of 4000ms, why does the error appear when the response time of the backend server exceeds 2000ms? Can the timeout be adjusted to prevent this error?

The issue you’re experiencing is likely related to the health check timeout, which is separate from the server timeout you’ve set. By default, HAProxy uses a shorter timeout for health checks, often around 2000ms. To resolve this, add a specific ‘timeout check’ directive in your backend configuration.

Try adding this line to your backend section:

timeout check 4000ms

This will align the health check timeout with your other timeouts. Also, ensure your /check-status endpoint responds quickly. If it’s consistently slow, you might need to optimize it or consider using a different health check method.

Remember, while increasing timeouts can prevent false negatives, it’s crucial to balance this with maintaining responsiveness for your users. Monitor your backend performance closely to ensure it’s meeting your service level objectives.

hey there! have u tried adding ‘timeout check 4000ms’ to ur backend config? that might help with the health check timing out too fast. also, double-check ur backend server’s performance - sometimes slow responses can trigger these issues. hope this helps! lemme know if u need more info

hm interesting setup! have u tried adjusting the ‘timeout check’ setting specifically? sometimes that can be different from the other timeouts. also, whats ur server’s typical response time? maybe the 2000ms is an internal limit somewhere? just curious, have u noticed any patterns in when it happens?