Here’s how I’ve seen DB connection/checks in a load balancer healthcheck become bad idea:
- Healthcheck timeout causes replacing servers leading to hard down time as all instances are pulled out of rotation and fail to start new ones because latency with the DB was temporarily > healthcheck timeout (3 seconds for us currently!)
- Relying on a healthcheck to determine an DB health takes time: unhealthy threshold * interval before taking action (120s compared to crashing at start time which would be right away)
- Healthcheck induced db query floods amplify a db issue as more instances are spun up all trying to create new connections and execute db queries at once. Solution is usually hopping into the db server and killing all connections, not try to spin up more server instances which could make it worse.
- DB restart causes healthcheck fails and server instances to cycle taking healthcheck threshold + fargate provisioning time to fully recover
What to do instead?
Monitor database health separately. Identify failure modes a load balancer can solve by pulling unhealthy server instances out of rotation (unrecoverable crash, hardware failure) and ones where it can’t (database cluster issues, latency) then tune healthchecks accordingly to speed up recovery time.