Lætitia AVROT: Mostly Dead Is Slightly Alive: Killing Zombie Sessions
Why It Matters
Zombie sessions waste CPU, I/O, and lock resources, degrading performance and increasing latency for OLTP and cloud‑native workloads. Proper tuning eliminates hours‑long resource waste and improves system reliability.
Key Takeaways
- •Default TCP keepalive in Linux is 7200 seconds
- •Zombie sessions hold locks and block vacuum
- •client_connection_check_interval detects dead clients instantly
- •Tuning both parameters prevents hours-long resource waste
- •PgBouncer also needs keepalive settings for pool health
Pulse Analysis
PostgreSQL administrators constantly battle “zombie” sessions—backend processes that linger in an active or idle‑in‑transaction state after the client disappears. Because Linux’s default TCP keepalive interval is two hours, these dead connections retain locks, inflate the process list, and prevent the vacuum worker from reclaiming space. In micro‑service architectures where queries finish in seconds or minutes, waiting for a default timeout is unacceptable. Yet the PostgreSQL community has kept the defaults at zero, delegating responsibility to the operating system to preserve portability across diverse network environments.
The solution lies in combining kernel‑level keepalive probes with PostgreSQL’s own client‑connection check. Adjusting tcp_keepalives_idle, _interval and _count forces the OS to abort silent sockets after a minute of inactivity, freeing locks that would otherwise linger. Meanwhile, the client_connection_check_interval parameter, introduced in PostgreSQL 14, makes the backend periodically poll the socket even while a query runs, instantly detecting an application crash or browser tab closure. Together they cover both network‑level silence and application‑level termination, eliminating the “mostly dead” state that wastes CPU and I/O.
Practitioners typically set tcp_keepalives_idle to 60 seconds, interval to 10 seconds, and count to six, which kills a non‑responsive connection within roughly one minute. Pairing this with client_connection_check_interval = 2s ensures long‑running queries abort as soon as the client disappears. In connection‑pooling environments like PgBouncer, enabling tcp_keepalive and rotating idle server connections further reduces stale backends. These adjustments translate into faster lock turnover, lower vacuum latency, and more predictable resource consumption—critical factors for high‑throughput OLTP systems and cloud‑native deployments that cannot afford hours of hidden waste.
Comments
Want to join the conversation?
Loading comments...