PostgreSQL - DBA Toolkit - Replication Lag

Your Replica Looks Fine — Until It Does Not

Streaming replication lag is a silent failure.

The replica is connected. WAL is flowing. Everything looks healthy in a dashboard that only checks whether the process is running. But the replica is ten minutes behind. If the primary fails right now, that is ten minutes of committed transactions that do not exist on the standby.

This script queries pg_stat_replication on the primary to show every connected standby, how far behind each one is in bytes and in estimated time, and what stage of the replication pipeline each is at.

    Loading…

Reading the Results

pg_stat_replication (run on primary)

Column	What It Tells You
`client_addr`	IP address of the standby server
`usename`	Replication user
`application_name`	Standby name — set in `recovery.conf` or `postgresql.conf` on the replica
`state`	`streaming` = healthy; `catchup` = recovering; `backup` = base backup in progress
`sent_lsn`	WAL position sent to this standby
`write_lsn`	WAL position written to standby disk
`flush_lsn`	WAL position flushed (durable) on standby
`replay_lsn`	WAL position applied to standby data files
`write_lag`	Time from primary WAL write to standby write
`flush_lag`	Time from primary WAL write to standby flush
`replay_lag`	Time from primary WAL write to standby replay — the real replication lag
`sync_state`	`async`, `sync`, `quorum` — whether this standby participates in synchronous commit
`sent_lag_bytes`	Bytes of WAL not yet sent to standby
`replay_lag_bytes`	Bytes of WAL sent but not yet replayed

What to Watch For

Signal	What It Means
`state` = `catchup` for extended time	Replica is behind and recovering — check network and replica I/O capacity
`replay_lag` growing steadily	The standby cannot keep up with WAL generation rate
`replay_lag_bytes` > 0 on a synchronous standby	Commits on the primary are waiting for this standby — latency impact
Missing rows (expected standby not appearing)	Replica has disconnected — check standby logs and connectivity
`sync_state = sync` with high `replay_lag`	Synchronous replication is slowing your primary commits
Large gap between `sent_lsn` and `replay_lsn`	WAL is arriving but not being applied — standby I/O or CPU bottleneck

Checking Lag from the Standby

If you need to check lag from the standby server itself (when you cannot access the primary):

    Loading…

replay_delay shows how old the last replayed transaction is — a reliable measure of data freshness.

Gareth Winterman