Always On Replica Health

High Availability Isn’t Magic

You build Availability Groups.

You configure replicas.
You set synchronous commit.
You test failover.

Everyone claps.

“Great — we’re highly available now.”

No.

You’re capable of high availability.

That’s not the same thing as being healthy.

The Silent Drift Problem

Always On doesn’t usually explode.

It drifts.

A replica disconnects
Sync health drops from HEALTHY to PARTIALLY_HEALTHY
A node flips to RESOLVING
Latency creeps up
Someone patches a node and forgets to recheck

No alarms fire.

No outage yet.

But your safety net just developed a tear.

And you won’t notice until you need it.

That’s the nightmare.

What This Query Actually Tells You

This script joins:

sys.availability_groups
sys.availability_replicas
sys.dm_hadr_availability_replica_states

And it shows you, in plain English:

Availability Group name
Replica server name
Role (PRIMARY / SECONDARY / RESOLVING)
Connection state
Synchronization health

That’s it.

No drama.

Just the truth.

And the truth is what saves you at 2am.

What Good Looks Like

You want to see:

One PRIMARY
One or more SECONDARY
CONNECTED
HEALTHY

Anything else?

You dig.

Because here’s the uncomfortable reality:

If your secondary isn’t synchronised, you don’t have HA.
You have hope.

And hope is not a strategy.

Why This Matters More Than Backups (Sometimes)

Backups protect you from data loss.

Availability Groups protect you from downtime.

But if your AG is silently unhealthy:

Failover won’t work properly
Data may not be synchronised
Your RPO assumptions are wrong
Your RTO assumptions are fantasy

And those assumptions are often written into SLAs.

You don’t want to discover your SLA is fiction during an incident call.

Drift Happens Quietly

Common reasons replica health degrades:

Network instability
Storage latency
Log send queue growth
Redo queue growth
Long-running transactions
Endpoint misconfiguration
Certificate expiry
Windows patching
Someone “just testing something”

Always On is brilliant.

But it’s not self-healing.

It’s self-reporting.

Only if you look.

This Is Pattern Monitoring Again

Just like file growth.

Just like blocking.

Just like backups.

You’re not looking for explosions.

You’re looking for behaviour change.

If a replica has been:

HEALTHY for 3 years
and suddenly becomes
PARTIALLY_HEALTHY

That’s your signal.

Not when it breaks.

When it starts drifting.

The Difference Between Junior and Senior DBA Work

Junior DBA:

“It failed over fine in testing.”

Senior DBA:

“Show me replica health over the last 30 days.”

High availability isn’t a checkbox.

It’s a living system.

And living systems degrade.

When Things Go Wrong

If you ever see:

DISCONNECTED
NOT_HEALTHY
RESOLVING

Stop.

Don’t ignore it.

Because when the primary fails, that’s when the business discovers whether your HA design was real — or theoretical.

And you don’t want to discover that at the same time they do.

This Is Why We Check

You don’t build Availability Groups and forget them.

You monitor them.

Daily.

Automated if possible.

Because resilience is not configuration.

It’s visibility.

You don’t just build redundancy.
You verify it.

Always On Replica Health Script

    Loading…

Gareth Winterman