High Availability Isn’t Magic
You build Availability Groups.
You configure replicas.
You set synchronous commit.
You test failover.
Everyone claps.
“Great — we’re highly available now.”
No.
You’re capable of high availability.
That’s not the same thing as being healthy.
The Silent Drift Problem
Always On doesn’t usually explode.
It drifts.
- A replica disconnects
- Sync health drops from
HEALTHYtoPARTIALLY_HEALTHY - A node flips to
RESOLVING - Latency creeps up
- Someone patches a node and forgets to recheck
No alarms fire.
No outage yet.
But your safety net just developed a tear.
And you won’t notice until you need it.
That’s the nightmare.
What This Query Actually Tells You
This script joins:
sys.availability_groupssys.availability_replicassys.dm_hadr_availability_replica_states
And it shows you, in plain English:
- Availability Group name
- Replica server name
- Role (
PRIMARY/SECONDARY/RESOLVING) - Connection state
- Synchronization health
That’s it.
No drama.
Just the truth.
And the truth is what saves you at 2am.
What Good Looks Like
You want to see:
- One
PRIMARY - One or more
SECONDARY CONNECTEDHEALTHY
Anything else?
You dig.
Because here’s the uncomfortable reality:
If your secondary isn’t synchronised, you don’t have HA.
You have hope.
And hope is not a strategy.
Why This Matters More Than Backups (Sometimes)
Backups protect you from data loss.
Availability Groups protect you from downtime.
But if your AG is silently unhealthy:
- Failover won’t work properly
- Data may not be synchronised
- Your RPO assumptions are wrong
- Your RTO assumptions are fantasy
And those assumptions are often written into SLAs.
You don’t want to discover your SLA is fiction during an incident call.
Drift Happens Quietly
Common reasons replica health degrades:
- Network instability
- Storage latency
- Log send queue growth
- Redo queue growth
- Long-running transactions
- Endpoint misconfiguration
- Certificate expiry
- Windows patching
- Someone “just testing something”
Always On is brilliant.
But it’s not self-healing.
It’s self-reporting.
Only if you look.
This Is Pattern Monitoring Again
Just like file growth.
Just like blocking.
Just like backups.
You’re not looking for explosions.
You’re looking for behaviour change.
If a replica has been:
HEALTHYfor 3 years
and suddenly becomesPARTIALLY_HEALTHY
That’s your signal.
Not when it breaks.
When it starts drifting.
The Difference Between Junior and Senior DBA Work
Junior DBA:
“It failed over fine in testing.”
Senior DBA:
“Show me replica health over the last 30 days.”
High availability isn’t a checkbox.
It’s a living system.
And living systems degrade.
When Things Go Wrong
If you ever see:
DISCONNECTEDNOT_HEALTHYRESOLVING
Stop.
Don’t ignore it.
Because when the primary fails, that’s when the business discovers whether your HA design was real — or theoretical.
And you don’t want to discover that at the same time they do.
This Is Why We Check
You don’t build Availability Groups and forget them.
You monitor them.
Daily.
Automated if possible.
Because resilience is not configuration.
It’s visibility.
You don’t just build redundancy.
You verify it.
Always On Replica Health Script
Loading…
Gareth Winterman