Database Administration — Sat Jan 31

← Home | ← database-admin

Always On Replica Health

Sat Jan 31
#sql-server #always-on #monitoring #high-availability #dba-life

High Availability Isn’t Magic

You build Availability Groups.

You configure replicas.
You set synchronous commit.
You test failover.

Everyone claps.

“Great — we’re highly available now.”

No.

You’re capable of high availability.

That’s not the same thing as being healthy.


The Silent Drift Problem

Always On doesn’t usually explode.

It drifts.

  • A replica disconnects
  • Sync health drops from HEALTHY to PARTIALLY_HEALTHY
  • A node flips to RESOLVING
  • Latency creeps up
  • Someone patches a node and forgets to recheck

No alarms fire.

No outage yet.

But your safety net just developed a tear.

And you won’t notice until you need it.

That’s the nightmare.


What This Query Actually Tells You

This script joins:

  • sys.availability_groups
  • sys.availability_replicas
  • sys.dm_hadr_availability_replica_states

And it shows you, in plain English:

  • Availability Group name
  • Replica server name
  • Role (PRIMARY / SECONDARY / RESOLVING)
  • Connection state
  • Synchronization health

That’s it.

No drama.

Just the truth.

And the truth is what saves you at 2am.


What Good Looks Like

You want to see:

  • One PRIMARY
  • One or more SECONDARY
  • CONNECTED
  • HEALTHY

Anything else?

You dig.

Because here’s the uncomfortable reality:

If your secondary isn’t synchronised, you don’t have HA.
You have hope.

And hope is not a strategy.


Why This Matters More Than Backups (Sometimes)

Backups protect you from data loss.

Availability Groups protect you from downtime.

But if your AG is silently unhealthy:

  • Failover won’t work properly
  • Data may not be synchronised
  • Your RPO assumptions are wrong
  • Your RTO assumptions are fantasy

And those assumptions are often written into SLAs.

You don’t want to discover your SLA is fiction during an incident call.


Drift Happens Quietly

Common reasons replica health degrades:

  • Network instability
  • Storage latency
  • Log send queue growth
  • Redo queue growth
  • Long-running transactions
  • Endpoint misconfiguration
  • Certificate expiry
  • Windows patching
  • Someone “just testing something”

Always On is brilliant.

But it’s not self-healing.

It’s self-reporting.

Only if you look.


This Is Pattern Monitoring Again

Just like file growth.

Just like blocking.

Just like backups.

You’re not looking for explosions.

You’re looking for behaviour change.

If a replica has been:

  • HEALTHY for 3 years
    and suddenly becomes
  • PARTIALLY_HEALTHY

That’s your signal.

Not when it breaks.

When it starts drifting.


The Difference Between Junior and Senior DBA Work

Junior DBA:

“It failed over fine in testing.”

Senior DBA:

“Show me replica health over the last 30 days.”

High availability isn’t a checkbox.

It’s a living system.

And living systems degrade.


When Things Go Wrong

If you ever see:

  • DISCONNECTED
  • NOT_HEALTHY
  • RESOLVING

Stop.

Don’t ignore it.

Because when the primary fails, that’s when the business discovers whether your HA design was real — or theoretical.

And you don’t want to discover that at the same time they do.


This Is Why We Check

You don’t build Availability Groups and forget them.

You monitor them.

Daily.

Automated if possible.

Because resilience is not configuration.

It’s visibility.


You don’t just build redundancy.
You verify it.

Always On Replica Health Script

Always On Replica Health.sql
    Loading…
  

Gareth Winterman