Data Architecture — Mon Feb 16

← Home | ← data-architecture

Party with Partitions

Mon Feb 16
#database #dba #architecture #organizing data #on-prem #performance #cloud data #data lake #partitioning #data-architecture #scalability

I Used to Think Partitioning Was a DBA Thing

Back when I was living deep in SQL Server, partitioning felt tactical.

You used it when a table got too big. You used it when maintenance windows were tight. You used it when index rebuilds started taking longer than your patience.

It felt operational.

But the longer I’ve worked in data engineering and architecture, the more I’ve realised something important:

Partitioning is not a DBA optimisation. It’s an architectural contract.

And if you get that contract wrong early on, everything downstream becomes expensive.


What Partitioning Actually Is

At its simplest, partitioning is dividing data into logical segments based on a key.

Common examples:

  • By date (daily, monthly, yearly)
  • By tenant / customer
  • By region or site
  • By device / turbine / asset

But here’s the important bit:

Partitioning isn’t about storage. It’s about access patterns.

You partition based on how the data is:

  • Written
  • Queried
  • Retained
  • Deleted
  • Archived

If you design it around size alone, you’ll regret it.


The Real Reason We Partition

Partitioning gives you leverage in five critical areas:

1. Performance

When queries can eliminate partitions early (partition pruning), the engine reads less data. Less I/O. Less memory pressure. Less pain.

2. Maintenance

You can:

  • Switch partitions in/out
  • Rebuild indexes per partition
  • Drop old data instantly

Instead of running a 6‑hour delete that fills your transaction log.

3. Cost Control

In cloud platforms (Databricks, Snowflake, BigQuery), partitioning directly impacts:

  • Scan cost
  • Compute usage
  • Storage tiering

Bad partitioning = unnecessary compute spend.

4. Data Lifecycle Management

Hot data. Warm data. Cold data. Archive.

Partitioning lets you manage that cleanly.

5. Operational Safety

If something goes wrong, you isolate blast radius.

A bad load into one partition is fixable. A bad load into a single monolithic table is… a long night.


Database Partitioning vs Folder Partitioning

This is where things get interesting.

Traditional RDBMS partitioning:

  • SQL Server partition functions
  • PostgreSQL declarative partitioning
  • Oracle range/list partitions

Modern lakehouse partitioning:

  • Folder structure (/year=2026/month=02/day=17/)
  • Delta Lake partition columns
  • Hive-style partitioning

They solve similar problems — but at different layers.

Database partitioning controls physical storage inside the engine. Lakehouse partitioning controls file layout and query pruning.

If you’re building modern pipelines, you need to think about both.


The Classic Mistakes

I’ve seen all of these in the wild.

Don’t Partitioning by an ID with high cardinality

Millions of tiny partitions. Metadata explosion. Terrible performance.

Don’t Over-partitioning by date (hourly when daily is enough)

Small files everywhere. Query planning overhead.

Don’t Under-partitioning

One giant table. Deletes take hours. Maintenance windows vanish.

Don’t Designing partitions around today’s queries only

Architecture must anticipate growth.


What I Now Believe

Partitioning belongs in the Data Architecture pillar.

Not database admin. Not performance tuning. Not “we’ll fix it later.”

It should be decided when:

  • You define your domain model
  • You define retention policy
  • You design ingestion frequency
  • You understand query behaviour

Partitioning is a first-class design decision.


A Simple Mental Model

Ask yourself four questions before choosing a partition key:

  1. What is the dominant filter in queries?
  2. How often does new data arrive?
  3. How long do we retain data?
  4. How will we delete or archive it?

If you can’t answer those, you’re guessing.

And guessing in architecture is expensive.


Final Thought

When I was a younger DBA, partitioning felt like a clever trick.

Now, as a data engineer thinking architecturally, I see it differently.

Partitioning is not about making a query 5% faster.

It’s about making sure your system still works when your data is 10x bigger than you expected.

Design it early. Design it deliberately.

Or be ready to rebuild it later.


Gareth Winterman