Database Family · Wide-Column

Wide-Column Stores

A distributed sorted map: rows partitioned by hash key, columns sorted within the partition. Built to absorb writes at firehose speed and serve them back by partition + range. The internet's time-series and feed engine.

Partition keyClustering keyLSM-treeTunable consistencyLinear scale
← Back to Database Side
Quick Facts

At a Glance

Core Ideas

  • Partition key: hashed to choose a node. Every read for that partition lands on the same shard.
  • Clustering key: sorts rows within the partition. The reason range scans on time work efficiently.
  • Wide row: a partition can hold millions of cells — the whole "time-series for sensor 42" lives together.
  • LSM-tree storage: writes go to memory + commit log, then flush + compact. Sequential, fast — that's the trick.
  • Tunable consistency: per-query ONE / QUORUM / ALL on reads and writes. CAP, made dial-able.
The Engines

Who Plays Here

Apache Cassandra

The reference open-source wide-column DB. Masterless, multi-DC, CQL query language. Powers Discord, Apple, Netflix at scale.

ScyllaDB

Cassandra-compatible, rewritten in C++ on a shard-per-core architecture. Same model, several × the throughput per node.

HBase

The Bigtable-on-Hadoop one. HDFS underneath, master-based architecture. Common in batch + analytics stacks.

Google Bigtable

The original. Powers Search, Analytics, Maps. Available as a managed service on GCP.

Amazon Keyspaces

Managed, Cassandra-compatible. Serverless pricing.

YugabyteDB

Distributed SQL with a Cassandra-compatible YCQL API — wide-column semantics with optional ACID.

When Wide-Column Wins

The Sweet Spots

Write-Heavy, Append-Mostly

Telemetry, IoT readings, click streams, audit logs, chat messages. Millions of writes per second, almost no updates. The LSM-tree thrives on this — sequential writes, no in-place updates, compaction in the background.

Time-Ordered Reads of a Single Entity

"Last 100 messages in this channel." "All sensor readings for device 42 in the last hour." Partition key = entity, clustering key = timestamp DESC. One disk seek, one sequential read. This is the killer pattern.

Multi-Region Active-Active

Cassandra was designed for it. Writes accepted in any datacenter, replicated asynchronously, conflicts resolved by last-write-wins (or CRDTs in newer versions). Geographic latency stays local; eventual convergence across regions.

Linear Scale on Commodity Hardware

Add a node, the cluster rebalances, throughput goes up roughly linearly. No master, no sharding ceremony, no read replicas to wire up — the model is "more nodes = more capacity" all the way to thousands.

When to Stay Away

Bad Fits

  • Ad-hoc queries. No joins, no WHERE on arbitrary columns, secondary indexes are weak. If you don't know the access pattern up front, you'll regret picking this.
  • Strong cross-partition transactions. Lightweight transactions exist (Paxos in Cassandra), but they're slow and limited. Cross-partition multi-row ACID — not the model.
  • Small data. The operational complexity (multi-node clusters, repairs, compaction tuning) doesn't pay off until you actually have the volume.
  • Heavy update / delete workloads. Tombstones and compaction overhead pile up. If you're updating the same row 100× a day, you're fighting the engine.
Modeling

Designing for the Query

Pick Partition Key for Distribution

High cardinality, even access. user_id usually fine. country_code is a hot-shard machine — a few countries dominate. (user_id, day_bucket) avoids unbounded partitions for users who generate millions of events.

Pick Clustering Key for Order

Whatever the read needs in order — timestamp DESC, message ID, score. The clustering key controls on-disk layout, so a range scan is one seek + sequential read.

One Table per Query

Need messages by channel and by author? Two tables, both written to. Denormalization is the design — storage is cheap, reads are the expensive thing. Use logged batches or change-data-capture to keep them in sync.

Bound the Partition Size

Rule of thumb: keep partitions under ~100MB / 100k rows. Bucket time-series by day or hour: (sensor_id, yyyymmdd). Otherwise repair, compaction, and read latency all suffer.

Pitfalls

Common Mistakes

  • Querying without the partition key. Forces an ALLOW FILTERING scan across the cluster. Don't.
  • Tombstone storms. Mass deletes on a partition slow reads to a crawl until compaction catches up. Prefer TTLs over DELETE.
  • Counters used like SQL columns. Counter columns are not idempotent — retries can double-count. Reach for them deliberately.
  • Skipping repair. Anti-entropy repairs need to run on schedule, or replicas drift and reads return inconsistent data.
  • Treating it like a relational DB. Trying to JOIN, GROUP BY, or run reports — you'll bring the cluster to its knees. Push analytics to a warehouse.
Continue

Related