Document Databases · Database Family Deep Dive

Quick Facts

At a Glance

Core Ideas

Document: a JSON-like object stored as one unit — fields, arrays, nested sub-documents.
Collection: the bucket of documents (analogous to a table, but no enforced shape).
Aggregate: the unit that's read and written atomically — design around it, not around normal forms.
Schema-on-read: the database accepts whatever you write; the app validates and reshapes on the way out.
Secondary indexes let you query inside documents — but they cost on writes and storage.

The Engines

Who Plays Here

MongoDB

The default document database. Rich query language, aggregation pipeline, multi-document transactions since 4.0. Atlas runs it managed.

Couchbase

Document + key-value with SQL-like N1QL queries. Strong on caching workloads thanks to its memcached lineage.

Firestore

Google's serverless document store with real-time listeners — built for mobile/web clients that want live data sync.

DynamoDB (document mode)

AWS' managed key-document hybrid. Predictable single-digit-ms latency, but you must design around the access pattern up front.

Cosmos DB

Azure's multi-model. Mongo API, SQL API, and others — pick the wire protocol that fits.

RavenDB / ArangoDB

Niche choices: RavenDB for .NET-first shops; ArangoDB if you want documents and graph in one engine.

When Documents Win

Good Fits

The Entity is Naturally a Tree

A product with variants, options, images, and translated descriptions. A blog post with embedded comments and tags. A medical record with sections and history. Stuffing these into 8 relational tables and joining them back together every read is work the application doesn't get paid for. One document, one read.

The Shape Changes Often

Early-stage products, integrations with third parties whose payloads keep evolving, multi-tenant systems where each tenant has its own custom fields. Adding a field is just writing it; old documents simply don't have it. Migration is a backfill script, not an ALTER TABLE on a 200M-row table.

Read-Heavy by ID or Small Key Set

If 95% of your reads are "give me this user's profile" or "give me this order with everything in it," documents are exactly that — one fetch, no assembly. Where they struggle is the 5% that wants to slice across documents (revenue by region, top customers by month).

Geographic / Edge Distribution

Most document stores were built for sharding from day one. Globally distributed Cosmos DB, Firestore multi-region, DynamoDB Global Tables — the data follows the user. Distributed SQL can do this too, but it's been retrofitted; document stores assume it.

When to Stay Away

Bad Fits

Cross-entity transactions. "Move money from A to B and decrement stock by 1" wants ACID across rows. Mongo can do multi-doc transactions now, but they're slow and rarely the right shape — relational is built for this.
Reporting / BI. Analysts speak SQL. If finance has to query the production document store, you'll end up exporting to a warehouse anyway — start there.
Highly relational data. Friend-of-a-friend, "products that share categories with this one" — graph or relational. Documents force you to denormalize the same fact into many places, then keep them in sync.
Strict invariants. Foreign keys, unique constraints across collections, declared types — document stores don't enforce these. The app does, and the app is wrong sometimes.

Modeling

Designing the Document

Embed vs Reference

Embed when the child belongs to one parent and is read with it (order line items, blog comments on a small post). Reference when the child is shared (a product referenced by many orders) or unbounded (every comment on a Reddit thread). The 16MB Mongo document cap is a real ceiling — model around it.

Design for the Query, Not the Entity

In SQL you model the data and the query optimizer figures it out. In document stores you model the read. If two queries want different shapes of the same data, store both shapes — the duplication is the point. Maintain consistency with change streams or transactional outbox.

Schema Validation — Use It

Mongo's $jsonSchema validator, Firestore security rules, DynamoDB attribute conditions. Schema-flexible doesn't mean schema-absent; it means the schema lives where you choose. Validate at the database boundary, not just the app, so jobs and migrations can't write garbage.

Partition / Shard Key

Pick a key with high cardinality and even access. tenant_id works only if tenants are roughly the same size — one whale tenant becomes a hot shard. Composite keys (tenant_id + user_id) usually distribute better. Once chosen, the key is hard to change.

Pitfalls

Common Mistakes

Treating it like a JSON-shaped SQL. Throwing normalized data into collections and joining client-side. You'll fight the engine forever.
Unbounded arrays. Comments embedded in a viral post grow until the document hits the size limit and writes start failing. Cap it or move to references.
No indexes on filter fields. Collection scans are silent for a while, then catastrophic at scale.
Trusting eventual consistency where you needed strong. Read-your-write across replicas isn't guaranteed by default — pin to primary or use causal consistency.
Skipping the access-pattern doc. DynamoDB punishes this most, but every document store rewards designing the access patterns before the schema.

Continue