Database Family · Warehouse / Lake

Warehouses & Data Lakes

Columnar storage, massively parallel execution, separated compute and storage. Designed to scan billions of rows for one analytical answer — not to serve a request in 5ms. The engine behind dashboards, BI, and the modern data stack.

ColumnarOLAPMPPParquet / IcebergLakehouse
← Back to Database Side
Quick Facts

OLTP vs OLAP

The Split That Justifies This Family

  • OLTP (your app DB): many small transactions, row-oriented, milliseconds, freshness matters most. Postgres, MySQL.
  • OLAP (warehouse): few enormous queries, column-oriented, seconds-to-minutes, throughput matters most. Snowflake, BigQuery.
  • Columnar: store one column's values together. Aggregations read only the columns the query needs and compress beautifully.
  • MPP: massively parallel processing — query split across hundreds of cores, results combined.
  • Storage / compute separation: data lives in object storage (S3, GCS); compute is spun up on demand. Pay for what you query, scale independently.
The Engines

Who Plays Here

Snowflake

The cloud warehouse that defined the modern shape — multi-cluster compute against shared storage, time travel, zero-copy clones. Cross-cloud.

BigQuery

Google's serverless warehouse. No clusters to manage; query, get billed by bytes scanned. Native ML, geospatial, streaming inserts.

Databricks

Lakehouse on Spark + Delta Lake. Strong on data engineering, ML, and notebooks. Photon engine for SQL competes with the warehouses.

Amazon Redshift

The original cloud warehouse. RA3 nodes separate storage from compute; Spectrum queries S3 directly.

ClickHouse

Open-source columnar DB. Brutally fast for analytics; powers product analytics tools (PostHog, Plausible) and observability platforms.

DuckDB

SQLite-shaped OLAP. An entire columnar engine in one binary, runs in your laptop or your browser. Reshaping how analysts work locally.

Iceberg / Delta / Hudi

Open table formats over Parquet in object storage. ACID, time travel, schema evolution — without locking into a single vendor's compute.

Warehouse vs Lake vs Lakehouse

The Three Words That Get Confused

Data Warehouse

Curated, schema-on-write, governed. Data is loaded through pipelines, modeled, ready to query. Snowflake, BigQuery, Redshift in their classic form. The home of "the metric is right."

Data Lake

Raw files in object storage — Parquet, JSON, CSV — schema-on-read. Cheap, infinite, undisciplined. Great for keeping everything; bad for trusting any single answer. Pure lakes are increasingly rare.

Lakehouse

Object storage as the substrate, with an open table format (Iceberg, Delta, Hudi) layering ACID, time travel, and schema evolution on top. Multiple engines (Spark, Trino, Snowflake, DuckDB) read the same data. Where the industry is converging.

When Warehouses Win

What They're For

BI Dashboards & Reporting

Tableau, Looker, Power BI, Metabase pointed at the warehouse. Finance, growth, product analytics. The warehouse is the place where every team's data meets and the same definition of "active user" lives.

Big Aggregations

"Revenue by country by week for the last 3 years, joined with marketing spend." A query that scans terabytes and returns in seconds. OLTP databases would melt; columnar warehouses are built for exactly this.

ML Feature Engineering & Training Sets

Most ML training data starts as a warehouse query. Snowflake/BigQuery/Databricks all have ML pieces grafted on so you don't have to move the data to compute it.

Reverse ETL & Activation

The warehouse is the source of truth; tools (Hightouch, Census) push the segments back to Salesforce, HubSpot, ad platforms. The warehouse becomes the operational data plane, not just the analytics one.

When to Stay Away

Bad Fits

  • App-facing low-latency reads. Even fast warehouses are tens-to-hundreds of milliseconds at best. Don't put one in your request path.
  • Many small writes. Warehouses love big batched loads, not single-row inserts. Stream into a buffer (Kafka, Kinesis) and micro-batch.
  • Point lookups. "Get user 42's record" reads one row out of billions; row-oriented stores are an order of magnitude better at this.
  • Strict OLTP transactions. Warehouse "transactions" exist but are designed for bulk loads, not concurrent updates from a checkout flow.
The Modern Data Stack

How Data Gets Into the Warehouse

Ingestion (E + L)

Fivetran, Airbyte, Stitch, native CDC (Debezium) — pull from operational databases, SaaS APIs, event streams; land in the warehouse with minimal transformation. ELT, not ETL: load first, transform inside the warehouse.

Transformation (T)

dbt is the dominant tool. SQL files + Jinja, version-controlled, tested, documented. Models build downstream models. Your fct_orders isn't a magic dashboard — it's a tested, lineage-tracked SQL artifact.

Orchestration

Airflow, Dagster, Prefect schedule the DAG: ingestion runs, then dbt builds, then dashboards refresh, then reverse-ETL syncs. Failures, retries, lineage, alerts.

Quality & Observability

dbt tests for schema and freshness, Great Expectations / Monte Carlo / Soda for data quality, OpenLineage for lineage. The warehouse is now a product — it deserves the same monitoring as one.

Pitfalls

Common Mistakes

  • SELECT * on a wide table. Columnar engines reward selecting only what you need; * reads every column.
  • Surprise cost spikes. Snowflake credits, BigQuery bytes-scanned — one bad query in a loop can rack up thousands of dollars overnight. Set quotas and alerts.
  • Treating it like the OLTP DB. Updating one row at a time, micro-transactions, sub-second freshness expectations. Wrong tool.
  • No partitioning / clustering. Without partition pruning, every query scans the whole table. Partition by the time column you actually filter on.
  • One-warehouse-fits-all. Mixing ad-hoc analyst queries, dbt builds, and reverse-ETL on the same warehouse — they fight for resources. Separate workloads onto separate compute.
Continue

Related