Columnar storage, massively parallel execution, separated compute and storage. Designed to scan billions of rows for one analytical answer — not to serve a request in 5ms. The engine behind dashboards, BI, and the modern data stack.
← Back to Database SideThe cloud warehouse that defined the modern shape — multi-cluster compute against shared storage, time travel, zero-copy clones. Cross-cloud.
Google's serverless warehouse. No clusters to manage; query, get billed by bytes scanned. Native ML, geospatial, streaming inserts.
Lakehouse on Spark + Delta Lake. Strong on data engineering, ML, and notebooks. Photon engine for SQL competes with the warehouses.
The original cloud warehouse. RA3 nodes separate storage from compute; Spectrum queries S3 directly.
Open-source columnar DB. Brutally fast for analytics; powers product analytics tools (PostHog, Plausible) and observability platforms.
SQLite-shaped OLAP. An entire columnar engine in one binary, runs in your laptop or your browser. Reshaping how analysts work locally.
Open table formats over Parquet in object storage. ACID, time travel, schema evolution — without locking into a single vendor's compute.
Curated, schema-on-write, governed. Data is loaded through pipelines, modeled, ready to query. Snowflake, BigQuery, Redshift in their classic form. The home of "the metric is right."
Raw files in object storage — Parquet, JSON, CSV — schema-on-read. Cheap, infinite, undisciplined. Great for keeping everything; bad for trusting any single answer. Pure lakes are increasingly rare.
Object storage as the substrate, with an open table format (Iceberg, Delta, Hudi) layering ACID, time travel, and schema evolution on top. Multiple engines (Spark, Trino, Snowflake, DuckDB) read the same data. Where the industry is converging.
Tableau, Looker, Power BI, Metabase pointed at the warehouse. Finance, growth, product analytics. The warehouse is the place where every team's data meets and the same definition of "active user" lives.
"Revenue by country by week for the last 3 years, joined with marketing spend." A query that scans terabytes and returns in seconds. OLTP databases would melt; columnar warehouses are built for exactly this.
Most ML training data starts as a warehouse query. Snowflake/BigQuery/Databricks all have ML pieces grafted on so you don't have to move the data to compute it.
The warehouse is the source of truth; tools (Hightouch, Census) push the segments back to Salesforce, HubSpot, ad platforms. The warehouse becomes the operational data plane, not just the analytics one.
Fivetran, Airbyte, Stitch, native CDC (Debezium) — pull from operational databases, SaaS APIs, event streams; land in the warehouse with minimal transformation. ELT, not ETL: load first, transform inside the warehouse.
dbt is the dominant tool. SQL files + Jinja, version-controlled, tested, documented. Models build downstream models. Your fct_orders isn't a magic dashboard — it's a tested, lineage-tracked SQL artifact.
Airflow, Dagster, Prefect schedule the DAG: ingestion runs, then dbt builds, then dashboards refresh, then reverse-ETL syncs. Failures, retries, lineage, alerts.
dbt tests for schema and freshness, Great Expectations / Monte Carlo / Soda for data quality, OpenLineage for lineage. The warehouse is now a product — it deserves the same monitoring as one.
* reads every column.