The unglamorous foundation of every AI system — ingesting, cleaning, joining, transforming, and labeling data so models have something useful to learn from. The famous "80% of the work."
← Back to AI Landscape| Category | Tools |
|---|---|
| In-memory analysis | pandas, Polars (Rust-built, much faster), DuckDB |
| Big data / distributed | Apache Spark (PySpark), Dask, Ray Data |
| SQL transforms | dbt, SQLMesh — version-controlled SQL pipelines in your warehouse |
| Workflow orchestration | Airflow, Prefect, Dagster, Argo Workflows |
| Streaming | Kafka + Flink, Spark Streaming, Beam |
| Data quality | Great Expectations, Soda, dbt tests, Pandera |
| Data warehouses | Snowflake, BigQuery, Databricks, Redshift, ClickHouse |
| Lakehouses / formats | Delta Lake, Apache Iceberg, Apache Hudi, Parquet |
| Ingestion / CDC | Fivetran, Airbyte, Debezium, Estuary |
| Labeling platforms | Label Studio, Snorkel, Scale AI, Surge, Prodigy |
| Synthetic data | Gretel, Mostly AI, Tonic — privacy-preserving fakes |
Get raw data from operational systems (databases, SaaS, files, streams) into a place you can analyze it.
import pandas as pd df = pd.read_parquet("raw/orders.parquet") df = (df .drop_duplicates(subset="order_id") .dropna(subset=["customer_id"]) .assign( order_date = pd.to_datetime(df.order_date), revenue = df.qty * df.unit_price, ) .query("revenue > 0") )
Modern teams increasingly do this in SQL via dbt, or in Polars / DuckDB for 10-100× speedups over pandas on large data.
Fail loud, fail early. Tests run as part of the pipeline:
Beyond classical ML data prep, LLM apps need: