Data Prep · AI Stack Deep Dive

Quick Facts

At a Glance

Basic Concepts

ETL vs ELT: Extract → Load → Transform (modern, in-warehouse) vs Extract → Transform → Load (older, in-pipeline).
Schema-on-read vs schema-on-write: data lakes vs data warehouses.
Cleaning: dedupe, normalize, handle missing values & outliers.
Feature engineering: turn raw fields into model-ready inputs (encoding, scaling, lag features).
Labeling: the supervision signal — humans, rules, or other models.

Landscape

The Major Tools

Category	Tools
In-memory analysis	pandas, Polars (Rust-built, much faster), DuckDB
Big data / distributed	Apache Spark (PySpark), Dask, Ray Data
SQL transforms	dbt, SQLMesh — version-controlled SQL pipelines in your warehouse
Workflow orchestration	Airflow, Prefect, Dagster, Argo Workflows
Streaming	Kafka + Flink, Spark Streaming, Beam
Data quality	Great Expectations, Soda, dbt tests, Pandera
Data warehouses	Snowflake, BigQuery, Databricks, Redshift, ClickHouse
Lakehouses / formats	Delta Lake, Apache Iceberg, Apache Hudi, Parquet
Ingestion / CDC	Fivetran, Airbyte, Debezium, Estuary
Labeling platforms	Label Studio, Snorkel, Scale AI, Surge, Prodigy
Synthetic data	Gretel, Mostly AI, Tonic — privacy-preserving fakes

Mechanics

The Data Prep Workflow

Ingestion

Get raw data from operational systems (databases, SaaS, files, streams) into a place you can analyze it.

Batch: nightly dumps via Fivetran / Airbyte / custom scripts.
CDC (Change Data Capture): stream every row change in near-real-time (Debezium, Estuary).
Streaming: Kafka / Kinesis for event firehoses.

Cleaning & Transformation

import pandas as pd

df = pd.read_parquet("raw/orders.parquet")
df = (df
    .drop_duplicates(subset="order_id")
    .dropna(subset=["customer_id"])
    .assign(
        order_date = pd.to_datetime(df.order_date),
        revenue    = df.qty * df.unit_price,
    )
    .query("revenue > 0")
)

Modern teams increasingly do this in SQL via dbt, or in Polars / DuckDB for 10-100× speedups over pandas on large data.

Feature Engineering

Encoding: categorical → one-hot, target, hashing.
Scaling: standardize / normalize for distance-based models.
Aggregations: rolling sums, group-by means, lags, recency.
Datetime: hour-of-day, day-of-week, holiday flags.
Text/Image: embeddings (often produced by a foundation model).

Data Quality & Validation

Fail loud, fail early. Tests run as part of the pipeline:

Schema: column exists, type is right.
Range: price > 0, percentage in [0,100].
Cardinality: distinct values within bounds.
Freshness: latest timestamp not older than X.
Uniqueness & nulls.

Labeling for ML

Manual labeling in tools like Label Studio (with QA & agreement scoring).
Programmatic labeling (Snorkel) — write weak rules, train a label model.
LLM-assisted labeling — use a strong model to pre-label; humans verify.
Active learning — label only the examples the model is most uncertain about.

Data Prep for LLM Apps

Beyond classical ML data prep, LLM apps need:

Document chunking — split long documents into RAG-friendly pieces (semantic, fixed-size, recursive).
Cleaning — strip nav/ads from HTML, OCR PDFs, normalize whitespace.
Embedding generation — batch-call an embedding API; store in a vector DB.
Metadata enrichment — tag chunks with source, date, ACL for filtering.
Eval datasets — golden Q&A pairs to measure RAG quality over time.

Continue

Other AI Stack Layers

Foundation Models Model Providers Frameworks Vector DBs Dev Agents MLOps Classic ML ↑ Back to AI Landscape