AI Stack Layer · 8 of 8

Data Prep

The unglamorous foundation of every AI system — ingesting, cleaning, joining, transforming, and labeling data so models have something useful to learn from. The famous "80% of the work."

ETL / ELTCleaningLabelingFeature EngineeringLayer 8
← Back to AI Landscape
Quick Facts

At a Glance

Basic Concepts

  • ETL vs ELT: Extract → Load → Transform (modern, in-warehouse) vs Extract → Transform → Load (older, in-pipeline).
  • Schema-on-read vs schema-on-write: data lakes vs data warehouses.
  • Cleaning: dedupe, normalize, handle missing values & outliers.
  • Feature engineering: turn raw fields into model-ready inputs (encoding, scaling, lag features).
  • Labeling: the supervision signal — humans, rules, or other models.
Landscape

The Major Tools

CategoryTools
In-memory analysispandas, Polars (Rust-built, much faster), DuckDB
Big data / distributedApache Spark (PySpark), Dask, Ray Data
SQL transformsdbt, SQLMesh — version-controlled SQL pipelines in your warehouse
Workflow orchestrationAirflow, Prefect, Dagster, Argo Workflows
StreamingKafka + Flink, Spark Streaming, Beam
Data qualityGreat Expectations, Soda, dbt tests, Pandera
Data warehousesSnowflake, BigQuery, Databricks, Redshift, ClickHouse
Lakehouses / formatsDelta Lake, Apache Iceberg, Apache Hudi, Parquet
Ingestion / CDCFivetran, Airbyte, Debezium, Estuary
Labeling platformsLabel Studio, Snorkel, Scale AI, Surge, Prodigy
Synthetic dataGretel, Mostly AI, Tonic — privacy-preserving fakes
Mechanics

The Data Prep Workflow

Ingestion

Get raw data from operational systems (databases, SaaS, files, streams) into a place you can analyze it.

  • Batch: nightly dumps via Fivetran / Airbyte / custom scripts.
  • CDC (Change Data Capture): stream every row change in near-real-time (Debezium, Estuary).
  • Streaming: Kafka / Kinesis for event firehoses.
Cleaning & Transformation
import pandas as pd

df = pd.read_parquet("raw/orders.parquet")
df = (df
    .drop_duplicates(subset="order_id")
    .dropna(subset=["customer_id"])
    .assign(
        order_date = pd.to_datetime(df.order_date),
        revenue    = df.qty * df.unit_price,
    )
    .query("revenue > 0")
)

Modern teams increasingly do this in SQL via dbt, or in Polars / DuckDB for 10-100× speedups over pandas on large data.

Feature Engineering
  • Encoding: categorical → one-hot, target, hashing.
  • Scaling: standardize / normalize for distance-based models.
  • Aggregations: rolling sums, group-by means, lags, recency.
  • Datetime: hour-of-day, day-of-week, holiday flags.
  • Text/Image: embeddings (often produced by a foundation model).
Data Quality & Validation

Fail loud, fail early. Tests run as part of the pipeline:

  • Schema: column exists, type is right.
  • Range: price > 0, percentage in [0,100].
  • Cardinality: distinct values within bounds.
  • Freshness: latest timestamp not older than X.
  • Uniqueness & nulls.
Labeling for ML
  • Manual labeling in tools like Label Studio (with QA & agreement scoring).
  • Programmatic labeling (Snorkel) — write weak rules, train a label model.
  • LLM-assisted labeling — use a strong model to pre-label; humans verify.
  • Active learning — label only the examples the model is most uncertain about.
Data Prep for LLM Apps

Beyond classical ML data prep, LLM apps need:

  • Document chunking — split long documents into RAG-friendly pieces (semantic, fixed-size, recursive).
  • Cleaning — strip nav/ads from HTML, OCR PDFs, normalize whitespace.
  • Embedding generation — batch-call an embedding API; store in a vector DB.
  • Metadata enrichment — tag chunks with source, date, ACL for filtering.
  • Eval datasets — golden Q&A pairs to measure RAG quality over time.
Continue

Other AI Stack Layers