MLOps · AI Stack Deep Dive

Quick Facts

At a Glance

Basic Concepts

Experiment tracking: log every run's params, metrics, code, and artifacts so you can reproduce.
Model registry: versioned, stage-tagged storage for trained models (Staging → Production).
Pipelines: reproducible DAGs that turn raw data into trained, evaluated, deployed models.
Feature store: central source of computed features, shared between training and serving.
LLMOps adds: prompt versioning, eval harnesses, tracing, cost & token monitoring.

Landscape

The Major Tools

Tool	Layer	Best for
MLflow	Tracking + Registry	Open-source standard for experiments & models.
Weights & Biases	Tracking + Sweeps	Beautiful UI, hyperparameter sweeps, collaboration.
Vertex AI	Full platform (GCP)	Pipelines, registry, serving, monitoring on Google Cloud.
Amazon SageMaker	Full platform (AWS)	Notebooks → training jobs → endpoints, all on AWS.
Azure Machine Learning	Full platform (Azure)	Microsoft's MLOps suite.
Databricks	Lakehouse + ML	Spark + MLflow + serving in one platform.
Kubeflow	K8s-native	Open-source ML pipelines on Kubernetes.
Metaflow	Workflow framework	Netflix's framework — simple Python decorators.
DVC	Data versioning	Git-like versioning for datasets & models.
Feast / Tecton	Feature store	Train & serve from the same feature definitions.
BentoML / KServe / Triton	Model serving	Productionize models behind a REST/gRPC API.
Ray	Distributed compute	Distributed training, tuning, serving, RL.

Mechanics

The MLOps Lifecycle

Experiment Tracking

Every training run logs: hyperparameters, metrics over time, code git SHA, dataset version, environment, artifacts (model file, plots). The same UI lets you compare runs, reproduce winners, and roll back losers.

import mlflow
with mlflow.start_run():
    mlflow.log_params({"lr": 0.001, "epochs": 10})
    # … train …
    mlflow.log_metric("accuracy", acc)
    mlflow.log_model(model, "model")

Pipelines & Reproducibility

An ML pipeline is a DAG: ingest → validate → preprocess → train → evaluate → register → deploy. Each step has versioned inputs/outputs. Re-running with the same config gives the same model.

Model Registry & Deployment

Stages: None → Staging → Production → Archived.
Aliases / Champion-Challenger: traffic-split between candidate models.
Serving: REST/gRPC endpoint, batch prediction, edge deploy, embedded.
Shadow mode: a new model receives real traffic but its predictions don't ship — you compare offline.

Monitoring

Operational: latency, error rate, cost (same as any service).
Data drift: are inputs today statistically different from training data?
Concept drift: are labels / outcomes shifting?
Quality: precision, recall, AUC over time on labeled samples.
For LLMs: token usage, hallucination rate, eval scores, tool-call success.

LLMOps — What's Different

Concern	LLMOps tool
Prompt versioning & evals	Langfuse, LangSmith, Promptfoo, Braintrust
Tracing every call	Langfuse, Phoenix (Arize), Helicone, W&B Weave
Cost & token tracking	OpenMeter, Helicone, native provider dashboards
Guardrails / safety	NVIDIA NeMo Guardrails, Guardrails AI
Fine-tuning ops	Provider APIs + W&B / MLflow tracking

Continue

Other AI Stack Layers

Foundation Models Model Providers Frameworks Vector DBs Dev Agents Classic ML Data Prep ↑ Back to AI Landscape