AI Stack Layer · 6 of 8

MLOps & LLMOps

DevOps for ML. Track experiments, manage models, run training pipelines, deploy to production, and monitor everything — for both classical ML and LLM applications.

Experiment TrackingModel RegistryPipelinesMonitoringLayer 6
← Back to AI Landscape
Quick Facts

At a Glance

Basic Concepts

  • Experiment tracking: log every run's params, metrics, code, and artifacts so you can reproduce.
  • Model registry: versioned, stage-tagged storage for trained models (Staging → Production).
  • Pipelines: reproducible DAGs that turn raw data into trained, evaluated, deployed models.
  • Feature store: central source of computed features, shared between training and serving.
  • LLMOps adds: prompt versioning, eval harnesses, tracing, cost & token monitoring.
Landscape

The Major Tools

ToolLayerBest for
MLflowTracking + RegistryOpen-source standard for experiments & models.
Weights & BiasesTracking + SweepsBeautiful UI, hyperparameter sweeps, collaboration.
Vertex AIFull platform (GCP)Pipelines, registry, serving, monitoring on Google Cloud.
Amazon SageMakerFull platform (AWS)Notebooks → training jobs → endpoints, all on AWS.
Azure Machine LearningFull platform (Azure)Microsoft's MLOps suite.
DatabricksLakehouse + MLSpark + MLflow + serving in one platform.
KubeflowK8s-nativeOpen-source ML pipelines on Kubernetes.
MetaflowWorkflow frameworkNetflix's framework — simple Python decorators.
DVCData versioningGit-like versioning for datasets & models.
Feast / TectonFeature storeTrain & serve from the same feature definitions.
BentoML / KServe / TritonModel servingProductionize models behind a REST/gRPC API.
RayDistributed computeDistributed training, tuning, serving, RL.
Mechanics

The MLOps Lifecycle

Experiment Tracking

Every training run logs: hyperparameters, metrics over time, code git SHA, dataset version, environment, artifacts (model file, plots). The same UI lets you compare runs, reproduce winners, and roll back losers.

import mlflow
with mlflow.start_run():
    mlflow.log_params({"lr": 0.001, "epochs": 10})
    # … train …
    mlflow.log_metric("accuracy", acc)
    mlflow.log_model(model, "model")
Pipelines & Reproducibility

An ML pipeline is a DAG: ingest → validate → preprocess → train → evaluate → register → deploy. Each step has versioned inputs/outputs. Re-running with the same config gives the same model.

Model Registry & Deployment
  • Stages: None → Staging → Production → Archived.
  • Aliases / Champion-Challenger: traffic-split between candidate models.
  • Serving: REST/gRPC endpoint, batch prediction, edge deploy, embedded.
  • Shadow mode: a new model receives real traffic but its predictions don't ship — you compare offline.
Monitoring
  • Operational: latency, error rate, cost (same as any service).
  • Data drift: are inputs today statistically different from training data?
  • Concept drift: are labels / outcomes shifting?
  • Quality: precision, recall, AUC over time on labeled samples.
  • For LLMs: token usage, hallucination rate, eval scores, tool-call success.
LLMOps — What's Different
ConcernLLMOps tool
Prompt versioning & evalsLangfuse, LangSmith, Promptfoo, Braintrust
Tracing every callLangfuse, Phoenix (Arize), Helicone, W&B Weave
Cost & token trackingOpenMeter, Helicone, native provider dashboards
Guardrails / safetyNVIDIA NeMo Guardrails, Guardrails AI
Fine-tuning opsProvider APIs + W&B / MLflow tracking
Continue

Other AI Stack Layers