DevOps for ML. Track experiments, manage models, run training pipelines, deploy to production, and monitor everything — for both classical ML and LLM applications.
← Back to AI Landscape| Tool | Layer | Best for |
|---|---|---|
| MLflow | Tracking + Registry | Open-source standard for experiments & models. |
| Weights & Biases | Tracking + Sweeps | Beautiful UI, hyperparameter sweeps, collaboration. |
| Vertex AI | Full platform (GCP) | Pipelines, registry, serving, monitoring on Google Cloud. |
| Amazon SageMaker | Full platform (AWS) | Notebooks → training jobs → endpoints, all on AWS. |
| Azure Machine Learning | Full platform (Azure) | Microsoft's MLOps suite. |
| Databricks | Lakehouse + ML | Spark + MLflow + serving in one platform. |
| Kubeflow | K8s-native | Open-source ML pipelines on Kubernetes. |
| Metaflow | Workflow framework | Netflix's framework — simple Python decorators. |
| DVC | Data versioning | Git-like versioning for datasets & models. |
| Feast / Tecton | Feature store | Train & serve from the same feature definitions. |
| BentoML / KServe / Triton | Model serving | Productionize models behind a REST/gRPC API. |
| Ray | Distributed compute | Distributed training, tuning, serving, RL. |
Every training run logs: hyperparameters, metrics over time, code git SHA, dataset version, environment, artifacts (model file, plots). The same UI lets you compare runs, reproduce winners, and roll back losers.
import mlflow with mlflow.start_run(): mlflow.log_params({"lr": 0.001, "epochs": 10}) # … train … mlflow.log_metric("accuracy", acc) mlflow.log_model(model, "model")
An ML pipeline is a DAG: ingest → validate → preprocess → train → evaluate → register → deploy. Each step has versioned inputs/outputs. Re-running with the same config gives the same model.
| Concern | LLMOps tool |
|---|---|
| Prompt versioning & evals | Langfuse, LangSmith, Promptfoo, Braintrust |
| Tracing every call | Langfuse, Phoenix (Arize), Helicone, W&B Weave |
| Cost & token tracking | OpenMeter, Helicone, native provider dashboards |
| Guardrails / safety | NVIDIA NeMo Guardrails, Guardrails AI |
| Fine-tuning ops | Provider APIs + W&B / MLflow tracking |