Model Providers · AI Stack Deep Dive

Quick Facts

At a Glance

Basic Concepts

Inference API: a hosted endpoint you call with a prompt and get tokens back.
SDKs exist for every major provider — Python, TypeScript, Go, Java, .NET.
Pricing is per million tokens, split into input and output.
Prompt caching lets you reuse repeated context cheaply (Anthropic, OpenAI, Google).
Cloud marketplaces (Azure, AWS, GCP) re-sell the same models inside their security perimeter.

Landscape

The Major Providers

Provider	Flagship Models	Notable Strengths
OpenAI	GPT-5, GPT-4o, o-series	Largest market share, ChatGPT brand, Realtime API.
Anthropic	Claude (Opus, Sonnet, Haiku)	Long context, coding, safety, agent SDK.
Google	Gemini Pro / Flash / Ultra	Massive context (1M+), native multimodal, Workspace.
Meta	Llama (open weights)	Free to download, run anywhere.
Mistral AI	Mistral, Mixtral, Codestral	European, efficient MoE, open + commercial.
xAI	Grok	Real-time X (Twitter) data.
Cohere	Command, Embed, Rerank	Enterprise RAG, multilingual embeddings.
AWS Bedrock	Claude, Llama, Titan, Nova, Mistral	One API across many providers, AWS perimeter.
Azure OpenAI	GPT family	Microsoft enterprise + Azure compliance.
Vertex AI	Gemini, Claude, Llama	Google Cloud's model garden.
Together / Fireworks / Groq	Open models, hosted	Fast inference for open models (esp. Groq's LPU).
Hugging Face	~All open models	The "GitHub of AI" + Inference Endpoints.

Mechanics

Working with Provider APIs

A Typical Call

# OpenAI Python SDK
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {"role": "system", "content": "You are concise."},
        {"role": "user",   "content": "Summarize quantum entanglement."},
    ],
)
print(resp.choices[0].message.content)

Anthropic, Google, and Mistral SDKs follow nearly identical shapes — the API surface has converged.

Pricing & Cost Optimization

Input tokens cost less than output — typically 4–5× cheaper.
Prompt caching can cut repeated-context costs by 90% (Anthropic, OpenAI, Gemini).
Batch APIs halve cost for non-real-time work.
Smaller models first: Haiku / Flash / Mini before reaching for Opus / Pro / Premium.
Distillation: use a big model to generate training data for a small one.

Streaming & Tool Use

Streaming sends tokens as they're generated — essential for chat UX.
Tool / function calling lets the model invoke your code (search, DB queries, calculators).
Structured outputs / JSON mode guarantees schema-conformant responses.
Vision — pass images alongside text on most modern APIs.

Choosing a Provider

If you need…	Consider
Best general capability	OpenAI, Anthropic, Google
Long context (1M+ tokens)	Google Gemini, Anthropic Claude
Coding agents	Anthropic Claude
On-prem / data residency	Open weights via vLLM, Hugging Face TGI
Lowest latency	Groq, Cerebras, Fireworks
Enterprise compliance	Azure OpenAI, AWS Bedrock, Vertex AI

Continue

Other AI Stack Layers

Foundation Models Frameworks Vector DBs Dev Agents MLOps Classic ML Data Prep ↑ Back to AI Landscape