AI Stack Layer · 2 of 8

Model Providers

The companies that train, host, and serve the world's foundation models — through APIs, SDKs, and increasingly through cloud marketplaces. Pick a provider, get a model.

APIsSDKsHosted InferencePricing per tokenLayer 2
← Back to AI Landscape
Quick Facts

At a Glance

Basic Concepts

  • Inference API: a hosted endpoint you call with a prompt and get tokens back.
  • SDKs exist for every major provider — Python, TypeScript, Go, Java, .NET.
  • Pricing is per million tokens, split into input and output.
  • Prompt caching lets you reuse repeated context cheaply (Anthropic, OpenAI, Google).
  • Cloud marketplaces (Azure, AWS, GCP) re-sell the same models inside their security perimeter.
Landscape

The Major Providers

ProviderFlagship ModelsNotable Strengths
OpenAIGPT-5, GPT-4o, o-seriesLargest market share, ChatGPT brand, Realtime API.
AnthropicClaude (Opus, Sonnet, Haiku)Long context, coding, safety, agent SDK.
GoogleGemini Pro / Flash / UltraMassive context (1M+), native multimodal, Workspace.
MetaLlama (open weights)Free to download, run anywhere.
Mistral AIMistral, Mixtral, CodestralEuropean, efficient MoE, open + commercial.
xAIGrokReal-time X (Twitter) data.
CohereCommand, Embed, RerankEnterprise RAG, multilingual embeddings.
AWS BedrockClaude, Llama, Titan, Nova, MistralOne API across many providers, AWS perimeter.
Azure OpenAIGPT familyMicrosoft enterprise + Azure compliance.
Vertex AIGemini, Claude, LlamaGoogle Cloud's model garden.
Together / Fireworks / GroqOpen models, hostedFast inference for open models (esp. Groq's LPU).
Hugging Face~All open modelsThe "GitHub of AI" + Inference Endpoints.
Mechanics

Working with Provider APIs

A Typical Call
# OpenAI Python SDK
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {"role": "system", "content": "You are concise."},
        {"role": "user",   "content": "Summarize quantum entanglement."},
    ],
)
print(resp.choices[0].message.content)

Anthropic, Google, and Mistral SDKs follow nearly identical shapes — the API surface has converged.

Pricing & Cost Optimization
  • Input tokens cost less than output — typically 4–5× cheaper.
  • Prompt caching can cut repeated-context costs by 90% (Anthropic, OpenAI, Gemini).
  • Batch APIs halve cost for non-real-time work.
  • Smaller models first: Haiku / Flash / Mini before reaching for Opus / Pro / Premium.
  • Distillation: use a big model to generate training data for a small one.
Streaming & Tool Use
  • Streaming sends tokens as they're generated — essential for chat UX.
  • Tool / function calling lets the model invoke your code (search, DB queries, calculators).
  • Structured outputs / JSON mode guarantees schema-conformant responses.
  • Vision — pass images alongside text on most modern APIs.
Choosing a Provider
If you need…Consider
Best general capabilityOpenAI, Anthropic, Google
Long context (1M+ tokens)Google Gemini, Anthropic Claude
Coding agentsAnthropic Claude
On-prem / data residencyOpen weights via vLLM, Hugging Face TGI
Lowest latencyGroq, Cerebras, Fireworks
Enterprise complianceAzure OpenAI, AWS Bedrock, Vertex AI
Continue

Other AI Stack Layers