AI Stack Layer · 1 of 8

Foundation Models

Massive neural networks trained on internet-scale data — the engines under every modern AI app. Foundation models are pre-trained once, then adapted to thousands of downstream tasks.

LLMsPre-trainedMulti-modalOpen & ClosedLayer 1
← Back to AI Landscape
Quick Facts

At a Glance

What it is
Pre-trained neural network
Training data
Trillions of tokens / images
Sizes
Billions to trillions of parameters
Modalities
Text, image, audio, video, code
Access
Closed (API) or Open-weights
Examples
GPT, Claude, Gemini, Llama

Basic Concepts

  • Foundation model: a single model trained on broad data that can be adapted to many tasks (term coined by Stanford in 2021).
  • Token: the unit of text the model processes — roughly ¾ of a word in English.
  • Context window: how many tokens the model can "see" at once (4K → 200K → 1M+ today).
  • Parameters: the learned weights — more isn't always better, but capability generally scales with size.
  • Inference = running the model on new input. Training = teaching it from data.
Landscape

The Major Models

FamilyMakerOpen?Strengths
GPT (4o, 4.1, 5)OpenAIClosedMultimodal, broad capability, ChatGPT.
Claude (Sonnet, Opus, Haiku)AnthropicClosedLong context, coding, safety, agentic workflows.
Gemini (Pro, Flash, Ultra)Google DeepMindClosedNative multimodal, huge context windows, Workspace integration.
Llama (3.x, 4)MetaOpen weightsMost-used open model; runs anywhere.
Mistral / MixtralMistral AIMixedEuropean, efficient, MoE architecture.
DeepSeekDeepSeekOpen weightsStrong reasoning at fraction of training cost.
QwenAlibabaOpen weightsMultilingual, strong on Asian languages.
GrokxAIMixedX (Twitter) integration, real-time data.
Mechanics

How They Work

The Transformer Architecture

Almost all modern foundation models are transformers — a 2017 architecture from Google ("Attention Is All You Need"). The core idea: self-attention lets the model weigh how every token relates to every other token, in parallel. This scales beautifully with compute.

Pre-training vs Fine-tuning
  • Pre-training: learn language patterns from trillions of tokens (months of GPU time, $$$$).
  • Fine-tuning: adapt to a specific task or style with a small dataset (hours, $).
  • RLHF (Reinforcement Learning from Human Feedback): teach the model to prefer helpful, harmless responses.
  • Instruction tuning: train on (instruction, response) pairs so it follows requests.
Capabilities & Limitations
  • Strengths: language understanding, code, reasoning, summarization, translation, image understanding.
  • Hallucination: models can confidently invent facts — always validate.
  • Knowledge cutoff: models know nothing past their training date (without tools/RAG).
  • Math & precision: still uneven — use tool calling for arithmetic, code execution.
Reasoning Models

A new generation (OpenAI o-series, Claude with extended thinking, Gemini 2.0 Flash Thinking, DeepSeek R1) spends extra compute on a private "thinking" pass before answering — vastly better at math, code, and multi-step problems, at higher latency & cost.

Multi-modal Models

Modern models accept images, audio, and video alongside text. Native multimodal models (GPT-4o, Gemini, Claude) are trained jointly on all modalities; older systems bolt vision encoders on top.

Choosing

Open vs Closed

Closed (API)
  • State-of-the-art capability, no infra to manage.
  • Pay-per-token; instant scale.
  • Continuous improvement — model updates roll in.
  • Less control: data leaves your perimeter (with caveats).
Open Weights
  • Run on your own hardware — full data privacy.
  • Free to fine-tune, distill, quantize.
  • Generally trail closed frontier by 6–12 months.
  • You own the GPU bill and ops headache.
Continue

Other AI Stack Layers