Foundation Models · AI Stack Deep Dive

Quick Facts

At a Glance

What it is

Pre-trained neural network

Training data

Trillions of tokens / images

Sizes

Billions to trillions of parameters

Modalities

Text, image, audio, video, code

Access

Closed (API) or Open-weights

Examples

GPT, Claude, Gemini, Llama

Basic Concepts

Foundation model: a single model trained on broad data that can be adapted to many tasks (term coined by Stanford in 2021).
Token: the unit of text the model processes — roughly ¾ of a word in English.
Context window: how many tokens the model can "see" at once (4K → 200K → 1M+ today).
Parameters: the learned weights — more isn't always better, but capability generally scales with size.
Inference = running the model on new input. Training = teaching it from data.

Landscape

The Major Models

Family	Maker	Open?	Strengths
GPT (4o, 4.1, 5)	OpenAI	Closed	Multimodal, broad capability, ChatGPT.
Claude (Sonnet, Opus, Haiku)	Anthropic	Closed	Long context, coding, safety, agentic workflows.
Gemini (Pro, Flash, Ultra)	Google DeepMind	Closed	Native multimodal, huge context windows, Workspace integration.
Llama (3.x, 4)	Meta	Open weights	Most-used open model; runs anywhere.
Mistral / Mixtral	Mistral AI	Mixed	European, efficient, MoE architecture.
DeepSeek	DeepSeek	Open weights	Strong reasoning at fraction of training cost.
Qwen	Alibaba	Open weights	Multilingual, strong on Asian languages.
Grok	xAI	Mixed	X (Twitter) integration, real-time data.

Mechanics

How They Work

The Transformer Architecture

Almost all modern foundation models are transformers — a 2017 architecture from Google ("Attention Is All You Need"). The core idea: self-attention lets the model weigh how every token relates to every other token, in parallel. This scales beautifully with compute.

Pre-training vs Fine-tuning

Pre-training: learn language patterns from trillions of tokens (months of GPU time, $$$$).
Fine-tuning: adapt to a specific task or style with a small dataset (hours, $).
RLHF (Reinforcement Learning from Human Feedback): teach the model to prefer helpful, harmless responses.
Instruction tuning: train on (instruction, response) pairs so it follows requests.

Capabilities & Limitations

Strengths: language understanding, code, reasoning, summarization, translation, image understanding.
Hallucination: models can confidently invent facts — always validate.
Knowledge cutoff: models know nothing past their training date (without tools/RAG).
Math & precision: still uneven — use tool calling for arithmetic, code execution.

Reasoning Models

A new generation (OpenAI o-series, Claude with extended thinking, Gemini 2.0 Flash Thinking, DeepSeek R1) spends extra compute on a private "thinking" pass before answering — vastly better at math, code, and multi-step problems, at higher latency & cost.

Multi-modal Models

Modern models accept images, audio, and video alongside text. Native multimodal models (GPT-4o, Gemini, Claude) are trained jointly on all modalities; older systems bolt vision encoders on top.

Choosing

Open vs Closed

Closed (API)

State-of-the-art capability, no infra to manage.
Pay-per-token; instant scale.
Continuous improvement — model updates roll in.
Less control: data leaves your perimeter (with caveats).

Open Weights

Run on your own hardware — full data privacy.
Free to fine-tune, distill, quantize.
Generally trail closed frontier by 6–12 months.
You own the GPU bill and ops headache.

Continue

Other AI Stack Layers

Model Providers Frameworks Vector DBs Dev Agents MLOps Classic ML Data Prep ↑ Back to AI Landscape