Learning Track · 4 of 4

Build a Production-Grade Image Upload & Thumbnail Service

Eleven modules that take you from an empty Git repo to a scalable, resilient image service — handling uploads, validation, async processing, multiple formats and sizes, CDN distribution, signed URLs, and edge cases. This track covers blob storage, async pipelines, media processing, caching, and operational excellence for user-generated content at scale.

Blob storageImage processingAsync pipelinesCDNAccess controlObservability
← Back to Learning Tracks
How to Use This Track

Learning by Shipping File Services

Ground rules

  • Validate early. User uploads are vectors for attacks (zip bombs, infinite files, malware). Validate format, size, and content.
  • Process asynchronously. Image resizing is expensive. Never block the upload on processing.
  • Embrace immutability. Once uploaded, images don't change. Use ETags and cache aggressively.
  • Pick a storage backend and SDK. AWS S3, Google Cloud Storage, or self-hosted MinIO are all valid. Learn your SDK deeply.
  • Edge cases are the work. EXIF data, corrupted files, unsupported formats, SSRF, path traversal — think like an attacker.

The eleven modules

Module 01 · ~2–3 hrs

Foundations & Project Setup

Start with a clear design. Image services are deceptively complex — edge cases, security, and performance are everywhere.

Tasks

  • Create a Git repo with standard config (.editorconfig, .gitignore, linter/formatter).
  • Scaffold an HTTP server. Pick one: Node, Python, Go, Rust.
  • Write a one-page design doc: problem, goals, storage backend choice, image sizes you'll generate, API sketch, failure modes.
  • Create database schema: images (metadata), variants (sizes/formats), uploads (user tracking).
  • Document your validation and security strategy in the design doc.
Acceptance criteria
  • git clone + one command spins up the server.
  • Design doc committed to /docs/design.md with schema sketch.
  • Linter passes. Pre-commit hook blocks failures.
Module 02 · ~4–6 hrs

Upload API & Blob Storage

The baseline: accept uploads, store them durably, and serve them back. Start with your storage backend.

Tasks

  • Set up blob storage: AWS S3, Google Cloud Storage, or MinIO locally.
  • Implement POST /upload: multipart file upload, store in blob storage, record metadata in DB.
  • Implement GET /images/:id: fetch from blob storage, return with correct content-type.
  • Implement DELETE /images/:id: delete from blob storage and DB.
  • Return proper HTTP headers: Content-Type, ETag, Cache-Control.
Acceptance criteria
  • curl -F file=@test.jpg POST /upload stores the file and returns an image ID.
  • GET /images/:id returns the file with correct content-type.
  • Deleting an image removes it from blob storage and the database.
Module 03 · ~4–6 hrs

Input Validation & Security

Users can upload anything. Validate format, size, content. Prevent zip bombs, SSRF, path traversal.

Tasks

  • File type validation: check magic bytes (file signature), not just extension.
  • Size limits: max file size (e.g., 50MB). Rate limit uploads per user.
  • SSRF prevention: if you support URL uploads, block private IPs and metadata endpoints.
  • Malware scanning (optional): integrate ClamAV or VirusTotal for safety.
  • Reject uploads immediately if validation fails with clear 400 error.
Acceptance criteria
  • Uploading a text file with .jpg extension is rejected.
  • Uploading a file larger than your limit is rejected with 413 Payload Too Large.
  • A rate-limited user gets 429 Too Many Requests.
Module 04 · ~5–7 hrs

Async Image Processing

Resizing images is expensive. Don't block the upload. Enqueue, process asynchronously, store variants.

Tasks

  • On successful upload, enqueue a processing job for the image.
  • A worker processes the job: read the original from blob storage, generate a thumbnail (200x200), store both.
  • Use ImageMagick, libvips, or equivalent. libvips is faster and uses less memory.
  • Track processing state: pending, processing, success, failed.
  • If processing fails, don't lose the original image. Log the error and expose it in the API.
Acceptance criteria
  • Upload an image; GET /images/:id?size=thumb blocks until the thumbnail is ready.
  • Processing a 5MB image takes < 5 seconds on local hardware.
  • A processing failure doesn't delete the original.
Module 05 · ~5–7 hrs

Multiple Sizes & Formats

Different clients need different sizes. Generate: thumbnail (200x200), medium (800x800), full (original). Support modern formats: WebP, AVIF.

Tasks

  • Generate multiple variants: thumb, medium, large. Store each as a variant in blob storage.
  • Implement responsive images: GET /images/:id?size=medium&format=webp returns WebP, falling back to JPEG.
  • On-demand generation (optional): if a variant doesn't exist, generate it on-the-fly and cache.
  • Document your variant strategy in the design doc: which sizes, which formats, storage costs.
  • Measure disk usage: log variant sizes so you can optimize later.
Acceptance criteria
  • GET /images/:id?size=thumb returns a 200x200 thumbnail.
  • GET /images/:id?format=webp returns WebP if available, else original.
  • You can describe storage overhead: original + 2 thumbnails + WebP variants.
Module 06 · ~3–4 hrs

Metadata & EXIF Handling

EXIF data can leak sensitive info (GPS, camera, timestamps). Extract what's useful, strip what's not.

Tasks

  • Extract EXIF on upload: camera model, date taken, dimensions. Store in the images table.
  • Strip sensitive EXIF on variant generation (GPS, camera serial).
  • Expose GET /images/:id/metadata with safe EXIF (omit GPS, serial).
  • Use dimensions from EXIF to validate uploads (e.g., reject images < 100x100).
  • Document your EXIF stripping strategy in the design doc.
Acceptance criteria
  • GET /images/:id/metadata returns EXIF data (no GPS or sensitive info).
  • Variants don't contain GPS or camera serial in EXIF.
  • You can describe why EXIF stripping matters (privacy).
Module 07 · ~4–6 hrs

CDN Integration & Distribution

Serve images from edge locations, not your origin. Integrate with a CDN: CloudFlare, Fastly, or AWS CloudFront.

Tasks

  • Set up a CDN (CloudFlare, Fastly, or equivalent). Point to your origin.
  • Configure cache headers: Cache-Control: public, max-age=31536000 for immutable images.
  • Implement POST /images/:id/purge to purge from CDN if an image is deleted.
  • Use a CDN-friendly URL format: https://cdn.example.com/images/:id.
  • Measure: edge cache hit rate, origin traffic reduction, latency from different regions.
Acceptance criteria
  • Image served from CDN (check response headers for CF-Cache-Status or equivalent).
  • Deleting an image purges it from the CDN.
  • Cache-Control headers are correct (long-lived for immutable images).
Module 08 · ~4–6 hrs

Signed URLs & Access Control

Not all images are public. Generate time-limited signed URLs for private uploads. Implement ownership and sharing.

Tasks

  • Add a public flag to images. Private images are only accessible via signed URLs.
  • Implement POST /images/:id/signed-url: generate a time-limited signed URL (valid for 1 hour, configurable).
  • Implement ownership: only the uploader can delete their images and generate signed URLs.
  • Use HMAC or JWT for signing. Validate signature on GET /images/:id.
  • Expose POST /images/:id/share to grant time-limited access to another user.
Acceptance criteria
  • Private images return 403 Forbidden without a valid signed URL.
  • POST /images/:id/signed-url generates a URL valid for 1 hour.
  • User A cannot delete User B's private images.
Module 09 · ~5–7 hrs

Observability & Performance

You can't debug what you can't see. Instrument: logs, metrics, traces. Monitor upload latency, processing time, CDN performance.

Tasks

  • Structured logs for every upload: file size, content-type, processing time, variants generated.
  • Metrics: uploads/sec, total storage used, variant generation time (p50, p95, p99), CDN hit rate.
  • Distributed tracing: instrument the full path (upload → validate → store → process variants).
  • Build a Grafana dashboard: upload volume, processing queue depth, variant generation latency, storage usage.
  • Define SLOs: e.g., 99.9% of uploads complete within 30 seconds (over 30 days).
Acceptance criteria
  • You can query logs for a single upload: validation → storage → variant generation.
  • Dashboard shows variant generation latency; you can identify bottlenecks.
  • SLO doc with error budget committed.
Module 10 · ~5–7 hrs

Testing & Resilience

Edge cases are everywhere: corrupted uploads, missing processing, quota exceeded. Test them all.

Tasks

  • Unit tests for validation: magic bytes, file size, EXIF parsing.
  • Integration tests: upload → validate → store → process → serve. Verify end-to-end.
  • Edge case tests: corrupted JPEG, 0-byte file, 50MB file, animated GIF, AVIF without support.
  • Chaos tests: kill blob storage, kill processing worker. Verify originals aren't lost.
  • Coverage threshold: 80% for validation and processing logic.
Acceptance criteria
  • Test suite runs in under 2 minutes.
  • You can describe 5 edge cases your tests cover (e.g., corrupted JPEG, missing processor).
  • Killing the processor doesn't lose uploads; jobs retry on restart.
Module 11 · ~6–8 hrs

Capstone & Production Readiness

Ship it. Deploy, document, set up monitoring, go live.

Tasks

  • Write a multi-stage Dockerfile. Use docker-compose.yml for local dev (app + Postgres + worker + MinIO).
  • GitHub Actions: lint → test → build → push → deploy.
  • Deploy to production: Fly.io, Render, Railway, or a VM. Set up health checks and auto-rollback.
  • Write two runbooks: "storage quota exceeded" and "variant generation is slow".
  • Write a capacity planning guide: storage calculations, processing capacity, CDN costs.
  • Polish the README: architecture diagram, API examples, how to run locally, how to deploy.
Acceptance criteria
  • Every PR gets a preview URL; main deploys to production automatically.
  • A bad deploy auto-rolls back.
  • Runbooks + capacity planning guide committed to /docs.
After the Track

Where to Go Next

  • Stretch goals on the same project: watermarking, background blur, AI tagging, image search, video support, progressive uploads.
  • Read about media processing. Study libvips, FFmpeg, ML-based image processing; sketch how you'd add video transcoding.
  • Try the Webhook Delivery track to deepen your async systems knowledge.
  • Write about it. Blog post per module; especially on EXIF stripping, CDN cache invalidation, and variant generation.