Database Family · Search

Search Engines

Inverted indexes, analyzers, and ranking. Built to answer "find documents that match these tokens, in roughly the right order, fast" — and accidentally became the world's log database along the way.

Inverted indexAnalyzersBM25FacetingLogs & observability
← Back to Database Side
Quick Facts

At a Glance

Core Ideas

  • Inverted index: for each token, a sorted list of the documents that contain it. Lookups are fast because the index already has the answer.
  • Analyzer: the pipeline that turns text into tokens — lowercase, strip punctuation, stem, fold accents, split on language rules.
  • Scoring: BM25 is the modern default. Documents with rarer terms and shorter fields rank higher.
  • Near-real-time: a write becomes searchable after a refresh interval (often 1s) — not instantly.
  • Schema-flexible but typed: mappings declare per-field analyzers and types. Get them wrong and ranking suffers silently.
The Engines

Who Plays Here

Elasticsearch

The category leader for a decade. Rich query DSL, aggregations, mature ecosystem (Beats, Logstash, Kibana). License changed in 2021 — read your terms.

OpenSearch

AWS' Apache-2.0 fork after the license change. Mostly drop-in for Elasticsearch <7.10; growing its own feature set since.

Apache Solr

The other Lucene-based veteran. Strong on faceting and configurable scoring; popular in libraries, ecommerce, government.

Meilisearch / Typesense

Modern, smaller, opinionated. Built for instant-search UIs — typo-tolerant, fast to set up, no operational sprawl.

Algolia

Hosted search-as-a-service. Polished SDKs, edge-replicated indexes. Pay per operation; great DX, watch the bill.

Postgres FTS / SQLite FTS5

Built-in full-text search. Often enough for moderate corpora before you need a dedicated engine.

When Search Wins

Where the Inverted Index Earns Its Keep

Full-Text Search Over a Catalog

Products, articles, support tickets, code. Tokens, stemming, synonyms, typo tolerance, phrase matching, fuzzy matching, language-aware analyzers. SQL LIKE never gets there; Postgres FTS does for moderate scale; a search engine is the answer past that.

Faceted Search & Filters

"Show me dresses, blue, size M, under $100, in stock, sorted by relevance" — and aggregate the counts beside every facet so the UI shows "Blue (412), Red (89)…" Search engines compute facets natively in one query; SQL takes one query per facet.

Logs, Metrics, Observability

The "ELK stack" became the default for centralized logging because Elasticsearch made grep-across-everything fast. Index by day, retain by policy, query in Kibana. Modern alternatives — ClickHouse, Loki, OpenSearch — compete here now.

Hybrid Lexical + Vector Search

BM25 finds the documents that contain the right words; vector embeddings find the documents that mean the same thing. Modern search engines do both and combine the scores (RRF, learned rankers). The standard pattern for RAG retrieval over enterprise corpora.

Geo Search

"Restaurants within 2km of here, sorted by distance, filtered by cuisine, ranked by review count." Geo points and shapes are first-class — geohash indexes make radius queries fast.

When to Stay Away

Bad Fits

  • System of record. Search engines lose data more often than relational DBs and don't have foreign keys, constraints, or proper transactions. Index from a primary store, don't make the index be the primary store.
  • Heavy updates to the same document. Every update is effectively a delete + reindex. Slow, expensive, fragments segments.
  • Strict consistency. Refresh intervals mean reads after writes can miss the new doc. Acceptable for search; not for "did the order go through."
  • Joins. Parent/child and nested docs exist but are awkward and slow. Denormalize at index time.
Modeling

Designing the Index

Mappings Matter

A field analyzed as text is searchable but not aggregatable. The same field as keyword is filterable and aggregatable but not full-text searchable. Typical pattern: a multi-field that is both — title (text) and title.keyword. Get this wrong on day one and reindexing later is painful.

Analyzers per Language

English stemming on French content tokenizes nonsense. Set per-field analyzers when the corpus is multilingual; consider language detection and per-document analyzer routing.

Index Lifecycle

For logs and time-series: rolling indices (logs-2026.04.27), hot/warm/cold tiers, automatic deletion past retention. ILM (Index Lifecycle Management) does this declaratively. Without it, the cluster fills up and stops accepting writes.

Reindex from Source

Always be able to rebuild the index from the source of truth. Schema changed, analyzer changed, sharding changed — reindex. If you can't, you've coupled too tightly.

Pitfalls

Common Mistakes

  • Mapping explosion. Indexing a JSON blob with arbitrary keys creates a field per key. The mapping balloons until the cluster crashes. Use dynamic: strict or flatten before indexing.
  • Too many small shards. Each shard has fixed overhead. A few hundred million docs in 200 shards is worse than the same in 20.
  • Deep pagination. from + size past 10k is brutal. Use search_after or scroll/PIT.
  • Tuning relevance with code instead of analyzers. If the wrong tokens are in the index, no amount of query magic fixes it. Fix the analyzer first.
  • Not accounting for refresh latency in tests. Tests that index then immediately search are flaky. Force a refresh or wait.
Continue

Related