Inverted indexes, analyzers, and ranking. Built to answer "find documents that match these tokens, in roughly the right order, fast" — and accidentally became the world's log database along the way.
← Back to Database SideThe category leader for a decade. Rich query DSL, aggregations, mature ecosystem (Beats, Logstash, Kibana). License changed in 2021 — read your terms.
AWS' Apache-2.0 fork after the license change. Mostly drop-in for Elasticsearch <7.10; growing its own feature set since.
The other Lucene-based veteran. Strong on faceting and configurable scoring; popular in libraries, ecommerce, government.
Modern, smaller, opinionated. Built for instant-search UIs — typo-tolerant, fast to set up, no operational sprawl.
Hosted search-as-a-service. Polished SDKs, edge-replicated indexes. Pay per operation; great DX, watch the bill.
Built-in full-text search. Often enough for moderate corpora before you need a dedicated engine.
Products, articles, support tickets, code. Tokens, stemming, synonyms, typo tolerance, phrase matching, fuzzy matching, language-aware analyzers. SQL LIKE never gets there; Postgres FTS does for moderate scale; a search engine is the answer past that.
"Show me dresses, blue, size M, under $100, in stock, sorted by relevance" — and aggregate the counts beside every facet so the UI shows "Blue (412), Red (89)…" Search engines compute facets natively in one query; SQL takes one query per facet.
The "ELK stack" became the default for centralized logging because Elasticsearch made grep-across-everything fast. Index by day, retain by policy, query in Kibana. Modern alternatives — ClickHouse, Loki, OpenSearch — compete here now.
BM25 finds the documents that contain the right words; vector embeddings find the documents that mean the same thing. Modern search engines do both and combine the scores (RRF, learned rankers). The standard pattern for RAG retrieval over enterprise corpora.
"Restaurants within 2km of here, sorted by distance, filtered by cuisine, ranked by review count." Geo points and shapes are first-class — geohash indexes make radius queries fast.
A field analyzed as text is searchable but not aggregatable. The same field as keyword is filterable and aggregatable but not full-text searchable. Typical pattern: a multi-field that is both — title (text) and title.keyword. Get this wrong on day one and reindexing later is painful.
English stemming on French content tokenizes nonsense. Set per-field analyzers when the corpus is multilingual; consider language detection and per-document analyzer routing.
For logs and time-series: rolling indices (logs-2026.04.27), hot/warm/cold tiers, automatic deletion past retention. ILM (Index Lifecycle Management) does this declaratively. Without it, the cluster fills up and stops accepting writes.
Always be able to rebuild the index from the source of truth. Schema changed, analyzer changed, sharding changed — reindex. If you can't, you've coupled too tightly.
dynamic: strict or flatten before indexing.from + size past 10k is brutal. Use search_after or scroll/PIT.