System Design Fundamentals

Search Engines & Full-Text Search

A

Search Engines & Full-Text Search

Building Your First Search Experience

You’re launching an e-commerce platform with three million products. A customer types “comfortable running shoes for flat feet” and expects relevant results in milliseconds. Your first instinct? Query your PostgreSQL database with:

SELECT * FROM products WHERE description LIKE '%running shoes%' LIMIT 20;

This works fine for thousands of rows. For millions? You’re looking at a sequential scan. Performance tanks. Worse, the query misses synonyms (“sneakers,” “trainers”), typos (“runniing”), and has no concept of relevance—a shoe made of rubber that happens to mention “running shoes” in a review ranks the same as an actual running shoe.

This is the gap that search engines fill. They’re not traditional databases. They’re purpose-built systems for finding needles in massive haystacks, ranking results by relevance, handling typos, and delivering results in milliseconds. Think of them as specialized siblings to the time-series databases we explored earlier—another example of a database system optimized for a specific workload.

In this chapter, we’ll explore how search engines work, why they’re different from your primary data store, and when (and when not) to use them.

What Full-Text Search Actually Is

Full-text search breaks free from exact matching. Instead of “does this row match this pattern?”, it asks “how relevant is this document to this query?” and ranks results accordingly.

Let’s start with the fundamental data structure: the inverted index. If a normal database indexes by document → words, an inverted index flips the relationship:

word → [list of documents containing that word]

When you search for “running shoes,” the engine:

  1. Finds all documents containing “running”
  2. Finds all documents containing “shoes”
  3. Combines results (intersection, union, or phrase proximity depending on query type)
  4. Scores and ranks them using relevance algorithms

Before indexing, text goes through analysis:

  • Tokenization: Split “comfortable running shoes” into tokens: [“comfortable”, “running”, “shoes”]
  • Lowercasing: Ensure “Running” and “running” are the same token
  • Stop word removal: Words like “the,” “a,” “for” often don’t add value; skip them
  • Stemming/Lemmatization: Reduce words to their root form. “Running,” “runs,” “ran” all become “run”

This preprocessing is where search engines handle the quirks that make searching hard.

The Library Card Catalog Analogy

Imagine a library before computers. To find books on gardening, you’d:

  1. Walk to the card catalog
  2. Look up “gardening” → see cards pointing to shelf locations
  3. Notice the catalog also points you to cards labeled “garden,” “gardens,” “horticulture” (all related)
  4. Pull books from those shelves

You didn’t read every book to find gardening books. You used an index.

An inverted index in a search engine is exactly this. The “cards” are index entries. The “shelves” are document IDs or positions. The catalog’s ability to connect “gardening” and “gardens”? That’s stemming. The cross-references between related terms? That’s synonym expansion.

The key insight: a search engine trades write complexity for read speed. Building that card catalog takes time (indexing). But finding books is instant (querying). In systems with millions of documents, this trade-off is essential.

Inside a Search Engine: Elasticsearch and Friends

Modern search engines like Elasticsearch, OpenSearch, and Apache Solr share common architecture:

The Cluster Model

┌─────────────────────────────────┐
│  Search Cluster                 │
├─────────────────────────────────┤
│  ┌──────────────┐              │
│  │ Node 1       │              │
│  │ ┌──────────┐ │              │
│  │ │ Shard 0  │ │              │
│  │ │ (Primary)│ │              │
│  │ └──────────┘ │              │
│  │ ┌──────────┐ │              │
│  │ │ Shard 2  │ │              │
│  │ │(Replica) │ │              │
│  │ └──────────┘ │              │
│  └──────────────┘              │
│                                 │
│  ┌──────────────┐              │
│  │ Node 2       │              │
│  │ ┌──────────┐ │              │
│  │ │ Shard 1  │ │              │
│  │ │ (Primary)│ │              │
│  │ └──────────┘ │              │
│  │ ┌──────────┐ │              │
│  │ │ Shard 0  │ │              │
│  │ │(Replica) │ │              │
│  │ └──────────┘ │              │
│  └──────────────┘              │
│                                 │
│  ┌──────────────┐              │
│  │ Node 3       │              │
│  │ ┌──────────┐ │              │
│  │ │ Shard 2  │ │              │
│  │ │(Replica) │ │              │
│  │ └──────────┘ │              │
│  │ ┌──────────┐ │              │
│  │ │ Shard 1  │ │              │
│  │ │(Replica) │ │              │
│  │ └──────────┘ │              │
│  └──────────────┘              │
└─────────────────────────────────┘

An index is divided into shards (partitions), each living on different nodes. Replicas provide redundancy. This is the scatter-gather pattern in action: a query hits every shard in parallel, each searches locally, results merge at the coordinator node.

The Indexing Pipeline

When you index a document:

Document (raw JSON)

Analyzer (language-specific rules)

Tokenizer ("comfortable running shoes" → ["comfortable", "running", "shoes"])

Token Filters (lowercase, remove stop words, stem)
  ├→ "comfortable" (pass through)
  ├→ "running" → "run" (stemmed)
  └→ "shoes" → "shoe" (stemmed)

Inverted Index
  {
    "comfortable": [doc_1, doc_5, ...],
    "run": [doc_1, doc_2, doc_3, ...],
    "shoe": [doc_1, doc_2, doc_4, ...]
  }

This happens once at index time. Search is fast because we’re working with pre-processed data.

Relevance Scoring: TF-IDF and BM25

Two algorithms dominate relevance scoring:

TF-IDF (Term Frequency × Inverse Document Frequency):

Score(doc, term) = TF(term in doc) × IDF(term in collection)

Where:
TF = how often the term appears in this document
IDF = log(total documents / documents containing term)

If “running” appears 10 times in a product and in 50% of all products, TF-IDF would be:

  • TF: 10 (or normalized: 10/total_words_in_doc)
  • IDF: log(1,000,000 / 500,000) ≈ 0.7
  • Score: ~7 (normalized)

Common words have low IDF (they’re in half of all docs), so they contribute less. Rare words have high IDF.

BM25 (used by Elasticsearch by default) improves on TF-IDF:

Score(doc, term) = IDF(term) × (
  (TF(term) × (k1 + 1)) / (TF(term) + k1 × (1 - b + b × (doc_length / avg_doc_length)))
)

Where:
k1 ≈ 1.2 (saturation parameter—diminishing returns after a point)
b ≈ 0.75 (length normalization)

Why the complexity? Because in real data, term frequency doesn’t scale linearly. If “running shoes” appears 1 time, it’s important. If it appears 100 times, we shouldn’t boost the score 100x—that’s diminishing returns. BM25 bakes this in.

Pro Tip: Field boosting lets you say “matches in the product name count 10x more than matches in the description.” You’d configure: "name": { "type": "text", "boost": 10 }.

Query Types

Elasticsearch supports multiple query types:

// Match query: "running shoes" (flexible matching)
{
  "query": {
    "match": {
      "description": {
        "query": "running shoes",
        "operator": "and",
        "fuzziness": "AUTO"
      }
    }
  }
}

// Phrase query: exact phrase in order
{
  "query": {
    "match_phrase": {
      "description": "comfortable running shoes"
    }
  }
}

// Boolean query: combine multiple conditions
{
  "query": {
    "bool": {
      "must": [
        { "match": { "description": "running shoes" } }
      ],
      "filter": [
        { "range": { "price": { "lte": 150 } } },
        { "term": { "in_stock": true } }
      ],
      "should": [
        { "match": { "brand": "Nike" } }
      ],
      "minimum_should_match": 0
    }
  }
}

// Fuzzy query: handles typos
{
  "query": {
    "fuzzy": {
      "product_name": {
        "value": "runniing",
        "fuzziness": "1"
      }
    }
  }
}

The bool query is powerful: must clauses must match (AND), should clauses boost if matched (OR), filter clauses must match but don’t affect scoring, and must_not excludes results.

Elasticsearch doesn’t index documents instantly. Every few seconds (default: 1 second), it refreshes: segments of the inverted index become searchable. This trade-off—eventual consistency for throughput—is fundamental.

You can force immediate visibility with a flush, but that’s expensive. Most use cases accept the slight lag.

Search Engine Comparison

AspectElasticsearchOpenSearchApache SolrMeilisearchTypesense
Setup ComplexityModerateModerateHighLowLow
Scaling ModelHorizontal (shards/replicas)HorizontalHorizontalLimitedLimited
Query LanguageJSON DSLJSON DSLLucene syntaxSimple JSONSimple JSON
Relevance TuningAdvanced (BM25, field boost, function score)AdvancedAdvancedLimitedGood
Full-Text FeaturesExcellentExcellentExcellentGoodGood
Typo ToleranceYes (fuzzy)YesYesNativeNative
AutocompleteVia prefix queriesVia prefix queriesVia edge-gram filterNativeNative
Cost at ScaleHigh (memory-intensive)HighModerateLowLow
Operational OverheadHighHighHighLowLow
Best ForEnterprise, complex searchEnterprise (Elasticsearch alternative)Complex search, Lucene fansSimple SAASReal-time, hosted

When to use Meilisearch or Typesense: You’re building a SAAS or need simple search without running a cluster. Want defaults that work out-of-the-box.

When to use Elasticsearch/OpenSearch: Enterprise features matter. You need fine-grained relevance tuning. You’re comfortable with operational complexity.

When to use Solr: You’re already in the Java ecosystem. You need advanced Lucene features.

Your index might look like:

{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "product_name": {
        "type": "text",
        "analyzer": "product_analyzer",
        "boost": 10
      },
      "description": {
        "type": "text",
        "analyzer": "product_analyzer"
      },
      "brand": {
        "type": "keyword"
      },
      "category": {
        "type": "keyword"
      },
      "price": {
        "type": "float"
      },
      "in_stock": {
        "type": "boolean"
      },
      "rating": {
        "type": "float"
      },
      "created_at": {
        "type": "date"
      }
    }
  }
}

Notice: product_name is boosted 10x (matches there matter more), brand and category are keywords (exact matching, used for filtering, not scoring), numeric fields enable range filters.

A complex query:

{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "comfortable running shoes",
            "fields": ["product_name^10", "description", "brand"],
            "type": "best_fields"
          }
        }
      ],
      "filter": [
        { "range": { "price": { "lte": 150 } } },
        { "term": { "in_stock": true } },
        { "terms": { "category": ["athletic", "footwear"] } }
      ],
      "should": [
        { "range": { "rating": { "gte": 4 } } }
      ]
    }
  },
  "aggs": {
    "brands": {
      "terms": { "field": "brand", "size": 10 }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 50 },
          { "from": 50, "to": 100 },
          { "from": 100, "to": 150 }
        ]
      }
    }
  }
}

This query finds products matching the search terms, filters by price and availability, boosts highly-rated products, and returns facets (aggregations) for filtering.

Handling Typos: Fuzzy Matching and Edit Distance

Users type “runniing” instead of “running.” How do we still find relevant results?

Edit distance (Levenshtein distance) measures how many single-character edits (insertions, deletions, substitutions) are needed to transform one word into another:

"running" → "runniing" = 1 insertion = distance 1
"nike" → "nike" = 0 = distance 0
"shoe" → "shoe" = 0 = distance 0
"comfort" → "comfert" = 1 substitution = distance 1

Elasticsearch allows fuzziness: 1 or fuzziness: 2 for typo tolerance. Higher values match more, but slower and less relevant.

For autocomplete, consider edge-grams: index substrings.

"running shoes" →
Tokens: ["r", "ru", "run", "runn", "runni", "runnin", "running", "s", "sh", "sho", "shoe", "shoes"]

User types “runn” → instantly matches edge-gram, returns “running shoes” as top result.

Distributed Search: The Scatter-Gather Pattern

When a query arrives:

Query Request

Coordinator Node
  ├→ Send query to Shard 0 (on Node 1)
  ├→ Send query to Shard 1 (on Node 2)
  ├→ Send query to Shard 2 (on Node 3)

Each shard searches locally, returns top-K results

Coordinator merges and re-ranks results

Return top results to client

This is embarrassingly parallel—each shard is independent. With 5 shards, a query touching all of them runs in parallel, achieving single-digit millisecond latency even across millions of documents.

The Cost of Change: Updates and Deletes

Search engines don’t delete documents instantly. Elasticsearch uses soft deletes and segment merging:

  1. When you delete a document, it’s marked as deleted (not removed)
  2. Search queries skip deleted documents (they have a filter applied)
  3. Periodically, segments merge, physically removing deleted documents
  4. This process (garbage collection) frees disk space

Why not just delete immediately? Because segments are immutable. Deleting would require rewriting the entire segment, which is expensive. Soft deletes defer the cost.

Did You Know? If you have a high update rate (say, millions of updates per day), the garbage collection overhead can become significant. Some teams run nightly reindex jobs to rebuild the index from a source-of-truth database.

PostgreSQL Full-Text Search: A Simpler Alternative

For smaller datasets or simpler requirements, PostgreSQL’s built-in tsvector and tsquery might suffice:

-- Create a full-text search column
ALTER TABLE products ADD COLUMN search_index tsvector;

-- Populate it with tokenized text
UPDATE products SET search_index =
  to_tsvector('english', product_name || ' ' || description);

-- Create a GiST index for faster searches
CREATE INDEX idx_search ON products USING GiST (search_index);

-- Query
SELECT * FROM products
WHERE search_index @@ to_tsquery('english', 'running & shoes')
ORDER BY ts_rank(search_index, query) DESC;

Pros: No external dependency, transactional consistency, simple.

Cons: Limited to single node (no sharding), slower on millions of documents, weaker relevance tuning, no fuzzy matching or autocomplete by default.

For e-commerce with millions of products, you’d outgrow PostgreSQL full-text search. For an internal tools dashboard with thousands of logs? It’s perfect.

Key Takeaways

  • Inverted indexes flip the lookup: word → documents, enabling fast full-text search at scale
  • Analysis (tokenization, stemming, lowercasing) handles language complexity before indexing
  • BM25 scoring balances term frequency and rarity, outperforming simpler TF-IDF in real data
  • Sharding and replicas let you scale searches horizontally across a cluster
  • Near-real-time indexing (refresh intervals) trades immediate consistency for throughput
  • Choose your engine based on scale and complexity: PostgreSQL for small datasets, Meilisearch/Typesense for simple SAAS, Elasticsearch/OpenSearch for enterprise

Practice Scenarios

Scenario 1: Autocomplete at Scale You need to autocomplete product searches while users type. 10 million products, 1,000 QPS, latency under 50ms. Design the indexing strategy (edge-grams vs. prefix queries) and query execution (single-node coordinator vs. distributed). What happens if autocomplete queries start hitting different shards for the same prefix? How do you optimize?

Scenario 2: Multi-Language Search Your platform operates in 8 languages. Users search for products in any language, expecting relevant results even for synonyms and typos. A user searching for “chaussure” (French for shoe) should find “shoe” results. Design the analyzer configuration, consider language-specific stemming, and decide: one index with language field, or separate indices per language? Trade-offs?

Scenario 3: Search Relevance Tuning Your e-commerce search team reports: “Results for ‘running shoes’ show office supplies before athletic shoes.” Investigate: what’s likely wrong? (Hint: consider field boosting, TF-IDF skew, or filter/must vs. should logic.) How do you debug relevance?

What’s Next

Search engines excel at full-text, but struggle with relationships. “Find products similar to this one” or “show me customers who bought related items” requires traversing connections—the domain of graph databases. Next chapter, we’ll explore how to model and query interconnected data efficiently, from social networks to recommendation engines.