Search Autocomplete System

The Problem: Instant Suggestions at Scale

Type “goo” into Google, and within 50 milliseconds you see “google”, “gmail”, “google translate”, “google maps” — all ranked by popularity. This is search autocomplete (typeahead), and it’s a deceptively complex problem.

The challenge: you need to serve billions of queries per day with under 100ms latency. The user is typing character by character, so you’re not handling 10B queries — you’re handling 10B × (average query length of 4 characters) = 40B autocomplete prefix matches per day. That’s 460K requests per second.

You cannot query a database for every keystroke. You cannot even do a full text search. You need pre-computed, in-memory data structures and aggressive caching.

Functional and Non-Functional Requirements

Functional requirements:

Return top 5-10 suggestions as user types each character
Suggestions ranked by popularity and relevance
Support trending queries (top queries in the last hour)
Optional: Personalized suggestions based on user’s search history
Handle typos and spell correction (advanced)

Non-functional requirements:

Latency: under 100ms per keystroke (p99)
Throughput: 10B queries/day but 40B autocomplete prefix requests
Freshness: Suggestions update periodically (e.g., trending topics refresh hourly)
Availability: 99.95% uptime (unavailable autocomplete is degraded UX, not broken)
Data size: Top 1M queries fit in memory; full query log in data warehouse

Scale Estimation

Let’s do the math carefully:

User query volume:

10B queries/day
Average query length: 4 characters
Each character triggers an autocomplete request
40B autocomplete requests/day = ~460K requests/second average
Peak: ~1.5M requests/second (during peak hours)

Popular queries:

Zipfian distribution: top 1% of queries account for 60% of traffic
Top 1M queries cover ~95% of all autocomplete traffic
We can fit ~100K queries per second in L1 cache

Trending queries:

Update every 15 minutes or hourly
Top 1K to 10K trending queries at any time
Much smaller than all-time top queries

The Trie Data Structure

The Trie (prefix tree) is the fundamental insight for this problem. It’s the only data structure that lets you efficiently find all strings starting with a given prefix.

graph TD
    root["root"]
    g["g<br/>(5.2M queries)"]
    go["go<br/>(3.1M queries)"]
    goo["goo<br/>(2.8M queries)"]
    goog["goog<br/>(2.7M queries)"]
    googl["googl<br/>(2.6M queries)"]

    c["c<br/>(1.2M queries)"]
    ca["ca<br/>(800K queries)"]
    cat["cat<br/>(600K queries)"]

    root --> g
    root --> c
    g --> go
    go --> goo
    goo --> goog
    goog --> googl
    root --> c
    c --> ca
    ca --> cat

    goo -.->|top_5| ["google", "google maps", "google translate", "google docs", "google sheets"]
    cat -.->|top_5| ["cat", "cats", "cat videos", "cat memes", "caterpillar"]

Key insight: Each node stores:

Character
Children (pointers to next nodes)
Query count (frequency)
Top-K suggestions (pre-computed list of top 5-10 strings with this prefix)

This is crucial: we don’t compute the top suggestions at query time. We pre-compute them when building the trie. So when the user types “goo”, we just look up the node, and top-5 suggestions are already there.

High-Level Architecture

graph LR
    Client["Client<br/>(Browser/Mobile)"]
    CDN["CDN<br/>(Cloudflare)"]
    APIGw["API Gateway"]
    AutocompleteService["Autocomplete Service<br/>(Trie lookup)"]

    TrieMemory["Trie<br/>(In-Memory or Redis)"]
    TrendingService["Trending Service<br/>(Real-time analytics)"]

    QueryLogger["Query Logger<br/>(Kafka)"]
    OfflineProcessor["Offline Processor<br/>(Spark/MapReduce)"]
    TrieBuilder["Trie Builder"]

    Client -->|GET /autocomplete?q=goo| CDN
    CDN --> APIGw
    APIGw --> AutocompleteService
    AutocompleteService --> TrieMemory
    AutocompleteService --> TrendingService

    TrieMemory -.->|every 15 min| TrieBuilder
    QueryLogger -.->|query events| OfflineProcessor
    OfflineProcessor -.->|top 1M queries| TrieBuilder
    TrieBuilder -.->|rebuilt trie| TrieMemory

The flow:

Client types a character and fires autocomplete request (debounced)
CDN caches responses for popular prefixes
Autocomplete Service does a simple trie lookup
Trie returns top-K suggestions instantly
Trending Service overlays real-time trending queries
Offline pipeline (Spark/Hadoop) rebuilds the trie from query logs hourly

Deep Dive: Trie Construction and Updates

Building the Trie Offline

The trie is built offline from historical query data, not in real-time:

1. Data collection: Log all queries to Kafka
   - query: "coffee"
   - timestamp: 1707043200
   - user_id: optional (for personalization)

2. Batch processing (hourly):
   - Stream queries from Kafka to Spark
   - Aggregate: GROUP BY query, COUNT(*) to get frequency
   - Output: (query, count) for top 1M queries
   - Example:
     ("google", 5_200_000)
     ("google maps", 1_800_000)
     ("gmail", 1_500_000)
     ...

3. Trie construction:
   - For each (query, count) in sorted order:
     - Insert into trie, store count at node
     - When node is complete, compute top-K for this prefix
   - Example node "go":
     top_5 = [
       ("google", 5200000),
       ("google maps", 1800000),
       ("google docs", 900000),
       ("google photos", 800000),
       ("go to meeting", 700000)
     ]

4. Serialize trie:
   - Trie size for 1M queries: ~500 MB
   - Serialize to binary format or JSON
   - Store in Redis or in-memory cache

Trending queries update faster (every 15 minutes):

1. Real-time aggregation (last 15 min):
   - Use HyperLogLog + counters for approximate counts
   - For prefix "c", compute trending queries:
     trending_now = [
       ("covid", 50_000),
       ("crypto", 48_000),
       ("coffee", 45_000)
     ]

2. Merge with trie suggestions:
   - When user types "c", combine:
     - Trie top-5 for "c" (all-time popular)
     - Trending top-5 for "c" (current hour)
     - Deduplicate, re-rank
     - Return top-5

Trie Node Serialization

{
  "char": "g",
  "count": 5_200_000,
  "children": {
    "o": {...},
    "i": {...},
    "a": {...}
  },
  "top_k": [
    {
      "query": "google",
      "count": 5_200_000
    },
    {
      "query": "google maps",
      "count": 1_800_000
    },
    ...
  ]
}

Client-Side Optimizations

The server can only do so much. The client must also optimize:

Debouncing

Don’t fire a request on every keystroke. Wait 100-200ms after the user stops typing:

let debounceTimer;
searchInput.addEventListener('input', (event) => {
    clearTimeout(debounceTimer);
    debounceTimer = setTimeout(() => {
        fetch(`/autocomplete?q=${event.target.value}`)
            .then(response => response.json())
            .then(data => displaySuggestions(data));
    }, 150); // 150ms debounce
});

Impact: Reduces requests from 40B/day to 10-15B/day.

Browser Caching

Cache recent suggestions in the browser:

const suggestionCache = {};
function getSuggestions(prefix) {
    if (suggestionCache[prefix]) {
        return suggestionCache[prefix];
    }
    // Fetch from server
}

Prefetching

After the user types “ca”, prefetch suggestions for “cat”, “car”, “cab”:

const prefix = userInput; // "ca"
const likelyNextChars = ['t', 'r', 'b', 'p', ...];
for (let char of likelyNextChars) {
    prefetch(`/autocomplete?q=${prefix}${char}`);
}

This masks network latency.

Scaling Considerations

Shard the Trie by Prefix Range

A single trie for 1M queries is ~500 MB in memory. With 10 servers, each can hold 10M queries.

Shard by prefix range:

Trie shard 1: a-c (characters a, b, c, ...)
Trie shard 2: d-g
Trie shard 3: h-m
Trie shard 4: n-s
Trie shard 5: t-z

Route requests:

GET /autocomplete?q=cat -> shard 1
GET /autocomplete?q=google -> shard 3

Replication and Redundancy

Each shard is replicated for high availability:

Shard 1 primary: server-1a
Shard 1 replica: server-1b, server-1c

If server-1a fails, traffic routes to server-1b or server-1c.

Multi-Layer Caching

Autocomplete responses are highly cacheable because suggestions don’t change much:

Layer 1: Browser cache (500 MB, user's device)
Layer 2: CDN edge (10 GB per POP, cached globally)
Layer 3: Application cache (Redis, 100 GB)
Layer 4: Trie (in-memory, 10 GB per shard)
Layer 5: Database (query logs, data warehouse)

Effectiveness: Top 1000 prefixes (“a”, “ab”, “abc”, …, “ab…zzz”) serve 80% of traffic and can be CDN-cached globally.

Distributed Trie in Redis

If in-memory is not enough, store the trie in Redis:

Key: "trie:node:{prefix}"
Value: JSON with node data and top-K

GET trie:node:go -> returns top-10 for "go"
GET trie:node:google -> returns top-10 for "google"

Lookup becomes a Redis GET instead of memory lookup. Latency increases from 5ms to 20ms but is still under our 100ms budget.

Trade-offs and Design Decisions

Decision	Real-Time Updates	Periodic Rebuilds	Our Choice
Freshness	Up-to-date immediately	15-60 min lag	Periodic rebuilds
Latency	Higher (update overhead)	Lowest	Periodic rebuilds
Complexity	Very high	Simple	Periodic rebuilds
Resource usage	High CPU/memory	Low	Periodic rebuilds

Why periodic rebuilds win: Autocomplete trends don’t change that fast. If Google Trends shows “covid” trending now, 15 minutes later it’s still trending. Users don’t need real-time updates.

Decision	In-Memory Trie	Redis Trie	Our Choice
Latency	5-10ms	20-30ms	In-memory
Availability	Fails if instance restarts	Survives restarts	In-memory (with RTO plan)
Scalability	Limited by RAM	Unlimited	In-memory with sharding
Cost	Cheap (single machine)	Expensive (Redis cluster)	In-memory

Our choice: In-memory with replication. Trade off availability slightly (rebuild trie on restart) for speed and cost.

Decision	Personalized suggestions	Global suggestions	Our Choice
Relevance	Perfect	Generic	Global + optional personalization
Latency	Slow (fetch user history)	Fast	Global
Complexity	Very high	Simple	Global
User experience	Better for power users	Same for all	Global (80/20 rule)

Why global wins: 80% of users just want the most popular suggestions. The effort to personalize isn’t worth it for most.

Key Takeaways

Trie is the right data structure: Hash tables are too slow (scan all keys). Tries give you prefix matching in O(k) where k is query length.
Pre-compute, don’t compute: Store top-K at each node. Don’t compute on every request.
Offline processing scales better: Build trie from batch data hourly. Don’t update in real-time.
Multi-layer caching is essential: Browser cache, CDN, app cache, trie. Each layer handles a different scale.
Client-side optimization matters: Debouncing and prefetching reduce server load by 50-80%.

Practice Exercise

Extend this design to support:

Multi-language: English, Spanish, Mandarin suggestions in separate tries or one unified trie?
Spell correction: User types “seearch”. Should we suggest “search”?
Personalization: After global suggestions, append user’s 3 most recent searches.

Next section: We’ve covered real-world system design walkthroughs. Let’s explore the patterns that emerge across all of them — how do we recognize when to use caching, queuing, or sharding?