Search Autocomplete System
The Problem: Instant Suggestions at Scale
Type “goo” into Google, and within 50 milliseconds you see “google”, “gmail”, “google translate”, “google maps” — all ranked by popularity. This is search autocomplete (typeahead), and it’s a deceptively complex problem.
The challenge: you need to serve billions of queries per day with under 100ms latency. The user is typing character by character, so you’re not handling 10B queries — you’re handling 10B × (average query length of 4 characters) = 40B autocomplete prefix matches per day. That’s 460K requests per second.
You cannot query a database for every keystroke. You cannot even do a full text search. You need pre-computed, in-memory data structures and aggressive caching.
Functional and Non-Functional Requirements
Functional requirements:
- Return top 5-10 suggestions as user types each character
- Suggestions ranked by popularity and relevance
- Support trending queries (top queries in the last hour)
- Optional: Personalized suggestions based on user’s search history
- Handle typos and spell correction (advanced)
Non-functional requirements:
- Latency: under 100ms per keystroke (p99)
- Throughput: 10B queries/day but 40B autocomplete prefix requests
- Freshness: Suggestions update periodically (e.g., trending topics refresh hourly)
- Availability: 99.95% uptime (unavailable autocomplete is degraded UX, not broken)
- Data size: Top 1M queries fit in memory; full query log in data warehouse
Scale Estimation
Let’s do the math carefully:
User query volume:
- 10B queries/day
- Average query length: 4 characters
- Each character triggers an autocomplete request
- 40B autocomplete requests/day = ~460K requests/second average
- Peak: ~1.5M requests/second (during peak hours)
Popular queries:
- Zipfian distribution: top 1% of queries account for 60% of traffic
- Top 1M queries cover ~95% of all autocomplete traffic
- We can fit ~100K queries per second in L1 cache
Trending queries:
- Update every 15 minutes or hourly
- Top 1K to 10K trending queries at any time
- Much smaller than all-time top queries
The Trie Data Structure
The Trie (prefix tree) is the fundamental insight for this problem. It’s the only data structure that lets you efficiently find all strings starting with a given prefix.
graph TD
root["root"]
g["g<br/>(5.2M queries)"]
go["go<br/>(3.1M queries)"]
goo["goo<br/>(2.8M queries)"]
goog["goog<br/>(2.7M queries)"]
googl["googl<br/>(2.6M queries)"]
c["c<br/>(1.2M queries)"]
ca["ca<br/>(800K queries)"]
cat["cat<br/>(600K queries)"]
root --> g
root --> c
g --> go
go --> goo
goo --> goog
goog --> googl
root --> c
c --> ca
ca --> cat
goo -.->|top_5| ["google", "google maps", "google translate", "google docs", "google sheets"]
cat -.->|top_5| ["cat", "cats", "cat videos", "cat memes", "caterpillar"]
Key insight: Each node stores:
- Character
- Children (pointers to next nodes)
- Query count (frequency)
- Top-K suggestions (pre-computed list of top 5-10 strings with this prefix)
This is crucial: we don’t compute the top suggestions at query time. We pre-compute them when building the trie. So when the user types “goo”, we just look up the node, and top-5 suggestions are already there.
High-Level Architecture
graph LR
Client["Client<br/>(Browser/Mobile)"]
CDN["CDN<br/>(Cloudflare)"]
APIGw["API Gateway"]
AutocompleteService["Autocomplete Service<br/>(Trie lookup)"]
TrieMemory["Trie<br/>(In-Memory or Redis)"]
TrendingService["Trending Service<br/>(Real-time analytics)"]
QueryLogger["Query Logger<br/>(Kafka)"]
OfflineProcessor["Offline Processor<br/>(Spark/MapReduce)"]
TrieBuilder["Trie Builder"]
Client -->|GET /autocomplete?q=goo| CDN
CDN --> APIGw
APIGw --> AutocompleteService
AutocompleteService --> TrieMemory
AutocompleteService --> TrendingService
TrieMemory -.->|every 15 min| TrieBuilder
QueryLogger -.->|query events| OfflineProcessor
OfflineProcessor -.->|top 1M queries| TrieBuilder
TrieBuilder -.->|rebuilt trie| TrieMemory
The flow:
- Client types a character and fires autocomplete request (debounced)
- CDN caches responses for popular prefixes
- Autocomplete Service does a simple trie lookup
- Trie returns top-K suggestions instantly
- Trending Service overlays real-time trending queries
- Offline pipeline (Spark/Hadoop) rebuilds the trie from query logs hourly
Deep Dive: Trie Construction and Updates
Building the Trie Offline
The trie is built offline from historical query data, not in real-time:
1. Data collection: Log all queries to Kafka
- query: "coffee"
- timestamp: 1707043200
- user_id: optional (for personalization)
2. Batch processing (hourly):
- Stream queries from Kafka to Spark
- Aggregate: GROUP BY query, COUNT(*) to get frequency
- Output: (query, count) for top 1M queries
- Example:
("google", 5_200_000)
("google maps", 1_800_000)
("gmail", 1_500_000)
...
3. Trie construction:
- For each (query, count) in sorted order:
- Insert into trie, store count at node
- When node is complete, compute top-K for this prefix
- Example node "go":
top_5 = [
("google", 5200000),
("google maps", 1800000),
("google docs", 900000),
("google photos", 800000),
("go to meeting", 700000)
]
4. Serialize trie:
- Trie size for 1M queries: ~500 MB
- Serialize to binary format or JSON
- Store in Redis or in-memory cache
Updating for Trending
Trending queries update faster (every 15 minutes):
1. Real-time aggregation (last 15 min):
- Use HyperLogLog + counters for approximate counts
- For prefix "c", compute trending queries:
trending_now = [
("covid", 50_000),
("crypto", 48_000),
("coffee", 45_000)
]
2. Merge with trie suggestions:
- When user types "c", combine:
- Trie top-5 for "c" (all-time popular)
- Trending top-5 for "c" (current hour)
- Deduplicate, re-rank
- Return top-5
Trie Node Serialization
{
"char": "g",
"count": 5_200_000,
"children": {
"o": {...},
"i": {...},
"a": {...}
},
"top_k": [
{
"query": "google",
"count": 5_200_000
},
{
"query": "google maps",
"count": 1_800_000
},
...
]
}
Client-Side Optimizations
The server can only do so much. The client must also optimize:
Debouncing
Don’t fire a request on every keystroke. Wait 100-200ms after the user stops typing:
let debounceTimer;
searchInput.addEventListener('input', (event) => {
clearTimeout(debounceTimer);
debounceTimer = setTimeout(() => {
fetch(`/autocomplete?q=${event.target.value}`)
.then(response => response.json())
.then(data => displaySuggestions(data));
}, 150); // 150ms debounce
});
Impact: Reduces requests from 40B/day to 10-15B/day.
Browser Caching
Cache recent suggestions in the browser:
const suggestionCache = {};
function getSuggestions(prefix) {
if (suggestionCache[prefix]) {
return suggestionCache[prefix];
}
// Fetch from server
}
Prefetching
After the user types “ca”, prefetch suggestions for “cat”, “car”, “cab”:
const prefix = userInput; // "ca"
const likelyNextChars = ['t', 'r', 'b', 'p', ...];
for (let char of likelyNextChars) {
prefetch(`/autocomplete?q=${prefix}${char}`);
}
This masks network latency.
Scaling Considerations
Shard the Trie by Prefix Range
A single trie for 1M queries is ~500 MB in memory. With 10 servers, each can hold 10M queries.
Shard by prefix range:
Trie shard 1: a-c (characters a, b, c, ...)
Trie shard 2: d-g
Trie shard 3: h-m
Trie shard 4: n-s
Trie shard 5: t-z
Route requests:
GET /autocomplete?q=cat -> shard 1
GET /autocomplete?q=google -> shard 3
Replication and Redundancy
Each shard is replicated for high availability:
Shard 1 primary: server-1a
Shard 1 replica: server-1b, server-1c
If server-1a fails, traffic routes to server-1b or server-1c.
Multi-Layer Caching
Autocomplete responses are highly cacheable because suggestions don’t change much:
Layer 1: Browser cache (500 MB, user's device)
Layer 2: CDN edge (10 GB per POP, cached globally)
Layer 3: Application cache (Redis, 100 GB)
Layer 4: Trie (in-memory, 10 GB per shard)
Layer 5: Database (query logs, data warehouse)
Effectiveness: Top 1000 prefixes (“a”, “ab”, “abc”, …, “ab…zzz”) serve 80% of traffic and can be CDN-cached globally.
Distributed Trie in Redis
If in-memory is not enough, store the trie in Redis:
Key: "trie:node:{prefix}"
Value: JSON with node data and top-K
GET trie:node:go -> returns top-10 for "go"
GET trie:node:google -> returns top-10 for "google"
Lookup becomes a Redis GET instead of memory lookup. Latency increases from 5ms to 20ms but is still under our 100ms budget.
Trade-offs and Design Decisions
| Decision | Real-Time Updates | Periodic Rebuilds | Our Choice |
|---|---|---|---|
| Freshness | Up-to-date immediately | 15-60 min lag | Periodic rebuilds |
| Latency | Higher (update overhead) | Lowest | Periodic rebuilds |
| Complexity | Very high | Simple | Periodic rebuilds |
| Resource usage | High CPU/memory | Low | Periodic rebuilds |
Why periodic rebuilds win: Autocomplete trends don’t change that fast. If Google Trends shows “covid” trending now, 15 minutes later it’s still trending. Users don’t need real-time updates.
| Decision | In-Memory Trie | Redis Trie | Our Choice |
|---|---|---|---|
| Latency | 5-10ms | 20-30ms | In-memory |
| Availability | Fails if instance restarts | Survives restarts | In-memory (with RTO plan) |
| Scalability | Limited by RAM | Unlimited | In-memory with sharding |
| Cost | Cheap (single machine) | Expensive (Redis cluster) | In-memory |
Our choice: In-memory with replication. Trade off availability slightly (rebuild trie on restart) for speed and cost.
| Decision | Personalized suggestions | Global suggestions | Our Choice |
|---|---|---|---|
| Relevance | Perfect | Generic | Global + optional personalization |
| Latency | Slow (fetch user history) | Fast | Global |
| Complexity | Very high | Simple | Global |
| User experience | Better for power users | Same for all | Global (80/20 rule) |
Why global wins: 80% of users just want the most popular suggestions. The effort to personalize isn’t worth it for most.
Key Takeaways
- Trie is the right data structure: Hash tables are too slow (scan all keys). Tries give you prefix matching in O(k) where k is query length.
- Pre-compute, don’t compute: Store top-K at each node. Don’t compute on every request.
- Offline processing scales better: Build trie from batch data hourly. Don’t update in real-time.
- Multi-layer caching is essential: Browser cache, CDN, app cache, trie. Each layer handles a different scale.
- Client-side optimization matters: Debouncing and prefetching reduce server load by 50-80%.
Practice Exercise
Extend this design to support:
- Multi-language: English, Spanish, Mandarin suggestions in separate tries or one unified trie?
- Spell correction: User types “seearch”. Should we suggest “search”?
- Personalization: After global suggestions, append user’s 3 most recent searches.
Next section: We’ve covered real-world system design walkthroughs. Let’s explore the patterns that emerge across all of them — how do we recognize when to use caching, queuing, or sharding?