Database Indexes
The 30-Second Problem
Imagine you’re debugging a production incident. A seemingly simple query that should be instant is taking 30 seconds to complete:
SELECT user_id, email, created_at FROM users WHERE email = '[email protected]';
Your users table has 50 million rows. Without an index, the database must scan every single row, comparing each email value until it finds a match. That’s 50 million comparisons. Now you create a single index:
CREATE INDEX idx_users_email ON users(email);
The same query now runs in 5 milliseconds.
That 6,000x improvement isn’t magic—it’s the power of indexes. This chapter explores how databases organize data for fast retrieval, the types of indexes available, and critically, how to use them wisely to avoid the common pitfall of over-indexing.
We’re building on the ACID guarantees from the previous chapter. Indexes are where the theoretical meets the practical: they’re the primary tool for making transactional systems fast enough for real-world workloads. But they come with hidden costs you need to understand.
What Is an Index, Really?
An index is a data structure that maps key values to their locations in a table, enabling the database to find rows without scanning everything. Think of it like this:
Without an index, finding something is like searching through a phone book linearly—start at the front, check every entry until you find the one you want.
With an index, it’s like having a sorted directory organized by last name, then first name. You can jump directly to the section you need.
The fundamental trade-off is the reason you can’t just index everything: every index you create makes reads faster but makes writes slower and uses additional storage. When you insert a row, the database must:
- Insert the row itself
- Update every index on that table
- Maintain the internal structure of each index
On a table with millions of rows and dozens of indexes, this overhead becomes significant.
Index Taxonomy
Let’s establish some vocabulary because different database systems use these terms differently, and precision matters.
Primary Index vs Secondary Index
A primary index is built on the primary key—the unique identifier for each row. Most databases create this automatically and guarantee uniqueness. A secondary index is any other index you add. It may point to duplicate values.
Clustered vs Non-Clustered Index
This distinction matters more in some systems than others. A clustered index physically determines the sort order of the table itself. There can be only one clustered index per table (because a table can only be sorted in one way). Most databases make the primary key the clustered index by default.
A non-clustered index is a separate structure that points to rows without affecting their physical order. You can have many non-clustered indexes.
Unique Indexes
Some indexes enforce uniqueness constraints. They’re useful for business logic (like ensuring no two users have the same email) and have the side benefit of helping the optimizer understand data cardinality.
Composite Indexes
A composite (or multi-column) index indexes multiple columns together. The order matters enormously—more on this shortly.
How Databases Actually Organize Indexes
Let me show you the analogy that stuck with me: a book’s table of contents versus its index.
A clustered index is like a book organized by chapters (1, 2, 3…). The book itself IS organized this way. If you want to find “Chapter 5”, you don’t need a separate lookup table—you just know it comes after Chapter 4. The chapters are stored in order.
A non-clustered index is like the alphabetical index at the back of the book. It maps topic names to page numbers. The book itself isn’t organized alphabetically—it still follows the chapter structure. But the index lets you quickly find where “recursion” is mentioned without reading every page.
In a database context, if your table is clustered by user_id, rows with consecutive user IDs are stored near each other physically. A non-clustered index on email is a separate lookup structure that says “email=‘[email protected]’ is at address X”.
B-Tree Indexes: The Workhorse
Almost all general-purpose databases use some variant of B-Trees for indexing. Here’s why they’re so powerful:
B-Trees are self-balancing tree structures that keep data sorted and enable efficient searches. A B-Tree with branching factor d has these properties:
- Every node can hold up to
dkeys andd+1children - All leaves are at the same depth (perfectly balanced)
- Lookups take O(log n) time—even with a billion rows, you’re rarely more than a few levels deep
- Range queries are efficient because keys are sorted
Let me visualize a small B-Tree:
graph TD
A["[30, 60]"] --> B["[10, 20]"]
A --> C["[40, 50]"]
A --> D["[70, 80]"]
B --> E["5"]
B --> F["15"]
B --> G["25"]
C --> H["35"]
C --> I["45"]
C --> J["55"]
D --> K["65"]
D --> L["75"]
D --> M["85"]
To search for value 47, you’d start at the root [30, 60]. Since 47 lies between 30 and 60, you go to the middle child. Then you check [40, 50]—47 lies between these, so you go to the right child. You either find 47 or reach a leaf knowing it doesn’t exist. Three comparisons to search a tree of 15 values.
B+ Trees improve on B-Trees for range queries. In a B+ Tree, only leaf nodes contain actual data; internal nodes just guide navigation. This means:
- Leaves are linked together, so range queries become sequential reads
- Searching for all values between 35 and 75 is fast—navigate to 35, then follow leaf pointers
- Internal nodes can store more keys since they don’t hold data, making trees shallower
Most modern databases (PostgreSQL, MySQL, SQL Server) use B+ Tree variants.
Hash Indexes: Fast Point Lookups
Hash indexes take a different approach. They use a hash function to map keys to bucket locations, providing O(1) average-case lookup time. They’re excellent for equality queries (WHERE email = '[email protected]') but useless for range queries.
Why? When you hash a value, you lose all ordering information. The hashes of ‘[email protected]’ and ‘[email protected]’ have no relationship—they might hash to opposite ends of the table.
When to use hash indexes:
- Point lookups are your primary access pattern
- You never need range queries on that column
- Your database fully supports them (some don’t)
When to avoid:
- Range queries are common (like
WHERE created_at > '2024-01-01') - You need prefix searches (like
WHERE email LIKE 'alice%')
Many databases default to B-Trees because they’re more versatile, but PostgreSQL’s HASH indexes can be valuable for specific patterns.
Specialized Indexes for Special Cases
Bitmap Indexes
Bitmap indexes excel on low-cardinality columns (few distinct values). They’re common in data warehousing. Instead of storing pointers, they store a bitmap where each bit represents whether a row matches the value.
For a gender column with 100 million rows:
- Traditional index: might need 100 million pointers (high storage)
- Bitmap index: 100 million bits (12.5 MB)—plus you can use bitwise operations (AND, OR) efficiently
Trade-off: great for analytics, poor for columns with thousands of distinct values.
Full-Text Indexes
Full-text search uses inverted indexes—a mapping from words to documents containing them. When you index a document column, the database tokenizes text into words and creates a lookup structure.
CREATE INDEX idx_documents_fulltext ON articles USING GIN (content);
SELECT * FROM articles WHERE content @@ to_tsquery('postgres & database');
This enables fast keyword search without scanning every document.
GIN and GiST Indexes (PostgreSQL)
PostgreSQL’s Generalized Inverted Index (GIN) and Generalized Search Tree (GiST) handle complex data types:
- JSON fields: search nested objects quickly
- Arrays: find containment relationships
- Geometric types: spatial queries
These are increasingly important as databases support richer data types.
Index Type Comparison
| Index Type | Lookup Time | Range Query | Use Case | Limitations |
|---|---|---|---|---|
| B-Tree | O(log n) | Excellent | General purpose | Slightly slower than hash for point lookups |
| B+ Tree | O(log n) | Excellent | General purpose, especially ranges | Internal overhead |
| Hash | O(1) | None | Point lookups only | Cannot handle ranges or sorting |
| Bitmap | O(1) typically | Good | Low-cardinality columns | Terrible for high-cardinality data |
| Full-Text (Inverted) | O(1) typical | N/A | Text search | Only for text; not general purpose |
| GIN | O(log n) | Good | JSON, arrays, full-text | More complex maintenance |
| GiST | O(log n) | Good | Spatial, geometric | More complex maintenance |
Creating and Using Indexes Effectively
Let’s see what actually happens. Start with a slow query:
-- Slow: full table scan
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;
Output might show:
Seq Scan on orders (cost=0.00..50000.00 rows=1000)
Now add an index:
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
Same query now shows:
Index Scan using idx_orders_customer_id on orders (cost=0.29..4.30 rows=1000)
The optimizer chose a 100x faster plan. But here’s the critical insight: the optimizer must know the index exists and believe it will be faster based on statistics.
Composite Indexes and Column Ordering
The order of columns in a composite index matters enormously due to the leftmost prefix rule. Consider this common query:
SELECT * FROM users WHERE country = 'US' AND city = 'New York' AND state = 'NY';
If you create:
CREATE INDEX idx_location ON users(country, city, state);
The database can use this index efficiently for:
WHERE country = 'US'✓WHERE country = 'US' AND city = 'New York'✓WHERE country = 'US' AND city = 'New York' AND state = 'NY'✓
But it cannot efficiently use this index for:
WHERE city = 'New York'✗ (skips the first column)WHERE state = 'NY'✗ (skips two columns)
Why? B-Tree indexes are sorted by the first column, then by the second, then by the third. If you don’t start with the first column, you can’t use the sorting.
Pro tip: When building composite indexes, put the most restrictive columns first (the ones that eliminate the most rows), unless query patterns suggest otherwise. Monitor EXPLAIN ANALYZE output to verify the optimizer uses your index.
Covering Indexes and Query Satisfaction
Some databases support covering indexes—indexes that contain all columns needed to answer a query. Instead of looking up the row in the table, the database can return data directly from the index.
CREATE INDEX idx_user_email_cover ON users(email) INCLUDE (user_id, name);
-- This query touches ONLY the index, not the table
SELECT user_id, name FROM users WHERE email = '[email protected]';
This is powerful for read-heavy workloads but increases storage costs.
The Over-Indexing Trap
Here’s where many developers go wrong. Each index speeds up reads but slows writes:
CREATE INDEX idx_col1 ON big_table(col1);
CREATE INDEX idx_col2 ON big_table(col2);
CREATE INDEX idx_col3 ON big_table(col3);
CREATE INDEX idx_col4 ON big_table(col4);
-- ... 15 more indexes later
Now every INSERT, UPDATE, or DELETE must update 20+ indexes. On a table receiving millions of writes per day, this becomes a bottleneck. The database must:
- Modify the table
- Navigate and modify each B-Tree structure
- Maintain balance in each tree
- Write changes to the write-ahead log
A table with 1,000 QPS of reads and 100 QPS of writes might be fine with 5 indexes. A table with 100 QPS of reads and 1,000 QPS of writes should have very few indexes.
Did you know: Many databases have index monitoring tools. PostgreSQL’s pg_stat_user_indexes shows index usage—you might discover indexes that are never used and waste resources.
When NOT to Index
Be skeptical of indexing in these scenarios:
Small tables (under 100,000 rows): A full scan is often faster than index overhead. The database might scan the table from cache anyway.
Frequently updated columns: If you index a column that changes often, every update requires index maintenance. Consider whether the read benefit justifies the write cost.
Low-selectivity columns: If a column has only a few distinct values (like is_active with just true/false), an index doesn’t narrow the search much. You might still scan 50% of the table.
Varchar columns with LIKE ‘%pattern%’: Prefix-only searches like LIKE 'prefix%' can use indexes; but LIKE '%middle%' requires full scans.
Partial Indexes for Specific Patterns
Here’s an elegant technique many overlook. If you have a pattern like “find active users”, create an index only on rows that match:
CREATE INDEX idx_active_users ON users(user_id) WHERE status = 'active';
This index only covers active users, saving space and update cost. Queries that filter by status can use it:
SELECT * FROM users WHERE status = 'active' AND email = '[email protected]';
Key Takeaways
-
Indexes are the primary tool for making databases fast, enabling O(log n) or O(1) lookups instead of full table scans, but they’re not free—every index adds write overhead and storage cost.
-
B-Tree and B+ Tree indexes are the default for most databases because they support both point lookups and range queries efficiently; hash indexes are faster for point lookups but useless for ranges.
-
Composite index column order matters due to the leftmost prefix rule—the first column must appear in your WHERE clause for the index to be useful; put restrictive columns first.
-
Don’t index everything—over-indexing turns fast small tables into slow big tables; measure with EXPLAIN ANALYZE and delete unused indexes rather than adding speculatively.
-
The write cost of an index depends on your workload—read-heavy systems can support many indexes; write-heavy systems need few, strategic ones.
-
Partial indexes and covering indexes solve specific problems elegantly without the overhead of full indexes.
Put It Into Practice
Scenario 1: E-Commerce Query Optimization
Your product catalog has 5 million items. You notice slow queries like:
SELECT * FROM products WHERE category_id = 42 AND price < 100 ORDER BY rating DESC;
You’re considering these indexes:
(category_id)(price)(category_id, price)(category_id, price, rating)
Which should you create? Use EXPLAIN ANALYZE to compare. Consider whether the write cost is justified (how often are products inserted/updated?). What if you mostly query by category and always want results sorted by rating?
Scenario 2: Over-Indexing Recovery
A legacy system has 50 indexes on a critical transactions table that receives 10,000 writes per second. The table is getting slower, not faster. Writes are taking 50ms instead of 5ms. Design a plan to:
- Identify which indexes are actually used (hint: PostgreSQL has
pg_stat_user_indexes) - Remove the ones with 0 scans
- Consolidate overlapping indexes using composite columns
- Measure the impact
Estimate how much write latency you’d save by removing 30 unused indexes.
Scenario 3: Covering Index Design
You have a common query that reads user records for a dashboard:
SELECT user_id, name, email, last_login FROM users WHERE status = 'active' ORDER BY last_login DESC;
This query runs 1,000 times per minute. The table receives 10 writes per second. Design an index strategy using:
- Partial indexes (only active users)
- Covering indexes (include all columns needed)
- Column ordering (think about the ORDER BY clause)
Write the CREATE INDEX statement(s) and explain the trade-offs.
Next: Query Optimization
We’ve built the foundation—indexes are the building blocks. In the next chapter, we’ll see how query optimizers decide whether to use indexes, how to write queries that work with your indexes instead of against them, and how to read EXPLAIN plans like a pro. You’ll learn that the best index is worthless if the optimizer doesn’t use it, and that sometimes a full table scan is actually the right choice.