Database Indexes

The 30-Second Problem

Imagine you’re debugging a production incident. A seemingly simple query that should be instant is taking 30 seconds to complete:

SELECT user_id, email, created_at FROM users WHERE email = '[email protected]';

Your users table has 50 million rows. Without an index, the database must scan every single row, comparing each email value until it finds a match. That’s 50 million comparisons. Now you create a single index:

CREATE INDEX idx_users_email ON users(email);

The same query now runs in 5 milliseconds.

That 6,000x improvement isn’t magic—it’s the power of indexes. This chapter explores how databases organize data for fast retrieval, the types of indexes available, and critically, how to use them wisely to avoid the common pitfall of over-indexing.

We’re building on the ACID guarantees from the previous chapter. Indexes are where the theoretical meets the practical: they’re the primary tool for making transactional systems fast enough for real-world workloads. But they come with hidden costs you need to understand.

What Is an Index, Really?

An index is a data structure that maps key values to their locations in a table, enabling the database to find rows without scanning everything. Think of it like this:

Without an index, finding something is like searching through a phone book linearly—start at the front, check every entry until you find the one you want.

With an index, it’s like having a sorted directory organized by last name, then first name. You can jump directly to the section you need.

The fundamental trade-off is the reason you can’t just index everything: every index you create makes reads faster but makes writes slower and uses additional storage. When you insert a row, the database must:

Insert the row itself
Update every index on that table
Maintain the internal structure of each index

On a table with millions of rows and dozens of indexes, this overhead becomes significant.

Index Taxonomy

Let’s establish some vocabulary because different database systems use these terms differently, and precision matters.

Primary Index vs Secondary Index

A primary index is built on the primary key—the unique identifier for each row. Most databases create this automatically and guarantee uniqueness. A secondary index is any other index you add. It may point to duplicate values.

Clustered vs Non-Clustered Index

This distinction matters more in some systems than others. A clustered index physically determines the sort order of the table itself. There can be only one clustered index per table (because a table can only be sorted in one way). Most databases make the primary key the clustered index by default.

A non-clustered index is a separate structure that points to rows without affecting their physical order. You can have many non-clustered indexes.

Unique Indexes

Some indexes enforce uniqueness constraints. They’re useful for business logic (like ensuring no two users have the same email) and have the side benefit of helping the optimizer understand data cardinality.

Composite Indexes

A composite (or multi-column) index indexes multiple columns together. The order matters enormously—more on this shortly.

How Databases Actually Organize Indexes

Let me show you the analogy that stuck with me: a book’s table of contents versus its index.

A clustered index is like a book organized by chapters (1, 2, 3…). The book itself IS organized this way. If you want to find “Chapter 5”, you don’t need a separate lookup table—you just know it comes after Chapter 4. The chapters are stored in order.

A non-clustered index is like the alphabetical index at the back of the book. It maps topic names to page numbers. The book itself isn’t organized alphabetically—it still follows the chapter structure. But the index lets you quickly find where “recursion” is mentioned without reading every page.

In a database context, if your table is clustered by user_id, rows with consecutive user IDs are stored near each other physically. A non-clustered index on email is a separate lookup structure that says “email=‘[email protected]’ is at address X”.

B-Tree Indexes: The Workhorse

Almost all general-purpose databases use some variant of B-Trees for indexing. Here’s why they’re so powerful:

B-Trees are self-balancing tree structures that keep data sorted and enable efficient searches. A B-Tree with branching factor d has these properties:

Every node can hold up to d keys and d+1 children
All leaves are at the same depth (perfectly balanced)
Lookups take O(log n) time—even with a billion rows, you’re rarely more than a few levels deep
Range queries are efficient because keys are sorted

Let me visualize a small B-Tree:

graph TD
    A["[30, 60]"] --> B["[10, 20]"]
    A --> C["[40, 50]"]
    A --> D["[70, 80]"]
    B --> E["5"]
    B --> F["15"]
    B --> G["25"]
    C --> H["35"]
    C --> I["45"]
    C --> J["55"]
    D --> K["65"]
    D --> L["75"]
    D --> M["85"]

To search for value 47, you’d start at the root [30, 60]. Since 47 lies between 30 and 60, you go to the middle child. Then you check [40, 50]—47 lies between these, so you go to the right child. You either find 47 or reach a leaf knowing it doesn’t exist. Three comparisons to search a tree of 15 values.

B+ Trees improve on B-Trees for range queries. In a B+ Tree, only leaf nodes contain actual data; internal nodes just guide navigation. This means:

Leaves are linked together, so range queries become sequential reads
Searching for all values between 35 and 75 is fast—navigate to 35, then follow leaf pointers
Internal nodes can store more keys since they don’t hold data, making trees shallower

Most modern databases (PostgreSQL, MySQL, SQL Server) use B+ Tree variants.

Hash Indexes: Fast Point Lookups

Hash indexes take a different approach. They use a hash function to map keys to bucket locations, providing O(1) average-case lookup time. They’re excellent for equality queries (WHERE email = '[email protected]') but useless for range queries.

Why? When you hash a value, you lose all ordering information. The hashes of ‘[email protected]’ and ‘[email protected]’ have no relationship—they might hash to opposite ends of the table.

When to use hash indexes:

Point lookups are your primary access pattern
You never need range queries on that column
Your database fully supports them (some don’t)

When to avoid:

Range queries are common (like WHERE created_at > '2024-01-01')
You need prefix searches (like WHERE email LIKE 'alice%')

Many databases default to B-Trees because they’re more versatile, but PostgreSQL’s HASH indexes can be valuable for specific patterns.

Specialized Indexes for Special Cases

Bitmap Indexes

Bitmap indexes excel on low-cardinality columns (few distinct values). They’re common in data warehousing. Instead of storing pointers, they store a bitmap where each bit represents whether a row matches the value.

For a gender column with 100 million rows:

Traditional index: might need 100 million pointers (high storage)
Bitmap index: 100 million bits (12.5 MB)—plus you can use bitwise operations (AND, OR) efficiently

Trade-off: great for analytics, poor for columns with thousands of distinct values.

Full-Text Indexes

Full-text search uses inverted indexes—a mapping from words to documents containing them. When you index a document column, the database tokenizes text into words and creates a lookup structure.

CREATE INDEX idx_documents_fulltext ON articles USING GIN (content);
SELECT * FROM articles WHERE content @@ to_tsquery('postgres & database');

This enables fast keyword search without scanning every document.

GIN and GiST Indexes (PostgreSQL)

PostgreSQL’s Generalized Inverted Index (GIN) and Generalized Search Tree (GiST) handle complex data types:

JSON fields: search nested objects quickly
Arrays: find containment relationships
Geometric types: spatial queries

These are increasingly important as databases support richer data types.

Index Type Comparison

Index Type	Lookup Time	Range Query	Use Case	Limitations
B-Tree	O(log n)	Excellent	General purpose	Slightly slower than hash for point lookups
B+ Tree	O(log n)	Excellent	General purpose, especially ranges	Internal overhead
Hash	O(1)	None	Point lookups only	Cannot handle ranges or sorting
Bitmap	O(1) typically	Good	Low-cardinality columns	Terrible for high-cardinality data
Full-Text (Inverted)	O(1) typical	N/A	Text search	Only for text; not general purpose
GIN	O(log n)	Good	JSON, arrays, full-text	More complex maintenance
GiST	O(log n)	Good	Spatial, geometric	More complex maintenance

Creating and Using Indexes Effectively

Let’s see what actually happens. Start with a slow query:

-- Slow: full table scan
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;

Output might show:

Seq Scan on orders  (cost=0.00..50000.00 rows=1000)

Now add an index:

CREATE INDEX idx_orders_customer_id ON orders(customer_id);

Same query now shows:

Index Scan using idx_orders_customer_id on orders  (cost=0.29..4.30 rows=1000)

The optimizer chose a 100x faster plan. But here’s the critical insight: the optimizer must know the index exists and believe it will be faster based on statistics.

Composite Indexes and Column Ordering

The order of columns in a composite index matters enormously due to the leftmost prefix rule. Consider this common query:

SELECT * FROM users WHERE country = 'US' AND city = 'New York' AND state = 'NY';

If you create:

CREATE INDEX idx_location ON users(country, city, state);

The database can use this index efficiently for:

WHERE country = 'US' ✓
WHERE country = 'US' AND city = 'New York' ✓
WHERE country = 'US' AND city = 'New York' AND state = 'NY' ✓

But it cannot efficiently use this index for:

WHERE city = 'New York' ✗ (skips the first column)
WHERE state = 'NY' ✗ (skips two columns)

Why? B-Tree indexes are sorted by the first column, then by the second, then by the third. If you don’t start with the first column, you can’t use the sorting.

Pro tip: When building composite indexes, put the most restrictive columns first (the ones that eliminate the most rows), unless query patterns suggest otherwise. Monitor EXPLAIN ANALYZE output to verify the optimizer uses your index.

Covering Indexes and Query Satisfaction

Some databases support covering indexes—indexes that contain all columns needed to answer a query. Instead of looking up the row in the table, the database can return data directly from the index.

CREATE INDEX idx_user_email_cover ON users(email) INCLUDE (user_id, name);

-- This query touches ONLY the index, not the table
SELECT user_id, name FROM users WHERE email = '[email protected]';

This is powerful for read-heavy workloads but increases storage costs.

The Over-Indexing Trap

Here’s where many developers go wrong. Each index speeds up reads but slows writes:

CREATE INDEX idx_col1 ON big_table(col1);
CREATE INDEX idx_col2 ON big_table(col2);
CREATE INDEX idx_col3 ON big_table(col3);
CREATE INDEX idx_col4 ON big_table(col4);
-- ... 15 more indexes later

Now every INSERT, UPDATE, or DELETE must update 20+ indexes. On a table receiving millions of writes per day, this becomes a bottleneck. The database must:

Modify the table
Navigate and modify each B-Tree structure
Maintain balance in each tree
Write changes to the write-ahead log

A table with 1,000 QPS of reads and 100 QPS of writes might be fine with 5 indexes. A table with 100 QPS of reads and 1,000 QPS of writes should have very few indexes.

Did you know: Many databases have index monitoring tools. PostgreSQL’s pg_stat_user_indexes shows index usage—you might discover indexes that are never used and waste resources.

When NOT to Index

Be skeptical of indexing in these scenarios:

Small tables (under 100,000 rows): A full scan is often faster than index overhead. The database might scan the table from cache anyway.

Frequently updated columns: If you index a column that changes often, every update requires index maintenance. Consider whether the read benefit justifies the write cost.

Low-selectivity columns: If a column has only a few distinct values (like is_active with just true/false), an index doesn’t narrow the search much. You might still scan 50% of the table.

Varchar columns with LIKE ‘%pattern%’: Prefix-only searches like LIKE 'prefix%' can use indexes; but LIKE '%middle%' requires full scans.

Partial Indexes for Specific Patterns

Here’s an elegant technique many overlook. If you have a pattern like “find active users”, create an index only on rows that match:

CREATE INDEX idx_active_users ON users(user_id) WHERE status = 'active';

This index only covers active users, saving space and update cost. Queries that filter by status can use it:

SELECT * FROM users WHERE status = 'active' AND email = '[email protected]';

Key Takeaways

Indexes are the primary tool for making databases fast, enabling O(log n) or O(1) lookups instead of full table scans, but they’re not free—every index adds write overhead and storage cost.
B-Tree and B+ Tree indexes are the default for most databases because they support both point lookups and range queries efficiently; hash indexes are faster for point lookups but useless for ranges.
Composite index column order matters due to the leftmost prefix rule—the first column must appear in your WHERE clause for the index to be useful; put restrictive columns first.
Don’t index everything—over-indexing turns fast small tables into slow big tables; measure with EXPLAIN ANALYZE and delete unused indexes rather than adding speculatively.
The write cost of an index depends on your workload—read-heavy systems can support many indexes; write-heavy systems need few, strategic ones.
Partial indexes and covering indexes solve specific problems elegantly without the overhead of full indexes.

Put It Into Practice

Scenario 1: E-Commerce Query Optimization

Your product catalog has 5 million items. You notice slow queries like:

SELECT * FROM products WHERE category_id = 42 AND price < 100 ORDER BY rating DESC;

You’re considering these indexes:

(category_id)
(price)
(category_id, price)
(category_id, price, rating)

Which should you create? Use EXPLAIN ANALYZE to compare. Consider whether the write cost is justified (how often are products inserted/updated?). What if you mostly query by category and always want results sorted by rating?

Scenario 2: Over-Indexing Recovery

A legacy system has 50 indexes on a critical transactions table that receives 10,000 writes per second. The table is getting slower, not faster. Writes are taking 50ms instead of 5ms. Design a plan to:

Identify which indexes are actually used (hint: PostgreSQL has pg_stat_user_indexes)
Remove the ones with 0 scans
Consolidate overlapping indexes using composite columns
Measure the impact

Estimate how much write latency you’d save by removing 30 unused indexes.

Scenario 3: Covering Index Design

You have a common query that reads user records for a dashboard:

SELECT user_id, name, email, last_login FROM users WHERE status = 'active' ORDER BY last_login DESC;

This query runs 1,000 times per minute. The table receives 10 writes per second. Design an index strategy using:

Partial indexes (only active users)
Covering indexes (include all columns needed)
Column ordering (think about the ORDER BY clause)

Write the CREATE INDEX statement(s) and explain the trade-offs.

Next: Query Optimization

We’ve built the foundation—indexes are the building blocks. In the next chapter, we’ll see how query optimizers decide whether to use indexes, how to write queries that work with your indexes instead of against them, and how to read EXPLAIN plans like a pro. You’ll learn that the best index is worthless if the optimizer doesn’t use it, and that sometimes a full table scan is actually the right choice.