System Design Fundamentals

Database Query Optimization

A

Database Query Optimization

The 600x Improvement

You’ve identified the bottleneck: an API endpoint that should return in 100ms is taking 3 seconds. You trace the request and find it’s executing a single database query. That query is scanning 10 million rows in the users table and filtering in application memory to find one user. You add a simple index on the user_id column. The query now runs in 5ms instead of 3 seconds. That’s a 600x improvement. No code changes. No hardware upgrades. Just measurement and the right index.

This happens more often than you’d think. A single database query optimization can be the highest-impact change you make. This is why database query optimization deserves its own section.

Did you know? Studies show that 70-80% of slow API response times are caused by inefficient database queries, not slow application code or network latency. Yet many engineers spend time optimizing code when the real culprit is a missing index or a poorly written query.

How Databases Execute Queries

Before you can optimize a query, you need to understand what the database is doing. Every SQL query goes through this pipeline:

Query Text → Parsing → Query Planning → Query Optimization → Execution Engine → Result

The database parser checks syntax. The planner considers different execution strategies. The optimizer picks the best one based on table statistics. The execution engine runs it. You can’t see the parser and optimizer working, but you can peek at the planner’s output using EXPLAIN.

Understanding EXPLAIN Plans

The EXPLAIN command shows you the database’s execution strategy without actually running the query:

EXPLAIN ANALYZE
SELECT u.id, u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at greater than '2024-01-01'
GROUP BY u.id, u.name;

A simplified output might look like:

Aggregate (cost=15000..15001 rows=1 width=...)
  Group By: u.id, u.name
  ->  Hash Join (cost=5000..10000 rows=50000 width=...)
        Hash Cond: (u.id = o.user_id)
        ->  Seq Scan on users u (cost=0..1000 rows=100000 width=...)
              Filter: (created_at greater than '2024-01-01')
        ->  Seq Scan on orders o (cost=0..500 rows=1000000 width=...)

Let’s decode this:

  • Seq Scan on users means a sequential scan—reading every row in the users table.
  • Seq Scan on orders means reading every row in the orders table.
  • Hash Join combines the results using an in-memory hash table.
  • Aggregate groups and counts.
  • cost=0..1000 is the planner’s estimated cost (lower is better).
  • rows=100000 is the estimated number of rows processed.

The issue here: we’re scanning all 1 million orders, even though we only need orders for users created after 2024-01-01. An index on users(created_at) and orders(user_id) would help.

Sequential Scan vs. Index Scan

Two main strategies for reading data:

  • Sequential Scan: Read every row in the table from start to finish. For small tables (thousands of rows), this is often faster than index overhead.
  • Index Scan: Use an index to jump directly to relevant rows. For large tables with selective filters, this is much faster.

Example: Finding users by user_id in a 100-million-row table.

SELECT * FROM users WHERE user_id = 42;

Without an index: read all 100 million rows sequentially. Cost: minutes. With an index: jump to user_id 42 directly. Cost: milliseconds.

The Three Join Types

Databases choose between different join strategies:

Join TypeHow It WorksBest For
Nested LoopFor each row in left table, scan right tableSmall right table, unique lookups
Hash JoinBuild hash table from right table, probe with leftLarge joins, no index on join column
Merge JoinBoth tables sorted, merge sorted streamsLarge joins with indexed columns, range queries

If you see a “Nested Loop Join” with a large table on the right side, it’s likely slow. An index on the join column would enable a faster Hash or Merge Join.

Indexes: When They Help and When They Don’t

Indexes accelerate lookups, but they have costs:

Indexes Help When

  • The WHERE clause is selective (e.g., finding 1 row out of 1 million).
  • You’re joining on indexed columns.
  • You need sorting (indexed columns are pre-sorted).
  • The table is large (index overhead is worth it).

Indexes Don’t Help When

  • The filter is not selective. If you’re filtering on a boolean column (active=true) and 90% of rows are active, an index doesn’t narrow much. Cost of index traversal offsets the benefit.
  • The table is small (under 10,000 rows). Sequential scan is often faster.
  • The index column is on the right side of a non-equality comparison. WHERE created_at less than '2024-01-01' can use an index, but WHERE '2024-01-01' less than created_at might not (depending on the database).
  • The column is modified frequently (insert-heavy workloads). Index maintenance has overhead.

Query Planner Statistics

The query planner relies on table statistics: How many rows are in the table? What’s the cardinality of this column (how many distinct values)? When statistics are stale, the planner makes bad decisions.

Solution: Run ANALYZE on your tables periodically.

ANALYZE users;

This updates statistics without modifying data. After ANALYZE, the planner makes better decisions.

Common Slow Query Patterns

Here are patterns you’ll see in real production databases:

Pattern 1: N+1 Queries

You want to fetch users and their recent orders:

users = db.query("SELECT * FROM users LIMIT 100")
for user in users:
    orders = db.query(f"SELECT * FROM orders WHERE user_id = {user.id}")
    user.orders = orders

This runs 101 queries: one to fetch users, then one for each of the 100 users. If each query takes 10ms, that’s 1 second total. This pattern appears everywhere—in ORMs, in nested loops, in business logic.

Solution: Use a JOIN.

SELECT u.*, o.* FROM users u
LEFT JOIN orders o ON u.id = o.user_id
LIMIT 100;

One query. The same result. Orders of magnitude faster.

Pattern 2: Missing WHERE Clause Indexes

SELECT * FROM orders WHERE status = 'completed' AND user_id = 42;

If there’s no index on (user_id, status), the database scans the entire orders table (millions of rows) and filters in the execution engine.

Solution: Add a composite index.

CREATE INDEX idx_orders_user_status ON orders(user_id, status);

Now the database can jump directly to orders for user_id 42 with status ‘completed’.

Pattern 3: Functions in WHERE Clauses

SELECT * FROM users WHERE UPPER(name) = 'JOHN';

This looks innocent, but the database can’t use an index on the name column because it’s applying a function (UPPER) to every row. The database must scan all rows, apply UPPER to each, and filter.

Solution: Store the data normalized (names lowercase), or create a function index (database-specific).

SELECT * FROM users WHERE name = 'john';

Now an index on name works.

Pattern 4: Implicit Type Casting

SELECT * FROM users WHERE user_id = '42';  -- user_id is an integer

The database might cast the string ‘42’ to an integer for each row, preventing index usage.

Solution: Use the correct type.

SELECT * FROM users WHERE user_id = 42;

Pattern 5: Selecting All Columns

SELECT * FROM orders WHERE user_id = 42;

If the orders table has 50 columns but you only need 3, you’re transferring 47 unnecessary columns over the network and into memory.

Solution: Select only what you need.

SELECT id, created_at, total FROM orders WHERE user_id = 42;

Pattern 6: LIKE with Leading Wildcard

SELECT * FROM products WHERE name LIKE '%iPhone%';

A leading wildcard prevents index usage. The database can’t use an index to find strings that contain “iPhone” in the middle.

Solution: Use full-text search (PostgreSQL, MySQL) for this pattern, or store searchable data in a separate index.

Advanced Optimization Techniques

Composite Indexes (Column Order Matters)

If you often filter on user_id first, then status:

CREATE INDEX idx_orders ON orders(user_id, status);

Put the most selective column first (or the one most often filtered).

Covering Indexes

A covering index includes all columns needed for the query, so the database doesn’t need to fetch from the main table:

CREATE INDEX idx_orders_covering
ON orders(user_id, status)
INCLUDE (total, created_at);

The query fetches everything from the index, never touches the main table.

Partial Indexes

Index only the rows you care about:

CREATE INDEX idx_active_users ON users(id)
WHERE status = 'active';

This index is smaller and faster to maintain than indexing all users.

Cursor-Based Pagination (vs. OFFSET)

For paginating large datasets, OFFSET is slow:

SELECT * FROM orders
WHERE user_id = 42
LIMIT 100 OFFSET 1000;

OFFSET 1000 means “skip the first 1000 rows”—the database must process them even though it discards them. For page 1000, this is expensive.

Solution: Use cursor-based (keyset) pagination:

SELECT * FROM orders
WHERE user_id = 42
  AND id greater than 9999  -- last id from previous page
LIMIT 100;

This uses an index to jump directly to id 9999, then reads the next 100 rows. No skipping.

Materialized Views for Complex Aggregations

If you have a complex query that’s run repeatedly:

SELECT user_id, COUNT(*) as order_count, SUM(total) as revenue
FROM orders
GROUP BY user_id;

Create a materialized view:

CREATE MATERIALIZED VIEW user_revenue AS
SELECT user_id, COUNT(*) as order_count, SUM(total) as revenue
FROM orders
GROUP BY user_id;

Now queries hit the pre-computed view instead of aggregating millions of rows.

Trade-off: The view is stale until you refresh it. Refresh periodically or on-demand.

The Query Optimization Decision Tree

Here’s a checklist for optimizing a slow query:

1. Run EXPLAIN ANALYZE
   ├─ Are there Seq Scans on large tables?
   │  └─ Add an index on the WHERE clause columns
   ├─ Are there nested loop joins?
   │  └─ Add an index on the join columns
   ├─ Is there a Sort step on a large dataset?
   │  └─ Add an index on the sort columns
   └─ Is the query plan reasonable but still slow?
      └─ Go to step 2

2. Check query structure
   ├─ Is this an N+1 pattern?
   │  └─ Rewrite as a JOIN
   ├─ Are you selecting * ?
   │  └─ Select only needed columns
   ├─ Are you paginating with OFFSET?
   │  └─ Use cursor-based pagination
   └─ Is the query inherently expensive?
      └─ Go to step 3

3. Consider caching or denormalization
   ├─ Can you cache the result?
   │  └─ Use a cache layer (Redis, Memcached)
   ├─ Is this a frequent aggregation?
   │  └─ Create a materialized view
   └─ Does the query match a common pattern?
      └─ Consider data structure changes

Example: Optimizing a Real Slow Query

Let’s optimize a dashboard query:

Original (takes 5 seconds):

SELECT u.id, u.name, u.email,
       COUNT(o.id) as orders,
       SUM(o.total) as revenue
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.status = 'active'
GROUP BY u.id, u.name, u.email;

EXPLAIN shows:

Aggregate
  -> Hash Join
     -> Seq Scan on users (1 million rows)
     -> Seq Scan on orders (100 million rows)

Every row in users and orders is scanned. Issues:

  1. No index on users(status). Seq Scan on 1 million rows.
  2. No index on orders(user_id). Full scan of 100 million rows for the join.
  3. Grouping on u.email which isn’t needed if id and name are unique.

Optimized:

CREATE INDEX idx_users_status ON users(status);
CREATE INDEX idx_orders_user_id ON orders(user_id);

SELECT u.id, u.name, u.email,
       COUNT(o.id) as orders,
       SUM(o.total) as revenue
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.status = 'active'
GROUP BY u.id, u.name, u.email;

Result: 5 seconds becomes 150ms. The indexes allow the database to:

  • Use the status index to find only active users.
  • Use the user_id index to join efficiently.

Key Takeaways

  • Most slow systems are database-bound. Master query optimization and you’ve solved 70% of performance problems.
  • EXPLAIN ANALYZE is your primary diagnostic tool. Learn to read it fluently.
  • The four resources: table scans, joins, sorts, and aggregations. Indexes accelerate all of them.
  • Indexes have costs (write slowdown, storage). Use them strategically, not everywhere.
  • N+1 queries are the most common pattern. Use JOINs instead of loops.
  • Column order in composite indexes matters. Most selective first.
  • Cursor-based pagination beats OFFSET for large offsets.
  • Materialized views pre-compute expensive aggregations.

Practice Scenarios

Scenario 1: A user profile endpoint takes 800ms. EXPLAIN shows a Seq Scan on the users table with no filter (you’re fetching 1 user out of 10 million rows). The query is SELECT * FROM users WHERE username = ?. Why is this scanning the full table? What index would fix it?

Scenario 2: You have a query that JOINs users, orders, and order_items tables. It’s taking 3 seconds. You add an index on orders.user_id, but it’s still slow. What would you check next?

Scenario 3: Your dashboard query aggregates orders by user and category. It runs once per hour and takes 20 seconds. It’s not in your critical path, but it’s a database hog. What’s the best optimization for this pattern?

Connecting to the Next Section

Database optimization often solves 70% of performance problems. But what about the remaining 30%? The next section (Ch. 107) covers application-level optimizations—the work your code does between receiving a request and hitting the database. You’ll learn caching, async processing, batching, and serialization optimization. Together, database and application optimizations form the foundation of a fast system.