Query Optimization

Introduction: The 15-Second Dashboard Crisis

It’s Monday morning at a B2B SaaS company. The product team opens Slack with an urgent message: “The analytics dashboard is unusable—it’s taking 15 seconds to load.” The CEO is in a board meeting. Three customers have already complained. Your team springs into action.

You pull up the query logs and find the culprit: a single SQL query hitting the orders table, then joining to customers, then to products, without a single index. The query is performing a full table scan on 50 million rows. Meanwhile, you know the answer exists—it always does. The data is there. But the path to that data is catastrophically inefficient.

This is query optimization. It’s not about making magical improvements; it’s about finding the right path through your data. In the previous section, we covered indexes as the tool for fast lookups. Now we’re looking at the orchestration layer—the query optimizer—that decides which path to take, and how to help it make better decisions.

By the end of this chapter, you’ll know how modern databases decide whether to use that index, when to rewrite your query automatically, and how to recognize optimization opportunities yourself. More importantly, you’ll understand the trade-offs: when to optimize in the database, when to optimize in code, and when to stop optimizing altogether.

What Is Query Optimization?

Query optimization is the process of taking a user’s request (a SQL query) and executing it as efficiently as possible. It’s not about rewriting the query to return different results—it’s about returning the same results faster.

The database doesn’t execute your query as written. Instead, it:

Parses your SQL into an abstract syntax tree
Plans multiple execution strategies and estimates their cost
Chooses the plan with the lowest estimated cost
Executes that plan
Returns results

The Query Optimizer’s Job

The query optimizer is the engine of this process. It’s a sophisticated piece of software that tries to find the cheapest way to answer your question. “Cheap” here means minimizing resource usage—typically measured in disk I/O operations, CPU cycles, and memory.

Cost-Based vs. Rule-Based Optimization

There are two fundamental approaches to query optimization:

Cost-Based Optimization (used by PostgreSQL, MySQL 5.7+, Oracle, SQL Server):

The optimizer maintains statistics about your data (table sizes, column distributions, index selectivity)
It generates multiple candidate plans and estimates the cost of each
It picks the plan with the lowest estimated cost
More computationally expensive for the optimizer, but better at adapting to your actual data

Rule-Based Optimization (older systems, sometimes faster for simple queries):

The optimizer applies a fixed set of rules: “Always use an index if available,” “Join the smallest table first,” etc.
No statistics needed, faster to plan
Often makes poor choices when the rules don’t match reality

Modern systems use cost-based optimization because data distributions vary. An index that’s perfect for 99% selectivity is terrible for 1% selectivity.

Logical vs. Physical Query Plans

A logical plan describes what needs to happen without specifying how:

Filter (price > 100)
  └─ Join (orders.customer_id = customers.id)
      ├─ Scan (orders table)
      └─ Scan (customers table)

A physical plan specifies the concrete algorithms and data structures:

Filter (price > 100) using CPU
  └─ Hash Join on orders.customer_id = customers.id
      ├─ Index Range Scan on orders (price_idx where price > 100)
      └─ Hash lookup on customers.id (build from customers table)

The optimizer transforms a logical plan into a physical plan by choosing specific algorithms for each operation.

The Road Trip Analogy

Imagine you’re planning a drive from New York to Boston. You could:

Scenic route: Drive down every road in America until you stumble upon Boston (full table scan)
Highway route: Take the interstate directly (index lookup)
Hybrid: Drive on highways but make several stops for gas (index with filtering)

Your GPS is the query optimizer. It considers:

Traffic patterns (data distribution): The scenic route might have less traffic on weekends
Distance (I/O cost): How many disk reads to get the answer?
Tolls (CPU cost): How much computation per disk read?
Multiple routes: Maybe the scenic route is better if traffic is bad

The GPS doesn’t re-plan your trip every 5 miles. Similarly, the optimizer plans once (using statistics) and commits. If your assumptions change—traffic spikes, roads close, or you discover a shortcut—the optimizer won’t adapt mid-trip. This is why stale statistics are dangerous.

How the Query Optimizer Works

Let’s trace a query through a modern optimizer (like PostgreSQL’s):

Parsing

SELECT o.id, c.name, SUM(o.amount)
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.amount > 100
GROUP BY o.id, c.name
HAVING SUM(o.amount) > 500

The parser converts this into an abstract syntax tree (AST). If there’s a syntax error, it stops here.

Logical Planning

The planner rewrites the query into an equivalent but often more efficient form:

Predicate pushdown: Move filters closer to tables
Join reordering: Determine the order in which to join tables
Aggregate optimization: Rewrite GROUP BY to use indexes if possible

Physical Planning & Cost Estimation

This is where statistics matter. The planner asks the statistics subsystem:

“How many rows are in the orders table?” (table size)
“How many distinct values in customer_id?” (cardinality)
“What percentage of orders have amount > 100?” (selectivity)

These are stored in histograms—sketches of the column’s value distribution. For example:

customer_id histogram:
[1-1000]: 5000 rows
[1001-2000]: 4800 rows
[2001-3000]: 5200 rows
...

The planner estimates the cost of each candidate plan using these statistics, and chooses the one with minimum cost.

Execution

The physical plan is compiled into executable code and run.

Table Statistics and Cardinality Estimation

Statistics are the optimizer’s crystal ball. Without them, it guesses (badly). With wrong statistics, it optimizes for the wrong scenario.

Key statistics:

Statistic	Meaning	Why It Matters
Row count	Total rows in the table	Is this a huge table or small?
Column cardinality	Distinct values in a column	How selective is a filter on this column?
Null count	How many NULL values	Affects JOIN performance
Histogram	Distribution of values	Are values uniformly distributed or clustered?
Correlation	Relationship between columns	”Orders with high amounts are from repeat customers”

Cardinality estimation is the optimizer’s hardest problem. If a query filters on age < 30, the optimizer needs to know: “What percentage of users are under 30?” If the actual percentage is 10% but the stats say 50%, the optimizer will pick a sequential scan when an index would have been perfect.

Pro Tip: Update Statistics Regularly

Most databases auto-analyze, but if you have volatile data or batch imports, explicitly run:

-- PostgreSQL
ANALYZE table_name;

-- MySQL
ANALYZE TABLE table_name;

-- SQL Server
UPDATE STATISTICS table_name;

Stale statistics are a leading cause of mysterious query slowdowns after data migrations or bulk loads.

Join Algorithms: The Art of Combining Data

When your query joins two tables, the optimizer must choose from several algorithms. Each has different performance characteristics.

Nested Loop Join

for each row r in table A:
  for each row s in table B:
    if r.id == s.id:
      output (r, s)

Cost: M * N + M * log(N) where M and N are table sizes (if the inner table is indexed)

Characteristics:

Simplest algorithm
Best when one table is small or the join condition is highly selective
Scales terribly with large tables

When chosen: Small table joined with larger table on an indexed column

Hash Join

1. Build phase: Load table B into a hash table on join key
2. Probe phase: For each row in table A, probe the hash table

Cost: M + N (linear, very fast)

Characteristics:

Requires enough memory to hold the hash table
Excellent for large table joins
Can spill to disk if the hash table exceeds memory

When chosen: Large table joined with large table, no index on join column

Merge Join

1. Sort table A by join key (if not already sorted)
2. Sort table B by join key (if not already sorted)
3. Merge the sorted streams

Cost: M log M + N log N (dominated by sorting)

Characteristics:

Excellent when both tables are already sorted (e.g., both on indexed columns)
Can produce results in sorted order (useful for downstream operations)
Requires significant I/O for sorting if tables aren’t already sorted

When chosen: Tables are already sorted on the join column, or sorting is beneficial for other reasons

Comparison Table

Algorithm	Small × Large	Medium × Medium	Large × Large	Indexed Join Col
Nested Loop	Best	Poor	Terrible	Best
Hash	Good	Best	Best	Doesn’t matter
Merge	Good	Good	Good	Excellent

Query Rewriting Techniques

The optimizer often rewrites your query automatically. Understanding these transformations helps you write better queries.

Subquery Unnesting

Before (subquery):

SELECT * FROM orders
WHERE customer_id IN (
  SELECT id FROM customers WHERE country = 'USA'
)

What the optimizer does (rewrite to JOIN):

SELECT DISTINCT o.* FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA'

The JOIN is faster because it can exploit indexes and use better join algorithms.

Predicate Pushdown

Before:

SELECT * FROM (
  SELECT * FROM orders
  WHERE amount > 0
) WHERE status = 'completed'

What the optimizer does (push both filters down):

SELECT * FROM orders
WHERE amount > 0 AND status = 'completed'

This filters at the source, reducing rows earlier in the pipeline.

Aggregate Optimization

Before:

SELECT customer_id, COUNT(*)
FROM orders
WHERE order_date > '2024-01-01'
GROUP BY customer_id

If there’s an index on (order_date, customer_id), the optimizer can use the index directly instead of scanning the entire table.

The N+1 Query Problem

One query optimization pitfall that no optimizer can fix is the N+1 problem. It’s a coding pattern, not a query issue.

# BAD: N+1 queries
customers = db.query("SELECT * FROM customers LIMIT 100")
for customer in customers:
    orders = db.query("SELECT * FROM orders WHERE customer_id = ?", customer.id)
    # Process customer and orders

This executes 1 query to fetch customers + 100 queries to fetch their orders = 101 total queries. If each query has a 1ms network roundtrip, that’s 100ms wasted waiting.

Solution 1: Eager loading (JOIN)

SELECT c.*, o.*
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
LIMIT 100

One query, but potentially many duplicate rows (one per order).

Solution 2: Batch loading

customer_ids = [c.id for c in customers]
orders_by_customer = db.query(
  "SELECT * FROM orders WHERE customer_id IN (?)",
  customer_ids
)
# Index results in memory
orders_map = {}
for order in orders_by_customer:
    if order.customer_id not in orders_map:
        orders_map[order.customer_id] = []
    orders_map[order.customer_id].append(order)

Two queries, but constant memory and no duplicate rows. ORMs often call this “batch loading” or “prefetching.”

Pagination: OFFSET vs. Keyset

How you paginate affects query performance dramatically.

OFFSET-Based Pagination

SELECT * FROM orders
ORDER BY id DESC
LIMIT 20 OFFSET 1000000

This tells the database: “Skip 1 million rows, then give me 20.” The database must scan or count 1 million rows even if it doesn’t return them.

Cost: Grows with page number. Page 1000 is 1000× slower than page 1.

Problem: Exponential slowdown as users paginate deeper.

Keyset Pagination (Cursor-Based)

SELECT * FROM orders
WHERE id < ? /* last id from previous page */
ORDER BY id DESC
LIMIT 20

This says: “Give me 20 orders with an id less than 1000000.” With an index on id, the database can jump directly to that position.

Cost: Constant regardless of page number.

Benchmark (PostgreSQL, 10M rows, index on id):

Page 1 (OFFSET 0): 0.1ms
Page 100 (OFFSET 2000): 5ms
Page 100,000 (OFFSET 2M): 800ms
Keyset, any page: 0.1ms

Trade-off: Keyset pagination is harder to implement (can’t jump to arbitrary pages) but vastly more scalable.

Partitioning as Query Optimization

Partitioning tables by key (typically time or geography) can dramatically improve query optimization.

Example: Orders table partitioned by year

CREATE TABLE orders (
  id BIGINT,
  customer_id INT,
  order_date DATE,
  amount DECIMAL
) PARTITION BY RANGE (YEAR(order_date)) (
  PARTITION p2022 VALUES LESS THAN (2023),
  PARTITION p2023 VALUES LESS THAN (2024),
  PARTITION p2024 VALUES LESS THAN (2025)
)

Query for 2024 orders:

SELECT * FROM orders WHERE order_date >= '2024-01-01'

The optimizer uses partition pruning: it only scans the p2024 partition, ignoring p2022 and p2023. This reduces I/O by 66%.

Limitations: Partitioning adds complexity. Use it when:

Tables exceed available memory or reasonable index sizes
Queries naturally filter by the partition key
You need to archive/delete old partitions

Reading EXPLAIN ANALYZE Output

Let’s interpret a real query plan. Here’s a slow query:

EXPLAIN ANALYZE
SELECT o.id, c.name, SUM(o.amount)
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.amount > 100
GROUP BY o.id, c.name;

Output:

Seq Scan on orders o  (cost=0.00..50000.00 rows=500000)
  Filter: (amount > 100)
  Rows removed by filter: 9500000
→ Hash Join  (cost=500.00..10000.00)
  Hash Cond: (o.customer_id = c.id)
  → Hash  (cost=250.00..250.00 rows=10000)
    → Seq Scan on customers c
  → Gather Merge  (cost=250.00..9750.00)
    → Partial GroupAggregate
      Sort Key: o.id, c.name

How to read it:

Seq Scan on orders: A full table scan (inefficient if there’s an index)
Rows removed by filter: 9500000: Only 500,000 of 10M rows pass the filter
Hash Join: Using hash join algorithm
cost=0.00..50000.00: Estimated cost units (not real time)

Red flags:

Full table scans (Seq Scan) when an index exists on the filter column
“Rows removed” is very high (filter applied late)
Memory issues (“Spill to disk”)

Optimization:

CREATE INDEX idx_orders_amount ON orders(amount);

Now the query can use an index range scan instead of a full scan.

Trade-Offs in Query Optimization

Query Complexity vs. Readability

A hand-optimized query might run 10% faster but be unreadable:

-- Fast but unmaintainable
SELECT /*+ APPEND */ * FROM (
  SELECT /*+ PARALLEL(8) */ * FROM orders
  WHERE ROWNUM <= 1000
) WHERE customer_id IN (
  SELECT id FROM customers WHERE status = 'active'
)

vs.

-- Readable, likely 5% slower
SELECT o.* FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.status = 'active'
LIMIT 1000

When to optimize for speed: High-traffic queries, dashboards, reporting. When to optimize for readability: Infrequent queries, maintenance code, legacy systems.

Denormalization for Read Performance

Normalized schemas are write-efficient but read-inefficient. Denormalization caches frequently-needed data.

Normalized approach:

SELECT o.id, c.name, s.status_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN order_statuses s ON o.status_id = s.id

Denormalized approach:

SELECT id, customer_name, status_name
FROM orders

The denormalized version is faster but requires careful updates: if a customer’s name changes, you must update every row in the orders table.

When to denormalize: Only for heavily-read data where writes are infrequent or batched.

Materialized Views: Precomputed Results vs. Staleness

A materialized view stores query results as a physical table:

CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT DATE(order_date), SUM(amount), COUNT(*)
FROM orders
GROUP BY DATE(order_date);

Cost: Stale data. The view is updated every hour (or on-demand), but between updates, reports are inaccurate.

Benefit: Queries run in microseconds instead of seconds.

Trade-off: Only use for reports where 1-hour staleness is acceptable.

Optimization at the Application Level

Sometimes the best query optimization is to not query.

Caching example:

# Query: 500ms
result = db.query("SELECT COUNT(*) FROM products")

# With cache: 0.1ms
result = cache.get("product_count") or db.query(...)

When to cache: Queries that are expensive but change slowly.

Danger: Cache invalidation is hard. Stale cache is often worse than no cache.

Key Takeaways

The query optimizer is a cost-based planner that estimates execution costs using table statistics. Stale or missing statistics are a leading cause of bad query plans.
Join algorithms (Nested Loop, Hash, Merge) are chosen based on table size and index availability. Hash joins are best for large tables without indexes.
The N+1 problem is a coding pattern where you issue one query per row instead of one query for all rows. Eager loading or batch loading fixes it.
Keyset (cursor-based) pagination scales infinitely better than OFFSET pagination for large datasets.
Query rewriting (subquery unnesting, predicate pushdown, partition pruning) happens automatically in modern databases, but you can help the optimizer by writing clear, simple queries.
EXPLAIN ANALYZE is your window into the optimizer’s decisions. Use it to identify sequential scans, unused indexes, and inefficient join orders.
Don’t optimize prematurely. Optimize high-traffic queries first. Measure before and after. Readable code is often worth 5-10% performance loss.

Put It Into Practice

Scenario 1: The Slow Report

Your reporting system runs this query every morning, and it takes 45 minutes:

SELECT
  c.name,
  COUNT(o.id) as order_count,
  SUM(o.amount) as total_spent
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.created_at >= DATE_SUB(NOW(), INTERVAL 365 DAY)
GROUP BY c.name
ORDER BY total_spent DESC

Analyze this query:

What index would help most?
Would partitioning the orders table improve performance?
Could a materialized view be a better solution?
What statistics might be stale?

Scenario 2: The N+1 API Endpoint

Your REST API endpoint /api/customers/:id/orders is slow because it runs 1 + N queries (1 customer query + 1 query per order). Design a solution that:

Reduces the number of queries to 1 or 2
Keeps response size reasonable
Handles pagination without OFFSET

Scenario 3: The Mysterious Slowdown

A query that ran in 100ms yesterday now runs in 5 seconds today. The query itself hasn’t changed. What could cause this? How would you diagnose it?

Connection to the Next Section

We’ve explored how to optimize individual queries, but a query’s speed is only half the story. The other half is how you connect to the database and manage those connections. In the next section on Connection Pooling and Database Performance, we’ll dive into the infrastructure layer: how to prevent connection storms, reuse connections efficiently, and handle failures gracefully. You’ll learn why even perfectly optimized queries can’t save you if you’re opening and closing connections carelessly.