System Design Fundamentals

Relational vs NoSQL Databases

A

Relational vs NoSQL Databases

When the Choice Matters Most

Picture this: You’re building a platform that serves both high-frequency trading data and flexible user profiles. Your first instinct is to use what you know—perhaps a traditional relational database. But halfway through, you realize you’re constantly adding new fields to your user profile, creating migration headaches. Meanwhile, your financial transaction system demands strict consistency and complex multi-table joins. You’re facing a problem that has plagued architects for the past two decades: which database paradigm is right for this workload?

This chapter builds on our previous exploration of caching strategies (Chapter 36) and edge distribution (Chapter 37). Those systems cache and replicate data efficiently, but they assume data already exists in a well-understood form. Before we dive into transactional guarantees in Chapter 39, we need to understand where and how data lives at its source.

The choice between relational and NoSQL databases isn’t academic—it determines your system’s scaling ceiling, operational complexity, and engineering velocity. Let’s move beyond “NoSQL is fast” and “SQL is reliable” to understand the fundamental tradeoffs.

Understanding the Fundamental Paradigms

Relational Databases (RDBMS)

A relational database organizes data into normalized tables with strictly defined schemas. Each table has columns with predefined types, and rows represent individual records. The schema is enforced at write time—you cannot insert a record that doesn’t conform to the table structure.

Core concepts:

  • Schema-on-write: Data structure is enforced when inserted
  • Normalization: Data is split across multiple tables to minimize redundancy
  • Foreign keys: References enforce referential integrity
  • ACID transactions: Multiple operations succeed together or fail together
-- Example: An e-commerce product catalog
CREATE TABLE products (
  id INT PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  price DECIMAL(10, 2),
  created_at TIMESTAMP
);

CREATE TABLE product_reviews (
  id INT PRIMARY KEY,
  product_id INT NOT NULL,
  rating INT CHECK (rating BETWEEN 1 AND 5),
  review_text TEXT,
  FOREIGN KEY (product_id) REFERENCES products(id)
);

-- Querying across tables
SELECT p.name, AVG(r.rating) as avg_rating
FROM products p
LEFT JOIN product_reviews r ON p.id = r.product_id
GROUP BY p.id;

NoSQL Databases

NoSQL encompasses diverse database types united by a rejection of the relational model. Instead of rigid schemas, NoSQL systems typically embrace flexible data structures, horizontal scaling, and high throughput.

Main categories:

Document stores (MongoDB, CouchDB): Store semi-structured documents (typically JSON-like). Each document can have a different structure within the same collection.

// MongoDB-style document store
db.products.insertOne({
  _id: ObjectId("..."),
  name: "Wireless Headphones",
  price: 149.99,
  specs: {
    weight_g: 250,
    battery_hours: 20,
    colors: ["black", "silver", "blue"]
  },
  reviews: [
    { user: "alice", rating: 5, text: "Great sound!" },
    { user: "bob", rating: 4, text: "Good but pricey" }
  ],
  created_at: ISODate("2024-01-15")
})

Key-value stores (Redis, DynamoDB, Memcached): Store values associated with keys. Simple but fast. Often in-memory or in-memory-first.

SET user:1001:profile '{"name": "Alice", "age": 28, "tier": "premium"}'
GET user:1001:profile
HSET user:1001:prefs color:light theme:dark

Column-family stores (Cassandra, HBase): Organize data by column rather than row. Excellent for time-series data and analytics.

// Conceptual Cassandra model
CREATE TABLE metrics (
  host TEXT,
  timestamp BIGINT,
  cpu_usage DECIMAL,
  memory_usage DECIMAL,
  disk_io DECIMAL,
  PRIMARY KEY (host, timestamp)
);

Graph databases (Neo4j, Amazon Neptune): Explicitly model relationships between entities. Traverse relationships with minimal computation.

// Neo4j Cypher query
MATCH (user:User {id: "alice"})-[:FOLLOWS]->(friend:User)
       -[:LIKES]->(post:Post)
WHERE user.created_at > timestamp('2024-01-01')
RETURN DISTINCT post.title

Schema-on-Write vs Schema-on-Read

The fundamental philosophical difference separates these paradigms:

Relational databases enforce schema-on-write: When you insert data, the database checks that it matches the table definition. Invalid data is rejected. This catches errors early but requires migrations when requirements change.

NoSQL systems embrace schema-on-read: The database accepts whatever structure you send. Applications interpret the data when reading it. This provides flexibility but can hide inconsistencies.

Consider a user profile system:

// Schema-on-write (SQL)
ALTER TABLE users ADD COLUMN phone_number VARCHAR(20);
-- Every existing row now has a phone_number column (NULL initially)
-- New rows must include or default this field

// Schema-on-read (MongoDB)
db.users.updateOne({_id: 1}, {$set: {phone_number: "555-1234"}})
// Other documents may lack this field—applications handle it

A Library Analogy

Imagine organizing a library:

Relational approach resembles the Dewey Decimal System. Every book gets a precise call number based on strict hierarchies. If a book is both fiction and biography, you must choose one category. Adding a new book type (like audiobooks) requires adjusting the entire taxonomy. But once classified, you can instantly find all books in a category, and the structure prevents duplication.

NoSQL approach resembles a personal filing system where you create folders dynamically. A folder for “favorite authors” might contain books, magazine clippings, and handwritten notes—all mixed together. Adding a new file type requires no restructuring. But finding “all fiction books” requires you to examine each folder and decide which items count as fiction.

Both systems work. The question is: what’s your primary use case?

Comparing Database Types: Technical Details

Architecture and Scaling

| Aspect | Relational | Document | Key-Value | Column-Family | Graph |
|--------|------------|----------|-----------|---------------|-------|
| **Write Model** | Row-based, normalized | Document | Single key | Column groups | Nodes & edges |
| **Scaling** | Vertical (sharding complex) | Horizontal (partition by key) | Horizontal (trivial) | Horizontal (excellent) | Horizontal (with replication cost) |
| **Consistency** | Strong (ACID) | Eventual (BASE) | Eventual (configurable) | Eventual (tunable) | Depends on implementation |
| **Query Flexibility** | Excellent (any join) | Good (within doc) | Poor (key-value only) | Limited (range scans) | Excellent (relationship traversal) |
| **Storage Efficiency** | High (normalized) | Medium (denormalized) | Depends on values | High (compression friendly) | Medium (relationship overhead) |

Relational Strengths

  • Complex joins: Efficiently combine data from multiple tables with predicates
  • Consistency guarantees: ACID properties ensure data integrity
  • Data integrity: Foreign keys prevent orphaned records
  • Analytics: Aggregate queries across normalized data
  • Transactional safety: Multiple operations atomic

When relational excels:

  • Financial systems (banking, trading)
  • E-commerce transactions (orders, inventory, payments)
  • HR and employee management
  • Reporting and analytics with complex aggregations

NoSQL Strengths

  • Flexible schema: Add fields without migrations
  • Horizontal scaling: Partition data across many servers trivially
  • High write throughput: Optimized for write-heavy workloads
  • Denormalization benefit: Embed related data in a single document for fast reads
  • Specific optimization: Each type optimizes for its access pattern

When NoSQL excels:

  • User profiles and session data (flexible fields)
  • Real-time analytics and metrics (high volume writes)
  • Content management (blogs, comments, social feeds)
  • Caching layers (Redis)
  • IoT sensor data (millions of writes per second)
  • Social graphs (relationship-heavy queries)

Practical Example: E-Commerce Product Catalog

Let’s design a product catalog system that must:

  • Store product information with variable attributes
  • Support fast lookups by product ID
  • Enable searching by category and price range
  • Track reviews and ratings
  • Handle 10,000 writes per second during sales

Relational Design (PostgreSQL)

CREATE TABLE products (
  id BIGINT PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  sku VARCHAR(50) UNIQUE,
  category_id INT NOT NULL,
  price_cents BIGINT NOT NULL,
  stock INT DEFAULT 0,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW(),
  FOREIGN KEY (category_id) REFERENCES categories(id)
);

CREATE TABLE product_attributes (
  id BIGINT PRIMARY KEY,
  product_id BIGINT NOT NULL,
  attribute_key VARCHAR(100),
  attribute_value TEXT,
  FOREIGN KEY (product_id) REFERENCES products(id),
  UNIQUE(product_id, attribute_key)
);

CREATE TABLE reviews (
  id BIGINT PRIMARY KEY,
  product_id BIGINT NOT NULL,
  user_id BIGINT NOT NULL,
  rating INT CHECK (rating BETWEEN 1 AND 5),
  review_text TEXT,
  helpful_count INT DEFAULT 0,
  created_at TIMESTAMP,
  FOREIGN KEY (product_id) REFERENCES products(id),
  FOREIGN KEY (user_id) REFERENCES users(id)
);

CREATE INDEX idx_products_category ON products(category_id);
CREATE INDEX idx_products_price ON products(price_cents);
CREATE INDEX idx_reviews_product ON reviews(product_id);

Query example:

SELECT
  p.id, p.name, p.price_cents,
  COUNT(r.id) as review_count,
  AVG(r.rating) as avg_rating,
  STRING_AGG(
    CASE WHEN pa.attribute_key = 'color' THEN pa.attribute_value END,
    ', '
  ) as colors
FROM products p
LEFT JOIN reviews r ON p.id = r.product_id
LEFT JOIN product_attributes pa ON p.id = pa.product_id
WHERE p.category_id = $1 AND p.price_cents BETWEEN $2 AND $3
GROUP BY p.id
ORDER BY avg_rating DESC
LIMIT 50;

Tradeoffs:

  • Creating indexes on category and price enables fast filtering
  • Joins across products, attributes, and reviews add query complexity
  • Adding a new attribute type (like “weight_kg”) requires schema changes across the application
  • High write volume during sales means contention on the products table

Document Store Design (MongoDB)

// Single document per product—fully denormalized
db.products.insertOne({
  _id: ObjectId("..."),
  name: "Premium Wireless Headphones",
  sku: "WH-1000-BLK",
  category: "electronics/audio",
  price_cents: 14999,
  stock: 542,
  attributes: {
    color: "black",
    weight_g: 250,
    battery_hours: 20,
    warranty_months: 24,
    bluetooth_version: "5.3"
  },
  reviews: [
    {
      user_id: ObjectId("..."),
      rating: 5,
      text: "Excellent sound quality",
      helpful_count: 127,
      created_at: ISODate("2024-02-10")
    },
    {
      user_id: ObjectId("..."),
      rating: 4,
      text: "Great but a bit pricey",
      helpful_count: 89,
      created_at: ISODate("2024-02-09")
    }
  ],
  rating_summary: {
    average: 4.6,
    count: 2847,
    distribution: { 5: 1850, 4: 650, 3: 200, 2: 100, 1: 47 }
  },
  created_at: ISODate("2024-01-15"),
  updated_at: ISODate("2024-02-11")
})

Query example:

// Fast single-document lookup
db.products.findOne({sku: "WH-1000-BLK"})

// Filter by category and price with aggregation
db.products.aggregate([
  {
    $match: {
      category: "electronics/audio",
      price_cents: { $gte: 10000, $lte: 20000 }
    }
  },
  {
    $sort: { "rating_summary.average": -1 }
  },
  {
    $limit: 50
  },
  {
    $project: {
      name: 1,
      price_cents: 1,
      rating_summary: 1,
      attributes: 1,
      review_count: { $size: "$reviews" }
    }
  }
])

Tradeoffs:

  • Adding new attributes (like “warranty_months”) requires no schema changes—just update documents
  • Reviews embedded in the document mean a single read gets all data needed for display
  • No joins needed—faster queries for simple access patterns
  • But updating a review (changing helpful_count) writes to the entire document (write amplification)
  • Rating summary is pre-aggregated for speed (sacrifices freshness for performance)

Hybrid Approach

In reality, sophisticated systems often use both:

┌─────────────────────────────────────────┐
│  Application Layer                      │
├─────────────────────────────────────────┤
│ PostgreSQL              Redis Cache     │
│ - Canonical product     - Hot products  │
│ - Normalization        - Reviews       │
│ - Transactions         - Rankings      │
├─────────────────────────────────────────┤
│ MongoDB                 Elasticsearch   │
│ - Product variants     - Full-text search│
│ - Flexible attributes  - Faceting       │
│ - Quick writes         - Analytics      │
└─────────────────────────────────────────┘

This approach uses:

  • PostgreSQL for the source of truth (products, inventory, orders)
  • Redis for caching hot data (reducing PostgreSQL load)
  • MongoDB for flexible product variants and rapidly changing attributes
  • Elasticsearch for full-text search across descriptions

Understanding the Consistency vs Flexibility Tradeoff

Consistency-First (Relational)

Relational databases prioritize strong consistency: Every read returns the most recent write. Multiple concurrent operations either all succeed or all fail atomically.

-- Transfer money: both or nothing
BEGIN TRANSACTION;
  UPDATE accounts SET balance = balance - 100 WHERE id = 1;
  UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
-- If either UPDATE fails, both rollback

This safety requires coordination overhead. When your data spans multiple servers, achieving strong consistency becomes expensive (two-phase commit, consensus algorithms). This limits horizontal scaling.

Flexibility-First (NoSQL)

NoSQL systems typically embrace eventual consistency: Operations complete immediately, but replicas catch up asynchronously. Different queries might temporarily see different data.

// Write succeeds immediately
db.products.updateOne({_id: 1}, {$set: {stock: 542}})

// Reader in datacenter A might see stock: 543
// Reader in datacenter B sees stock: 542 (temporarily)
// All replicas converge within milliseconds

This speeds up writes and enables trivial horizontal scaling, but requires applications to handle temporary inconsistencies.

When Each Matters

Strong consistency critical:

  • Financial transactions (double-spending prevention)
  • Inventory management (overselling prevention)
  • User authentication (preventing duplicate accounts)
  • Any situation where inconsistency has legal/safety implications

Eventual consistency acceptable:

  • Social media feeds (a few seconds of staleness is fine)
  • Analytics and metrics (100% real-time isn’t required)
  • Caching layers (by definition allows stale data)
  • User profile updates (eventual propagation is expected)

Denormalization: When and Why

One of the most misunderstood concepts in database design is denormalization.

Normalized design (relational ideal):

  • Each fact stored once
  • No redundancy
  • Requires joins to answer queries
  • Updates always consistent
-- Normalized: user details separated
SELECT users.name, COUNT(orders.id)
FROM users
LEFT JOIN orders ON users.id = orders.user_id
WHERE users.id = 123;

Denormalized design (NoSQL common):

  • Related data stored together
  • Some redundancy accepted
  • Single read gets everything needed
  • Updates must maintain consistency across copies
// Denormalized: order counts cached
db.users.findOne({_id: 123}, {
  name: 1,
  order_count: 1,
  total_spent: 1
})

When denormalization makes sense:

  • Read-heavy workloads (reading far exceeds writing)
  • The denormalized field changes infrequently
  • The cost of reads far exceeds the cost of inconsistent updates
  • You have mechanisms to maintain consistency (triggers, application logic, change data capture)

Do NOT denormalize when:

  • Data changes frequently (stale copies become a problem)
  • Consistency is critical (conflicting updates)
  • Storage is expensive (redundancy wastes space)
  • Update frequency exceeds read frequency

Scaling Characteristics

Vertical Scaling (Making One Server Bigger)

Both relational and NoSQL databases can scale vertically—add more CPU, RAM, and disk to a single server. But this has limits:

  • Single-server throughput ceiling
  • Single point of failure
  • Increasingly expensive hardware
  • No practical limit beyond physics

Horizontal Scaling (Adding More Servers)

Relational databases struggle here because:

  • Joins across sharded data become complex and slow
  • Maintaining strong consistency across servers requires costly coordination
  • Foreign key constraints don’t work across shards
  • Many operations become distributed transactions
Sharding a relational database:
┌──────────────────────────────┐
│   Shard 1: users A-M         │
│   ├─ tables: products        │
│   ├─ tables: orders          │
│   └─ tables: reviews         │
└──────────────────────────────┘
┌──────────────────────────────┐
│   Shard 2: users N-Z         │
│   ├─ tables: products        │
│   ├─ tables: orders          │
│   └─ tables: reviews         │
└──────────────────────────────┘
-- Query across shards: requires scatter-gather
SELECT * FROM orders WHERE user_id > 500
-- Must query both shards and merge results

NoSQL databases embrace horizontal scaling:

  • Partition data by key deterministically
  • Each server owns a range of keys
  • No coordination needed for independent keys
  • Reads and writes scale linearly with cluster size
Document store sharding:
┌──────────────────────────────┐
│   Shard 1: keys A-F          │
│   ├─ products (A-F)          │
│   ├─ reviews (A-F)           │
└──────────────────────────────┘
┌──────────────────────────────┐
│   Shard 2: keys G-M          │
│   ├─ products (G-M)          │
│   ├─ reviews (G-M)           │
└──────────────────────────────┘
-- Query single key: routes to correct shard, no coordination

Polyglot Persistence: Using the Right Tool

Modern systems increasingly use polyglot persistence: different databases optimized for different access patterns in the same system.

User Registration Service:
┌─ PostgreSQL: user accounts (strong consistency, complex joins)
├─ Redis: active sessions (fast, expiring)
├─ MongoDB: user preferences (flexible schema)
├─ Elasticsearch: user search (full-text and faceting)
└─ Neo4j: user social graph (relationship queries)

Example data flow:

1. User registers → PostgreSQL (authoritative)
2. Create session → Redis (fast lookup)
3. Store preferences → MongoDB (flexible)
4. Index for search → Elasticsearch (async)
5. Create friend graph → Neo4j (relationships)

Reading user profile:
  - Fetch from PostgreSQL + Redis cache
  - Render quickly with cached data
  - Async update search index and graph

Key Takeaways

  • Schema paradigm matters: Relational’s schema-on-write prevents errors but requires migrations. NoSQL’s schema-on-read offers flexibility but risks inconsistencies.

  • Consistency has a cost: Strong consistency (relational) limits horizontal scaling. Eventual consistency (NoSQL) enables scaling but complicates application logic.

  • Denormalization is a tradeoff: Embed related data in documents to avoid joins, but accept responsibility for maintaining consistency when that data changes.

  • Scalability differs fundamentally: Relational databases scale vertically and struggle with sharding. NoSQL databases partition naturally and scale horizontally trivially.

  • No universal winner exists: Relational databases excel at complex transactions and normalized data. NoSQL databases excel at flexible schemas and high-throughput writes. Use polyglot persistence to leverage both.

  • Query patterns drive architecture: Understand your read and write patterns first. Then choose the database that optimizes for those patterns.

Put It Into Practice

Scenario 1: User Analytics Dashboard You’re building a dashboard showing user signup trends, regional distribution, and device breakdowns. You must handle 100,000 events per second with immediate visibility (within 5 seconds). Would you use relational or NoSQL? Why? What if you also need to run SQL-like queries across arbitrary dimensions?

Scenario 2: Financial Reconciliation System Your fintech platform must reconcile user account balances with blockchain records every hour. The reconciliation must never miss a discrepancy, and every update must be correct or not happen at all. You have 50 million user accounts. How does this affect your database choice? Can you use eventual consistency?

Scenario 3: Product Catalog with Variants An e-commerce site sells products with hundreds of different attributes (electronics have specs, clothing has sizes/colors, furniture has materials). New attribute types are added constantly. You need to search products efficiently and support 10,000 updates per second during sales. What’s your multi-database approach? Which system owns the canonical product data?

What’s Next

Now that we understand where data lives and its organizational structure, we need to understand how to protect it. Chapter 39 explores ACID properties and transactions—how databases guarantee that complex multi-step operations either complete safely or fail gracefully. This becomes critical when using relational databases and even more crucial when implementing application-level transaction logic across NoSQL systems.

The consistency guarantees we choose here directly shape the transaction patterns we can safely use downstream.