System Design Fundamentals

Vertical Partitioning

A

Vertical Partitioning

Vertical partitioning, also called column-based splitting, involves dividing a database table by columns rather than rows. Instead of splitting data horizontally across multiple servers based on a partition key, vertical partitioning separates different groups of columns into different tables or databases. This technique is particularly valuable when different columns have vastly different access patterns, sizes, or availability requirements.

Understanding Vertical Partitioning

In a traditional monolithic database, all columns of a table reside together. A user profile table might contain frequently accessed fields (username, email) alongside rarely accessed fields (bio, profile_image_blob), or fields with different performance characteristics.

Vertical partitioning addresses a fundamental mismatch: why keep all columns together when they have different usage patterns?

Consider a simple users table:

CREATE TABLE users (
  user_id BIGINT,
  username VARCHAR(255),
  email VARCHAR(255),
  password_hash VARCHAR(255),
  created_at TIMESTAMP,
  last_login TIMESTAMP,
  bio TEXT,
  profile_image BLOB,
  preferences JSON,
  account_balance DECIMAL,
  payment_method VARCHAR(100)
);

Different columns have dramatically different characteristics:

  • Hot data: user_id, username, email are accessed on nearly every request
  • Warm data: password_hash, last_login, account_balance are accessed during authentication and transactions
  • Cold data: bio, profile_image are accessed infrequently, only when viewing a detailed profile
  • Different sizes: profile_image could be megabytes while username is hundreds of bytes

Vertical partitioning splits this into multiple tables optimized for their access patterns.

Partitioning Strategies

Column-Based Splitting

The simplest vertical partitioning strategy divides columns into logical groups:

-- Hot data: users_core
CREATE TABLE users_core (
  user_id BIGINT PRIMARY KEY,
  username VARCHAR(255),
  email VARCHAR(255),
  created_at TIMESTAMP,
  last_login TIMESTAMP
);

-- Warm data: users_auth
CREATE TABLE users_auth (
  user_id BIGINT PRIMARY KEY REFERENCES users_core(user_id),
  password_hash VARCHAR(255),
  account_balance DECIMAL,
  payment_method VARCHAR(100)
);

-- Cold data: users_profile
CREATE TABLE users_profile (
  user_id BIGINT PRIMARY KEY REFERENCES users_core(user_id),
  bio TEXT,
  profile_image BLOB,
  preferences JSON
);

With this structure, queries fetching user details for a homepage can query only users_core, avoiding I/O from loading megabyte-sized profile images.

Hot and Cold Data Separation

A common pattern distinguishes between frequently and infrequently accessed data:

graph TB
    subgraph Original["Original Table: users (all columns)"]
        UA[user_id, username, email, created_at]
        UB[password_hash, last_login, account_balance]
        UC[bio, profile_image, preferences]
    end

    subgraph Partitioned["After Vertical Partitioning"]
        direction TB
        H["πŸ”₯ Hot Table: users_hot<br/>(Frequently accessed)"]
        W["⚠️ Warm Table: users_warm<br/>(Occasionally accessed)"]
        C["❄️ Cold Table: users_cold<br/>(Rarely accessed)"]
    end

    Original -->|Split by access pattern| Partitioned
    H -->|SSD/Memory| Hot["Fast tier storage"]
    W -->|SSD| Warm["Standard storage"]
    C -->|Blob storage/Archive| Cold["Cost-effective storage"]

Hot data includes columns accessed on most requests:

  • User IDs, usernames, email addresses
  • Frequently checked status fields
  • Recent activity timestamps

Warm data includes columns accessed occasionally:

  • Encrypted passwords (accessed only during authentication)
  • Financial balances (accessed during transactions)
  • Audit logs (accessed for compliance)

Cold data includes columns rarely accessed:

  • Long-form text (bios, descriptions)
  • Large binary files (profile pictures, documents)
  • Historical archives

Separating these allows different storage and caching strategies. Hot data can be cached aggressively, warm data kept in standard databases, and cold data archived to cheaper object storage.

Partitioning by Access Patterns

Instead of temperature-based separation, partition by functional access patterns:

Authentication system accesses:
- user_id, email, password_hash

Profile viewing accesses:
- user_id, username, profile_image, bio, preferences

Payment system accesses:
- user_id, account_balance, payment_method

Analytics system accesses:
- user_id, created_at, last_login, preferences

Each system queries different subsets. Separating these allows:

  1. Independent scaling: The authentication table can be replicated across more servers than the profile table
  2. Specialized storage: Payment data might use encrypted storage, profile images use CDN-backed blob storage
  3. Different consistency models: Authentication data requires strong consistency; profile updates can be eventually consistent
  4. Targeted indexing: Build indexes optimized for each partition’s queries

Normalization vs. Denormalization in Vertical Partitioning

Vertical partitioning intersects with normalization. When columns are split across tables, relationships emerge.

Normalized Approach

Partition with foreign key relationships:

-- Base entity
CREATE TABLE users (
  user_id BIGINT PRIMARY KEY,
  username VARCHAR(255),
  email VARCHAR(255)
);

-- Vertically partitioned: settings
CREATE TABLE user_settings (
  user_id BIGINT PRIMARY KEY REFERENCES users(user_id),
  theme VARCHAR(50),
  notifications_enabled BOOLEAN,
  language VARCHAR(10)
);

-- Query: Fetch user with settings
SELECT u.*, us.theme, us.language
FROM users u
LEFT JOIN user_settings us ON u.user_id = us.user_id
WHERE u.user_id = 123;

Advantages:

  • No data duplication
  • Consistency enforced by foreign keys
  • Clean schema design

Disadvantages:

  • Requires joins across partitions (performance cost)
  • Latency increases if partitions are on different servers
  • Network round-trips for related data

Denormalized Approach

Replicate key identifiers to avoid joins:

CREATE TABLE users_core (
  user_id BIGINT PRIMARY KEY,
  username VARCHAR(255),
  email VARCHAR(255)
);

CREATE TABLE users_settings (
  user_id BIGINT PRIMARY KEY,
  username VARCHAR(255),  -- denormalized copy
  email VARCHAR(255),     -- denormalized copy
  theme VARCHAR(50),
  notifications_enabled BOOLEAN,
  language VARCHAR(10)
);

-- Query: Fetch settings without join
SELECT theme, language FROM user_settings
WHERE user_id = 123;

Advantages:

  • No joins required, queries are faster
  • Partition is self-contained
  • Better for distributed systems

Disadvantages:

  • Data duplication increases storage
  • Consistency challenges (username changes must update both tables)
  • Requires application logic or triggers for synchronization

In distributed systems, denormalization often wins. The cost of joins across network boundaries exceeds the benefit of normalized storage.

Combining Vertical and Horizontal Partitioning

Vertical and horizontal partitioning are complementary. Combine them for maximum optimization:

User Table (Vertical Split):
β”œβ”€β”€ users_core (user_id, username, email)
β”‚   └── Horizontal: Partition by user_id range
β”‚       β”œβ”€β”€ Shard 1: user_id 1-1M
β”‚       β”œβ”€β”€ Shard 2: user_id 1M-2M
β”‚       └── Shard 3: user_id 2M-3M
β”‚
β”œβ”€β”€ users_auth (user_id, password_hash, account_balance)
β”‚   └── Horizontal: Partition by user_id range
β”‚       β”œβ”€β”€ Shard 1: user_id 1-1M
β”‚       β”œβ”€β”€ Shard 2: user_id 1M-2M
β”‚       └── Shard 3: user_id 2M-3M
β”‚
└── users_profile (user_id, bio, profile_image)
    └── Horizontal: Partition by user_id range
        β”œβ”€β”€ Shard 1: user_id 1-1M
        β”œβ”€β”€ Shard 2: user_id 1M-2M
        └── Shard 3: user_id 2M-3M

Each partition (vertical split) is then sharded horizontally, creating a two-dimensional distribution:

  1. Vertical: Group related columns
  2. Horizontal: Distribute each group across servers

This maximizes benefits:

users_core (hot data)
β”œβ”€β”€ Small, frequently accessed
β”œβ”€β”€ Shard 1: users 1-1M (Memory-cached, SSD)
β”œβ”€β”€ Shard 2: users 1M-2M (Memory-cached, SSD)
└── Shard 3: users 2M-3M (Memory-cached, SSD)

users_profile (cold data)
β”œβ”€β”€ Large, rarely accessed
β”œβ”€β”€ Shard 1: users 1-1M (Object storage)
β”œβ”€β”€ Shard 2: users 1M-2M (Object storage)
└── Shard 3: users 2M-3M (Object storage)

The hot data shards can be small and fast; cold data shards can use cheaper storage.

Practical Implementation Patterns

API Layer Routing

The application layer must know which partition to query:

class UserService:
    def get_user_core(self, user_id):
        """Fetch hot data only"""
        return db_hot.query(
            "SELECT * FROM users_core WHERE user_id = %s",
            (user_id,)
        )

    def get_user_profile(self, user_id):
        """Fetch cold data when needed"""
        return db_cold.query(
            "SELECT * FROM users_profile WHERE user_id = %s",
            (user_id,)
        )

    def get_user_full(self, user_id):
        """Fetch all data (multiple queries)"""
        core = self.get_user_core(user_id)
        profile = self.get_user_profile(user_id)

        # Merge in application
        return {**core, **profile}

Caching Strategy

Different partitions benefit from different caching strategies:

# Hot data: Cache aggressively
cache.set(
    f"user:core:{user_id}",
    user_core_data,
    ttl=3600  # 1 hour
)

# Warm data: Cache moderately
cache.set(
    f"user:auth:{user_id}",
    user_auth_data,
    ttl=300  # 5 minutes
)

# Cold data: Cache rarely or not at all
# Profile images: Cache via CDN, not in-process cache

Trade-Offs and Considerations

Advantages

  • Reduced I/O: Queries accessing only hot data don’t read cold columns
  • Faster caching: Smaller partitions fit in memory more easily
  • Flexible storage tiers: Place hot on SSD, cold on cheaper blob storage
  • Independent optimization: Optimize each partition for its access pattern
  • Query performance: Smaller tables mean faster scans and indexes

Disadvantages

  • Increased complexity: Application must route to correct partitions
  • Join overhead: Reconstructing full records requires multiple queries
  • Data consistency: Maintaining consistency across partitions is harder
  • Schema changes: Adding columns requires updating partition logic
  • Operational complexity: Managing multiple tables instead of one

When to Use Vertical Partitioning

Vertical partitioning is most valuable when:

  1. Access patterns diverge significantly: Some columns accessed constantly, others rarely
  2. Column sizes vary dramatically: Some columns are large (images, blobs), others tiny
  3. Different consistency requirements: Some columns need strong consistency, others can be eventually consistent
  4. Storage cost matters: Expensive to keep massive columns in hot storage
  5. Scalability bottlenecked by I/O: Reducing data volume per query improves performance

When columns are uniformly accessed and similar size, horizontal partitioning alone is usually sufficient.

Real-World Examples

E-Commerce Product Catalog

products_meta (hot)
β”œβ”€β”€ product_id, name, price, stock_level
└── Accessed on every product page load

products_details (warm)
β”œβ”€β”€ product_id, description, specifications
└── Accessed when user views full product page

products_media (cold)
β”œβ”€β”€ product_id, images, videos, 3d_models
└── Accessed when user wants to see gallery

products_reviews (warm)
β”œβ”€β”€ product_id, review_count, average_rating
└── Accessed on product page (but count only)
└── Detailed reviews in separate partition

Social Media User Profile

profile_core (hot)
β”œβ”€β”€ user_id, username, follower_count
└── Accessed on feed, mentions, search

profile_contact (warm)
β”œβ”€β”€ user_id, email, phone, address
└── Accessed during settings, rarely

profile_content (warm)
β”œβ”€β”€ user_id, bio, profile_picture_url
└── Accessed when viewing profile

profile_media (cold)
β”œβ”€β”€ user_id, photos, videos, archives
└── Accessed when user browses own content

Key Takeaways

  1. Vertical partitioning splits tables by columns, separating data with different access patterns
  2. Hot/cold separation allows cost-effective storage and caching strategies
  3. Combine vertical and horizontal partitioning for maximum scalability
  4. Application logic must route requests to the correct partition
  5. Denormalization often wins in distributed systems to avoid expensive joins
  6. Use vertical partitioning when access patterns, sizes, or consistency requirements diverge significantly

Connection to Next Section

Successful vertical partitioning requires choosing which columns to group togetherβ€”a decision that mirrors the challenge of horizontal partitioning: choosing the right partition key. In the next section, we explore how to select partition keys that balance data distribution, access patterns, and consistency requirements.