System Design Fundamentals

Popular Databases

A

Popular Databases

The database landscape has exploded in recent years. Gone are the days when PostgreSQL or MySQL were your only options. Today, we choose databases based on our specific access patterns, consistency requirements, and scale expectations. This appendix provides a practical reference for the most commonly used databases in system design, organized by category.

Relational Databases

PostgreSQL

PostgreSQL is often called the “Swiss Army knife” of databases—it does nearly everything well. It’s open-source, ACID-compliant, and constantly evolving.

Key Features:

  • Mature and battle-tested (50+ years of database research built in)
  • Rich type system (JSON, arrays, custom types, composite types)
  • Powerful extensions ecosystem (PostGIS for geo, pg_trgm for fuzzy matching, TimescaleDB)
  • Full-text search capabilities
  • JSONB for document-like storage with indexing
  • Excellent query optimizer
  • Replication and high-availability solutions (streaming replication, logical replication)

When to Use: General-purpose applications, complex queries, strong ACID guarantees needed, moderate to large scale. If you’re unsure what database to pick, PostgreSQL is often the right starting point.

Typical Use Cases: SaaS platforms, financial systems, e-commerce backends, content management systems.

Considerations: Single-node writes can become a bottleneck at extreme scale. Horizontal sharding is manual and complex. Slower than specialized databases for specific workloads.

MySQL

MySQL remains the most widely deployed database in the world, particularly for web applications. InnoDB is the default storage engine and provides ACID compliance.

Key Features:

  • Easy to set up and operate
  • Synchronous replication (master-slave, master-master)
  • InnoDB storage engine with row-level locking
  • Reasonable performance for most web workloads
  • Mature tooling and broad community knowledge

When to Use: Web applications, content platforms, anything where you need reliable replication but don’t need PostgreSQL’s advanced features.

Typical Use Cases: WordPress, Drupal, custom web applications, payment processing.

Considerations: Historically slower than PostgreSQL for complex queries. InnoDB can struggle with very large tables. Sharding requires application-level coordination.

Amazon Aurora

Aurora is AWS’s cloud-native relational database. It’s compatible with MySQL and PostgreSQL but reimpleered from the ground up.

Key Features:

  • MySQL/PostgreSQL compatibility (you can migrate easily)
  • 5x faster writes than MySQL, 3x faster than PostgreSQL
  • Shared storage layer (separates compute from storage)
  • Automatic scaling—storage grows automatically up to 128 TB
  • Read replicas are nearly free (share storage layer)
  • Multi-AZ failover in under 30 seconds
  • Serverless option (Aurora Serverless v2) auto-scales compute

When to Use: AWS-native architectures, applications needing high availability with low operational overhead, variable or unpredictable workloads (with Serverless).

Typical Use Cases: Cloud-native web applications, SaaS products on AWS, high-performance OLTP systems.

Considerations: AWS lock-in. More expensive than self-managed PostgreSQL/MySQL. Backups are automatic but tied to AWS (not portable).

Document Databases

MongoDB

MongoDB is the most popular document database. Data is stored as JSON-like documents within collections, offering flexible schema.

Key Features:

  • Flexible, schema-less design (each document can differ)
  • Horizontal scaling via sharding (data distributed by shard key)
  • Aggregation pipeline for complex data transformations
  • Multi-document ACID transactions (as of version 4.0)
  • Replication set for high availability
  • Rich indexing options
  • Full-text search

When to Use: When your data structure varies or evolves frequently, when you want quick prototyping without schema design, rapid development cycles.

Typical Use Cases: Mobile app backends, IoT data collection, user profiles with variable attributes, content platforms.

Considerations: Write operations are slower than some alternatives. Transactions incur overhead. Schema flexibility can become a liability if not disciplined—data becomes inconsistent over time. Over-normalization defeats the purpose.

Pro Tip: Be intentional about your data model even in MongoDB. Flexibility is powerful but can hide data quality issues.

Amazon DynamoDB

DynamoDB is AWS’s fully managed NoSQL service. You don’t manage servers or capacity—AWS handles it all.

Key Features:

  • Fully managed (serverless)
  • Single-digit millisecond latency at any scale
  • Auto-scaling based on read/write capacity
  • Pay-per-request or provisioned capacity options
  • Built-in replication across Availability Zones
  • Global Secondary Indexes (GSIs) for alternate query patterns
  • TTL for automatic data expiration
  • Streams for change data capture

When to Use: AWS-native architectures, applications with unpredictable traffic, when you want zero operational overhead for the database.

Typical Use Cases: Real-time analytics dashboards, user sessions, leaderboards, notifications, IoT sensor data.

Considerations: Limited query flexibility (must think in terms of partition keys and sort keys). No complex joins or transactions across partitions. Pricing can surprise you if you’re not careful with capacity planning.

Wide-Column Stores

Apache Cassandra

Cassandra is a distributed, peer-to-peer database designed for massive scale. Every node is equal—no single point of failure.

Key Features:

  • Masterless architecture (every node can accept writes)
  • High write throughput (designed for writes, not reads)
  • Tunable consistency (you choose consistency level per query)
  • Automatic replication across data centers
  • No single point of failure
  • CQL (Cassandra Query Language) similar to SQL
  • Good for time-series and event data

When to Use: When you need to handle millions of writes per second, when you must have multi-datacenter replication, when you accept eventual consistency.

Typical Use Cases: Time-series metrics (monitoring, IoT), event logging, analytics pipelines, activity tracking.

Considerations: Operational complexity is high. Eventual consistency requires application handling of duplicates and out-of-order events. Reads are slower than writes. Need expertise to tune and operate well.

Apache HBase

HBase sits on top of Hadoop’s HDFS and is designed for random read/write access to very large tables.

Key Features:

  • Distributed column-oriented store
  • Strong consistency
  • Excellent for random read/write patterns
  • Integrates with Hadoop ecosystem
  • Good compression ratios for sparse data
  • Replication via HDFS

When to Use: When you’re already in the Hadoop ecosystem, when you need strong consistency with millions of rows, when you want column-oriented compression.

Typical Use Cases: Time-series data in Hadoop, large-scale analytics, messaging platforms.

Considerations: Operational overhead is substantial. Slower than Cassandra for pure write throughput. Usually requires expertise in Hadoop.

Key-Value Stores

Redis

Redis is an in-memory data structure store. It’s fast—sub-millisecond latency—but not persistent by default.

Key Features:

  • Ultra-fast (in-memory, single-threaded event loop)
  • Rich data structures (strings, lists, sets, sorted sets, streams, geospatial)
  • Pub/sub messaging
  • Lua scripting for atomic operations
  • Persistence options (RDB snapshots, AOF journaling)
  • Clustering for distribution
  • Replication for high availability
  • Key expiration/TTL built-in

When to Use: Caching, sessions, real-time leaderboards, rate limiting, message queues, feature flags.

Typical Use Cases: Session storage, database caching, leaderboards in gaming, real-time notifications, shopping carts, pub/sub messaging.

Considerations: In-memory means RAM is the bottleneck. Data loss risk if persistence isn’t configured. Not suitable for large datasets that don’t fit in memory. Single-threaded by default (though Redis 6.0+ has threading).

Pro Tip: Use Redis for “hot” data only. It’s perfect as a cache-aside layer in front of a persistent database.

Memcached

Memcached is simpler than Redis—a distributed memory cache, nothing more.

Key Features:

  • Simple key-value caching
  • Multi-threaded
  • Distributed across servers
  • No persistence
  • Lower CPU overhead than Redis
  • Excellent for memcache adoption across many platforms

When to Use: Simple caching, when you don’t need data structures, when you want minimal operational overhead.

Typical Use Cases: Database result caching, rendered page caching, expensive computation caching.

Considerations: No persistence means total data loss on restart. No pub/sub or complex operations. Redis has replaced Memcached for most new projects.

Time-Series Databases

InfluxDB

InfluxDB is purpose-built for metrics and events. Time is a first-class citizen.

Key Features:

  • Optimized for timestamped data ingestion
  • InfluxQL and Flux query languages
  • Downsampling to keep old data compact
  • Retention policies (auto-delete old data)
  • High write throughput
  • Built-in clustering (InfluxDB Enterprise)
  • Schema-less (tags and fields)

When to Use: Metrics collection, monitoring, application telemetry, sensor data.

Typical Use Cases: Infrastructure monitoring, application performance monitoring (APM), IoT sensor data, stock prices.

Considerations: The open-source version is single-node. Clustering requires Enterprise edition. Query performance on large time ranges can be slow.

TimescaleDB

TimescaleDB is a PostgreSQL extension for time-series data. You get all of PostgreSQL plus time-series optimizations.

Key Features:

  • PostgreSQL extension (runs as a PostgreSQL table)
  • Full SQL support for time-series queries
  • Automatic partitioning by time
  • Compression for historical data
  • Continuous aggregates for rolling calculations
  • Full relational capabilities

When to Use: When you’re already using PostgreSQL, when you need SQL for time-series analysis, when you want the flexibility of a relational database.

Typical Use Cases: Infrastructure monitoring in PostgreSQL environments, financial time-series, production metrics alongside relational data.

Considerations: Single-node (horizontal scaling is paid/cloud). Less specialized than InfluxDB. Requires PostgreSQL expertise.

Graph Databases

Neo4j

Neo4j stores data as relationships between nodes. The query language Cypher is intuitive for relationship-heavy data.

Key Features:

  • Relationship-first data model
  • Cypher query language (intuitive for graph traversals)
  • ACID transactions
  • Property graph model (nodes and edges have attributes)
  • Pattern matching in queries
  • High performance for relationship queries
  • Community and Enterprise editions

When to Use: Social networks, recommendation engines, knowledge graphs, identity and access management, master data management.

Typical Use Cases: LinkedIn-like networks, “friends of friends” queries, product recommendations, fraud detection (relationship patterns).

Considerations: Operational complexity is moderate. Not ideal for non-relationship data. Licensing can be expensive for Enterprise. Learning Cypher has a curve.

Pro Tip: Graph databases shine when relationships are as important as the data itself. If your queries rarely traverse relationships, a relational database is simpler.

Search Engines

Elasticsearch

Elasticsearch is a distributed search and analytics engine built on Lucene. You can index, search, and analyze massive volumes of data.

Key Features:

  • Full-text search
  • Real-time analytics
  • Distributed and scalable
  • REST API
  • Schema flexibility (mappings)
  • Aggregations for bucketing and statistics
  • Part of the ELK Stack (Elasticsearch, Logstash, Kibana)
  • Machine learning features (anomaly detection, forecasting)

When to Use: Log analysis, full-text search, application search functionality, real-time analytics dashboards.

Typical Use Cases: ELK logging stacks, product search in e-commerce, document search, metrics analysis.

Considerations: Not a primary data store (no ACID). Operational complexity is high (cluster management, shard balancing). Expensive at scale. Licensing recently changed (now source-available with restrictions).

NewSQL Databases

CockroachDB

CockroachDB is a distributed SQL database with strong consistency. It mimics PostgreSQL’s wire protocol but scales horizontally.

Key Features:

  • Distributed and highly available (no single point of failure)
  • Strong consistency with SERIALIZABLE isolation
  • PostgreSQL compatibility (same wire protocol)
  • Automatic sharding
  • Multi-region replication
  • SQL interface (standard relational model)
  • ACID transactions across any number of shards

When to Use: When you need horizontal scalability of NoSQL but want the simplicity and consistency of SQL. Financial systems that need strong consistency.

Typical Use Cases: Geo-distributed financial systems, global product catalogs, systems requiring strict ACID across shards.

Considerations: Still smaller community than PostgreSQL. Less operational knowledge available. Transactions have higher latency than single-node databases.

Google Spanner

Spanner is Google’s globally distributed database with strong consistency. It uses atomic clocks (TrueTime) to ensure external consistency.

Key Features:

  • Globally distributed (strong consistency across datacenters)
  • SQL interface
  • ACID transactions
  • Automatic sharding and replication
  • Synchronous replication (high availability)
  • Extremely high availability (99.999%)

When to Use: When you need global distribution with strong consistency, when you want a fully managed database, when you’re in the GCP ecosystem.

Typical Use Cases: Global financial platforms, globally distributed applications requiring strong consistency.

Considerations: Google Cloud only. High operational cost. Overkill for applications that don’t truly need global distribution with consistency.

Database Comparison Matrix

DatabaseTypeConsistencyScalingBest ForHosted Option
PostgreSQLRelationalACIDVertical (sharding manual)General-purpose, complex queriesAWS RDS, Azure, GCP Cloud SQL, Heroku
MySQLRelationalACIDVertical (sharding manual)Web applications, legacy systemsAWS RDS, Azure, GCP Cloud SQL, PlanetScale
AuroraRelationalACIDHorizontal (storage auto-scales)AWS-native, high availabilityAWS Aurora (managed)
MongoDBDocumentEventually consistent (now ACID with multi-doc)Horizontal (sharding)Flexible schema, rapid iterationMongoDB Atlas (managed)
DynamoDBDocument/Key-valueEventually consistentHorizontal (auto)AWS serverless, unpredictable trafficAWS DynamoDB (fully managed)
CassandraWide-columnTunableHorizontal (peer-to-peer)Massive write throughput, time-seriesAstra (DataStax managed)
HBaseWide-columnStrongHorizontal (via HDFS)Hadoop integration, analyticsCloudera, Hortonworks (managed)
RedisKey-valueEventual (with replication)Horizontal (clustering)Caching, real-time, sessionsAWS ElastiCache, Redis Cloud
MemcachedKey-valueNone (cache)HorizontalSimple cachingAWS ElastiCache, cloud providers
InfluxDBTime-seriesEventualHorizontal (Enterprise)Metrics, monitoring, telemetryInfluxDB Cloud (managed)
TimescaleDBTime-seriesACIDVertical (scaling paid)Time-series in PostgreSQLManaged via Timescale Cloud
Neo4jGraphACIDVertical (Enterprise horizontal)Social networks, recommendationsNeo4j Aura (managed)
ElasticsearchSearchEventualHorizontalFull-text search, log analysisElastic Cloud (managed)
CockroachDBRelationalACIDHorizontalSQL at scale, strong consistencyCockroachDB Cloud (managed)
SpannerRelationalACIDHorizontal (global)Global distribution, strong consistencyGoogle Cloud Spanner (managed)

Choosing the Right Database

Start with the access pattern. Ask yourself:

  • How will I query the data most frequently?
  • What’s my consistency requirement?
  • How much data will I store?
  • What’s my throughput requirement?

Consider the team. Choose databases your team knows or can learn. Operational expertise matters enormously.

Prefer boring, proven technology. PostgreSQL and MySQL are the safe default. They solve most problems adequately. Choose a specialized database only when your workload truly demands it.

Plan for scale, but don’t over-engineer. Most systems grow into their database choice over months or years. Start simple, monitor, and evolve.

Evaluate managed services. Cloud-managed databases (RDS, DynamoDB, Atlas) trade control for operational simplicity. The operational savings often justify the cost.

Key Takeaways

  • Relational databases (PostgreSQL, MySQL, Aurora) remain the default choice for most applications. They’re mature, well-understood, and handle complex queries well.
  • Document databases (MongoDB, DynamoDB) suit applications with flexible or evolving schemas and rapid development cycles.
  • Wide-column stores (Cassandra, HBase) are necessary only at extreme scale or with specific patterns (high write throughput, multi-region).
  • Key-value stores (Redis) are essential for caching, sessions, and real-time data—not primary stores.
  • Time-series databases (InfluxDB, TimescaleDB) are purpose-built for metrics and events. Use them if that’s your primary use case.
  • Graph databases (Neo4j) shine when relationships are as important as data itself.
  • Search engines (Elasticsearch) are for full-text search and analytics, not ACID transactions.
  • NewSQL (CockroachDB, Spanner) solve the distributed SQL problem but add operational complexity.

Choose the simplest database that satisfies your requirements. You can always migrate later, but doing so is expensive.