System Design Fundamentals

Estimating Scale

A

Estimating Scale and Constraints

Why Back-of-the-Envelope Calculations Matter

After gathering requirements, you need to put numbers on the problem. How many requests per second? How much storage will you need in a year? How much bandwidth? These aren’t asking for precision to the decimal point — they’re asking: what’s the order of magnitude?

Here’s why this matters: A system designed for 100 requests per second looks completely different from one designed for 100,000 requests per second. One might run on a single server with a basic database. The other needs load balancing, database replication, caching layers, and distributed queues. The difference is a factor of 1,000, and it changes everything.

The good news: you don’t need to be exact. Being within an order of magnitude is good enough. If the true answer is 50,000 QPS and you estimate 100,000, you’re still designing a distributed system (which is correct). If you estimate 5,000, you’ve made a 10x error that would lead you astray.

This is a skill that separates junior engineers from senior ones. Junior engineers design systems without thinking about scale. Senior engineers always ask: “At what scale does this break?”

Core Numbers to Memorize

You need to internalize a few fundamental numbers. These are your mental shortcuts for estimation.

Powers of 2

Storage grows in powers of 2. Memorize these:

UnitValue
1 KB (kilobyte)2^10 bytes = 1,000 bytes
1 MB (megabyte)2^20 bytes = 1 million bytes
1 GB (gigabyte)2^30 bytes = 1 billion bytes
1 TB (terabyte)2^40 bytes = 1 trillion bytes
1 PB (petabyte)2^50 bytes = 1 quadrillion bytes

In practice: A single tweet is about 500 bytes. A high-res photo is 2-5 MB. A one-hour HD video is 1-2 GB. These intuitions help you estimate quickly.

Common Latencies

Different operations have vastly different speeds. Memorize the order of magnitude:

OperationLatencyReality Check
L1 cache reference0.5 nanosecondFaster than a nanosecond
RAM access100 nanosecondsMemory is fast
SSD read150 microsecondsDisk is much slower than RAM
HDD read10 millisecondsSpinning disk is slow
Network packet (round trip)150 millisecondsSending data across the internet has inherent delay

The key insight: RAM is about 100,000x faster than network. Network is about 100,000x faster than disk. This drives architecture decisions.

Throughput Rules of Thumb

How much traffic can a single server handle? It depends on the operation:

ScenarioRequests/Second
Single server (simple read)10,000 - 100,000 QPS
Single server (database write)1,000 - 10,000 QPS
Single relational database5,000 - 10,000 QPS for simple queries
NoSQL database (key-value)100,000+ QPS
Cache (Redis/Memcached)100,000 - 1,000,000 QPS

Pro tip: These aren’t exact. They’re starting points for thinking about whether you need multiple servers or databases.

The Estimation Framework: From Users to Bits

Here’s the repeatable process for taking vague requirements and converting them to concrete numbers:

Step 1: Start With Total Users

Question: How many users does the system have?

Example: Twitter has about 500 million registered users.

Step 2: Calculate Daily Active Users (DAU)

Question: What percentage of users are active on any given day?

This is crucial because not all registered users use the service daily. Twitter’s DAU is around 300 million — about 60% of registered users.

Formula: DAU = Total Users × Daily Active Percentage

Step 3: Derive Requests Per Second (QPS)

Question: How many actions does each user take per day?

For Twitter: each user reads about 200 tweets per day and posts about 2 tweets per day. So each user generates about 202 requests per day for reading/posting.

Formula: Requests/day = DAU × Actions Per User Per Day

Convert to QPS: QPS = Requests/day ÷ 86,400

The shortcuts:

  • 1 million requests per day = ~12 QPS
  • 1 billion requests per day = ~12,000 QPS
  • Remember: 86,400 seconds in a day ≈ 100,000 for easy math

Twitter example:

  • DAU: 300 million
  • Requests per user per day: 202
  • Total requests: 300M × 202 = 60.6 billion requests/day
  • QPS: 60.6B ÷ 86,400 = ~700,000 QPS average

Does that match reality? Twitter peaks at around 500K-1M QPS, so we’re in the right ballpark.

Step 4: Calculate Storage Needs

Question: How much data is created per user action?

For Twitter, a tweet is roughly:

  • Tweet ID: 8 bytes
  • User ID: 8 bytes
  • Timestamp: 8 bytes
  • Text content: ~500 bytes (280 character limit)
  • Metadata (favorites, retweets, etc): ~50 bytes
  • Total: ~600 bytes per tweet

Formula:

  1. Data per action
  2. Actions per day
  3. Growth over time (1 year, 5 years)

Twitter example:

  • 2 tweets posted per user × 300M DAU = 600M tweets/day
  • 600M tweets × 600 bytes = 360 GB of tweet data per day
  • Per year: 360 GB × 365 = 131 TB just for tweet text
  • Add in indexes, replication, and other data: multiply by 3-5x

So Twitter probably uses multiple petabytes of storage for tweets alone.

Step 5: Calculate Bandwidth

Question: How much data flows through the network?

Bandwidth = Requests per second × Average response size

Twitter example (reading):

  • 700K QPS average (mostly reads)
  • Average response: 100KB (a page of tweets)
  • Bandwidth: 700K × 100KB = 70 GB/second

That’s a lot. You need serious network infrastructure.

Critical Concept: Peak vs. Average Traffic

Here’s something junior engineers forget: peak traffic is much higher than average.

Most systems are bursty. People use social media in the morning, evening, and during lunch. There’s a peak and a trough. Typical peak traffic is 2-5x the average.

Why this matters: Your system needs to handle peak traffic. If average is 700K QPS and peak is 3.5M QPS, you have to build for 3.5M.

Formula:

  • Peak QPS = Average QPS × Peak multiplier
  • Typical peak multiplier: 2x to 5x (use 3x as a default estimate)

The 80/20 Rule: Hotspot Detection

Most systems are skewed. In Netflix, the top 1% of content gets 50% of views. In social media, celebrities have millions of followers; average users have dozens. In e-commerce, a few products sell way more than others.

This affects your design:

If 80% of traffic hits 20% of your data:

  • You need caching for that hot 20%
  • You might need to shard that hot data differently
  • You might need a separate read replica for popular content

How to estimate: Ask the interviewer about distribution. “Are a few celebrities responsible for most tweets, or is traffic pretty distributed?” Their answer changes your caching strategy.

Practical Estimation Examples

Let’s walk through real examples to show the process:

Example 1: Twitter

Given requirements:

  • 500M registered users, 300M DAU
  • Each user reads ~200 tweets/day, posts ~2 tweets/day
  • Design for 1 year

Estimation:

  1. QPS: (300M × 202) ÷ 86,400 = ~700K QPS average (peak: ~2M QPS)

  2. Storage for tweets:

    • 600M tweets/day × 600 bytes = 360 GB/day
    • Per year: 360 GB × 365 = 131 TB
    • With replication (3x): 393 TB
  3. Storage for indexes and metadata: multiply by 1.5x = ~590 TB total

  4. Bandwidth:

    • Peak: 2M QPS × 100KB/response = 200 GB/second incoming
    • This requires global CDN and massive infrastructure

Architectural implications: You definitely need load balancing, database replication, caching, and global distribution.

Example 2: Video Streaming Service

Given requirements:

  • 100 million registered users, 50M DAU
  • Each user watches 2 hours/day of video
  • Store 1 year of videos at multiple resolutions (720p, 1080p, 4K)
  • Design for 1 year

Estimation:

  1. Video hours created per day:

    • Assume 1 hour of video creation per 1,000 users per day
    • 100M ÷ 1,000 = 100K hours/day
  2. Storage per resolution:

    • 720p: 1 Mbps = 450 MB/hour
    • 1080p: 2 Mbps = 900 MB/hour
    • 4K: 8 Mbps = 3.6 GB/hour
  3. Annual storage (three resolutions):

    • 100K hours/day × 365 = 36.5M hours/year
    • At 1080p: 36.5M × 900MB = 32.85 PB
    • Three resolutions: ~100 PB per year
  4. Bandwidth (viewing):

    • 50M DAU × 2 hours/day = 100M viewing hours/day
    • 100M hours × 1080p bitrate (2 Mbps) ÷ 86,400 seconds = ~2.3 Tbps

Architectural implications: This is enormous. You need edge caching in every region, adaptive bitrate streaming, and massive storage infrastructure.

Example 3: Chat Application

Given requirements:

  • 100M users, 10M DAU
  • Each user sends 10 messages/day, each message is 1KB
  • Design for 1 year

Estimation:

  1. Messages/second:

    • 10M users × 10 messages/day = 100M messages/day
    • 100M ÷ 86,400 = ~1,157 QPS average
  2. Storage:

    • 100M messages/day × 1 KB = 100 GB/day
    • Per year: 100 GB × 365 = 36.5 TB
    • With replication: ~110 TB
  3. Message ordering requirement:

    • If users need messages in strict order, you need careful consideration of write consistency

Architectural implications: This is actually manageable on moderate infrastructure. Single database with replication might work, but you’d want a message queue for resilience.

Common Estimation Mistakes

Mistake 1: Over-Precision

You calculate 1,247,683 QPS. Nobody cares about precision to the last digit. Order of magnitude: “about 1-2 million QPS.” That’s good enough.

Mistake 2: Forgetting Peak Traffic

You estimate average QPS but design for average. Your system falls over at peak. Always design for peak.

Mistake 3: Ignoring Data Growth Over Time

“We’ll store 5 years of data at this rate.” That’s 5x storage. Exponential growth is even worse. Consider how to archive old data.

Mistake 4: Not Connecting Estimates to Architectural Decisions

“So we have 100K QPS. What does that mean for architecture?”

100K QPS for simple reads = single server can handle it. For writes with strong consistency = you need multiple databases. For global latency = you need a CDN.

Estimates should drive design decisions.

Did You Know?

The most famous back-of-the-envelope calculation in tech: Jeff Dean’s “Numbers Every Programmer Should Know” (2010). Still mostly accurate:

  • L1 cache: 4 cycles (0.5 ns)
  • Main memory reference: 100 ns
  • Network round trip within same data center: 500,000 ns (0.5 ms)
  • Network round trip cross-continent: 150,000,000 ns (150 ms)

The key insight: every order of magnitude shift in latency requires a different architectural approach.

Key Takeaways

  • Back-of-the-envelope estimation is about getting order of magnitude correct, not precision
  • Memorize key numbers: powers of 2 for storage, latencies, common throughput numbers
  • Use the framework: users → DAU → QPS → storage → bandwidth
  • Convert daily traffic to QPS using: requests/day ÷ 86,400
  • Always account for peak traffic (typically 2-5x average)
  • Use the 80/20 rule to identify what needs caching
  • Connect your estimates to architectural decisions — what does 1M QPS mean for database design?
  • Store growth over time: if you keep data forever, storage grows linearly with time

Practice Exercise

Estimate scale for three systems:

  1. Instagram-like photo sharing:

    • 500M users, 100M DAU
    • Each user posts 2 photos/day
    • Each photo: 3 MB
    • Store for 5 years

    Calculate: QPS, daily storage needed, total storage for 5 years, bandwidth

  2. Real-time multiplayer game:

    • 10M concurrent players (peak)
    • Each player sends movement updates 30 times/second
    • Each update: 100 bytes

    Calculate: QPS for movement updates, total network bandwidth, peak load

  3. E-commerce platform:

    • 50M users, 5M DAU
    • Each user views 20 products/day, buys 1 product/week
    • Each product has 100KB of data (images, descriptions)
    • 1 million products in catalog

    Calculate: QPS for viewing, QPS for purchases, storage for product catalog, storage for user data

For each, explain how your estimates would drive architectural decisions.

Next: Building the High-Level Architecture

Now that you’ve estimated scale, you’re ready to design. Estimates become constraints that guide your choices. Do you need distributed databases? Multiple servers? Caching? The next section shows you how to sketch the high-level architecture that can handle your estimated scale, and how to draw diagrams that make your thinking clear.