Warehouses vs Data Lakes
Your Data Storage Dilemma
Imagine your company is growing fast. You’re collecting data from everywhere: web analytics tracking user clicks, mobile apps sending session events, your CRM logging customer interactions, payment systems recording transactions, and IoT sensors streaming device telemetry. It’s a deluge of data.
Now the requests start coming in:
- The CEO wants a dashboard showing sales by region, customer lifetime value, and churn trends—delivered every morning at 6 AM.
- The data science team wants raw, unfiltered data to train machine learning models that predict customer behavior.
- The finance team needs structured, auditable reports for compliance and revenue recognition.
- The product team wants to experiment with real-time recommendations based on user behavior.
You sit in the architecture meeting and someone asks: “Do we build a data warehouse, a data lake, or both?”
This is one of the most consequential decisions in data infrastructure. Get it wrong, and you’ll either build an inflexible system that can’t evolve, or a chaotic swamp where nobody knows where data lives or what it means. Get it right, and you’ve unlocked insights that competitors can’t match.
Let’s understand what each approach offers, and when to use them.
What is a Data Warehouse?
A data warehouse is a purpose-built system optimized for analytical queries on structured, historical data. Think of it as a factory floor: raw materials (transactions, events) come in, they’re carefully processed, refined, and organized into finished goods (reports, dashboards) that business stakeholders consume.
Key Characteristics
Schema-on-Write: Data is transformed before it enters the warehouse. A raw transaction event gets parsed, validated, deduplicated, and enriched. Only clean, conforming data lives in the warehouse.
Structured Format: Everything is organized into tables with predefined columns, types, and relationships. A sales transaction has columns like order_id, customer_id, product_id, amount, timestamp.
OLAP Optimization: The warehouse is built for analytical queries (OLAP = Online Analytical Processing). Queries like “total revenue by product category for Q4” scan millions of rows and aggregate. Warehouses use columnar storage (store data column-by-column, not row-by-row) because analytical queries only touch a few columns, not entire rows.
Star and Snowflake Schemas: Data is organized into fact tables (events, transactions) and dimension tables (customers, products, dates). This structure makes queries fast and reporting intuitive.
-- Fact table: Sales
CREATE TABLE sales_fact (
sale_id INT,
date_id INT,
customer_id INT,
product_id INT,
amount DECIMAL(10, 2),
quantity INT
);
-- Dimension table: Customer
CREATE TABLE customer_dim (
customer_id INT PRIMARY KEY,
name VARCHAR(255),
segment VARCHAR(50),
country VARCHAR(100),
signup_date DATE
);
-- Analytical query
SELECT
c.segment,
DATE_TRUNC('month', d.date) AS month,
SUM(f.amount) AS total_revenue
FROM sales_fact f
JOIN customer_dim c ON f.customer_id = c.customer_id
JOIN date_dim d ON f.date_id = d.date_id
WHERE d.date >= '2024-01-01'
GROUP BY c.segment, DATE_TRUNC('month', d.date)
ORDER BY month DESC, total_revenue DESC;
ETL Pipeline: Data flows through Extract (pull from source systems), Transform (clean, deduplicate, enrich), Load (write to warehouse). The Transform step is heavyweight—it’s where data quality is enforced.
Popular Warehouse Solutions
- Snowflake: Cloud-native, separates compute and storage, excellent for scaling, strong for multi-tenant scenarios.
- Amazon Redshift: Columnar, MPP (Massively Parallel Processing), integrates with AWS ecosystem.
- Google BigQuery: Serverless, strong SQL engine, built-in machine learning functions.
- Azure Synapse: Integrates with Microsoft cloud, good for enterprises already in Azure.
What is a Data Lake?
A data lake is a centralized repository for storing any type of data—structured, semi-structured, or unstructured—in its raw form, at massive scale.
Instead of forcing data into a predefined structure, a data lake says: “Store everything first, figure out what it means later.”
Key Characteristics
Schema-on-Read: Data lands as-is (schema is inferred when you query it). A raw JSON event from your mobile app gets stored exactly as it arrived. When an analyst wants to use it, they parse the JSON, extract relevant fields, and handle variations in the schema.
Any Format: Parquet files, ORC, Avro, JSON, XML, images, videos, raw logs. A data lake doesn’t care. This makes it perfect for organizations with diverse data sources.
ELT Pipeline: Data flows through Extract (pull from sources), Load (store raw data), Transform (when you query it). The Transform is lightweight—data is transformed on-demand.
Flexible and Evolving: New data sources? Add them to the lake. New fields in your event schema? They land in the lake without breaking anything. Later, when you understand the data better, you can catalog and organize it.
Cost Efficient: Storage is cheap (S3 costs ~$0.023 per GB/month). Compute is decoupled—you pay for it only when you query. You can store 10 years of raw data and only pay for the gigabytes used.
# Reading raw JSON events from a data lake using Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DataLakeAnalysis").getOrCreate()
# Load raw JSON events
events_df = spark.read.json("s3://my-data-lake/mobile-app-events/2024/*/events.json")
# Flexible schema-on-read: parse nested JSON
user_events = events_df.filter(
(events_df.event_type == "purchase") &
(events_df.timestamp > "2024-01-01")
).select(
"user_id",
"event_type",
"timestamp",
"details.product_id",
"details.amount"
)
user_events.groupBy("product_id").agg({"amount": "sum"}).show()
Architecture: Distributed Storage + Catalog
Data lakes typically sit on distributed file systems (HDFS, S3, Azure Data Lake Storage). A catalog service (AWS Glue, Apache Hive Metastore) tracks what data exists, its schema, and metadata.
graph LR
A["Data Sources<br/>Apps, APIs, Logs, IoT"] -->|ELT| B["Distributed Storage<br/>S3, HDFS, Azure Blob"]
B -->|Query| C["Compute Engines<br/>Spark, Presto, Athena"]
B -->|Catalog| D["Metadata Service<br/>Hive, Glue, Atlas"]
C -->|Results| E["Analytics<br/>BI Tools, ML Pipelines"]
D -->|Schema Info| C
The Lakehouse: Best of Both Worlds
Here’s the problem: data warehouses are governed but inflexible. Data lakes are flexible but chaotic.
What if you could have the governance and performance of a warehouse with the flexibility and cost of a lake?
This is the lakehouse pattern. Technologies like Delta Lake (Databricks), Apache Iceberg (Netflix), and Apache Hudi (Uber) add warehouse-like features to data lakes:
- ACID Transactions: Multiple writers and readers, no corruption.
- Time Travel: Query data as it was at any point in the past (“as of 2024-02-01”).
- Schema Enforcement and Evolution: Define a schema, enforce it, evolve it safely.
- Unified Batch and Streaming: Same table for batch and streaming data.
# Delta Lake: Lake with warehouse features
from delta.tables import DeltaTable
from pyspark.sql import functions as F
# Create a Delta table (ACID, time-travel capable)
df.write.format("delta").mode("overwrite").save("s3://my-lake/customer_events")
# Later: read as of a specific timestamp
historical_df = spark.read.option("versionAsOf", "2024-02-01").delta("s3://my-lake/customer_events")
# Schema enforcement: this write fails if columns don't match
new_events = spark.read.json("s3://raw-events/mobile-app")
new_events.write.format("delta").mode("append").save("s3://my-lake/customer_events")
# Generate unique customer IDs (update, not supported in Parquet!)
DeltaTable.forPath(spark, "s3://my-lake/customer_events") \
.update(condition="customer_id IS NULL", set={"customer_id": F.monotonically_increasing_id()})
Architecture in Depth
Data Warehouse: MPP and Columnar Storage
Warehouses use Massively Parallel Processing (MPP). Queries are split across many nodes; each processes a partition of data and returns results that are combined.
They store data in columnar format: all values for a single column are stored contiguously. This is optimal for analytical queries that touch a few columns but many rows.
Row-oriented (used by databases, transactional systems):
CustomerID | Name | Segment | Revenue
1 | Alice | Premium | $5000
2 | Bob | Standard | $1200
Columnar (used by warehouses):
CustomerID: [1, 2, 3, 4, 5, ...]
Name: [Alice, Bob, Carol, Dave, Eve, ...]
Segment: [Premium, Standard, Premium, Basic, Standard, ...]
Revenue: [$5000, $1200, $8300, $600, $2100, ...]
Query: "Sum of revenue by segment"
Columnar: Only read Segment and Revenue columns
Row-oriented: Read all columns, ignore Name and CustomerID
This compression works because columnar data is often repetitive. Segment column might have only 5 distinct values; these can be heavily compressed.
Data Lake: Distributed File Systems and Catalogs
Data lakes typically use S3 or HDFS. Data is partitioned (split into folders by date, region, etc.) so that queries only touch relevant files.
A catalog service maintains metadata:
{
"table": "mobile_app_events",
"location": "s3://my-lake/mobile-app-events",
"format": "parquet",
"columns": [
{"name": "user_id", "type": "string"},
{"name": "event_type", "type": "string"},
{"name": "timestamp", "type": "timestamp"},
{"name": "properties", "type": "struct"}
],
"partitions": ["year", "month", "day"],
"lineage": ["mobile-app-service (v2.3)"],
"owner": "analytics-team"
}
When you query, the catalog tells the query engine (Spark, Presto, Athena) where data lives and its schema.
Data Pipeline Patterns
Batch ETL (Warehouse-style):
- Every night at 2 AM, pull all transactions from the past 24 hours.
- Deduplicate, validate, enrich with customer data.
- Insert into warehouse fact tables.
- Morning dashboards run on clean data.
# Batch ETL example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("BatchETL").getOrCreate()
# Extract
raw_transactions = spark.read.parquet("s3://raw/transactions/2024-02-11/")
# Transform
clean = raw_transactions \
.dropDuplicates(["transaction_id"]) \
.filter("amount > 0") \
.join(customer_master, "customer_id", "inner") \
.withColumn("processed_date", F.current_timestamp())
# Load
clean.write.format("delta").mode("append").save("s3://warehouse/transactions_fact")
Streaming Ingestion (Lake-style):
- Kafka topic receives 100K events per second.
- Stream consumer writes each event to the lake (Parquet files in S3).
- Near real-time availability for analysis.
# Streaming to data lake
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StreamingIngest").getOrCreate()
# Read from Kafka
kafka_stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "mobile-events") \
.load()
# Parse JSON
events = kafka_stream.select(
F.from_json(F.col("value").cast("string"), "user_id STRING, event_type STRING, timestamp TIMESTAMP").alias("data")
).select("data.*")
# Write to Delta Lake (ACID, handles concurrent writes)
events.writeStream \
.format("delta") \
.option("checkpointLocation", "s3://lake/checkpoints/mobile-events") \
.mode("append") \
.start("s3://lake/mobile-events")
Change Data Capture (CDC): Instead of full extracts, capture changes to source systems. This is efficient for large databases.
-- Example: Capture changes from a PostgreSQL database
-- Using Debezium + Kafka
-- Schema: before_state, after_state, operation (INSERT/UPDATE/DELETE)
-- {
-- "operation": "UPDATE",
-- "before": {"customer_id": 123, "email": "[email protected]"},
-- "after": {"customer_id": 123, "email": "[email protected]"},
-- "timestamp": "2024-02-11T10:30:00Z"
-- }
Governance and Data Quality
This is where many data lakes fail.
The Data Swamp Anti-Pattern
A data lake without governance becomes a data swamp:
- Nobody knows what data lives where.
- The same “customer” is defined 10 different ways across tables.
- Tables haven’t been updated in 6 months; nobody knows if they’re stale.
- Data quality issues propagate downstream (garbage in, garbage out).
- Compliance teams can’t trace where PII lives.
To avoid this:
Data Catalog: Maintain an inventory. What’s in each table? Who owns it? What’s the SLA?
Lineage: Track where data comes from. “This revenue metric comes from transactions table, which is fed by the billing system, which syncs every 6 hours.”
Schema Registry: Enforce structure, but allow evolution. “Events table has these 20 columns. New columns can be added, but old columns can’t be removed without deprecation.”
Data Quality Checks: Automated tests before data lands. “Transactions can’t have negative amounts. Customer IDs must exist in the customer master.”
# Data quality checks using Great Expectations
from great_expectations.dataset import SparkDFDataset
expectation_suite = {
"expect_column_values_to_be_in_set": {
"column": "order_status",
"value_set": ["pending", "confirmed", "shipped", "delivered", "cancelled"]
},
"expect_column_values_to_be_between": {
"column": "order_amount",
"min_value": 0,
"max_value": 1000000
},
"expect_table_row_count_to_be_between": {
"min_value": 1000,
"max_value": 10000000
}
}
Access Control: Who can read sensitive tables? Who can write? Use role-based access (RBAC). PII should be masked for non-authorized users.
Comparison: Warehouse vs Lake vs Lakehouse
| Dimension | Warehouse | Data Lake | Lakehouse |
|---|---|---|---|
| Data Format | Structured, predefined schema | Any format (JSON, images, etc.) | Structured with schema flexibility |
| Schema Approach | Schema-on-write | Schema-on-read | Schema-on-read, enforced on-write |
| Processing Model | ETL (transform before load) | ELT (transform after load) | Both ETL and ELT |
| Query Performance | Very fast (optimized, indexed) | Variable (depends on query engine) | Fast (optimized columnar storage) |
| Cost Model | Expensive compute, reasonable storage | Cheap storage, variable compute | Cheap storage, optimized compute |
| Governance | Built-in, enforced | Must be added (hard to retrofit) | Built-in (ACID, lineage, schema) |
| Flexibility | Low (schema changes are hard) | High (new data types easily added) | High (with enforcement) |
| Time-to-Query | Slow (weeks to load new source) | Fast (days) | Fast (days) |
| Time-to-Insight | Fast (dashboards run on ready data) | Slow (data needs transformation) | Fast (both batch and streaming) |
| ACID Transactions | Yes | No (until recently) | Yes |
| Real-Time Data | Hard (batch-oriented) | Natural (streaming-friendly) | Natural |
| ML Friendly | No (data is transformed away) | Yes (raw data for model training) | Yes (raw + structured) |
Real-World Scenarios
E-Commerce Analytics Platform
Your company has an online store. You need:
- Daily BI Reports: Sales by product, category, and region (updated each morning).
- ML Models: Predict churn, recommend products, detect fraud.
- Real-Time Dashboard: Live conversion rate, cart abandonment, top products right now.
Solution: Lakehouse
- Raw Data Lake: Events stream from your web/app into S3 via Kafka (purchases, clicks, page views, abandoned carts). Cost: ~$2K/month for storage.
- Delta Lake Tables: Transform events into customer, product, and order tables using Spark jobs. Overnight batch + hourly streaming transforms.
- Warehouse Layer: Select high-value tables (orders, customers, products) are synced to Snowflake for BI dashboards. Cost: ~$5K/month for compute.
- ML Pipelines: Data scientists query Delta Lake directly for training data; schema is clean and versioned.
This setup costs less than warehouse-only, is more flexible than warehouse-only, and supports real-time without ETL bottlenecks.
Real-Time Recommendation Engine
You need to recommend products to users as they browse, based on their history and behavior.
Solution: Data Lake + Streaming
# Kafka -> Spark Streaming -> Delta Lake -> Feature Store -> ML Model
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RTRecommendations").getOrCreate()
# Stream user events
events = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "user-events") \
.load()
# Extract features in real-time
user_features = events.select(
F.from_json(F.col("value").cast("string"), "user_id STRING, product_id STRING, action STRING").alias("data")
).select("data.*") \
.groupBy("user_id") \
.agg(
F.count("product_id").alias("products_viewed"),
F.collect_list("product_id").alias("recent_products"),
F.max("timestamp").alias("last_activity")
)
# Write to feature store
user_features.writeStream \
.format("delta") \
.option("checkpointLocation", "s3://lake/checkpoints/user-features") \
.outputMode("update") \
.start("s3://lake/user-features")
# API service queries feature store in real-time
# SELECT * FROM user_features WHERE user_id = ? -> instant recommendations
Latency: under 100ms from user action to recommendation.
Multi-Tenant SaaS Analytics
You operate a SaaS platform with 1000s of customers. Each wants to see their own analytics.
Solution: Warehouse with Tenant Isolation
-- Row-level security: customers only see their own data
CREATE POLICY tenant_isolation ON sales_fact
FOR SELECT USING (tenant_id = current_setting('app.tenant_id'));
-- Snowflake example: query with tenant context
ALTER SESSION SET app.tenant_id = 'customer-123';
SELECT * FROM sales_fact; -- Only sees customer-123's data
Warehouse works well here because:
- Each tenant’s queries are isolated and predictable.
- You can shard tenants across clusters for cost optimization.
- Data is clean and deduplicated.
- Compliance is easier (audit trails, row-level security).
When to Choose What?
Use a Data Warehouse if:
- Your analytical patterns are well-known and stable.
- You have a small-to-medium number of data sources.
- You need sub-second query performance.
- Your team is SQL-fluent but not Python/Spark-strong.
- You value simplicity and predictability over flexibility.
Use a Data Lake if:
- You’re early-stage or exploratory (don’t know what you’ll need).
- You have diverse data sources (logs, videos, sensor data, etc.).
- You want to preserve raw data for future use cases.
- Your team is strong in Python, Scala, and distributed computing.
- Cost is a primary concern.
Use a Lakehouse if:
- You need both real-time and batch analytics.
- You want ML-friendly raw data and BI-friendly curated tables.
- You’re willing to adopt newer technologies (Delta, Iceberg, Hudi).
- You need ACID transactions and data governance on a lake.
Key Takeaways
-
Data warehouses are optimized for known analytical patterns. They enforce structure, guarantee quality, and deliver fast queries. They’re expensive to maintain and inflexible when requirements change.
-
Data lakes store raw data cheaply and flexibly. They’re ideal for exploration and ML, but can become chaotic without governance.
-
Lakehouses combine the best of both: structure, governance, and ACID transactions from warehouses; flexibility, cost, and streaming support from lakes.
-
Schema-on-write (warehouse) catches errors early but limits flexibility. Schema-on-read (lake) is flexible but requires discipline downstream.
-
Data governance is critical. A lakehouse without catalogs, lineage, and quality checks becomes a data swamp.
-
MPP and columnar storage are why warehouses are so fast. Understanding these architectures helps you design better queries and data models.
-
The choice is not binary. Most enterprises use both: a lake for data collection and exploration, a warehouse for trusted analytics.
Practice Scenarios
Scenario 1: The Chaos Problem
Your company has grown from 50 to 500 employees. You built a data lake 3 years ago by throwing all logs, events, and database backups into S3. Now:
- Nobody knows what data is current vs. stale.
- The same “user” is represented differently in 15 different places.
- A SQL analyst wrote a “customer metrics” table 2 years ago; it’s never been updated.
- The finance team can’t trust any reports because they don’t know where data comes from.
What would you do to restore order? What governance components would you add?
Scenario 2: The ML vs BI Tension
Your data team is split. The data scientists want raw, untransformed data for building models. The BI analysts want clean, deduplicated, dimension-enriched tables for dashboards. Currently, you have a warehouse that serves BI perfectly but makes ML hard (data is too transformed). Design an architecture that makes both teams happy.
Scenario 3: The Real-Time Requirement
Your company’s product is a mobile app that generates 1M events per day. Finance wants daily revenue reports (batch, easy). The product team wants real-time dashboards showing user engagement right now (streaming, hard). The ML team wants to retrain a churn model weekly using 2 years of historical data. Design a data infrastructure that supports all three requirements without overbuilding.
You’ve now seen the landscape: warehouses for structured, predictable analytics; lakes for flexible, exploratory work; lakehouses to have your cake and eat it too.
In the next chapter, we’ll zoom in on data replication, partitioning, and consistency patterns. You’ll learn how data actually flows between these systems, and what happens when replication goes wrong. The principles you’ve learned here—schema, governance, data pipelines—will apply directly to those distributed challenges.