System Design Fundamentals

Graph Databases

A

Graph Databases

A Relationship Problem in Disguise

Imagine you’re building a social network. Your product team wants a feature: “Show me friends of friends who also like hiking and live within 50 miles of me.” Simple requirement, right?

In a relational database, you’d need to:

  1. Look up your friends (SELECT from users_friends WHERE user_id = me)
  2. For each friend, look up their friends (JOIN users_friends again)
  3. For each friend-of-friend, check their interests and location (JOIN with interests, JOIN with locations)
  4. Apply distance calculations on retrieved rows

If you have 1 million users, each with 500 friends on average, and each friend-of-friend has their own network? You’re executing millions of join operations, each scanning potentially billions of rows. The query that should return in milliseconds now takes minutes.

This is the fundamental problem that graph databases solve. Relationships are not second-class citizens in a graph database — they are first-class citizens. They’re stored explicitly, indexed directly, and traversed without expensive join operations.

Welcome to graph databases: a data model where relationships matter as much as the entities themselves.

What Is a Graph Database?

A graph database stores and retrieves data by relationships rather than by rows and tables. The core data structure is elegant:

  • Nodes: Entities (users, products, locations, accounts)
  • Edges: Relationships between entities (friend_of, purchased, located_in, manages)
  • Properties: Attributes on nodes and edges (name, created_at, strength of relationship)

This is fundamentally different from relational databases. A relational database models everything as tables with foreign keys. A graph database models everything as an actual graph.

Two Graph Models

Labeled Property Graphs (LPG) are the most common model in production systems:

  • Each node has a label (type) and properties (attributes)
  • Each edge has a type and properties
  • A single graph can contain multiple node and edge types
  • Examples: Neo4j, ArangoDB, Amazon Neptune, JanusGraph

RDF Triple Stores are semantic-focused:

  • Everything is expressed as subject-predicate-object triples
  • More rigid but powerful for knowledge representation
  • Examples: SPARQL endpoint, AllegroGraph, Virtuoso

We’ll focus on labeled property graphs, as they’re more common in system design contexts.

Index-Free Adjacency

Here’s the magic behind graph databases: index-free adjacency. Rather than querying an index to find related nodes (like a foreign key join), each node directly maintains pointers to its neighbors. When you want to find a user’s friends, you don’t scan an index — you directly traverse the relationships stored on that user’s node.

This means:

  • Finding neighbors of a node is O(1) in terms of disk operations (the neighbors are directly referenced)
  • Traversing multiple hops (friends → friends → friends) scales with the path length, not the total data size
  • Join costs are eliminated because relationships are pre-computed and stored

This is why graph databases excel at relationship-heavy queries where relational databases struggle.

A Social Network at a Party

Let’s use an analogy: imagine a physical social network as a party.

In a relational world, you’re the bouncer with a clipboard. To find “friends of friends of John,” you:

  1. Scan your guest list to find all of John’s friends
  2. For each friend, scan the list again to find their friends
  3. Cross-reference against attendance records
  4. Apply filters

With thousands of guests, each with hundreds of connections, your clipboard work is endless.

In a graph world, each person (node) literally holds hands with their friends (edges). You ask John: “Who are you holding hands with?” He points to Alice, Bob, Carol. You ask Alice: “Who are you holding hands with?” She points to Bob, David, Eve. No clipboard scanning needed. Following the chain of hands is direct and fast.

This is why graph databases are so effective at traversal — the relationship data structure is optimized for following connections, not scanning tables.

Technical Architecture and Storage

Graph database engines take two fundamentally different approaches:

Native Graph Engines

Store graph data in a format optimized for graph operations:

  • Each node maintains explicit references to its edges
  • Edges are stored to enable bi-directional traversal
  • Indexes are built around graph structures, not table rows
  • Examples: Neo4j (most mature), JanusGraph, ArangoDB’s graph mode

Advantages:

  • Index-free adjacency means traversals don’t require index lookups
  • Relationship traversal performance is consistent regardless of graph size
  • Query planning is graph-aware

Disadvantages:

  • Scaling writes across multiple nodes is challenging
  • Transactions must coordinate across nodes that own the graph structure

Graph Layers on Other Systems

Store graph data in relational or document stores, adding graph query semantics on top:

  • Amazon Neptune: Graph abstraction over AWS’s proprietary storage layer
  • MongoDB with graph features: Documents with relationship queries
  • PostgreSQL extensions: pgvector for graph embeddings, Apache AGE for property graphs

Advantages:

  • Leverage mature operational infrastructure
  • Often easier to scale writes horizontally
  • Can handle mixed workloads (tabular + graph)

Disadvantages:

  • Graph traversals still involve more joins than native engines
  • Performance degrades more quickly with deep traversals

Query Languages and Traversal

Different graph databases use different query languages, but they share common patterns.

Cypher (Neo4j)

Cypher is the most intuitive graph query language. It uses ASCII art to represent patterns:

// Find a user and their friends
MATCH (user:User {name: "Alice"})
RETURN user

// Find friends of friends
MATCH (user:User {name: "Alice"})
  -[:FRIEND_OF]->
  (friend:User)
  -[:FRIEND_OF]->
  (foaf:User)
RETURN DISTINCT foaf.name

// Shortest path between two users
MATCH (alice:User {name: "Alice"}),
      (bob:User {name: "Bob"}),
      path = shortestPath(
        (alice)-[*]-(bob)
      )
RETURN path

// Create nodes and relationships
CREATE (alice:User {name: "Alice", age: 28})
CREATE (bob:User {name: "Bob", age: 30})
CREATE (alice)-[:FRIEND_OF {since: 2020}]->(bob)

Gremlin (Apache TinkerPop)

Gremlin is more functional/procedural, common in distributed systems:

// Traverse graph
g.V().hasLabel('User').has('name', 'Alice')
  .out('FRIEND_OF')
  .out('FRIEND_OF')
  .distinct()

// With filtering
g.V().hasLabel('User')
  .has('name', 'Alice')
  .out('FRIEND_OF')
  .has('location', 'Seattle')

SPARQL (RDF Stores)

SPARQL is XML-based, verbose but powerful for semantic queries:

SELECT ?foaf
WHERE {
  ?alice foaf:knows ?friend .
  ?friend foaf:knows ?foaf .
}

Graph Algorithms: Beyond Simple Traversal

Modern graph databases include or integrate with algorithms that work on entire graph structures:

Shortest Path (BFS / Dijkstra)

Find the minimum number of relationships between two entities.

Use cases: Navigation networks, LinkedIn “how do you know this person,” fraud rings

MATCH path = shortestPath(
  (source:User)-[*..10]-(target:User)
)
WHERE source.name = "Alice" AND target.name = "Bob"
RETURN path

Complexity: O(V + E) for BFS (unweighted), O((V + E) log V) for Dijkstra (weighted)

PageRank

Identify high-influence nodes based on graph structure. Originally used by Google for web ranking.

Use cases: Recommendation systems, fraud detection (suspicious account networks), finding authority figures in networks

Algorithm: Iteratively scores nodes based on incoming relationships and their scores. Authority spreads through the graph.

Community Detection (Louvain)

Find tightly connected clusters in a graph.

Use cases: Customer segmentation, fraud rings, identifying organizational clusters

Time: O(V) to O(V^2) depending on implementation

Centrality Measures

  • Degree Centrality: How many connections does a node have?
  • Betweenness Centrality: How many shortest paths pass through this node? (bridge finder)
  • Closeness Centrality: How close is this node to all others?
DatabaseModelQuery LanguageScalingMaturityBest For
Neo4jLPGCypherSingle-server or HA clusterVery MatureSocial networks, reference data, knowledge graphs
Amazon NeptuneLPG + RDFGremlin, SPARQL, openCypherManaged, automatic scalingMatureAWS-native apps, multi-model queries
JanusGraphLPGGremlinExcellent (requires external storage)MatureLarge-scale graphs, distributed deployments
ArangoDBMulti-modelAQLGoodMatureMixed workloads, document + graph
DgraphLPGGraphQL+ExcellentMatureAPI-first, GraphQL-friendly applications
TigerGraphLPGGSQLExcellentMaturingReal-time analytics, algorithmic queries

Pro tip: Neo4j dominates mindshare and has the richest ecosystem, but it’s also proprietary and can be expensive at scale. If you need distributed scaling from day one, consider JanusGraph (open-source, pluggable storage) or Dgraph (GraphQL-native).

The Hard Problem: Scaling Graph Databases

Here’s where graph databases get tricky. Scaling a relational database is straightforward: shard by primary key and most queries can hit a single shard. Scaling a graph database is fundamentally harder.

The Partitioning Problem

Imagine partitioning a social network by user ID. User 1-1M goes to Shard A, User 1M-2M goes to Shard B. That works until User 1000 wants to traverse to their friends, who might live in Shard A, B, and C. You now have distributed traversals — network hops, latency, complexity.

There are two main strategies:

Vertex-Cut Partitioning:

  • Shard by edge target (replicate popular nodes across shards)
  • Reduces cross-shard traffic for high-degree nodes
  • Creates replication overhead

Edge-Cut Partitioning:

  • Traditional sharding (shard by node ID)
  • Minimizes replication but creates traversal fan-out
  • Better for low-diameter graphs

Neither is universally better. The right choice depends on your graph’s structure.

Write Performance vs Traversal Performance

Here’s the trade-off: graph databases optimize for traversal reads. Writes are harder, especially at scale:

  • Single-node: Excellent write performance (ACID transactions, the whole graph in memory/disk)
  • Distributed: Write consistency requires coordination across shards, reducing throughput

JanusGraph handles this by delegating storage to pluggable backends (HBase, Cassandra) and accepting eventual consistency. Neptune is managed by AWS and scales transparently but with eventual consistency for writes. Neo4j focuses on single-node performance with optional read replicas for scaling reads.

Real-World Applications

Friend Recommendations

// Recommend friends: users who are friends-of-friends but not already friends
MATCH (me:User {name: "Alice"})-[:FRIEND_OF]->(friend)-[:FRIEND_OF]->(foaf:User)
WHERE NOT (me)-[:FRIEND_OF]->(foaf)
  AND foaf.interests CONTAINS "hiking"
  AND distance(point({latitude: me.lat, longitude: me.lon}),
               point({latitude: foaf.lat, longitude: foaf.lon})) < 50
RETURN foaf.name, count(*) AS mutual_friends
ORDER BY mutual_friends DESC
LIMIT 10

Fraud Detection

Banks use graphs to detect fraudulent transaction rings:

// Find accounts that form a tight cluster (circular money transfers)
MATCH (account1:Account)
  -[:TRANSFERS_TO]->(account2:Account)
  -[:TRANSFERS_TO]->(account3:Account)
  -[:TRANSFERS_TO]->(account1)
WHERE account1.risk_score > 0.5
RETURN account1, account2, account3

Fraudsters form tight circles to move money and obscure origins. Graphs make these patterns obvious — a relational database would require complex self-joins and aggregations.

Knowledge Graphs

E-commerce and search use graphs to understand relationships:

// Recommend products based on graph relationships
MATCH (user:User {id: "user123"})
  -[:VIEWED]->(product:Product)
  -[:SIMILAR_TO]->(similar:Product)
WHERE NOT (user)-[:VIEWED]->(similar)
RETURN similar, count(*) AS recommendation_strength
ORDER BY recommendation_strength DESC

Amazon, Google, and Shopify all use knowledge graphs to power recommendations and search.

Network and Infrastructure Monitoring

// Find critical nodes in network topology
MATCH (router:Router)
RETURN router.id,
       size((router)-[:CONNECTS_TO]->()) AS degree,
       apoc.algo.betweennessCentrality(router) AS criticality
ORDER BY criticality DESC

When to Use and When NOT to Use Graphs

Use Graphs When:

  • Your data is highly relational (5plus levels of relationships matter)
  • You have many-to-many relationships at scale
  • Traversal performance is critical (real-time recommendations, fraud detection)
  • You need to discover patterns across deep paths
  • Your queries often ask “who is connected to whom” rather than “count aggregate values”

Don’t Use Graphs When:

  • You’re primarily doing aggregations (SUM, AVG, GROUP BY)
  • Your relationships are sparse or simple (one-to-many is fine in relational)
  • You need heavy write throughput at extreme scale (millions of writes/second)
  • Your queries are mostly single-entity lookups
  • You need traditional ACID transactions across many shards
  • Your team is unfamiliar with graph thinking (learning curve is real)

Common Mistake

Many teams try to use a single graph database for everything. In reality, successful large-scale systems often use graphs for specific high-value features (recommendations, fraud detection, knowledge graphs) while keeping transactional workloads in relational databases and analytics in data warehouses.

Did you know? Netflix, LinkedIn, and Airbnb all use graph databases extensively, but typically alongside relational and document stores, not instead of them.

Key Takeaways

  • Graphs excel at relationships: Index-free adjacency means traversing relationships is O(1) per hop, not O(n) with joins
  • Choose your model carefully: Labeled property graphs are most common; RDF is for semantic reasoning
  • Scaling is hard: Partitioning graphs is fundamentally more complex than sharding relational data
  • Query language matters: Cypher is intuitive, Gremlin is powerful, SPARQL is semantic—pick based on your use case
  • Algorithms are built-in: Modern graph databases include PageRank, community detection, and shortest path—leverage them
  • It’s a specialist tool: Use graphs for relationship-heavy features, not as a replacement for all data storage

Practice Scenarios

Scenario 1: Real-Time Recommendation Engine

You’re building a recommendation system for an e-commerce platform. You have 100 million products, 50 million users, and billions of viewed/purchased relationships. You need to show “related products” in under 100ms. A relational database’s join performance on this scale is unacceptable.

Design the graph schema. What are your nodes and edges? How would you partition the graph across 10 servers? How do you handle the “thundering herd” problem where popular products get hit with many traversals?

Scenario 2: Detecting Organized Fraud

A payments company sees organized fraud rings where accounts transfer money in circles to hide origins. You need to detect these patterns in real-time on a stream of transactions.

Model this as a graph. What are nodes and edges? Write a query to find tight circular transfer patterns. How do you handle the volume: millions of transactions per second? Would you process in batch or real-time?

Scenario 3: Knowledge Graph for a Multi-Tenant SaaS

You’re building a project management tool. Projects contain tasks; tasks have dependencies, assignees, and related documents. Users need to query: “Show me all tasks on the critical path for projects I own” and “Recommend collaborators based on who worked on similar tasks.”

Design the graph. How do you handle multi-tenancy (ensure users only see their data)? How do you partition? What’s your backup and failover strategy?

Looking Ahead

We’ve explored graph databases as specialized systems for relationship-heavy data. In the next chapter, we’ll shift to the opposite problem: data at rest, designed for analytical queries rather than real-time traversals. Data warehouses and data lakes are built to handle billions of rows and answer questions like “what were our top 10 products this quarter” — completely different from “find me friends of friends.”

The tension between transactional databases (graphs, relational, document) and analytical databases (warehouses, lakes) is one of the foundational architecture decisions you’ll face. Understanding when each excels is crucial to building systems that scale.