System Design Fundamentals

Service Discovery

A

Service Discovery

The Problem: Finding Your Way in a Dynamic World

Imagine you’re building an e-commerce platform with microservices. Your order service needs to call the inventory service to check stock levels before confirming a purchase. In a monolithic system, this is simple—you call a function within the same process. But in microservices, the inventory service lives on a different host, accessible only via HTTP requests to an IP address and port.

Here’s the catch: that IP address and port are constantly changing. Containers crash and restart on new machines. Auto-scaling spins up three new instances to handle load, then scales back down. A deployment pushes a new version to different hosts. Service instances are ephemeral—they live and die unpredictably.

You could hardcode the inventory service’s URL into the order service’s configuration. But the moment that instance dies or moves, your requests start failing. You’d need manual intervention to update configuration files and redeploy the order service. This brittle approach quickly becomes a nightmare at scale.

Service discovery solves this fundamental problem: how do services dynamically find and communicate with each other when their locations constantly change?

What Is Service Discovery?

Service discovery is a runtime mechanism that maintains a living, breathing registry of service instances and their network locations. It answers one essential question: “Which instances of service X are currently running and healthy?”

The system consists of three parts:

  1. The Service Registry — A database or distributed system that stores information about available service instances (hostname, IP address, port, health status, metadata). This is the source of truth.

  2. Service Registration — The mechanism by which services add themselves to the registry when they start, and remove themselves when they shut down gracefully. Services also send regular heartbeats to prove they’re still alive.

  3. Service Discovery — The mechanism by which clients find and use service instances. When a client needs to make a request, it queries the registry, gets back a list of healthy instances, and picks one to contact.

Two fundamental patterns govern how this works: client-side discovery and server-side discovery.

Client-Side Discovery Pattern

In client-side discovery, the client itself is responsible for querying the registry and selecting an instance.

Flow:

  1. Client asks the registry: “Give me all healthy instances of the inventory service”
  2. Registry returns a list: [10.0.1.5:8080, 10.0.2.3:8080, 10.0.3.1:8080]
  3. Client uses a load-balancing algorithm (round-robin, random, least-connections) to pick one
  4. Client makes the request directly to that instance

Advantage: Clients have full control and can implement sophisticated load-balancing logic tailored to their needs.

Disadvantage: Discovery logic is distributed—every client must implement it. If you change your load-balancing strategy, you update every client.

Server-Side Discovery Pattern

In server-side discovery, clients make requests to a known load balancer or router, which queries the registry on their behalf.

Flow:

  1. Client makes a request to a stable endpoint: inventory-service.internal:8080
  2. The load balancer intercepts the request and queries the registry: “Where are the instances of inventory service?”
  3. Load balancer picks an instance and forwards the request
  4. Response flows back through the load balancer to the client

Advantage: Clients are simple—they just call a stable endpoint. The infrastructure handles discovery.

Disadvantage: The load balancer becomes a potential single point of failure and a bottleneck.

An Analogy: The Phone Directory

Think of service discovery like a phone directory:

  • The Registry is the directory itself—the source listing who’s available and their current number.
  • Client-side discovery is like looking up a contact’s number yourself and dialing them directly. You have control, but you need to know how to dial and handle busy signals.
  • Server-side discovery is like calling a receptionist (load balancer) who looks up the number in the directory and patches you through. Simpler for you, but the receptionist is now essential.
  • Self-registration is like adding your own contact information to the directory as soon as you get a new phone number.
  • Third-party registration is like an admin who automatically adds your details whenever they detect you’ve joined the company.

How Services Register Themselves

Self-Registration Pattern

Services register themselves. When a service instance starts, it connects to the registry and says: “I’m here, running on 10.0.1.5:8080, and I’m healthy.”

Services maintain a heartbeat—periodic pings to the registry that say, “I’m still alive.” If a service crashes without gracefully deregistering, the registry eventually notices the missing heartbeat and marks that instance as dead.

Example: Netflix Eureka, Consul (with agent), Kubernetes kubelet registering nodes.

Third-Party Registration Pattern

An external agent watches for new service instances (by monitoring container orchestration, configuration, or infrastructure events) and registers them automatically.

When a new pod starts in Kubernetes, the kubelet on that node detects it and updates the Kubernetes API. The control plane populates service endpoint information. Clients find the service through Kubernetes DNS.

Example: Kubernetes control plane, AWS ECS service discovery, HashiCorp Nomad.

Technical Deep Dive

Client-Side Discovery: Netflix Eureka

Eureka is a service registry popular in Spring Boot and Netflix ecosystems.

How it works:

  1. Registration: Services register with the Eureka server on startup. The service sends its hostname, port, and health check URL.
// Eureka client configuration (Spring Boot)
@EnableEurekaClient
@SpringBootApplication
public class OrderServiceApplication {
    public static void main(String[] args) {
        SpringApplication.run(OrderServiceApplication.class, args);
    }
}
# application.yml
spring:
  application:
    name: order-service
eureka:
  client:
    serviceUrl:
      defaultZone: http://eureka-server:8761/eureka/
  instance:
    hostname: ${hostname:localhost}
    leaseRenewalIntervalInSeconds: 10
    leaseExpirationDurationInSeconds: 30
  1. Heartbeats: Every 10 seconds (configurable), the client sends a heartbeat to Eureka. If Eureka doesn’t receive a heartbeat for 30 seconds, it evicts the instance.

  2. Client caching: Clients fetch the registry every 30 seconds and cache it locally. Even if Eureka is temporarily unavailable, clients keep working with cached data.

  3. Service resolution: When the order service needs to call the inventory service:

@Service
public class OrderService {
    @Autowired
    private DiscoveryClient discoveryClient;

    public void checkInventory(String itemId) {
        List<ServiceInstance> instances = discoveryClient.getInstances("inventory-service");
        if (instances.isEmpty()) {
            throw new ServiceUnavailableException("No inventory service instances available");
        }

        ServiceInstance instance = instances.get(0); // Simple pick; in production use load balancing
        String url = String.format("http://%s:%d/check", instance.getHost(), instance.getPort());
        restTemplate.getForObject(url, InventoryResponse.class);
    }
}

Or more elegantly with Spring Cloud Load Balancer:

@Service
public class OrderService {
    @Autowired
    private RestTemplate restTemplate;

    public void checkInventory(String itemId) {
        // RestTemplate with @LoadBalanced automatically resolves "inventory-service"
        String response = restTemplate.getForObject("http://inventory-service/check", String.class);
    }
}

Server-Side Discovery: Kubernetes Services

Kubernetes embeds service discovery in the platform. When you define a Service, Kubernetes automatically:

  1. Assigns it a stable IP address (ClusterIP)
  2. Creates DNS entries (e.g., inventory-service.default.svc.cluster.local)
  3. Configures kube-proxy on every node to route traffic to healthy pods
# Service definition
apiVersion: v1
kind: Service
metadata:
  name: inventory-service
spec:
  selector:
    app: inventory
  ports:
    - port: 8080
      targetPort: 8080
      name: http
  type: ClusterIP
---
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inventory-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: inventory
  template:
    metadata:
      labels:
        app: inventory
    spec:
      containers:
      - name: inventory
        image: myregistry/inventory:v1.2
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          periodSeconds: 5

How it works under the hood:

  • Endpoint tracking: Kubernetes watches the pods matching the Service’s label selector. Each pod gets an IP. These IPs are stored in an Endpoints resource.
  • DNS: CoreDNS watches Service objects and creates DNS A records. inventory-service resolves to the ClusterIP (10.0.0.50).
  • Traffic routing: kube-proxy on each node watches Endpoints and configures iptables (or IPVS) rules. When traffic hits the ClusterIP, iptables rewrites the packet to go directly to a pod IP.

From the order service’s perspective:

// Just use the service name as the hostname—Kubernetes handles the rest
String response = restTemplate.getForObject("http://inventory-service:8080/check", String.class);

Service Mesh: Istio

Service meshes like Istio abstract discovery entirely. You still define Services, but Istio’s sidecar proxies (Envoy) intercept all traffic and handle discovery transparently.

When the order service makes a request to inventory service:

Order Service → Envoy Sidecar → (queries Istio control plane) → Envoy picks an instance → Inventory Pod

You write the same code as Kubernetes, but Istio gives you advanced features: traffic splitting, circuit breaking, retries, and observability—all without touching your application code.

DNS-Based Discovery

Many systems use DNS as the service registry. Services register with DNS, clients query DNS to resolve service names.

Consul DNS Example:

# Order service queries Consul DNS
$ dig inventory-service.service.consul
; <<>> DiG 9.11.3-1ubuntu1.8-Ubuntu <<>> inventory-service.service.consul
; (1 server found)
;; ANSWER SECTION:
inventory-service.service.consul. 0 IN A 10.0.1.5
inventory-service.service.consul. 0 IN A 10.0.2.3
inventory-service.service.consul. 0 IN A 10.0.3.1

Kubernetes DNS:

$ dig inventory-service.default.svc.cluster.local
inventory-service.default.svc.cluster.local. 30 IN A 10.0.0.50

Pro Tip: DNS is simple but has a challenge—TTL (Time To Live). If a client caches a DNS result for 60 seconds, it won’t learn about new instances or dead instances for that duration. This is why Kubernetes uses headless services for stateful applications—clients discover individual pod IPs directly, not the ClusterIP.

Health Checks and Failure Detection

The registry must know which instances are healthy. Three approaches:

  1. Heartbeat/TTL: Client sends periodic heartbeats. After TTL expires without a heartbeat, the instance is considered dead.

    • Simple but slow to detect failures (typically 20-60 seconds)
    • Used by Eureka, Consul, etcd
  2. HTTP Health Checks: The registry (or a health check service) periodically makes HTTP requests to a health endpoint.

    • Detects failures faster (5-10 seconds)
    • Used by Kubernetes (liveness and readiness probes), ECS, Load Balancers
  3. TCP Checks: The registry tries to open a TCP connection. If it succeeds, the instance is assumed healthy.

    • Very fast but less accurate—a running process might not be handling requests correctly
# Kubernetes health probes
livenessProbe:                    # Is the container alive?
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 3

readinessProbe:                   # Is it ready to accept traffic?
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 2

Real-World Patterns in Action

Pattern 1: Eureka with Spring Boot (Client-Side Discovery)

You’re running microservices on VMs or simple containers (not Kubernetes).

Setup:

  • Central Eureka server (usually HA with multiple nodes)
  • Each service registers itself
  • Clients cache the registry locally and query it for each request

Characteristics:

  • Mature ecosystem with good Spring Boot integration
  • Fast failure detection (heartbeat intervals are configurable)
  • Clients bear the load of discovery logic
  • Works well on-premises and in hybrid clouds

Pattern 2: Kubernetes (Server-Side Discovery Built-In)

You’re running on Kubernetes. Service discovery is provided by the platform.

Setup:

  • Define Services that match pods via labels
  • Kubernetes automatically manages Endpoints
  • CoreDNS resolves service names
  • kube-proxy routes traffic

Characteristics:

  • Zero additional infrastructure—discovery is free
  • DNS-based, works for everything (not just HTTP)
  • Tight integration with deployment, scaling, and observability
  • Ideal if you’re already on Kubernetes

Pattern 3: Service Mesh (Transparent Discovery)

You’re on Kubernetes and want advanced traffic management without changing your code.

Setup:

  • Install Istio (or Linkerd)
  • Inject sidecar proxies into pods
  • Define VirtualServices and DestinationRules to control traffic
  • The service mesh handles discovery and routing

Characteristics:

  • Transparent—application code unchanged
  • Advanced features: traffic splitting, circuit breaking, retry policies, mutual TLS
  • Additional operational complexity and resource overhead
  • Best when you need sophisticated traffic control

Pattern 4: Consul (Flexible, Multi-Datacenter)

You need service discovery across multiple datacenters or cloud providers.

Consul service registration:

service {
  id      = "order-service-1"
  name    = "order-service"
  address = "10.0.1.5"
  port    = 8080

  check {
    id       = "order-service-health"
    name     = "HTTP Health Check"
    http     = "http://10.0.1.5:8080/health"
    interval = "10s"
    timeout  = "5s"
  }
}

Clients resolve services via Consul DNS or the HTTP API:

$ curl http://localhost:8500/v1/catalog/service/order-service
[
  {
    "ID": "order-service-1",
    "Node": "node-1",
    "Address": "10.0.1.5",
    "Port": 8080,
    "ServicePort": 8080,
    "ServiceTags": ["primary"],
    "Checks": [{"Status": "passing"}]
  }
]

Comparing Approaches: A Decision Framework

AspectEurekaKubernetesService MeshConsul
Setup ComplexityModerateLow (if already on K8s)HighModerate-High
Discovery TypeClient-sideServer-side (DNS + kube-proxy)Server-side (sidecar proxies)Both options
Failure DetectionHeartbeat-based, tunableHealth probesHealth probesHeartbeat or checks
Multi-datacenterLimitedNot built-inRequires federationNative
Additional TrafficLight (periodic heartbeat)DNS queries + iptablesSidecar proxies (overhead)Health checks
Operational LoadManage Eureka clusterBuilt-in to K8sLearn Istio, troubleshoot proxiesManage Consul cluster
Best ForSpring Boot on VMs/DockerKubernetes clustersAdvanced traffic control on K8sOn-premises, hybrid, multi-cloud

Key Challenges and Trade-Offs

Consistency vs. Availability

The registry is a distributed system and faces the CAP theorem trade-off:

  • CP (Consistent, Partition-tolerant): Consul, etcd. If there’s a network partition, the registry might go read-only until quorum is restored. Safer but less available.
  • AP (Available, Partition-tolerant): Eureka. During partitions, nodes operate independently. Clients might get stale data, but the system keeps working. More available but eventually consistent.

Which to choose? Eureka’s AP model is better for microservices—you can tolerate stale registry entries for short periods. CP is overkill in practice.

DNS TTL Caching Challenges

DNS is elegant but has a subtle issue: caching.

When an application queries DNS for inventory-service, the response includes a TTL (e.g., 30 seconds). The operating system or application might cache this for the full TTL, delaying discovery of new instances.

Solutions:

  • Use short TTLs (5-10 seconds)
  • Use headless services or direct pod IP discovery
  • In application code, query the registry directly rather than relying solely on DNS caching

Added Infrastructure Complexity

Every discovery pattern adds operational complexity:

  • Eureka: You manage a Eureka cluster, monitor it for failures, handle failover.
  • Kubernetes: The control plane is complex, but it’s one less thing to manage separately.
  • Consul: Full cluster management, monitoring, backup strategies.
  • Service Mesh: Sidecar proxy lifecycle, control plane upgrades, new failure modes to understand.

Pro Tip: Before choosing a discovery solution, ask: “What infrastructure are we already running?” If you’re on Kubernetes, you already have discovery—use it. If you’re on VMs, start with Eureka or Consul.

Load-Balancing Strategies

Client-side discovery requires implementing load balancing in your code:

  • Round-robin: Simplest, but ignores server load
  • Least connections: Favors less-loaded servers
  • Weighted: Some instances get more traffic
  • Ring hash: Consistent hashing for session stickiness

Server-side discovery (load balancer) handles this, but you’re now dependent on the load balancer’s algorithms.

Recommendation: Use client-side load-balancing libraries (Spring Cloud Load Balancer, Ribbon) in client-side discovery, or let the service mesh handle it transparently.

Key Takeaways

  • Service discovery is mandatory in microservices. Without it, services can’t find each other dynamically as they scale, deploy, and fail.

  • Client-side discovery (Eureka, Consul) puts control in the client but requires each client to implement discovery logic. Server-side discovery (Kubernetes, load balancers) simplifies clients but centralizes routing.

  • Kubernetes provides service discovery for free. If you’re using Kubernetes, leverage it—Service resources, DNS, and kube-proxy give you a solid foundation.

  • Service mesh (Istio, Linkerd) abstracts discovery and provides advanced traffic control, but adds operational complexity. Use it when you need sophisticated features like traffic splitting or mutual TLS.

  • Health checks are critical. Without them, the registry includes dead instances, causing client failures. Use appropriate health check mechanisms (HTTP probes, heartbeats, TCP checks).

  • Consistency vs. availability trade-off: Eventually consistent registries (Eureka, Consul AP) are typically better for microservices than strict CP systems.

Practice Scenarios

Scenario 1: Scaling Under Load

You run an order service that queries inventory during peak traffic. Inventory service is currently 2 instances. Your load increases 10x. You auto-scale inventory to 10 instances over 5 minutes.

  • How does your choice of discovery mechanism affect this scaling event?
  • If you use client-side discovery with Eureka and 30-second registry refresh intervals, how long before all order instances know about the new inventory instances?
  • If you use Kubernetes, how quickly does traffic start reaching the new pods?
  • What are the implications for your auto-scaling configuration?

Scenario 2: Partial Failure in the Registry

Your Eureka cluster has 3 nodes. A network partition splits the cluster: 2 nodes on one side, 1 on the other. The 1 node can’t reach the 2.

  • If Eureka uses AP (available, partition-tolerant) semantics, what happens to clients connected to each side?
  • If you were using Consul (CP semantics) instead, would this be better or worse?
  • How does Kubernetes handle this scenario?

Scenario 3: Mixed Environments

You have some services on Kubernetes and others on VMs. You need them to discover each other.

  • What discovery mechanism do you use?
  • If you choose Consul, how do VM-based services register?
  • If you choose Kubernetes, how do VM services integrate with Service objects?
  • What are the operational trade-offs?

Connection to API Gateway

Service discovery solves the internal problem: how services find each other. But external clients (web browsers, mobile apps) can’t call internal services directly. They need a single entry point—an API Gateway—that routes requests to the appropriate backend services.

The API Gateway itself uses service discovery to locate backend services. It’s the first step in the request path for external traffic, and it relies entirely on the discovery mechanisms we’ve explored here. In the next section, we’ll see how gateways leverage service discovery to provide a facade for internal microservices.