System Design Deep Dive – Part 3: Scaling Infrastructure And Distributed Systems

Designing systems that scale across geographies, workloads, and services is at the core of modern software architecture. In this third part of the System Design Deep Dive series, we explore the infrastructure-level strategies and distributed systems patterns that empower you to build resilient, performant, and scalable architectures.

Whether you're preparing for high-stakes interviews or scaling real-world production systems, mastering these concepts—in the right sequence—builds the mental framework you need to design systems that thrive under pressure.

This post dives into:

Distributed Caching – Reduce latency and backend load
Bloom Filters – Efficiently check existence before querying caches
Database Sharding – Horizontally scale your data layer for throughput
SAGA Pattern – Coordinate distributed transactions with eventual consistency
Service Discovery with Eureka – Enable dynamic detection of services in distributed setups
Auto-Scaling with Kubernetes – Automatically scale infrastructure based on load
Content Delivery Networks (CDNs) – Speed up static content delivery across geographies

Scaling isn’t just about adding servers—it’s about making smart trade-offs. By the end of this deep dive, you’ll know how to scale, why it matters, and when to use each strategy.

Distributed Caching

Distributed caching is a system where cached data is spread across multiple nodes (servers), enabling applications to access frequently used data with low latency. Unlike single-node caches, distributed caches offer:

Scalability – Easily scale horizontally by adding/removing nodes.
High Availability – Avoids single points of failure.
Consistent Performance – Handles high traffic loads efficiently.

When to Use Distributed Caching

Use distributed caching when:

You run multiple application servers needing shared state or session data.
You want to reduce database load and improve read performance.
You need fast access to frequently used or semi-static data (e.g., user sessions, product catalogs, config values).

Caching Tools

Tool	Description
Redis	Popular in-memory key-value store with support for persistence and pub/sub.
Memcached	Lightweight and fast; lacks persistence or complex data types.
Hazelcast / Ehcache / Apache Ignite	JVM-based solutions with built-in clustering support.

Use Case: Product Catalog API

In an e-commerce app with millions of products, the product catalog changes infrequently but is accessed constantly. Caching this can drastically reduce DB load.

Cache Read Example – Spring Boot + Redis

@Service
public class ProductService {

    @Autowired
    private RedisTemplate<String, Product> redisTemplate;

    @Autowired
    private ProductRepository productRepository;

    public Product getProductById(String id) {
        String key = "product:" + id;

        // Step 1: Check Redis cache
        Product product = redisTemplate.opsForValue().get(key);
        if (product != null) return product;

        // Step 2: Fallback to DB
        product = productRepository.findById(id).orElseThrow();

        // Step 3: Cache the result for 30 minutes
        redisTemplate.opsForValue().set(key, product, Duration.ofMinutes(30));
        return product;
    }
}

For a complete caching system design with architecture diagrams and breakdowns, check out 👉 this detailed guide.

Caching Design Considerations

Concern	Description
Eviction Policy	Use TTL (Time-to-Live) or LRU (Least Recently Used) to manage memory.
Cache Invalidation	Ensure updates in DB are reflected in cache. Use TTLs, manual evictions, or message queues (e.g., Kafka).
Data Consistency	Eventual consistency is often acceptable; strict consistency may add overhead.
Partitioning (Sharding)	Use consistent hashing to distribute keys evenly across nodes.
Replication	Adds redundancy and fault tolerance, but also complexity.

Bloom Filters

A Bloom Filter is a probabilistic data structure used to test whether an element:

Might be in a set
Or is definitely not in the set

It is super space-efficient and blazing fast, but allows false positives (never false negatives).

Why Use Bloom Filters?

In large-scale systems, memory and latency are precious. Bloom Filters are commonly used to:

Avoid expensive cache/database lookups for missing keys
Prevent duplicate URL visits in web crawlers
Optimize email spam detection, blockchains, databases, and distributed key-value stores

How They Work

A Bloom filter uses a bit array of size m, all initialized to 0.
It uses k independent hash functions.
To insert an element, it's passed through each hash function, and the corresponding bit indices are set to 1.
To check if an element exists, the same hash functions are used. If all corresponding bits are 1, the item might be in the set. If any bit is 0, the item is definitely not in the set.

Use Case: Cache Lookup Reduction

Imagine you have a caching layer (e.g., Redis) for a product catalog with millions of products.

Problem

Querying Redis for non-existent product IDs puts unnecessary pressure on both Redis and the database.

Solution

Use a Bloom Filter as a first-check before hitting Redis.

Workflow

Example – Using Google Guava

public class ProductCacheChecker {
    private static final BloomFilter<Integer> productFilter =
            BloomFilter.create(Funnels.integerFunnel(), 1_000_000, 0.01);

    public static void main(String[] args) {
        // Insert product IDs
        productFilter.put(101);
        productFilter.put(102);

        // Check existence
        if (productFilter.mightContain(101)) {
            System.out.println("Might exist → Query cache");
        } else {
            System.out.println("Definitely doesn't exist → Skip cache");
        }
    }
}

Bloom Filter Trade-offs

Pros	Cons
Extremely memory-efficient	False positives (tunable error rate)
Super fast insert and lookup	Cannot delete entries (in standard version)
Ideal for preliminary filtering	Not 100% accurate

Real-World Applications

Google Bigtable – Avoid unnecessary disk reads
Apache HBase – Efficient row scanning and filtering
Cassandra – Improve SSTable access performance

Database Sharding

Database Sharding is a horizontal scaling technique where large datasets are divided across multiple independent databases called shards. Instead of storing everything in one monolithic database, sharding distributes the data—enabling better performance, scalability, and availability.

Think of a phone book split alphabetically across different people—each person handles a smaller, faster-to-search portion.

Why Use Sharding?

Scalability: Handles more data by distributing it across multiple machines.
Performance: Reduces read/write contention and query latency.
Availability: A failure in one shard doesn't affect the others.
Geographical Distribution: Place shards closer to users (geo-sharding).

Sharding Strategies

1. Range-Based Sharding

Data is divided based on value ranges.

Example:

Users with userId 1–10,000 → Shard 1
Users with userId 10,001–20,000 → Shard 2

Pros:

Simple to implement
Easy to understand and debug

Cons:

Risk of data skew and hotspotting (e.g., all new users go to one shard)
Not ideal for write-heavy workloads concentrated in one range

2. Hash-Based Sharding

A hash function distributes data based on the sharding key.

int shardId = userId.hashCode() % totalShards;
DataSource ds = shardManager.getDataSource(shardId);

Pros:

Good uniform data distribution
Minimizes risk of hotspots

Cons:

Hard to add/remove shards (requires rehashing all keys)
Poor locality — range queries are inefficient

3. Geo-Based Sharding

Sharding based on user location or region.

Example:

Users in US → Shard A
Users in EU → Shard B

Pros:

Ideal for latency-sensitive and compliance-heavy applications
Supports data residency and GDPR

Cons:

Requires complex routing logic
Uneven traffic patterns may still lead to hotspots

Sharding Architecture Example

For a SaaS platform with millions of users, suppose we shard both User and Order data by userId. A ShardRouter class is used to route requests to the appropriate DB shard.

Shard Manager handles routing
Each shard has its own DB server (e.g., MySQL)
Application queries are routed to the right shard

Sharding Challenges

Challenge	Description
Cross-shard queries	Expensive joins/aggregations; require app-side logic or distributed SQL engines
Rebalancing shards	Adding/removing shards may need data migration; consider consistent hashing
Global indexes	Hard to maintain across distributed data
Foreign key constraints	Cannot enforce across shards
Operational complexity	Requires shard-aware backups, monitoring, replication, and failover

Sharding Best Practices

Choose a sharding key that is both uniformly distributed and relevant to queries
Use a shard manager to centralize routing logic
Monitor and rebalance hot shards
Plan for data rebalancing early—pre-split where possible
Ensure backup and recovery per shard

SAGA Pattern

In modern microservice architectures, maintaining data consistency across distributed services is challenging — especially when traditional distributed transactions (like 2PC) are impractical.

The SAGA Pattern offers a reliable, event-driven alternative. It breaks a business transaction into a series of local transactions, each managed by a service. If a step fails, a compensating action is triggered to undo the partial work, ensuring eventual consistency.

Choreography vs Orchestration

1. Choreography-Based SAGA

Each service listens to events and performs local actions.
No central coordinator.

Pros:

Simple, decentralized
Easy to scale

Cons:

Harder to track flow
Difficult to manage complex workflows

2. Orchestration-Based SAGA

A central orchestrator controls the entire transaction flow.
Orchestrator invokes each service and handles compensations.

Pros:

Better visibility and control
Easier to handle complex logic or branching

Cons:

Introduces a central point of control (and potential bottleneck)
Adds orchestration logic complexity

Use Case: E-Commerce Orders

Let’s walk through a Choreography-based SAGA across three services:

Services Involved:

Order Service → Creates the order
Payment Service → Charges the customer
Inventory Service → Deducts stock

Each step is asynchronous, triggered by events passed through a messaging system (e.g., Kafka).

Step-by-Step SAGA Flow

Step 1: Create Order (Order Service)

Action: Save order to DB
Emit: OrderCreated event

Step 2: Process Payment (Payment Service)

Listen: OrderCreated
Action: Charge the customer
Emit:
- PaymentProcessed (on success)
- PaymentFailed (on failure)

@KafkaListener(topics = "order-created")
public void onOrderCreated(OrderCreatedEvent event) {
    try {
        paymentProcessor.charge(event.getOrderId());
        kafka.send("payment-processed", new PaymentProcessedEvent(event.getOrderId()));
    } catch (Exception e) {
        kafka.send("payment-failed", new PaymentFailedEvent(event.getOrderId()));
    }
}

Step 3: Update Inventory (Inventory Service)

Listen: PaymentProcessed
Action: Deduct inventory
Emit:
- InventoryUpdated (on success)
- InventoryUpdateFailed (on failure)

@KafkaListener(topics = "payment-processed")
public void onPaymentProcessed(PaymentProcessedEvent event) {
    try {
        inventoryService.deduct(event.getOrderId());
        kafka.send("inventory-updated", new InventoryUpdatedEvent(event.getOrderId()));
    } catch (Exception e) {
        kafka.send("inventory-update-failed", new InventoryUpdateFailedEvent(event.getOrderId()));
    }
}

Compensation Logic

SAGA ensures rollback via compensating actions when a step fails.

Case 1: Payment Failed

Handled by: Order Service
Action: Cancel the order

Case 2: Inventory Update Failed

Handled by: Payment Service
Action: Refund the customer

SAGA Responsibility Matrix

Service	Event Listened To	Action	Compensating Action
Order Service	N/A	Create order	Cancel order on payment failure
Payment Service	`OrderCreated`	Charge payment	Refund on inventory failure
Inventory	`PaymentProcessed`	Deduct stock	None (not responsible)

SAGA Benefits

No distributed locks or 2PC needed
Achieves eventual consistency across services
Services remain loosely coupled and independently scalable
Supports graceful failure recovery with compensating actions

SAGA Challenges & Practices

Best Practice	Why It Matters
Make local transactions idempotent	Prevents duplication from retries
Design idempotent compensating actions	Avoids over-corrections or double refunds
Store a Saga state log or status	Enables retries, auditing, and recovery
Use correlation IDs	For traceability across services
Use reliable messaging (Kafka, RabbitMQ)	Ensures delivery of critical events
Handle timeouts, retries, dead-letter queues	Protects against service outages and network errors

Service Discovery with Eureka

In a modern microservices architecture, services are dynamic — often scaling up/down or changing locations due to containerization or cloud deployments. Hardcoding service URLs or IPs is unreliable and breaks flexibility.

Service Discovery solves this by allowing services to locate and communicate with each other at runtime.

What is Eureka?

Eureka is a Service Discovery solution developed by Netflix, and it’s a core part of the Spring Cloud ecosystem.

Eureka Architecture Overview

Component	Role
Eureka Server	Acts as a registry where services register and query each other.
Eureka Client	Registers with Eureka Server and discovers other registered services.

Eureka Setup

1. Create Eureka Server

Add Dependency in pom.xml:

<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-netflix-eureka-server</artifactId>
</dependency>

Enable Eureka Server:

`application.yml`:

📍 Access Eureka Dashboard: http://localhost:8761

2. Register a Microservice (e.g., Order Service)

Add Eureka Client Dependency:

`application.yml`:

When the app starts, it auto-registers with Eureka and appears in the dashboard under order-service.

Inter-Service Communication

If Payment Service wants to call Order Service, it doesn’t need the IP or port. Instead, use service names and let Eureka handle discovery.

RestTemplate Setup with Load Balancer

@Bean
@LoadBalanced
public RestTemplate restTemplate() {
    return new RestTemplate();
}

Call a Service by Name

@Autowired
private RestTemplate restTemplate;

public Order getOrder(String orderId) {
    return restTemplate.getForObject(
            "http://order-service/orders/" + orderId,
            Order.class
    );
}

✔ order-service is resolved by Eureka and load-balanced between instances.

Health Checks & Self-Healing

Eureka clients send heartbeats every 30 seconds.
If a service stops heartbeating, it’s marked DOWN and eventually removed.
This keeps the registry accurate and resilient.

Eureka Benefits

Feature	Benefit
Dynamic Registration	No need to hardcode service URLs or IPs
Load Balancing	Works with Ribbon or Spring Cloud LoadBalancer
Fault Tolerance	Integrates with Hystrix or Resilience4J for graceful fallbacks
Real-time Monitoring	UI dashboard shows registered services and health status

Eureka Best Practices

Use logical and consistent service names (e.g., order-service, inventory-service).
Prefer @LoadBalanced RestTemplate or Feign clients for service communication.
Enable retries and fallbacks for resilience (e.g., Resilience4J).
Monitor service health via Eureka dashboard and metrics.

Auto-Scaling with Kubernetes

Kubernetes Auto-Scaling dynamically adjusts application resources based on demand—helping you maintain performance without over-provisioning.

With auto-scaling, your application:

Scales out during high traffic (adds more pods)
Scales in during low usage (removes pods)
Optimizes cost and infrastructure performance

Types of Auto-Scaling in Kubernetes

Type	Description
HPA (Horizontal Pod Autoscaler)	Scales the number of pods based on CPU, memory, or custom metrics
VPA (Vertical Pod Autoscaler)	Adjusts CPU/memory requests and limits for a pod
Cluster Autoscaler (CA)	Adds/removes nodes in the cluster based on pod scheduling needs

Horizontal Pod Autoscaler (HPA)

Best For:

Stateless microservices
APIs and frontend/backend services
Dynamic workloads

How HPA Works:

Kubernetes monitors metrics like CPU or memory utilization.
If a defined threshold is exceeded, it increases the number of pod replicas.
When usage drops, it scales back down (respecting min/max limits).

Example: HPA Based on CPU Utilization

Step 1: Define a Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-api
spec:
  replicas: 2
  selector:
      matchLabels:
          app: web-api
  template:
      metadata:
          labels:
              app: web-api
      spec:
          containers:
          - name: web-api
              image: myorg/web-api:latest
              resources:
                  requests:
                      cpu: "250m"
                  limits:
                      cpu: "500m"
              ports:
              - containerPort: 8080

Step 2: Define HPA YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api-hpa
spec:
  scaleTargetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: web-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
        resource:
          name: cpu
          target:
              type: Utilization
              averageUtilization: 60

This configuration tells Kubernetes:

"If average CPU utilization exceeds 60%, scale up (max 10 pods). If it drops below, scale down (min 2 pods)."

Test HPA with Load

You can simulate high traffic using a tool like hey:

hey -n 10000 -c 100 http://<your-service-ip>/api/resource

Monitor HPA behavior:

kubectl get hpa

Vertical Pod Autoscaler (VPA)

Instead of increasing pod count, VPA adjusts the CPU and memory allocated to each pod.

Best For:

Machine learning jobs
Batch workloads
Memory-intensive services

Example: VPA with Auto Mode

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-api-vpa
spec:
  targetRef:
      apiVersion: "apps/v1"
      kind: Deployment
      name: web-api
  updatePolicy:
      updateMode: "Auto"

In Auto mode, Kubernetes will restart the pod with adjusted CPU/memory values based on actual usage patterns.

Cluster Autoscaler (CA)

If Kubernetes cannot schedule a pod due to lack of node-level resources, the Cluster Autoscaler adds new nodes to the cluster.

Setup:

Enabled at cluster level (GKE, EKS, AKS, etc.)
Integrates with cloud provider’s compute autoscaling
Automatically scales nodes, not pods

CA only kicks in when a pod is unschedulable due to resource limits.

Auto-Scaling Best Practices

Best Practice	Why It Matters
Set resource requests/limits	HPA requires these metrics to determine scaling thresholds
Use custom metrics if needed	CPU may not reflect true load (e.g., I/O-bound or async workloads)
Monitor with Prometheus/Grafana	Helps visualize metrics and identify bottlenecks
Combine HPA + VPA cautiously	Avoid conflicts—both scale resources differently and may interfere

Content Delivery Networks

A Content Delivery Network (CDN) is a globally distributed network of edge servers and proxy nodes that cache and serve your web content—such as HTML, CSS, JavaScript, images, and videos—from locations closer to your users.

Rather than relying solely on your origin server, CDNs drastically improve performance by reducing latency, accelerating load times, and increasing reliability.

Why Use a CDN?

Benefit	Description
Faster Delivery	Reduces time-to-first-byte (TTFB) by serving assets from nearby edge nodes
Lower Origin Load	Offloads bandwidth and compute from your origin server
Higher Availability	Automatically reroutes traffic during node failures
Global Reach	Delivers consistent performance worldwide
Built-in Security	TLS termination, DDoS mitigation, WAFs, and bot protection

How CDNs Work

A user requests a static asset (e.g., an image or .js file).
DNS directs the request to the nearest CDN edge node.
If the asset is cached (cache hit), it's returned instantly.
If not cached (cache miss), the CDN fetches it from your origin, stores it, and delivers it to the user.

Key Components

PoP (Point of Presence): CDN data centers placed near users (e.g., major cities).
Edge Server: Delivers cached content from the closest PoP.
Origin Server: Your backend infrastructure, used only when needed.

Use Case: JS via CDN

Instead of hosting jQuery or Bootstrap on your server:

<!-- Load Bootstrap via CDN -->
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script>

Likely already cached in the user’s browser or ISP’s edge node.
Loads faster than pulling from your server.

CDN Caching & Headers

CDNs follow your origin server’s cache headers. Properly setting them ensures efficient cache behavior:

Recommended Headers

Header	Purpose
`Cache-Control`	`public, max-age=31536000` – Cache asset for 1 year
`ETag`	Cache validation without full downloads
`Vary`	Enables content negotiation (e.g., by language or encoding)
`Expires`	Deprecated but still respected by many CDNs

CDN Security Features

Most leading CDNs (e.g., Cloudflare, Akamai, Fastly) offer:

TLS Termination – Serve HTTPS from the edge
DDoS Mitigation – Detects and blocks volumetric attacks
Web Application Firewall (WAF) – Protects against XSS, SQLi, etc.
Bot & Rate Limiting – Control abusive traffic or scraping

CDN Best Practices

Tip	Why It Matters
Serve static assets via CDN	Reduces load on your backend and improves performance
Use compression (GZIP/Brotli)	Shrinks file size for faster delivery
Version your files (e.g., `.v2.js`)	Ensures users get latest assets after deployment
Set long `max-age` headers	Improve cache longevity and reduce bandwidth
Monitor via analytics	Optimize cache hit rates, regional usage, and user experience

Final Thoughts

Scaling distributed systems is a multi-dimensional challenge—one that requires not just clever code, but architectural foresight.

By strategically applying techniques like distributed caching, bloom filters, database sharding, and orchestration patterns, you can design systems that scale horizontally, fail gracefully, and perform under real-world pressure.

Every layer—from service discovery and auto-scaling Kubernetes clusters to global CDNs—adds resilience, responsiveness, and robustness to your infrastructure.

Make these strategies a core part of your system design toolkit. When you internalize their purpose and trade-offs, you’re not just solving today’s problems—you’re future-proofing your systems for the demands of tomorrow.

📌 Enjoyed this post? Bookmark it and drop a comment below — your feedback helps keep the content insightful and relevant!

Share the Knowledge!

Distributed Caching

When to Use Distributed Caching

Caching Tools

Use Case: Product Catalog API

Cache Read Example – Spring Boot + Redis

Caching Design Considerations

Bloom Filters

Why Use Bloom Filters?

How They Work

Use Case: Cache Lookup Reduction

Problem

Solution

Workflow

Example – Using Google Guava

Bloom Filter Trade-offs

Real-World Applications

Database Sharding

Why Use Sharding?

Sharding Strategies

1. Range-Based Sharding

2. Hash-Based Sharding

3. Geo-Based Sharding

Sharding Architecture Example

Sharding Challenges

Sharding Best Practices

SAGA Pattern

Choreography vs Orchestration

1. Choreography-Based SAGA

2. Orchestration-Based SAGA

Use Case: E-Commerce Orders

Step-by-Step SAGA Flow

Step 1: Create Order (Order Service)

Step 2: Process Payment (Payment Service)

Step 3: Update Inventory (Inventory Service)

Compensation Logic

Case 1: Payment Failed

Case 2: Inventory Update Failed

SAGA Responsibility Matrix

SAGA Benefits

SAGA Challenges & Practices

Service Discovery with Eureka

What is Eureka?

Eureka Architecture Overview

Eureka Setup

1. Create Eureka Server

application.yml:

2. Register a Microservice (e.g., Order Service)

application.yml:

Inter-Service Communication

RestTemplate Setup with Load Balancer

Call a Service by Name

Health Checks & Self-Healing

Eureka Benefits

Eureka Best Practices

Auto-Scaling with Kubernetes

Types of Auto-Scaling in Kubernetes

Horizontal Pod Autoscaler (HPA)

Best For:

How HPA Works:

Example: HPA Based on CPU Utilization

Step 1: Define a Deployment

Step 2: Define HPA YAML

Test HPA with Load

Vertical Pod Autoscaler (VPA)

Best For:

Example: VPA with Auto Mode

Cluster Autoscaler (CA)

Setup:

Auto-Scaling Best Practices

Content Delivery Networks

Why Use a CDN?

How CDNs Work

Key Components

Use Case: JS via CDN

CDN Caching & Headers

Recommended Headers

CDN Security Features

CDN Best Practices

Final Thoughts

`application.yml`:

`application.yml`: