Understand Shuffle Sharding to make Reliable Services

In a typical client-server architecture, requests from the client are forwarded to a random Application Server by the Application Load Balancer. When the request is a poison pill i.e. it crashes/hungs the server receiving the request, then all of your application servers will eventually crash/hung one by one if the client keeps on retrying (as the requests will be eventually forwarded to all the servers randomly).

Architecture Diagram with Application Load Balancer and eight worker nodes marked unavailable

Source

Will Simple Sharding help?

One way to isolate this problem is to create virtual shards i.e. isolate requests coming from clients so that they are served only by some specific instances. Generally this is done based on some hashing e.g. modulo / consistent.

In the below diagram, all the requests originating from client names starting from A or B goes to Worker1 and Worker2. So if Alpha request is a poison pill, only two servers will get impacted. Remaining servers will be unaffected as the requests never reach there.

Do note that the Bravo request won't be fulfilled either because both Worker1 and Worker2 have crashed/hung due to the poison pill request from Alpha. One dirty fish has poisoned the entire lake.

This strategy has definitely reduced the failure radius but still requests from Bravo are not being served. Can we do better?

Architecture Diagram with Application Load Balancer, shards and flow of customer to worker

Source

Can increasing the shards help?

The issue with the above was the way we created shards — there were simply too few combinations available — as each instance can be mapped to only one shard.

If we allow each instance to be mapped to multiple shards, then we can increase the number of combinations available and reduce our unavailability.

In the below diagram, we created 8 shards to map 8 clients. Previously we only had 4 shards.

Architecture Diagram with Application Load Balancer, eight shards with two worker nodes per shard, and each worker being assigned to two different shards

Source

Note that each workers are mapped to multiple shards e.g. Worker2 is mapped to Shard1 and Shard2.

Now if requests from Alpha crashes/hungs up Worker1 and Worker2 — requests from Bravo would still continue to work as they are mapped to Worker2 and Worker3. Since Worker3 is still alive, requests from Bravo would still get served.

Customer Name	Workers
Alpha	Worker-1 and Worker-2
Bravo	Worker-2 and Worker-3
Charlie	Worker-3 and Worker-4
Delta	Worker-4 and Worker-5
Echo	Worker-5 and Worker-6
Foxtrot	Worker-6 and Worker-7
Golf	Worker-7 and Worker-8
Hotel	Worker-8 and Worker-1

Let’s Shuffle!

Hope you were able to grasp the fundamentals using the above example. The idea is to create more shards with as few overlaps as possible. Let’s see how we can make a generic solution.

Given n application servers, if we randomly choose k instances and make them part of one virtual shard, then each shard would have k instances. The probability of 2 shards with 100% overlap would drastically go down as we increase the value of k.

To simplify, if we have 10 cards and we want to choose 4 cards among them, then there will be a total of 10 choose 4 combinations = 210 total combinations. If we randomly generate 2 such combinations, then the probability of them having 100% overlap, i.e., all 4 cards being the same, would be (1 / 210) ~= 0.47%.

Now to relate this analogy to our problem statement, given 10 application instances, if we randomly choose 4 instances to create a virtual shard, the probability of 2 shards with 100% overlap would be ~0.47%. This indicates that a poison request can only impact 0.47% of shards. We can further reduce this by increasing the total number of instances or increasing the shard size.

Talk is cheap, show me the code!

Shuffle sharding can be implemented in two ways — stateless or stateful.

Stateless, as the name indicates, does not persist any state data, i.e., shard information in a DB Store. It simply identifies the target application instances from an identifier. The target application instances are the members of the virtual shard mapped to that request.

Stateful sharding goes a bit further and persists the information of the shards to a database, which allows further customization of the way shards are created, e.g., customizing the shard assignment strategy by tuning weights.

Let’s see how we can implement stateless shuffle sharding using a simple strategy of generating multiple hashes from a unique identifier, followed by mapping that hash to a unique node.

public Set<Integer> assignNodes(String customerId) {
    Set<Integer> assignedNodes = new HashSet<>();

    // Need to find 4 nodes
    for (int i = 0; i < 4; i++) {
        // Create a unique input for each Node selection
        String hashInput = customerId + ":" + i;
        // maps a hash to a NodeId
        int nodeId = hashToNodeId(hashInput); 

        // Handle collisions by trying the next available Node
        while (assignedNodes.contains(NodeId)) {
            nodeId = (nodeId + 1) % config.getTotalNodes();
        }
        assignedNodes.add(nodeId);
    }
    return assignedNodes;
}

We can also leverage multiple hash functions, similar to a Bloom Filter, to generate multiple hashes instead of generating multiple inputs from the identifier, in case multiple unique inputs can't be generated.

Practical UseCases

Shuffle Sharding is a powerful and versatile technique used not only by AWS but also in popular open-source projects like Grafana Loki and Grafana Mimir.

In a recent AWS Tech Talk, I learned that shuffle sharding is extensively used in S3 to introduce decorrelation in the system.

Identify drive to store data

Shuffle Sharding helps randomly select a drive to store the data for your bucket, instead of directly mapping drives to buckets.

In order to ensure a balanced disk utilization, a very clever technique known as Two random choices is used.

Law of of two random choices states that randomly pick two drives and choose the one with the lower disk utilization. This helps in avoiding scanning of all the drives in the system and maintaining info regarding their disk utilization.

Resolve DNS Queries

Shuffle Sharding is also used to resolve DNS queries. For example, when you look up s3.amazonaws.com or mybucket.s3.amazonaws.com, it returns multiple answers to DNS queries.

Bucket requests can randomly go to any server. It doesn't matter which server the request goes to because the buckets aren't tied to any specific server.

Reduce Latencies

Shuffle Sharding is also used in AWS CRT (Common Base library across AWS offerings) to improve latencies. CRT dynamically tracks the latency distributions and cancels the request going beyond p95 latencies, and retries it so the request goes to another host. This is gambling but it has paid off.

Since S3 leverages Erasure Coding to store shards of the data, Shuffle Sharding is also used to reduce latencies by cancelling requests going to slow shards and retrying so it goes to another shard.

Thank you for reading. Hope you learnt something new. If you have any questions, please do comment.

Shuffle Sharding - the secret sauce behing AWS reliability

Will Simple Sharding help?

Can increasing the shards help?

Let’s Shuffle!

Talk is cheap, show me the code!

Practical UseCases

Identify drive to store data

Resolve DNS Queries

Reduce Latencies

Appendix

Comments

More from this blog

The String That Should Never Have Existed

Introducing AtomDB - A Strongly Consistent, Embedded KeyValue Database built from scratch

Building a Request Coalescer from Scratch

Understanding the basics of Kafka Binary Protocol

Command Palette

Will Simple Sharding help?

Can increasing the shards help?

Let’s Shuffle!

Talk is cheap, show me the code!

Practical UseCases

Identify drive to store data

Resolve DNS Queries

Reduce Latencies

Appendix

Comments

More from this blog