<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Distributed Systems]]></title><description><![CDATA[Distributed Systems]]></description><link>https://snehasishroy.com</link><generator>RSS for Node</generator><lastBuildDate>Tue, 12 May 2026 12:35:25 GMT</lastBuildDate><atom:link href="https://snehasishroy.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[The String That Should Never Have Existed]]></title><description><![CDATA[Somewhere in a production system, a random string was born.
It looked innocent. A transaction ID embedded in a metric name. A session token baked into a log field. A request hash used as an analytics ]]></description><link>https://snehasishroy.com/the-string-that-should-never-have-existed</link><guid isPermaLink="true">https://snehasishroy.com/the-string-that-should-never-have-existed</guid><category><![CDATA[distributed system]]></category><category><![CDATA[System Design]]></category><category><![CDATA[System Architecture]]></category><category><![CDATA[interview]]></category><category><![CDATA[computerscience]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Sat, 04 Apr 2026 19:04:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/64e9d89567deab13f465f320/655923f5-cf78-4a6b-8c6b-5c4e31e504b6.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Somewhere in a production system, a random string was born.</p>
<p>It looked innocent. A transaction ID embedded in a metric name. A session token baked into a log field. A request hash used as an analytics event name. The system accepted it, stored it, and indexed it without question.</p>
<p>Nobody ever queried it. Nobody ever would. But it kept arriving. Millions of them. Each one adding weight to a system that was never designed to carry it.</p>
<p>This is the story of random string pollution. And one way to stop it before it gets in.</p>
<hr />
<h2>The Problem Repeats Itself</h2>
<p>Random strings cause the same class of damage across different parts of your stack. The pattern is always the same. A developer instruments something with a unique identifier baked into a key or a name. The system treats every unique value as a new entry. Data explodes. Nothing useful gets added.</p>
<p>Here is where you have probably seen this play out.</p>
<p><strong>Metrics pipelines.</strong> Your service emits <code>api.response.txn_a3f9b2c1.duration</code> for every transaction. One million transactions a day means one million new time series today alone. Your monitoring system slows down. Storage balloons. Nobody ever queries a dashboard by a specific transaction hash.</p>
<p><strong>Log indexing.</strong> Your application ships structured logs to Elasticsearch. A developer adds a field called <code>request_7f3a9b2c</code> to capture per request context. Elasticsearch creates a new field mapping for every unique key it sees. Your index mapping explodes into thousands of dynamic fields. Queries slow to a crawl. Storage costs spike. The fields never appear in any search.</p>
<p><strong>Distributed tracing.</strong> Your tracing system records span names like <code>process.job_4e8d1a3f.execute</code> for background jobs. Every job run produces a new span name. Your trace search index fills up with spans that share no common name to group or filter by. Aggregations become meaningless. Finding slow jobs requires scrolling through thousands of one off entries.</p>
<p><strong>Analytics event tracking.</strong> Your frontend fires events named <code>checkout.session_9c2f7b1a.started</code> for every user session. Your analytics platform stores each event name as a distinct category. Reports fragment into noise. No analyst can aggregate checkout behavior across sessions because every session has a different event name.</p>
<p>The root cause is the same in every case. A random string got embedded in something that should only contain human readable words. The system did not know the difference.</p>
<hr />
<h2>The First Instinct</h2>
<p>You write rules. You know what random identifiers look like. UUIDs, hex strings, base58 encoded tokens. You write a regex for each and drop it at the ingestion boundary.</p>
<pre><code class="language-plaintext">[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}

[0-9a-f]{12,}
</code></pre>
<p>For a while this works. The obvious offenders get caught. Things calm down.</p>
<p>Then someone switches to a shorter ID format. Or they move to base62 encoding. Or they start using numeric sequences that look nothing like hex. A new pattern slips through every few weeks. Another rule gets added. The blocklist grows. Someone eventually adds a comment that says "do not touch this, nobody knows why it is here."</p>
<p>You are patching holes in a leaking pipe. What you need is a different pipe.</p>
<hr />
<h2>A Different Question</h2>
<p>Instead of asking "does this match a known bad pattern," ask a better question.</p>
<p>Does this string look like something a human would write?</p>
<p>Human readable strings like <code>api.response.duration</code>, <code>checkout.started</code>, or <code>process.job.execute</code> come from natural language. Letters in natural language are not random. They follow patterns. In English, <code>th</code>, <code>er</code>, and <code>in</code> are extremely common consecutive letter pairs. <code>xq</code>, <code>zk</code>, and <code>vj</code> almost never appear in real words.</p>
<p>Random strings like UUIDs and hashes ignore these patterns. Their characters distribute more evenly and more chaotically. You can measure that difference. And once you can measure it, you can act on it.</p>
<hr />
<h2>How It Works</h2>
<p>A bigram is a pair of consecutive characters. In the word <code>metric</code>, the bigrams are <code>me</code>, <code>et</code>, <code>tr</code>, <code>ri</code>, and <code>ic</code>. There are 676 possible bigrams across the lowercase alphabet.</p>
<p><strong>Build a probability map from real English.</strong> Take an English dictionary or a large text corpus. For every word, extract all its bigrams. Count how often each of the 676 possible pairs appears across all words. Normalize the counts to sit between 0 and 1. What you get is a map that tells you how likely any given letter pair is to appear in real English text.</p>
<pre><code class="language-plaintext">"th" -&gt; 0.92
"er" -&gt; 0.87
"xq" -&gt; 0.01
"zk" -&gt; 0.00
</code></pre>
<p>You compute this map once, write it to a file, and load it into memory at startup. It is 676 entries. A few kilobytes. It never changes.</p>
<p><strong>Score incoming strings.</strong> When a string arrives, tokenize it by splitting on <code>.</code>, <code>_</code>, and other delimiters to get individual segments. For each segment, extract its bigrams. Look up each bigram in the map. Apply a threshold. Any bigram below the threshold counts as random looking. Count random bigrams against normal looking ones across all tokens.</p>
<pre><code class="language-plaintext">Input: api.response.txn_a3f9b2c1.duration

Tokens: api, response, txn, a3f9b2c1, duration

Bigrams from "a3f9b2c1" (letters only: a, f, b, c):
  "af" -&gt; 0.04   random
  "fb" -&gt; 0.01   random
  "bc" -&gt; 0.02   random

Bigrams from "response":
  "re" -&gt; 0.81   normal
  "es" -&gt; 0.74   normal
  "sp" -&gt; 0.55   normal

Verdict: random bigrams exceed threshold, reject
</code></pre>
<p><strong>Act on the result.</strong> You reject the input, emit a detection event, and surface which services or clients are the biggest contributors. You make the threshold configurable so you can tune sensitivity without touching rule definitions.</p>
<hr />
<h2>Why This Works Across Your Stack</h2>
<p>You can apply this check anywhere in your system that accepts a string key or name from an external source.</p>
<p>Drop it at your metrics ingestion layer and stop cardinality explosion before it touches your time series database. Drop it at your log pipeline and prevent dynamic field mapping from blowing up your Elasticsearch index. Drop it at your tracing collector and keep your span names human readable and groupable. Drop it at your analytics ingest endpoint and keep your event taxonomy clean enough to actually report on.</p>
<p>The logic is the same in every case. The map does not change. Only the strings being checked are different.</p>
<p>A regex blocklist says "I know exactly what bad looks like." The bigram approach says "I know what readable looks like, and I reject what does not match." That position holds up as your system grows and as the patterns of abuse change over time.</p>
<p>The configurable threshold matters too. Not every string in your system is a clean English word. Abbreviations, domain specific terms, and short tokens sometimes score lower than you expect. You tune the threshold once per use case and move on.</p>
<hr />
<h2>The Gate Holds</h2>
<p>Back to that string. <code>api.response.txn_a3f9b2c1d2e3.duration</code>.</p>
<p>It arrives at your ingestion layer. Your system tokenizes the name, extracts bigrams, and runs the probabilities. The random looking pairs dominate. A detection event fires. The string goes nowhere.</p>
<p>Its siblings follow. Thousands of them across your metrics pipeline, your log shipper, your tracing collector, your analytics endpoint. Each one scoring poorly. None of them getting through. Your storage stays predictable. Your indexes stay clean. Your dashboards load.</p>
<p>The services sending them have no idea. That is fine. Someone will notice the detection events and investigate. But the damage stopped the moment it would have started.</p>
<hr />
]]></content:encoded></item><item><title><![CDATA[Introducing AtomDB - A Strongly Consistent, Embedded KeyValue Database built from scratch]]></title><description><![CDATA[Over the past 1 year, I was working on an itch to build something - not to just vibe code - but to actually understand the primitives and build something useful. I thought, what better way to understa]]></description><link>https://snehasishroy.com/introducing-atomdb-a-strongly-consistent-embedded-keyvalue-database-built-from-scratch</link><guid isPermaLink="true">https://snehasishroy.com/introducing-atomdb-a-strongly-consistent-embedded-keyvalue-database-built-from-scratch</guid><category><![CDATA[System Design]]></category><category><![CDATA[Databases]]></category><category><![CDATA[distributed system]]></category><category><![CDATA[technology]]></category><category><![CDATA[apache]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Sat, 21 Mar 2026 11:57:48 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/64e9d89567deab13f465f320/767d3717-a47b-4392-accc-d9501f5a6b06.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Over the past 1 year, I was working on an itch to build something - not to just vibe code - but to actually understand the primitives and build something useful. I thought, what better way to understand Distributed Systems than by building a Database? So, I started building one!</p>
<p>What was the motivation? Most engineers have felt this pain at least once. You are building a service that needs a strongly consistent, distributed configuration store. You reach for Redis, Etcd, or ZooKeeper — and then spend the next two weeks onboarding a new dependency, wiring up a separate deployment, managing its credentials, and debugging. The data requirements are modest - a few dozen keys, each a small string. The operational weight is disproportionate.</p>
<p>What if the store was just <em>part of your service</em>? Not a sidecar, not a separate container — but a library you integrate into your application. That is exactly the problem AtomDB was built to solve.</p>
<p>AtomDB is a distributed key-value store built on top of <strong>Apache Ratis</strong>, the Java implementation of the Raft consensus protocol. It ships as a <strong>Dropwizard bundle</strong> or <strong>a Standard Java Client</strong>, which can start a fully replicated Raft group within the same JVM process. There is no separate binary to run, no extra Dockerfile, and no extra deploy step. The cluster forms itself, heals itself, and can even grow and shrink automatically while your service is live.</p>
<p>This post is the engineering narrative behind that system: how it works, why it is designed the way it is, and what the code looks like from the inside.</p>
<hr />
<h2>Embedded Model</h2>
<p>Traditional key-value stores like Redis or Etcd are designed as an external cluster which you connect to over a network. Your service is a <em>client</em> of that external system. That model carries a hidden tax - every read or write crosses a network boundary, the store must be independently deployed and monitored, and your service's availability now depends on the health of an entirely separate system.</p>
<p>AtomDB flips the model. Instead of your service being a client of an external store, <strong>the store is embedded inside your service</strong>. Each instance of your application carries its own Raft node. Together those replicas form a Raft cluster and maintain a shared, consistent state without ever leaving the JVM process.</p>
<p>The practical consequence: every instance of your service participates in the cluster. When you scale your service from 3 pods to 5, two new Raft peers come online automatically. When you deploy your service, you deploy the store too.</p>
<p>Registering AtomDB bundle in your application looks exactly like adding any other Dropwizard bundle.</p>
<pre><code class="language-java">bootstrap.addBundle(new AtomDBBundle&lt;YourConfiguration&gt;() {

    @Override
    public AtomDBBundleConfig getAtomDBBundleConfig(YourConfiguration config) {
        return config.getAtomDBBundleConfig();
    }

    @Override
    public int getApplicationPort(YourConfiguration config) {
        DefaultServerFactory serverFactory = (DefaultServerFactory) config.getServerFactory();
        for (ConnectorFactory connector : serverFactory.getApplicationConnectors()) {
            if (connector instanceof HttpConnectorFactory httpConnector) {
                return httpConnector.getPort();
            }
        }
        return -1;
    }
});
</code></pre>
<p>That single <code>addBundle</code> call wires together the Raft group, the state machine, the HTTP operators, and a Feign-based client — all from the YAML configuration your service already reads at startup.</p>
<p>In case you are not using Dropwizard, you can use the provided <code>atomdb-client</code> to integrate it in your Java application.</p>
<hr />
<h2>Raft Consensus: Why Every Write Needs Majority Agreement</h2>
<p>Before understanding how AtomDB stores a value, you need to understand Raft's core guarantee. Raft is a consensus algorithm that ensures a cluster of nodes agrees on a single, ordered log of operations. In a cluster of <code>N</code> nodes, every write must be <strong>acknowledged by a strict majority (</strong><code>N/2 + 1</code> <strong>nodes) before it is considered committed</strong>. This majority is called the quorum.</p>
<p>For a 3-node AtomDB cluster, that means at least 2 nodes must confirm every <code>PUT</code> before the client sees a success response. For 5 nodes, at least 3 must confirm. Note that if you have a cluster of 100 Dropwizard applications, you don't have to run a 100 node AtomDB cluster. You can form a smaller 3 node cluster where your metadata can reside.</p>
<img src="https://cdn.hashnode.com/uploads/covers/64e9d89567deab13f465f320/0fa7ead0-e464-401b-bc27-7d5886f16d92.png" alt="" style="display:block;margin:0 auto" />

<blockquote>
<p>Source: <a href="https://www.mydistributed.systems/2021/04/raft.html">https://www.mydistributed.systems/2021/04/raft.html</a></p>
</blockquote>
<h3>Reads Route to the Leader</h3>
<p>In AtomDB, <strong>reads (</strong><code>GET</code><strong>) are always routed to the current Raft leader</strong>. This is intentional and deliberate. Because the leader is the node that manages log replication, it always has the most up-to-date committed state. By routing reads through the leader, AtomDB guarantees <strong>read-after-write consistency</strong>: if your client successfully commits a <code>PUT</code>, any subsequent <code>GET</code> on any node will observe that value.</p>
<p>This is markedly different from eventually-consistent systems where a write to Node 1 might not be immediately visible on a <code>GET</code> to Node 2. In AtomDB, consistency is a hard guarantee, not a probabilistic one.</p>
<p>The Ratis <code>query()</code> path is non-replicated — it answers directly from the leader's in-memory state without creating a log entry, keeping reads fast while preserving the consistency guarantee.</p>
<pre><code class="language-java">// In KeyValueStateMachine.java
@Override
public CompletableFuture&lt;Message&gt; query(Message request) {
    final String logData = request.getContent().toStringUtf8();
    final String[] parts = logData.split(":", 3);

    if ("GET".equals(parts[0]) &amp;&amp; parts.length &gt;= 2) {
        final String key = parts[1];
        final String value = keyValueStore.getOrDefault(key, "NOT_FOUND");
        return CompletableFuture.completedFuture(Message.valueOf(value));
    }
    return CompletableFuture.completedFuture(Message.valueOf("INVALID_QUERY"));
}
</code></pre>
<hr />
<h2>The State Machine: Writes Are Applied Only After Quorum</h2>
<p>The Raft log is not the data store — it is the <strong>ordered record of operations</strong> that transform the data store. The actual key-value pairs live in the <code>KeyValueStateMachine</code>, a class that extends Ratis's <code>BaseStateMachine</code>. Crucially, <code>applyTransaction</code> is called by the Raft runtime only after the log entry has been replicated to a majority of nodes and committed. Your in-memory <code>ConcurrentHashMap</code> is never touched by a write that hasn't cleared quorum.</p>
<pre><code class="language-java">// In KeyValueStateMachine.java
private final ConcurrentHashMap&lt;String, String&gt; keyValueStore = new ConcurrentHashMap&lt;&gt;();

@Override
public CompletableFuture&lt;Message&gt; applyTransaction(TransactionContext trx) {
    final RaftProtos.LogEntryProto entry = trx.getLogEntry();
    final String logData = entry.getStateMachineLogEntry()
            .getLogData().toStringUtf8();

    // Parse command: PUT:key:value
    final String[] parts = logData.split(":", 3);
    if ("PUT".equals(parts[0]) &amp;&amp; parts.length == 3) {
        keyValueStore.put(parts[1], parts[2]);
        reply = Message.valueOf("OK");
    }

    // This is the contract with Ratis: record the highest applied index
    updateLastAppliedTermIndex(entry.getTerm(), entry.getIndex());
    return CompletableFuture.completedFuture(reply);
}
</code></pre>
<p>The call to <code>updateLastAppliedTermIndex</code> at the end is not optional. It tells Raft which log entry this node has processed, so that in the event of a crash the runtime knows where to resume replay. Forget that call and you break the replay invariant — the node will re-apply entries it already applied, potentially corrupting state.</p>
<p>Think of it like <strong>Kafka KRaft</strong>, which replaced ZooKeeper with a Raft-based metadata quorum. Every broker metadata change (partition assignment, ISR list update, config change) is a log entry. The brokers' in-memory metadata is applied only after entries commit. <code>applyTransaction</code> in AtomDB is the direct counterpart of KRaft's metadata record application — and <code>takeSnapshot</code> maps cleanly onto KRaft's metadata snapshot mechanism.</p>
<img src="https://cdn.hashnode.com/uploads/covers/64e9d89567deab13f465f320/96e63d4d-5926-4654-a1be-30b9ad30f609.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>The Control Plane: Dynamic Peer Management Without Manual Intervention</h2>
<p>Raft itself is a consensus protocol, not an operations framework. It will faithfully replicate log entries once you tell it who the peers are — but it has no notion of <em>go discover three more nodes on these ports and add them to the cluster</em>. That operational intelligence is AtomDB's <strong>control plane</strong>.</p>
<h3>Bootstrap: Who Goes First?</h3>
<p>When multiple nodes start simultaneously, exactly one of them must bootstrap the Raft group; the others must join it. Choosing the wrong bootstrapper (or having two nodes both bootstrap) corrupts the group. AtomDB resolves this with a deterministic election inside <code>ClusterManager</code>:</p>
<ol>
<li><p><code>clusterService.awaitDiscovery()</code> blocks until at least <code>memberSize</code> peers are reachable on their admin ports (via TCP probe).</p>
</li>
<li><p>From the list of reachable peers, sorted deterministically, the <strong>first node in the list</strong> becomes the candidate bootstrapper.</p>
</li>
<li><p>Before bootstrapping, the candidate probes all other peers' HTTP <code>/cluster/v1/peers</code> endpoints with exponential-backoff retries. If any peer returns a non-empty follower list, an existing Raft cluster is detected and this node stands down.</p>
</li>
</ol>
<pre><code class="language-java">// In ClusterManager.java — the safety check before bootstrapping
private boolean isCandidateNode(List&lt;DiscoveryNode&gt; candidateNodes) {
    // Only the first node in the sorted list may bootstrap
    Optional&lt;DiscoveryNode&gt; firstNode = candidateNodes.stream().findFirst();
    String localNodeId = clusterService.getDiscovery().getLocalInstanceInfo().split(":")[0];

    if (!firstNode.map(n -&gt; n.getNodeId().equals(localNodeId)).orElse(false)) {
        return false; // Not first: wait for leader contact
    }

    // First node: probe all others for an existing cluster before starting
    boolean clusterExists = detectExistingCluster(otherCandidates);
    return !clusterExists; // Only bootstrap if no existing cluster found
}
</code></pre>
<p>Non-candidate nodes start with an <strong>empty Raft group</strong> — <code>RaftGroup.valueOf(raftGroupId, List.of())</code>. This is intentional and follows Ratis's membership-change protocol. An empty-group start means the node participates in no configuration until the leader explicitly adds it via <code>setConfiguration()</code>.</p>
<h3>Adding Nodes: The <code>AddListenerTask</code></h3>
<p>After the initial cluster forms, all subsequent membership changes are driven by a background task called <code>AddListenerTask</code>. It runs on the leader every 10 seconds and implements the following decision tree:</p>
<pre><code class="language-plaintext">AddListenerTask runs (leader only)
        │
        ├── Are any followers unreachable?
        │       │
        │       ├── YES: Is majority quorum still intact?
        │       │       ├── NO  → Log error, do nothing (unsafe to reconfigure)
        │       │       └── YES → Promote available listeners to followers (if flag enabled)
        │       │                 OR find a spare node, POST /ratis/v1/start on it,
        │       │                 then call setConfiguration() adding it as a follower
        │       │
        │       └── NO: Are any listeners unreachable?
        │               ├── YES → Replace with a spare node as listener
        │               └── NO  → Is follower count &lt; expectedMemberSize?
        │                           └── YES → Add spare node as follower
        │                               NO  → Is listener count &lt; maxListeners?
        │                                       └── YES → Add spare node as listener
        │                                           NO  → Cluster at capacity, do nothing
</code></pre>
<p>The quorum math is deliberately based on <code>expectedMemberSize</code> (from config), not the current number of running followers. This is a Raft safety requirement: the quorum threshold must be stable, not a moving target that changes every time a node disappears.</p>
<pre><code class="language-java">// In AddListenerTask.java — quorum safety check
int majorityQuorum = (expectedMemberSize / 2) + 1;

if (reachableFollowers.size() &lt; majorityQuorum) {
    log.error("CRITICAL: Majority quorum lost! Reachable followers: {}, Required: {}",
            reachableFollowers.size(), majorityQuorum);
    // Do nothing — unsafe to change configuration without quorum
    return;
}
</code></pre>
<p>When a spare node is promoted, AtomDB uses a two-step process. First, it POSTs to the spare node's <code>/ratis/v1/start</code> endpoint to start the Raft server process on that node. Then it calls <code>client.admin().setConfiguration()</code> through the Raft protocol to officially add the peer to the group. The new node joins with an empty group and learns the current log from the leader through the normal Raft log replication mechanism.</p>
<h3>The Peer String Format</h3>
<p>All discovery configuration encodes nodes in a compact string format:</p>
<pre><code class="language-plaintext">nodeId:host:raftPort:appPort:adminPort
n1:0.0.0.0:9000:8080:8081
</code></pre>
<ul>
<li><p><strong>raftPort (9000)</strong>: used by Ratis gRPC for log replication between peers</p>
</li>
<li><p><strong>appPort (8080)</strong>: used by the leader's HTTP client to call <code>/ratis/v1/start</code> on spare nodes</p>
</li>
<li><p><strong>adminPort (8081)</strong>: pinged during <code>awaitDiscovery()</code> to check liveness</p>
</li>
</ul>
<hr />
<h2>Listeners: A Non-Voting Standby Mechanism</h2>
<p>In Raft, a listener receives all log entries from the leader and maintains a current copy of the state, but it does not count toward the quorum. This has a useful operational property: a listener can be added to the cluster without changing the quorum size, making it a zero-risk standby.</p>
<p>AtomDB uses listeners as a first tier of high-availability:</p>
<table>
<thead>
<tr>
<th>Config Field</th>
<th>Effect</th>
</tr>
</thead>
<tbody><tr>
<td><code>maxListeners: 1</code></td>
<td>Keep one spare node as a listener at all times</td>
</tr>
<tr>
<td><code>enablePromotionOfListenersToFollowers</code></td>
<td>When <code>true</code>, promote a healthy listener to replace a failed follower</td>
</tr>
</tbody></table>
<p>With <code>enablePromotionOfListenersToFollowers: false</code> (the safe default), the leader will never silently swap a follower for a listener — any such promotion is a deliberate operator action. With it set to <code>true</code>, recovery from a single-follower failure becomes fully automatic, in a matter of seconds.</p>
<pre><code class="language-java">// In AddListenerTask.java — listener promotion gate
boolean enableListenerPromotion = clusterService.getClusterHealthStrategy()
        .isEnablePromotionOfListenersToFollowers();

if (enableListenerPromotion &amp;&amp; !currentListeners.isEmpty()) {
    int promotionCount = Math.min(followersNeeded, currentListeners.size());
    for (int i = 0; i &lt; promotionCount; i++) {
        RaftPeer listenerToPromote = currentListeners.get(i);
        RaftPeer promotedFollower = RaftPeer.newBuilder()
                .setId(listenerToPromote.getId())
                .setAddress(listenerToPromote.getAddress())
                .setStartupRole(RaftPeerRole.FOLLOWER)
                .build();
        currentFollowers.add(promotedFollower);
    }
}
</code></pre>
<p>A concrete scenario with a 5-node cluster (<code>memberSize: 3</code>, <code>maxListeners: 1</code>):</p>
<pre><code class="language-plaintext">Initial state:   n1(follower) n2(follower) n3(follower) n4(listener) n5(spare, not started)

n2 goes down:
  AddListenerTask detects unreachable follower n2
  Quorum check: 2 reachable followers ≥ majorityQuorum(2) — safe to proceed
  enablePromotionOfListenersToFollowers: false → skip listener promotion
  Find spare node n5 → POST /ratis/v1/start to n5
  Call setConfiguration([n1, n3, n5], [n4])

Result:          n1(follower) n3(follower) n5(follower) n4(listener)
                 Quorum restored. n2 can rejoin later as a follower.
</code></pre>
<hr />
<h2>Snapshot Durability: Surviving a Full Cluster Restart</h2>
<p>An in-memory state machine is convenient but fragile. If all nodes lose their disks simultaneously — or if a containerised deployment recycles all pods at once — every committed write is gone. To address this, AtomDB supports uploading snapshots to <strong>S3-compatible object storage</strong>.</p>
<h3>Taking a Snapshot</h3>
<p>A snapshot serialises the entire <code>ConcurrentHashMap</code> to a local file, computes its MD5, and then uploads both the snapshot and its checksum to S3 under a configurable prefix.</p>
<pre><code class="language-java">// In KeyValueStateMachine.java
@Override
public long takeSnapshot() {
    final TermIndex last = getLastAppliedTermIndex();

    // 1. Serialise the in-memory map to a local file
    File snapshotFile = s3Storage.getSnapshotFile(last.getTerm(), last.getIndex());
    try (ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(snapshotFile))) {
        out.writeObject(new ConcurrentHashMap&lt;&gt;(keyValueStore));
    }

    // 2. Compute and save an MD5 checksum alongside the snapshot
    MD5Hash md5 = MD5FileUtil.computeAndSaveMd5ForFile(snapshotFile);

    // 3. Upload the snapshot file to S3 at the configured prefix
    s3Storage.uploadSnapshotToS3(last.getTerm(), last.getIndex(), snapshotFile);
    s3Storage.updateLatestSnapshot(new SingleFileSnapshotInfo(...));

    return last.getIndex();
}
</code></pre>
<p>The snapshot file name encodes the Raft <code>term</code> and <code>index</code> at which it was taken (e.g. <code>snapshot.4_1892</code>). On startup, <code>loadSnapshot</code> downloads the latest snapshot from S3 before the Raft log replay begins. The new node catches up to the snapshot's index instantly, and then only needs to replay the delta between the snapshot and the current log head — which is typically small.</p>
<pre><code class="language-yaml"># Minimal S3 storage config in your YAML
storageConfig:
  type: S3
  bucketName: "my-service-atomdb"
  region: "us-east-1"
  accessKey: "your-access-key"
  secretKey: "unused"                        # token-based auth
  endpoint: "https://your-s3-endpoint"
  pathStyleAccess: true
  snapshotPrefix: "my-service-snapshots-prod/"  # unique per service+env
</code></pre>
<p>The <code>snapshotPrefix</code> is the single most important knob to get right in production: use a unique value per service <em>and</em> per environment. Sharing a prefix between staging and production is the fastest path to a corrupted cluster after an accidental cross-environment snapshot download.</p>
<hr />
<h2>Integration Tests: Real Servers, Real Raft</h2>
<p>Unit tests that mock Raft are not very useful — the interesting bugs live in timing, log replication races, and node failure/recovery sequences. AtomDB's integration tests spin up <strong>real in-process Dropwizard servers on dynamic ports</strong> using JUnit 5 extension.</p>
<pre><code class="language-java">@RegisterExtension
static MultiInstanceAtomDbExtension cluster = MultiInstanceAtomDbExtension.builder()
        .instanceCount(3)
        .quorumSize(3)
        .startupTimeout(Duration.ofSeconds(120))
        .build();
</code></pre>
<p><code>beforeAll</code> calls <code>ConfigGenerator.generateClusterConfigs()</code> to produce dynamic-port YAML configs, starts each instance sequentially with a 3-second gap (to avoid bootstrap races), and then polls <code>ClusterHealthChecker.waitForAllPeersInCluster()</code> until the expected number of followers are visible in the cluster's <code>listPeers()</code> response. Only then does the first test method run.</p>
<h3>Verifying Replication</h3>
<p>The most fundamental correctness property: a <code>PUT</code> on node 1 must be visible on node 2.</p>
<pre><code class="language-java">@Test
void testPutOnOneNodeGetOnAnother() {
    AtomDbClient client1 = cluster.getClient("n1");
    AtomDbClient client2 = cluster.getClient("n2");

    AtomDbResponse&lt;String&gt; putResponse = client1.put("replication-key", "replication-value");
    assertThat(putResponse.isSuccess()).isTrue();

    cluster.waitForReplication();

    AtomDbResponse&lt;String&gt; getResponse = client2.getKey("replication-key");
    assertThat(getResponse.isSuccess()).isTrue();
    assertThat(getResponse.getResponse()).isEqualTo("replication-value");
}
</code></pre>
<h3>Leader Failover</h3>
<p><code>LeaderFailoverTest</code> validates the complete failover lifecycle: stop the leader, verify the remaining two nodes elect a new leader, write new data, restart the former leader, and confirm it rejoins as a follower and catches up.</p>
<pre><code class="language-java">void testLeaderFailoverAndRecovery() throws Exception {
    verifyInitialClusterFormation();    // all 3 nodes up, listPeers returns 3 followers
    testClusterBeforeFailover();        // PUT/GET works
    stopLeaderNode();                   // cluster.stopNode("n1")
    testClusterAfterLeaderFailure();    // PUT/GET still works via new leader (n2 or n3)
    restartFormerLeaderNode();          // cluster.startNode("n1")
    verifyNodeRejoinsCluster();         // listPeers shows n1 as follower again
    testClusterAfterRecovery();         // all 3 nodes serve reads/writes
}
</code></pre>
<h3>Follower Replacement</h3>
<p><code>FollowerRecoveryTest</code> uses a 5-node cluster (3 followers, 1 listener, 1 spare) and asserts that when a follower dies, the spare is automatically promoted within the <code>AddListenerTask</code> cycle:</p>
<pre><code class="language-java">@RegisterExtension
static MultiInstanceAtomDbExtension cluster = MultiInstanceAtomDbExtension.builder()
        .instanceCount(5)
        .quorumSize(3)
        .expectedPeerCount(4) // n1-n4 join; n5 remains spare until needed
        .startupTimeout(Duration.ofSeconds(180))
        .build();
</code></pre>
<p>After <code>killFollowerAndVerifyPromotion()</code> stops node n2, the test uses Awaitility to poll <code>listPeers()</code> until it sees exactly 3 followers again — with n5 promoted into n2's slot.</p>
<h3>Listener Synchronisation</h3>
<p><code>ClusterListenerSyncTest</code> proves the non-voting-member guarantee: a listener keeps a current replica of the key-value store but its failure does not affect the voting quorum. The test simultaneously kills the listener node and one quorum follower, verifying that the remaining 2 quorum nodes still accept writes and serve reads.</p>
<hr />
<h2>Integrating as a Client</h2>
<p>If you are writing a Java service that wants to talk to an AtomDB cluster without embedding the full Raft server (for example, a service that is <em>not</em> part of the cluster but needs to read configuration from it), the <code>atomdb-client</code> module provides a thin Feign interface:</p>
<pre><code class="language-xml">&lt;dependency&gt;
    &lt;groupId&gt;com.snehasishroy&lt;/groupId&gt;
    &lt;artifactId&gt;atomdb-client&lt;/artifactId&gt;
    &lt;version&gt;1.0.0&lt;/version&gt;
&lt;/dependency&gt;
</code></pre>
<pre><code class="language-java">AtomDbClient client = Feign.builder()
        .client(new OkHttpClient())
        .encoder(new PlainTextEncoder(new JacksonEncoder()))
        .decoder(new JacksonDecoder())
        .target(AtomDbClient.class, "http://atomdb-node1:8080/");

// Store a value
AtomDbResponse&lt;String&gt; putResp = client.put("feature.flag.rollout", "true");
assertThat(putResp.isSuccess()).isTrue();

// Read it back
AtomDbResponse&lt;String&gt; getResp = client.getKey("feature.flag.rollout");
System.out.println(getResp.getResponse()); // "true"
</code></pre>
<p>The client interface itself is minimal by design:</p>
<pre><code class="language-java">public interface AtomDbClient {
    @RequestLine("GET /kv/v1/{key}")
    AtomDbResponse&lt;String&gt; getKey(@Param("key") String key);

    @RequestLine("PUT /kv/v1/{key}")
    @Headers("Content-Type: text/plain")
    AtomDbResponse&lt;String&gt; put(@Param("key") String key, String value);

    @RequestLine("GET /cluster/v1/peers")
    AtomDbResponse&lt;ClusterPeersResponse&gt; listPeers();

    @RequestLine("POST /snapshot/v1/")
    void triggerSnapshot();
}
</code></pre>
<p>Because reads are served by the leader and the leader can change during a failover, it is good practice to point your Feign target at a load balancer (or use retry logic with multiple target URLs) rather than hard-coding a single node.</p>
<hr />
<h2>Discovery Strategies: From Local Dev to Production</h2>
<p>AtomDB supports three discovery modes to cover the full lifecycle from laptop to production:</p>
<table>
<thead>
<tr>
<th>Mode</th>
<th>Config type</th>
<th>When to use</th>
</tr>
</thead>
<tbody><tr>
<td><strong>STATIC</strong></td>
<td><code>type: STATIC</code></td>
<td>Local Dev with fixed ports; no automatic node promotion</td>
</tr>
<tr>
<td><strong>DYNAMIC</strong></td>
<td><code>type: DYNAMIC</code></td>
<td>Local tests with spare nodes; simulates failover</td>
</tr>
<tr>
<td><strong>DROVE</strong></td>
<td><code>type: DROVE</code></td>
<td>Production; auto-discovers live instances via Drove</td>
</tr>
</tbody></table>
<p>In <strong>DYNAMIC</strong> mode every discoverable node — including future spares — must appear in the <code>peers</code> list upfront. A node not listed can never be discovered by the <code>AddListenerTask</code>.</p>
<p>In <strong>DROVE</strong> mode the control plane queries the Drove orchestration API (<a href="https://github.com/PhonePe/drove-orchestrator">https://github.com/PhonePe/drove-orchestrator</a>) to discover all healthy instances for the service. This is the zero-configuration production path - deploy a new instance, Drove registers it, AtomDB discovers it on the next <code>AddListenerTask</code> tick.</p>
<pre><code class="language-yaml"># Production DROVE config
atomDBBundleConfig:
  clusterConfiguration:
    discoveryConfig:
      type: DROVE
      droveEndpoint: ${DROVE_ENDPOINT_URL}
      raftPortName: raft
      communicationPortName: main
    clusterHealthConfig:
      type: DYNAMIC
      memberSize: 3
      maxListeners: 1
      enablePromotionOfListenersToFollowers: false
    groupUUID: 02511d47-d67c-49a3-9011-abb3109a44c2
</code></pre>
<hr />
<h2>Conclusion</h2>
<p>AtomDB started as an experiment: what is the minimum viable design for a strongly-consistent, embedded key-value store that a team could adopt with focus on minimalistic external dependency? The answer turned out to be a surprisingly thin layer on top of Apache Ratis.</p>
<p>The core insight is the <strong>separation of concerns between Raft (the protocol) and the control plane (the operational intelligence)</strong>. Ratis handles log replication, leader election, and membership changes — but it does not decide when to add nodes, which nodes to promote, or how to bootstrap a fresh cluster. That reasoning lives in <code>ClusterManager</code> and <code>AddListenerTask</code>, and it is the part of the system that is most specific to AtomDB's operational model.</p>
<p>The result is a system where you can:</p>
<ul>
<li><p><code>PUT</code> a key on any node, get a linearisable write-after-quorum guarantee</p>
</li>
<li><p><code>GET</code> a key and always read the latest committed value</p>
</li>
<li><p>Lose a minority of nodes and have the cluster self-heal with spare nodes within seconds</p>
</li>
<li><p>Survive a full cluster restart by downloading a snapshot from S3</p>
</li>
<li><p>Write tests that start real Raft clusters in-process, exercise actual failover sequences, and clean up after themselves in a single JUnit lifecycle</p>
</li>
</ul>
<p>The source code is available at <a href="https://github.com/snehasishroy/atomdb">github.com/snehasishroy/atomdb</a>. If you embed it in a Dropwizard application and find a bug or a missing feature, pull requests are very welcome.</p>
<p>P.S: I have done extensive testing to make sure the first release does not have any known bugs. However, this has not been tested on production env yet. Thank you for sticking to the end.</p>
<img src="https://cdn.hashnode.com/uploads/covers/64e9d89567deab13f465f320/a94fb9e0-3327-47df-8282-10ad376aa598.png" alt="" style="display:block;margin:0 auto" />]]></content:encoded></item><item><title><![CDATA[Shuffle Sharding - the secret sauce behing AWS reliability]]></title><description><![CDATA[In a typical client-server architecture, requests from the client are forwarded to a random Application Server by the Application Load Balancer. When the request is a poison pill i.e. it crashes/hungs the server receiving the request, then all of you...]]></description><link>https://snehasishroy.com/shuffle-sharding-the-secret-sauce-behing-aws-reliability</link><guid isPermaLink="true">https://snehasishroy.com/shuffle-sharding-the-secret-sauce-behing-aws-reliability</guid><category><![CDATA[2Articles1Week]]></category><category><![CDATA[System Design]]></category><category><![CDATA[AWS]]></category><category><![CDATA[distributed system]]></category><category><![CDATA[technology]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Sun, 08 Jun 2025 11:28:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/yo8jVsFH5Co/upload/9d93b657bf3dbc856668775ddaef6731.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In a typical client-server architecture, requests from the client are forwarded to a random Application Server by the Application Load Balancer. When the request is a poison pill i.e. it crashes/hungs the server receiving the request, then all of your application servers will eventually crash/hung one by one if the client keeps on retrying (<em>as the requests will be eventually forwarded to all the servers randomly</em>).</p>
<p><img src="https://static.us-east-1.prod.workshops.aws/public/f33d8266-91f8-4dfd-ad37-44edf7351ace/static/4_failure_management/2_fault_isolation/Fault_Isolation_with_Shuffle_Sharding/Images/RegularFlowBroken.png" alt="Architecture Diagram with Application Load Balancer and eight worker nodes marked unavailable" /></p>
<blockquote>
<p><a target="_blank" href="https://catalog.workshops.aws/well-architected-reliability/en-US/4-failure-management/2-fault-isolation/10-fault-isolation-with-shuffle-sharding/2-impact-of-failures">Source</a></p>
</blockquote>
<hr />
<h2 id="heading-will-simple-sharding-help">Will Simple Sharding help?</h2>
<p>One way to isolate this problem is to create virtual shards i.e. isolate requests coming from clients so that they are served only by some specific instances. Generally this is done based on some hashing e.g. modulo / consistent.</p>
<p>In the below diagram, all the requests originating from client names starting from A or B goes to Worker1 and Worker2. So if <em>Alpha</em> request is a poison pill, only two servers will get impacted. Remaining servers will be unaffected as the requests never reach there.</p>
<blockquote>
<p>Do note that the <em>Bravo</em> request won't be fulfilled either because both Worker1 and Worker2 have crashed/hung due to the poison pill request from <em>Alpha</em>. One dirty fish has poisoned the entire lake.</p>
</blockquote>
<p>This strategy has definitely reduced the failure radius but still requests from <em>Bravo</em> are not being served. Can we do better?</p>
<p><img src="https://static.us-east-1.prod.workshops.aws/public/f33d8266-91f8-4dfd-ad37-44edf7351ace/static/4_failure_management/2_fault_isolation/Fault_Isolation_with_Shuffle_Sharding/Images/ShardedFlow.png" alt="Architecture Diagram with Application Load Balancer, shards and flow of customer to worker" /></p>
<blockquote>
<p><a target="_blank" href="https://catalog.workshops.aws/well-architected-reliability/en-US/4-failure-management/2-fault-isolation/10-fault-isolation-with-shuffle-sharding/3-implement-sharding">Source</a></p>
</blockquote>
<hr />
<h2 id="heading-can-increasing-the-shards-help">Can increasing the shards help?</h2>
<p>The issue with the above was the way we created shards — there were simply too few combinations available — as each instance can be mapped to only one shard.</p>
<p>If we allow each instance to be mapped to multiple shards, then we can increase the number of combinations available and reduce our unavailability.</p>
<p>In the below diagram, we created 8 shards to map 8 clients. Previously we only had 4 shards.</p>
<p><img src="https://static.us-east-1.prod.workshops.aws/public/f33d8266-91f8-4dfd-ad37-44edf7351ace/static/4_failure_management/2_fault_isolation/Fault_Isolation_with_Shuffle_Sharding/Images/Architecture-shuffle-sharding.png" alt="Architecture Diagram with Application Load Balancer, eight shards with two worker nodes per shard, and each worker being assigned to two different shards" /></p>
<blockquote>
<p><a target="_blank" href="https://catalog.workshops.aws/well-architected-reliability/en-US/4-failure-management/2-fault-isolation/10-fault-isolation-with-shuffle-sharding/5-implement-shuffle-sharding">Source</a></p>
</blockquote>
<p>Note that each workers are mapped to multiple shards e.g. Worker2 is mapped to Shard1 and Shard2.</p>
<p>Now if requests from <em>Alpha</em> crashes/hungs up Worker1 and Worker2 — requests from <em>Bravo</em> would still continue to work as they are mapped to Worker2 and Worker3. Since Worker3 is still alive, requests from <em>Bravo</em> would still get served.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Customer Name</strong></td><td><strong>Workers</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Alpha</td><td>Worker-1 and Worker-2</td></tr>
<tr>
<td>Bravo</td><td>Worker-2 and Worker-3</td></tr>
<tr>
<td>Charlie</td><td>Worker-3 and Worker-4</td></tr>
<tr>
<td>Delta</td><td>Worker-4 and Worker-5</td></tr>
<tr>
<td>Echo</td><td>Worker-5 and Worker-6</td></tr>
<tr>
<td>Foxtrot</td><td>Worker-6 and Worker-7</td></tr>
<tr>
<td>Golf</td><td>Worker-7 and Worker-8</td></tr>
<tr>
<td>Hotel</td><td>Worker-8 and Worker-1</td></tr>
</tbody>
</table>
</div><hr />
<h2 id="heading-lets-shuffle">Let’s Shuffle!</h2>
<p>Hope you were able to grasp the fundamentals using the above example. The idea is to create more shards with as few overlaps as possible. Let’s see how we can make a generic solution.</p>
<blockquote>
<p>Given n application servers, if we randomly choose k instances and make them part of one virtual shard, then each shard would have k instances. The probability of 2 shards with 100% overlap would drastically go down as we increase the value of k.</p>
<p>To simplify, if we have 10 cards and we want to choose 4 cards among them, then there will be a total of 10 choose 4 combinations = 210 total combinations. If we randomly generate 2 such combinations, then the probability of them having 100% overlap, i.e., all 4 cards being the same, would be (1 / 210) ~= 0.47%.</p>
</blockquote>
<p>Now to relate this analogy to our problem statement, given 10 application instances, if we randomly choose 4 instances to create a virtual shard, the probability of 2 shards with 100% overlap would be ~0.47%. This indicates that a poison request can only impact 0.47% of shards. We can further reduce this by increasing the total number of instances or increasing the shard size.</p>
<hr />
<h2 id="heading-talk-is-cheap-show-me-the-code">Talk is cheap, show me the code!</h2>
<p>Shuffle sharding can be implemented in two ways — stateless or stateful.</p>
<p>Stateless, as the name indicates, does not persist any state data, i.e., shard information in a DB Store. It simply identifies the target application instances from an identifier. The target application instances are the members of the virtual shard mapped to that request.</p>
<p>Stateful sharding goes a bit further and persists the information of the shards to a database, which allows further customization of the way shards are created, e.g., customizing the shard assignment strategy by tuning weights.</p>
<p>Let’s see how we can implement stateless shuffle sharding using a simple strategy of generating multiple hashes from a unique identifier, followed by mapping that hash to a unique node.</p>
<pre><code class="lang-java"><span class="hljs-function"><span class="hljs-keyword">public</span> Set&lt;Integer&gt; <span class="hljs-title">assignNodes</span><span class="hljs-params">(String customerId)</span> </span>{
    Set&lt;Integer&gt; assignedNodes = <span class="hljs-keyword">new</span> HashSet&lt;&gt;();

    <span class="hljs-comment">// Need to find 4 nodes</span>
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">4</span>; i++) {
        <span class="hljs-comment">// Create a unique input for each Node selection</span>
        String hashInput = customerId + <span class="hljs-string">":"</span> + i;
        <span class="hljs-comment">// maps a hash to a NodeId</span>
        <span class="hljs-keyword">int</span> nodeId = hashToNodeId(hashInput); 

        <span class="hljs-comment">// Handle collisions by trying the next available Node</span>
        <span class="hljs-keyword">while</span> (assignedNodes.contains(NodeId)) {
            nodeId = (nodeId + <span class="hljs-number">1</span>) % config.getTotalNodes();
        }
        assignedNodes.add(nodeId);
    }
    <span class="hljs-keyword">return</span> assignedNodes;
}
</code></pre>
<blockquote>
<p>We can also leverage multiple hash functions, similar to a Bloom Filter, to generate multiple hashes instead of generating multiple inputs from the identifier, in case multiple unique inputs can't be generated.</p>
</blockquote>
<hr />
<h2 id="heading-practical-usecases">Practical UseCases</h2>
<p>Shuffle Sharding is a powerful and versatile technique used not only by AWS but also in popular open-source projects like <a target="_blank" href="https://grafana.com/docs/loki/latest/operations/shuffle-sharding/">Grafana Loki</a> and <a target="_blank" href="https://grafana.com/docs/mimir/latest/configure/configure-shuffle-sharding/">Grafana Mimir</a>.</p>
<p>In a recent <a target="_blank" href="https://www.youtube.com/watch?v=NXehLy7IiPM">AWS Tech Talk</a>, I learned that shuffle sharding is extensively used in S3 to introduce decorrelation in the system.</p>
<h2 id="heading-identify-drive-to-store-data">Identify drive to store data</h2>
<p>Shuffle Sharding helps randomly select a drive to store the data for your bucket, instead of directly mapping drives to buckets.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1750343609710/12f45d8f-c39c-4204-96ad-e34babb1941b.png" alt class="image--center mx-auto" /></p>
<p>In order to ensure a balanced disk utilization, a very clever technique known as <strong><em>Two random choices</em></strong> is used.</p>
<blockquote>
<p><strong><em>Law of of two random choices states that randomly pick two drives and choose the one with the lower disk utilization. This helps in avoiding scanning of all the drives in the system and maintaining info regarding their disk utilization.</em></strong></p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1750343766658/fdba3370-62b7-4a90-8342-14bd64b9f90e.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-resolve-dns-queries">Resolve DNS Queries</h2>
<p>Shuffle Sharding is also used to resolve DNS queries. For example, when you look up s3.amazonaws.com or mybucket.s3.amazonaws.com, it returns multiple answers to DNS queries.</p>
<p>Bucket requests can randomly go to any server. It doesn't matter which server the request goes to because the buckets aren't tied to any specific server.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1750343883548/39a88000-e7a0-4b27-a6cd-2fcc9ac2828c.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-reduce-latencies">Reduce Latencies</h2>
<p>Shuffle Sharding is also used in AWS CRT (Common Base library across AWS offerings) to improve latencies. CRT dynamically tracks the latency distributions and cancels the request going beyond p95 latencies, and retries it so the request goes to another host. This is gambling but it has paid off.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1750344348473/e84102cb-fa33-4f28-b8e0-9034b1e13778.png" alt class="image--center mx-auto" /></p>
<p>Since S3 leverages Erasure Coding to store shards of the data, Shuffle Sharding is also used to reduce latencies by cancelling requests going to slow shards and retrying so it goes to another shard.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1750344497885/ddcfea3c-da3a-4675-a07e-0ae49e36b73f.png" alt class="image--center mx-auto" /></p>
<hr />
<p>Thank you for reading. Hope you learnt something new. If you have any questions, please do comment.</p>
<p><img src="https://images.unsplash.com/photo-1487712010531-65e9aa8b4b1a?q=80&amp;w=3948&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.1.0&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" alt class="image--center mx-auto" /></p>
<h2 id="heading-appendix">Appendix</h2>
<ul>
<li><p><a target="_blank" href="https://catalog.workshops.aws/well-architected-reliability/en-US/4-failure-management/2-fault-isolation/10-fault-isolation-with-shuffle-sharding">https://catalog.workshops.aws/well-architected-reliability/en-US/4-failure-management/2-fault-isolation/10-fault-isolation-with-shuffle-sharding</a></p>
</li>
<li><p><a target="_blank" href="https://grafana.com/docs/loki/latest/operations/shuffle-sharding/">https://grafana.com/docs/loki/latest/operations/shuffle-sharding/</a></p>
</li>
</ul>
<hr />
]]></content:encoded></item><item><title><![CDATA[Building a Request Coalescer from Scratch]]></title><description><![CDATA[Let’s say you’re building a backend service. Things are going great — until traffic picks up. Suddenly, you notice something odd.
You’re seeing multiple identical requests hitting your database or API. They all ask for the same thing and they all mak...]]></description><link>https://snehasishroy.com/building-a-request-coalescer-from-scratch</link><guid isPermaLink="true">https://snehasishroy.com/building-a-request-coalescer-from-scratch</guid><category><![CDATA[distributed system]]></category><category><![CDATA[technology]]></category><category><![CDATA[Databases]]></category><category><![CDATA[System Design]]></category><category><![CDATA[2Articles1Week]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Tue, 20 May 2025 10:16:04 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/59yg_LpcvzQ/upload/c697dcbc53380549c8e77cf8369cc0d9.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Let’s say you’re building a backend service. Things are going great — until traffic picks up. Suddenly, you notice something odd.</p>
<p>You’re seeing <strong>multiple identical requests</strong> hitting your database or API. They all ask for the same thing and they all make separate, expensive calls.</p>
<p>Sound familiar?</p>
<p>If you’ve ever wondered, <em>“Why are we doing the same thing five times in parallel?”</em>, you’re not alone. This is a classic performance anti-pattern — and request coalescing can fix it.</p>
<hr />
<h2 id="heading-so-whats-the-problem">So, What’s the Problem?</h2>
<p>Imagine multiple users open the same product page at the same time. Or dozens of threads try to load the same user profile. They all hit your service with the same request—say, <code>getUser("123")</code>—within milliseconds of each other.</p>
<p>Now here’s the kicker: instead of sharing the work, <strong>each thread fires off its own request</strong> to the database or a remote API.</p>
<p>Why? Because your service has no idea that others are doing the same thing.</p>
<p>Let’s break it down:</p>
<pre><code class="lang-java">Thread A: fetch(<span class="hljs-string">"user123"</span>) → starts DB/API call
Thread B: fetch(<span class="hljs-string">"user123"</span>) → starts DB/API call
Thread C: fetch(<span class="hljs-string">"user123"</span>) → starts DB/API call
</code></pre>
<p>That’s three expensive calls… for the same data.</p>
<hr />
<h2 id="heading-the-thundering-herd-problem">The Thundering Herd Problem</h2>
<p>This scenario becomes even worse when a <strong>cache expires</strong> or a cold-start occurs. Suddenly, thousands of requests for the same key hit your backend simultaneously. This is known as the <strong>thundering herd problem</strong>.</p>
<p>In systems with shared caching or batch jobs, thundering herds can:</p>
<ul>
<li><p>Spike traffic to your database or upstream API</p>
</li>
<li><p>Cause rate-limiting, timeouts, or failures</p>
</li>
<li><p>Lead to cascading issues in downstream services</p>
</li>
</ul>
<p><strong>Why does this happen?</strong> Because each thread/client sees a cache miss and rushes to fetch the data independently—without knowing others are doing the same.</p>
<hr />
<h2 id="heading-can-we-do-better">Can We Do Better?</h2>
<p>What if you could say:</p>
<blockquote>
<p>“Hey, I see someone is already fetching this. Let me just wait and use their result.”</p>
</blockquote>
<p>That’s what <strong>request coalescing</strong> is all about.</p>
<hr />
<h2 id="heading-what-is-request-coalescing">What Is Request Coalescing?</h2>
<p><strong>Request coalescing</strong> is a technique where multiple concurrent requests for the <strong>same key</strong> are <strong>merged into one</strong>. Instead of all threads doing the same thing, only the <strong>first</strong> one does the work. The others just wait—and then reuse the result.</p>
<p>Here’s how it looks with coalescing:</p>
<pre><code class="lang-less"><span class="hljs-selector-tag">Thread</span> <span class="hljs-selector-tag">A</span>: <span class="hljs-selector-tag">fetch</span>(<span class="hljs-string">"user123"</span>) → <span class="hljs-selector-tag">starts</span> <span class="hljs-selector-tag">DB</span>/<span class="hljs-selector-tag">API</span> <span class="hljs-selector-tag">call</span>
<span class="hljs-selector-tag">Thread</span> <span class="hljs-selector-tag">B</span>: <span class="hljs-selector-tag">fetch</span>(<span class="hljs-string">"user123"</span>) → <span class="hljs-selector-tag">waits</span> <span class="hljs-selector-tag">for</span> <span class="hljs-selector-tag">result</span> <span class="hljs-selector-tag">from</span> <span class="hljs-selector-tag">A</span>
<span class="hljs-selector-tag">Thread</span> <span class="hljs-selector-tag">C</span>: <span class="hljs-selector-tag">fetch</span>(<span class="hljs-string">"user123"</span>) → <span class="hljs-selector-tag">waits</span> <span class="hljs-selector-tag">for</span> <span class="hljs-selector-tag">result</span> <span class="hljs-selector-tag">from</span> <span class="hljs-selector-tag">A</span>
</code></pre>
<p>Only <strong>one call</strong> goes through. Everyone else benefits.</p>
<hr />
<h2 id="heading-where-should-you-use-it">Where Should You Use It?</h2>
<p>Request coalescing makes sense when</p>
<ul>
<li><p>The <strong>same key</strong> is requested often (e.g., trending topics, popular users).</p>
</li>
<li><p>The <strong>backend call is expensive.</strong></p>
</li>
<li><p>You’re dealing with <strong>cold cache</strong> or <strong>frequent expirations.</strong></p>
</li>
<li><p>You use <strong>TTL-based caching</strong> and care about stability.</p>
</li>
</ul>
<hr />
<h2 id="heading-lets-build">Let's Build !</h2>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">User</span> </span>{
    String name;

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">User</span><span class="hljs-params">(String name)</span> </span>{
        <span class="hljs-keyword">this</span>.name = name;
    }
}
</code></pre>
<p>This is a simple POJO representing a <code>User</code>. In real systems, this might come from a database or remote API.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">UserDao</span> </span>{
    <span class="hljs-comment">// use synchronized to simulate a even higher load by allowing only one thread to go through</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">synchronized</span> User <span class="hljs-title">fetchByName</span><span class="hljs-params">(String name)</span> </span>{
        <span class="hljs-comment">// simulate db fetch which takes 0.5 sec</span>
        Stopwatch started = Stopwatch.createStarted();
        LockSupport.parkNanos(TimeUnit.MILLISECONDS.toNanos(<span class="hljs-number">500</span>));
        User user = <span class="hljs-keyword">new</span> User(name);
        <span class="hljs-keyword">long</span> elapsed = started.elapsed(TimeUnit.MILLISECONDS);
        log.info(<span class="hljs-string">"Took {} ms to fetch user"</span>, elapsed);
        <span class="hljs-keyword">return</span> user;
    }
}
</code></pre>
<p>This class simulates a <strong>costly database fetch</strong>:</p>
<ul>
<li><p>Uses <code>synchronized</code> to throttle concurrent access (imitating heavy load).</p>
</li>
<li><p>Sleeps for 500ms to simulate latency.</p>
</li>
</ul>
<p>This is important to see the benefit when coalescing kicks in — only one of these slow fetches should happen!</p>
<pre><code class="lang-java">
<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">UserController</span> </span>{
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> <span class="hljs-keyword">boolean</span> isCoalescingEnabled;
    UserDao userDao;
    RequestCoalescer&lt;User&gt; requestCoalescer;

    UserController(UserDao userDao, <span class="hljs-keyword">boolean</span> isCoalescingEnabled) {
        <span class="hljs-keyword">this</span>.userDao = userDao;
        <span class="hljs-keyword">this</span>.requestCoalescer = <span class="hljs-keyword">new</span> RequestCoalescer&lt;&gt;();
        <span class="hljs-keyword">this</span>.isCoalescingEnabled = isCoalescingEnabled;
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> User <span class="hljs-title">lookupName</span><span class="hljs-params">(String name)</span> </span>{
        <span class="hljs-keyword">if</span> (isCoalescingEnabled) {
            <span class="hljs-keyword">return</span> requestCoalescer.subscribe(name, () -&gt; userDao.fetchByName(name));
        } <span class="hljs-keyword">else</span> {
            <span class="hljs-keyword">return</span> userDao.fetchByName(name);
        }
    }
}
</code></pre>
<p>This class controls how requests are handled:</p>
<ul>
<li><p>You can <strong>toggle coalescing</strong> on/off using <code>isCoalescingEnabled</code>.</p>
</li>
<li><p>If enabled, the controller <strong>delegates to the coalescer</strong>.</p>
</li>
<li><p>If disabled, it simply hits the DAO each time — resulting in multiple slow, redundant fetches.</p>
</li>
</ul>
<p>This makes it easier to benchmark and demonstrate the benefits of coalescing.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">RequestCoalescer</span>&lt;<span class="hljs-title">T</span>&gt; </span>{
    Map&lt;String, CompletableFuture&lt;T&gt;&gt; inFlightRequests = <span class="hljs-keyword">new</span> ConcurrentHashMap&lt;&gt;();
    <span class="hljs-function"><span class="hljs-keyword">public</span> T <span class="hljs-title">subscribe</span><span class="hljs-params">(String key, Supplier&lt;T&gt; supplier)</span> </span>{
        CompletableFuture&lt;T&gt; future = getOrCreateFuture(key, supplier);
        <span class="hljs-keyword">return</span> future.join();
    }

    <span class="hljs-function"><span class="hljs-keyword">private</span> CompletableFuture&lt;T&gt; <span class="hljs-title">getOrCreateFuture</span><span class="hljs-params">(String key, Supplier&lt;T&gt; supplier)</span> </span>{
        CompletableFuture&lt;T&gt; future = inFlightRequests.get(key);
        <span class="hljs-keyword">if</span> (future != <span class="hljs-keyword">null</span>) {
            <span class="hljs-keyword">return</span> future;
        }
        CompletableFuture&lt;T&gt; newFuture = <span class="hljs-keyword">new</span> CompletableFuture&lt;&gt;();
        CompletableFuture&lt;T&gt; oldFuture = inFlightRequests.putIfAbsent(key, newFuture);
        <span class="hljs-keyword">if</span> (oldFuture != <span class="hljs-keyword">null</span>) {
            <span class="hljs-keyword">return</span> oldFuture;
        } <span class="hljs-keyword">else</span> {
            CompletableFuture.supplyAsync(() -&gt; {
                <span class="hljs-keyword">try</span> {
                    T result = supplier.get();
                    newFuture.complete(result);
                    inFlightRequests.remove(key, newFuture);
                    <span class="hljs-keyword">return</span> result;
                } <span class="hljs-keyword">catch</span> (Exception e) {
                    newFuture.completeExceptionally(e);
                    inFlightRequests.remove(key, newFuture);
                    <span class="hljs-comment">// return value is unused - newFuture is actually used.</span>
                    <span class="hljs-keyword">return</span> <span class="hljs-keyword">null</span>;
                }
            });
            <span class="hljs-keyword">return</span> newFuture;
        }
    }
}
</code></pre>
<p>Step 1: Check if a fetch for this key is already happening. If so, return the existing future.</p>
<p>Step 2: Try to insert a new future. If another thread beat us to it, we return <em>their</em> future instead.</p>
<p>Step 3: If we won the race, start the fetch in a new thread. Once it’s done, we:</p>
<ul>
<li><p>Complete the future</p>
</li>
<li><p>Remove the entry from the map (This is important to prevent memory leaks and avoid returning stale data)</p>
</li>
</ul>
<pre><code class="lang-java"><span class="hljs-meta">@Slf4j</span>
<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">UserControllerTest</span> </span>{
    <span class="hljs-meta">@ParameterizedTest</span>
    <span class="hljs-meta">@ValueSource(booleans = {true, false})</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">testLookupName</span><span class="hljs-params">(<span class="hljs-keyword">boolean</span> isCoalescingEnabled)</span> <span class="hljs-keyword">throws</span> InterruptedException </span>{
        UserController userController = <span class="hljs-keyword">new</span> UserController(<span class="hljs-keyword">new</span> UserDao(), isCoalescingEnabled);
        CountDownLatch latch = <span class="hljs-keyword">new</span> CountDownLatch(<span class="hljs-number">10</span>);
        Stopwatch timer = Stopwatch.createStarted();
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">10</span>; i++) {
            CompletableFuture.runAsync(() -&gt; {
                userController.lookupName(<span class="hljs-string">"test"</span>);
                latch.countDown();
            });
        }
        <span class="hljs-keyword">boolean</span> await = latch.await(<span class="hljs-number">10</span>, TimeUnit.SECONDS);
        Assertions.assertTrue(await);
        <span class="hljs-keyword">long</span> seconds = timer.elapsed(TimeUnit.SECONDS);
        log.info(<span class="hljs-string">"Took {} seconds"</span>, seconds);
        <span class="hljs-keyword">if</span> (isCoalescingEnabled) {
            Assertions.assertTrue(seconds &lt;= <span class="hljs-number">1</span>);
        } <span class="hljs-keyword">else</span> {
            Assertions.assertTrue(seconds &gt;= <span class="hljs-number">5</span>);
        }
    }
}
</code></pre>
<p>No code is completed until its tests are written.<br />In this test suite, we make 10 concurrent requests to lookup the user details for user with name <code>test</code>. When coalescing is disabled, the test takes 0.5 × 10 ~ 5 seconds to finish.</p>
<p>When coalescing is enabled, the test is finished in ~0.5 seconds because while the results of the first request is getting computed, the remaining requests are <em>virtually short-circuited.</em>  </p>
<hr />
<h2 id="heading-gotchas">Gotchas</h2>
<ul>
<li><p><strong>Memory leaks</strong>: Always remove entries from the in-flight map after use.</p>
</li>
<li><p><strong>Timeouts</strong>: What if the fetch never finishes? Add appropriate timeouts.</p>
</li>
<li><p><strong>Error sharing</strong>: If the request fails, make sure others don’t cache a bad result.</p>
</li>
<li><p><strong>Over-coalescing</strong>: Don’t block forever; design with concurrency limits.</p>
</li>
</ul>
<p>Thank you for reading. Hope you learnt something new today.</p>
<hr />
]]></content:encoded></item><item><title><![CDATA[Understanding the basics of Kafka Binary Protocol]]></title><description><![CDATA[Apache Kafka is a distributed event streaming platform used for high-performance data pipelines. In this article, we will take a look at the under belly of the Kafka and see how communication happens between the Kafka client and server.
Fundamentals
...]]></description><link>https://snehasishroy.com/understanding-the-kafka-communication-protocol-in-detail</link><guid isPermaLink="true">https://snehasishroy.com/understanding-the-kafka-communication-protocol-in-detail</guid><category><![CDATA[distributed system]]></category><category><![CDATA[System Design]]></category><category><![CDATA[System Architecture]]></category><category><![CDATA[kafka]]></category><category><![CDATA[2Articles1Week]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Mon, 19 May 2025 05:24:56 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/Oaqk7qqNh_c/upload/4b6a837e40c07421c80c8e88f58dfa15.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Apache Kafka is a distributed event streaming platform used for high-performance data pipelines. In this article, we will take a look at the under belly of the Kafka and see how communication happens between the Kafka client and server.</p>
<h3 id="heading-fundamentals">Fundamentals</h3>
<p>Let's start with the basics. Kafka uses a custom binary protocol for sending and receiving messages.</p>
<p>The <a target="_blank" href="https://kafka.apache.org/protocol.html#protocol_messages">specifications</a> define the request header as follows:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747629036180/90c9e412-b852-451d-82ca-0e5444b06505.png" alt class="image--center mx-auto" /></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Field</strong></td><td><strong>Data type</strong></td><td><strong>Description</strong></td></tr>
</thead>
<tbody>
<tr>
<td><code>request_api_key</code></td><td><code>INT16</code></td><td>The API key for the request</td></tr>
<tr>
<td><code>request_api_version</code></td><td><code>INT16</code></td><td>The version of the API for the request</td></tr>
<tr>
<td><code>correlation_id</code></td><td><code>INT32</code></td><td>A unique identifier for the request</td></tr>
<tr>
<td><code>client_id</code></td><td><code>NULLABLE_STRING</code></td><td>The client ID for the request</td></tr>
<tr>
<td><code>TAG_BUFFER</code></td><td><code>COMPACT_ARRAY</code></td><td>Optional tagged fields</td></tr>
</tbody>
</table>
</div><p><a target="_blank" href="https://kafka.apache.org/protocol.html#protocol_types">Specs</a> defines the data types as</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Type</td><td>Description</td></tr>
</thead>
<tbody>
<tr>
<td>INT16</td><td>Represents an integer between -2<sup>15</sup> and 2<sup>15</sup>-1 inclusive. The values are encoded using <strong>two bytes</strong> in network byte order (<strong>big-endian</strong>).</td></tr>
<tr>
<td>INT32</td><td>Represents an integer between -2<sup>31</sup> and 2<sup>31</sup>-1 inclusive. The values are encoded using <strong>four bytes</strong> in network byte order (<strong>big-endian</strong>).</td></tr>
<tr>
<td>COMPACT_ARRAY</td><td>Represents a sequence of objects of a given type T. Type T can be either a primitive type (e.g. STRING) or a structure. First, the length N + 1 is given as an UNSIGNED_VARINT. Then N instances of type T follow. A null array is represented with a length of 0. In protocol documentation an array of T instances is referred to as [T].</td></tr>
</tbody>
</table>
</div><p>Here's an example of a request message:</p>
<pre><code class="lang-java"><span class="hljs-number">00</span> <span class="hljs-number">00</span> <span class="hljs-number">00</span> <span class="hljs-number">23</span>  <span class="hljs-comment">// message_size:        35</span>
<span class="hljs-number">00</span> <span class="hljs-number">12</span>        <span class="hljs-comment">// request_api_key:     18</span>
<span class="hljs-number">00</span> <span class="hljs-number">04</span>        <span class="hljs-comment">// request_api_version: 4</span>
<span class="hljs-number">6f</span> <span class="hljs-number">7f</span> c6 <span class="hljs-number">61</span>  <span class="hljs-comment">// correlation_id:      1870644833</span>
...
</code></pre>
<p>Every Kafka request is an API call. The Kafka protocol defines over 70 different APIs, all of which do different things. Here are some examples:</p>
<ul>
<li><p><code>Produce</code> writes events to partitions.</p>
</li>
<li><p><code>CreateTopics</code> creates new topics.</p>
</li>
<li><p><code>ApiVersions</code> returns the broker's supported API versions.</p>
</li>
</ul>
<p>A Kafka request specifies the API its calling by using request_api_key header field.</p>
<h3 id="heading-message-body">Message body</h3>
<p>The schemas for the request and response bodies are determined by the API being called.</p>
<p>For example, here are some of the fields that the <code>Produce</code> request body contains:</p>
<ul>
<li><p>The name of the topic to write to.</p>
</li>
<li><p>The key of the partition to write to.</p>
</li>
<li><p>The event data to write.</p>
</li>
</ul>
<p>On the other hand, the <code>Produce</code> response body contains a response code for each event. These response codes indicate if the writes succeeded.</p>
<p>As a reminder, requests and responses both have the following format:</p>
<ol>
<li><p><code>message_size</code></p>
</li>
<li><p>Header</p>
</li>
<li><p>Body</p>
</li>
</ol>
<h3 id="heading-api-versioning"><strong>API versioning</strong></h3>
<p>Each API supports multiple versions, to allow for different schemas. Here's how API versioning works:</p>
<ul>
<li><p>Requests use the header field <code>request_api_version</code> to specify the API version being requested.</p>
</li>
<li><p>Responses always use the same API version as the request. For example, a <code>Produce Request (Version: 3)</code> will always get a <code>Produce Response (Version: 3)</code> back.</p>
</li>
<li><p>Each API's version history is independent. So, different APIs with the same version are unrelated. For example, <code>Produce Request (Version: 10)</code> is not related to <code>Fetch Request (Version: 10)</code>.</p>
</li>
</ul>
<h3 id="heading-the-apiversions-api"><strong>The</strong> <code>ApiVersions</code> API</h3>
<p>The <code>ApiVersions</code> API returns the broker's supported API versions. For example, <code>ApiVersions</code> may say that the broker supports <code>Produce</code> <a target="_blank" href="https://kafka.apache.org/protocol.html#protocol_api_keys"></a>versions 5 to 11, <code>Fetch</code> versions 0 to 3, etc.</p>
<h3 id="heading-visualizing-the-binary-protocol">Visualizing the Binary Protocol</h3>
<p>Here is a great <a target="_blank" href="https://binspec.org/kafka-api-versions-request-v4">link</a> that can help you visualize the binary protocol</p>
<h3 id="heading-hands-on">Hands-on</h3>
<p>So let’s build a POC of the Kafka server.</p>
<h2 id="heading-lets-start-our-server">Let’s start our Server</h2>
<pre><code class="lang-java"><span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">main</span><span class="hljs-params">(String[] args)</span> </span>{
    <span class="hljs-keyword">int</span> port = <span class="hljs-number">9092</span>;
    <span class="hljs-keyword">try</span> (ServerSocket server = <span class="hljs-keyword">new</span> ServerSocket(port)) {
        server.setReuseAddress(<span class="hljs-keyword">true</span>);
        log.info(<span class="hljs-string">"Server started on port {}"</span>, port);
        <span class="hljs-keyword">while</span> (<span class="hljs-keyword">true</span>) {
            Socket client = server.accept();
            log.info(<span class="hljs-string">"New client connected"</span>);
            handleClientAsync(client);
        }
    } <span class="hljs-keyword">catch</span> (IOException e) {
        log.error(<span class="hljs-string">"IOException: "</span>, e);
    }
}
</code></pre>
<p>The server starts on port 9092 (the default Kafka port) and enters an infinite loop that waits for client connections. When a client connects, it passes the client socket to <code>handleClientAsync()</code>. The <code>setReuseAddress(true)</code> prevents "address already in use" errors when restarting the server.</p>
<h2 id="heading-handle-multiple-clients-asynchronously">Handle Multiple Clients Asynchronously</h2>
<pre><code class="lang-java"><span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">handleClientAsync</span><span class="hljs-params">(Socket client)</span> <span class="hljs-keyword">throws</span> IOException </span>{
    <span class="hljs-keyword">new</span> Thread(() -&gt; {
        <span class="hljs-keyword">try</span> {
            processMessage(client);
        } <span class="hljs-keyword">catch</span> (IOException e) {
            log.error(<span class="hljs-string">"Error while handling client: "</span>, e);
        } <span class="hljs-keyword">finally</span> {
            <span class="hljs-keyword">try</span> {
                client.close();
            } <span class="hljs-keyword">catch</span> (IOException e) {
                log.error(<span class="hljs-string">"Error closing client socket: "</span>, e);
            }
        }
    }).start();
}
</code></pre>
<p>This method creates a new thread for each client connection, allowing the server to handle multiple clients simultaneously. It delegates the message handling to <code>processMessage()</code> This is a trivial implementation and will not scale well because of lack of Thread Pool.</p>
<h2 id="heading-processing-client-messages">Processing Client Messages</h2>
<pre><code class="lang-java"><span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">processMessage</span><span class="hljs-params">(Socket client)</span> <span class="hljs-keyword">throws</span> IOException </span>{
    DataInputStream in = <span class="hljs-keyword">new</span> DataInputStream(client.getInputStream());
    DataOutputStream out = <span class="hljs-keyword">new</span> DataOutputStream(client.getOutputStream());
    <span class="hljs-keyword">while</span> (!client.isClosed()) {
        <span class="hljs-comment">// Read the message header</span>
        <span class="hljs-keyword">byte</span>[] messageSize = <span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[<span class="hljs-number">4</span>];
        <span class="hljs-keyword">int</span> read = in.read(messageSize);
        <span class="hljs-keyword">if</span> (read &lt; <span class="hljs-number">4</span>) {
            log.info(<span class="hljs-string">"Read fewer characters than expected: {}"</span>, read);
            <span class="hljs-keyword">break</span>;
        }

        <span class="hljs-comment">// Read protocol metadata</span>
        <span class="hljs-keyword">byte</span>[] apiKey = <span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[<span class="hljs-number">2</span>];
        in.readFully(apiKey);
        <span class="hljs-keyword">byte</span>[] apiVersion = <span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[<span class="hljs-number">2</span>];
        in.readFully(apiVersion);
        <span class="hljs-keyword">byte</span>[] correlationId = <span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[<span class="hljs-number">4</span>];
        in.readFully(correlationId);

        <span class="hljs-comment">// Read client identification</span>
        <span class="hljs-keyword">byte</span>[] clientIdLength = <span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[<span class="hljs-number">2</span>];
        in.readFully(clientIdLength);
        <span class="hljs-keyword">byte</span>[] clientId = <span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[ByteBuffer.wrap(clientIdLength).getShort()];
        in.readFully(clientId);

        log.info(<span class="hljs-string">"Request from clientID {} for apiKey {} apiVersion {} correlationId {}"</span>, 
                <span class="hljs-keyword">new</span> String(clientId), <span class="hljs-keyword">new</span> String(apiKey), <span class="hljs-keyword">new</span> String(apiVersion), <span class="hljs-keyword">new</span> String(correlationId));
</code></pre>
<p>This is where we start seeing the protocol details. The method reads a message header containing size information, API key and version, correlation ID, and client ID. The use of <code>DataInputStream</code> helps in precise byte reading the binary protocol with fixed-size fields.</p>
<h2 id="heading-handle-describetopicpartitions-request">Handle DescribeTopicPartitions Request</h2>
<pre><code class="lang-java">        <span class="hljs-keyword">byte</span>[] topicsArrayLength = <span class="hljs-keyword">null</span>;
        <span class="hljs-keyword">byte</span>[] topicNameLength = <span class="hljs-keyword">null</span>;
        <span class="hljs-keyword">byte</span>[] topicName = <span class="hljs-keyword">null</span>;
        <span class="hljs-keyword">if</span> (ByteBuffer.wrap(apiKey).getShort() == <span class="hljs-number">75</span>) {
            log.info(<span class="hljs-string">"Received DescribeTopicPartitions request"</span>);
            <span class="hljs-keyword">byte</span>[] tagBufferLength = <span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[<span class="hljs-number">1</span>];
            in.readFully(tagBufferLength);
            log.info(<span class="hljs-string">"Tag Buffer Length: {} "</span>, ByteBuffer.wrap(tagBufferLength).get());

            topicsArrayLength = <span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[<span class="hljs-number">1</span>];
            in.readFully(topicsArrayLength);
            log.info(<span class="hljs-string">"Topics Array Length: {}"</span>, ByteBuffer.wrap(topicsArrayLength).get());

            topicNameLength = <span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[<span class="hljs-number">1</span>];
            in.readFully(topicNameLength);
            log.info(<span class="hljs-string">"Topic Name Length: {}"</span>, ByteBuffer.wrap(topicNameLength).get());

            topicName = <span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[ByteBuffer.wrap(topicNameLength).get() - <span class="hljs-number">1</span>];
            in.readFully(topicName);
            log.info(<span class="hljs-string">"Topic Name: {}"</span>, <span class="hljs-keyword">new</span> String(topicName));

            <span class="hljs-comment">// Read additional fields</span>
            <span class="hljs-keyword">byte</span>[] tagBufferLength2 = <span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[<span class="hljs-number">1</span>];
            in.readFully(tagBufferLength2);
            log.info(<span class="hljs-string">"Tag Buffer2 Length: {}"</span>, ByteBuffer.wrap(tagBufferLength2).get());

            <span class="hljs-keyword">byte</span>[] responsePartitionLimit = <span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[<span class="hljs-number">4</span>];
            in.readFully(responsePartitionLimit);
            log.info(<span class="hljs-string">"Response Partition Limit: {}"</span>, ByteBuffer.wrap(responsePartitionLimit).getInt());

            <span class="hljs-keyword">byte</span>[] cursor = <span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[<span class="hljs-number">1</span>];
            in.readFully(cursor);
            log.info(<span class="hljs-string">"Cursor: {}"</span>, ByteBuffer.wrap(cursor).get());
        }
</code></pre>
<p>This block handles a specific request type - API key 75, which is <code>DescribeTopicPartitions</code> command. It reads several fields including the topic name, buffer lengths, and cursor information.</p>
<h2 id="heading-sending-a-response">Sending a Response</h2>
<pre><code class="lang-java">        log.info(<span class="hljs-string">"Remaining bytes in input stream: {}"</span>, in.available());
        in.skip(in.available());

        <span class="hljs-comment">// Handle the request and send a response</span>
        ByteBuffer responseBuffer = createResponseBuffer(apiKey, apiVersion, correlationId, 
                                                        topicsArrayLength, topicNameLength, topicName);

        out.write(responseBuffer.array(), <span class="hljs-number">0</span>, responseBuffer.position());
        out.flush();
        log.info(<span class="hljs-string">"Response sent to client"</span>);
    }
}
</code></pre>
<p>After parsing the request, the code skips any remaining bytes (a defensive practice) and constructs a response using the <code>createResponseBuffer()</code> method. The response is then sent back to the client. This completes one request-response cycle in the continuous communication loop.</p>
<h2 id="heading-crafting-the-response-using-bytebuffer">Crafting the Response using ByteBuffer</h2>
<p>If you are interested in the Json reference, the official kafka client has <a target="_blank" href="https://github.com/apache/kafka/blob/ce4940f9891a96819e54f8db097ce3824876e8e5/clients/src/main/resources/common/message/DescribeTopicPartitionsResponse.json">one</a>.</p>
<pre><code class="lang-java"><span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> ByteBuffer <span class="hljs-title">createResponseBuffer</span><span class="hljs-params">(<span class="hljs-keyword">byte</span>[] apiKey, <span class="hljs-keyword">byte</span>[] apiVersion, <span class="hljs-keyword">byte</span>[] correlationID, 
                                               <span class="hljs-keyword">byte</span>[] topicsArrayLength, <span class="hljs-keyword">byte</span>[] topicNameLength, <span class="hljs-keyword">byte</span>[] topicName)</span> </span>{
    log.info(<span class="hljs-string">"Creating response buffer"</span>);
    ByteBuffer responseBuffer = ByteBuffer.allocate(<span class="hljs-number">1024</span>);
    responseBuffer.putInt(<span class="hljs-number">0</span>); <span class="hljs-comment">// Placeholder for message length</span>
    responseBuffer.put(correlationID);

    <span class="hljs-comment">// If API Key == 0x4b (75) (DescribeTopicPartitions)</span>
    <span class="hljs-keyword">if</span> (ByteBuffer.wrap(apiKey).getShort() == <span class="hljs-number">75</span>) {
        <span class="hljs-comment">// Create DescribeTopicPartitions response</span>
        responseBuffer.put((<span class="hljs-keyword">byte</span>) <span class="hljs-number">0</span>); <span class="hljs-comment">// Tag Buffer</span>
        responseBuffer.putInt(<span class="hljs-number">0</span>); <span class="hljs-comment">// Throttle time</span>
        responseBuffer.put(topicsArrayLength);<span class="hljs-comment">// Topic array length</span>
        responseBuffer.putShort((<span class="hljs-keyword">short</span>) <span class="hljs-number">3</span>); <span class="hljs-comment">// Error code</span>
        responseBuffer.put(topicNameLength); <span class="hljs-comment">// Topic name length</span>
        responseBuffer.put(topicName); <span class="hljs-comment">// Topic name</span>
        responseBuffer.put(<span class="hljs-keyword">new</span> <span class="hljs-keyword">byte</span>[<span class="hljs-number">16</span>]); <span class="hljs-comment">// 16-byte null ID</span>
        responseBuffer.put((<span class="hljs-keyword">byte</span>) <span class="hljs-number">0</span>); <span class="hljs-comment">// IsInternal == 0</span>
        responseBuffer.put((<span class="hljs-keyword">byte</span>) <span class="hljs-number">1</span>); <span class="hljs-comment">// partition count + 1</span>
        responseBuffer.putInt(<span class="hljs-number">0x00000DF8</span>); <span class="hljs-comment">// TopicAuthorizedOperations</span>
        responseBuffer.put((<span class="hljs-keyword">byte</span>) <span class="hljs-number">0</span>); <span class="hljs-comment">// compact-encoded empty TAG_BUFFER</span>

        responseBuffer.put((<span class="hljs-keyword">byte</span>) <span class="hljs-number">0xff</span>); <span class="hljs-comment">// Cursor</span>
        responseBuffer.put((<span class="hljs-keyword">byte</span>) <span class="hljs-number">0</span>); <span class="hljs-comment">// Tag Buffer</span>
    } <span class="hljs-keyword">else</span> {
        <span class="hljs-comment">// Create API versions response</span>
        <span class="hljs-keyword">short</span> apiVersionValue = ByteBuffer.wrap(apiVersion).getShort();
        <span class="hljs-keyword">short</span> errorCode = (apiVersionValue &lt; <span class="hljs-number">0</span> || apiVersionValue &gt; <span class="hljs-number">4</span>) ? (<span class="hljs-keyword">short</span>) <span class="hljs-number">35</span> : (<span class="hljs-keyword">short</span>) <span class="hljs-number">0</span>;
        responseBuffer.putShort(errorCode);

        responseBuffer.put((<span class="hljs-keyword">byte</span>) <span class="hljs-number">3</span>);
        <span class="hljs-comment">// First API</span>
        responseBuffer.putShort((<span class="hljs-keyword">short</span>) <span class="hljs-number">18</span>); <span class="hljs-comment">// API Versions 18</span>
        responseBuffer.putShort((<span class="hljs-keyword">short</span>) <span class="hljs-number">0</span>); <span class="hljs-comment">// Min version</span>
        responseBuffer.putShort((<span class="hljs-keyword">short</span>) <span class="hljs-number">4</span>); <span class="hljs-comment">// Max version</span>
        responseBuffer.put((<span class="hljs-keyword">byte</span>) <span class="hljs-number">0</span>); <span class="hljs-comment">// Tagged fields for this API (compact encoded 0)</span>

        <span class="hljs-comment">// Second API</span>
        responseBuffer.putShort((<span class="hljs-keyword">short</span>) <span class="hljs-number">75</span>); <span class="hljs-comment">// DescribeTopicPartitions 75</span>
        responseBuffer.putShort((<span class="hljs-keyword">short</span>) <span class="hljs-number">0</span>); <span class="hljs-comment">// Min version</span>
        responseBuffer.putShort((<span class="hljs-keyword">short</span>) <span class="hljs-number">0</span>); <span class="hljs-comment">// Max version</span>
        responseBuffer.put((<span class="hljs-keyword">byte</span>) <span class="hljs-number">0</span>); <span class="hljs-comment">// Tagged fields for this API (compact encoded 0)</span>

        responseBuffer.putInt(<span class="hljs-number">0</span>); <span class="hljs-comment">// Throttle time</span>
        responseBuffer.put((<span class="hljs-keyword">byte</span>) <span class="hljs-number">0</span>); <span class="hljs-comment">// No tagged fields</span>
    }

    <span class="hljs-comment">// Update the message length at the beginning of the buffer</span>
    <span class="hljs-keyword">int</span> messageLength = responseBuffer.position() - <span class="hljs-number">4</span>;
    log.info(<span class="hljs-string">"Message length: {}"</span>, messageLength);
    responseBuffer.putInt(<span class="hljs-number">0</span>, messageLength);
    <span class="hljs-keyword">return</span> responseBuffer;
}
</code></pre>
<p>This final method builds the response message. It branches based on the API key to create either a <code>DescribeTopicPartitions</code> response or the <code>APIVersions</code> response. The message length is calculated and inserted at the beginning of the buffer, a common pattern in binary protocols.</p>
<h3 id="heading-verification">Verification</h3>
<pre><code class="lang-less"><span class="hljs-selector-tag">echo</span> <span class="hljs-selector-tag">-n</span> "<span class="hljs-selector-tag">00000031004b0000589eecfb000c6b61666b612d746573746572000212756e6b6e6f776e2d746f7069632d73617a0000000001ff00</span>" \
| <span class="hljs-selector-tag">xxd</span> <span class="hljs-selector-tag">-r</span> <span class="hljs-selector-tag">-p</span> | <span class="hljs-selector-tag">nc</span> <span class="hljs-selector-tag">192</span><span class="hljs-selector-class">.168</span><span class="hljs-selector-class">.1</span><span class="hljs-selector-class">.6</span> <span class="hljs-selector-tag">9092</span> | <span class="hljs-selector-tag">hexdump</span> <span class="hljs-selector-tag">-C</span>

<span class="hljs-selector-tag">00000000</span>  <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">37</span> <span class="hljs-selector-tag">58</span> <span class="hljs-selector-tag">9e</span> <span class="hljs-selector-tag">ec</span> <span class="hljs-selector-tag">fb</span>  <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">02</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">03</span>  |..<span class="hljs-selector-class">.7X</span>...........|
<span class="hljs-selector-tag">00000010</span>  <span class="hljs-selector-tag">12</span> <span class="hljs-selector-tag">75</span> <span class="hljs-selector-tag">6e</span> <span class="hljs-selector-tag">6b</span> <span class="hljs-selector-tag">6e</span> <span class="hljs-selector-tag">6f</span> <span class="hljs-selector-tag">77</span> <span class="hljs-selector-tag">6e</span>  <span class="hljs-selector-tag">2d</span> <span class="hljs-selector-tag">74</span> <span class="hljs-selector-tag">6f</span> <span class="hljs-selector-tag">70</span> <span class="hljs-selector-tag">69</span> <span class="hljs-selector-tag">63</span> <span class="hljs-selector-tag">2d</span> <span class="hljs-selector-tag">73</span>  |<span class="hljs-selector-class">.unknown-topic-s</span>|
<span class="hljs-selector-tag">00000020</span>  <span class="hljs-selector-tag">61</span> <span class="hljs-selector-tag">7a</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span>  <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span>  |<span class="hljs-selector-tag">az</span>..............|
<span class="hljs-selector-tag">00000030</span>  <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">01</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">0d</span> <span class="hljs-selector-tag">f8</span>  <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">ff</span> <span class="hljs-selector-tag">00</span>                 |...........|
<span class="hljs-selector-tag">0000003b</span>

<span class="hljs-selector-tag">Hexdump</span> <span class="hljs-selector-tag">of</span> <span class="hljs-selector-tag">sent</span> "<span class="hljs-selector-tag">DescribeTopicPartitions</span>" <span class="hljs-selector-tag">request</span>:
<span class="hljs-selector-tag">Idx</span>  | <span class="hljs-selector-tag">Hex</span>                                             | <span class="hljs-selector-tag">ASCII</span>
<span class="hljs-selector-tag">-----</span>+<span class="hljs-selector-tag">-------------------------------------------------</span>+<span class="hljs-selector-tag">-----------------</span>
<span class="hljs-selector-tag">0000</span> | <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">31</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">4b</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">58</span> <span class="hljs-selector-tag">9e</span> <span class="hljs-selector-tag">ec</span> <span class="hljs-selector-tag">fb</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">0c</span> <span class="hljs-selector-tag">6b</span> <span class="hljs-selector-tag">61</span> | ..<span class="hljs-selector-class">.1</span><span class="hljs-selector-class">.K</span>.<span class="hljs-selector-class">.X</span>....<span class="hljs-selector-class">.ka</span>
<span class="hljs-selector-tag">0010</span> | <span class="hljs-selector-tag">66</span> <span class="hljs-selector-tag">6b</span> <span class="hljs-selector-tag">61</span> <span class="hljs-selector-tag">2d</span> <span class="hljs-selector-tag">74</span> <span class="hljs-selector-tag">65</span> <span class="hljs-selector-tag">73</span> <span class="hljs-selector-tag">74</span> <span class="hljs-selector-tag">65</span> <span class="hljs-selector-tag">72</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">02</span> <span class="hljs-selector-tag">12</span> <span class="hljs-selector-tag">75</span> <span class="hljs-selector-tag">6e</span> <span class="hljs-selector-tag">6b</span> | <span class="hljs-selector-tag">fka-tester</span>..<span class="hljs-selector-class">.unk</span>
<span class="hljs-selector-tag">0020</span> | <span class="hljs-selector-tag">6e</span> <span class="hljs-selector-tag">6f</span> <span class="hljs-selector-tag">77</span> <span class="hljs-selector-tag">6e</span> <span class="hljs-selector-tag">2d</span> <span class="hljs-selector-tag">74</span> <span class="hljs-selector-tag">6f</span> <span class="hljs-selector-tag">70</span> <span class="hljs-selector-tag">69</span> <span class="hljs-selector-tag">63</span> <span class="hljs-selector-tag">2d</span> <span class="hljs-selector-tag">73</span> <span class="hljs-selector-tag">61</span> <span class="hljs-selector-tag">7a</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> | <span class="hljs-selector-tag">nown-topic-saz</span>..
<span class="hljs-selector-tag">0030</span> | <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">01</span> <span class="hljs-selector-tag">ff</span> <span class="hljs-selector-tag">00</span>                                  | .....

<span class="hljs-selector-tag">Hexdump</span> <span class="hljs-selector-tag">of</span> <span class="hljs-selector-tag">received</span> "<span class="hljs-selector-tag">DescribeTopicPartitions</span>" <span class="hljs-selector-tag">response</span>:
<span class="hljs-selector-tag">Idx</span>  | <span class="hljs-selector-tag">Hex</span>                                             | <span class="hljs-selector-tag">ASCII</span>
<span class="hljs-selector-tag">-----</span>+<span class="hljs-selector-tag">-------------------------------------------------</span>+<span class="hljs-selector-tag">-----------------</span>
<span class="hljs-selector-tag">0000</span> | <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">37</span> <span class="hljs-selector-tag">58</span> <span class="hljs-selector-tag">9e</span> <span class="hljs-selector-tag">ec</span> <span class="hljs-selector-tag">fb</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">02</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">03</span> | ..<span class="hljs-selector-class">.7X</span>...........
<span class="hljs-selector-tag">0010</span> | <span class="hljs-selector-tag">12</span> <span class="hljs-selector-tag">75</span> <span class="hljs-selector-tag">6e</span> <span class="hljs-selector-tag">6b</span> <span class="hljs-selector-tag">6e</span> <span class="hljs-selector-tag">6f</span> <span class="hljs-selector-tag">77</span> <span class="hljs-selector-tag">6e</span> <span class="hljs-selector-tag">2d</span> <span class="hljs-selector-tag">74</span> <span class="hljs-selector-tag">6f</span> <span class="hljs-selector-tag">70</span> <span class="hljs-selector-tag">69</span> <span class="hljs-selector-tag">63</span> <span class="hljs-selector-tag">2d</span> <span class="hljs-selector-tag">73</span> | <span class="hljs-selector-class">.unknown-topic-s</span>
<span class="hljs-selector-tag">0020</span> | <span class="hljs-selector-tag">61</span> <span class="hljs-selector-tag">7a</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> | <span class="hljs-selector-tag">az</span>..............
<span class="hljs-selector-tag">0030</span> | <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">01</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">0d</span> <span class="hljs-selector-tag">f8</span> <span class="hljs-selector-tag">00</span> <span class="hljs-selector-tag">ff</span> <span class="hljs-selector-tag">00</span>                | ...........

<span class="hljs-selector-class">.ResponseHeader</span>
<span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.correlation_id</span> (<span class="hljs-number">1486810363</span>)
<span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.TAG_BUFFER</span>
<span class="hljs-selector-class">.ResponseBody</span>
<span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.throttle_time_ms</span> (<span class="hljs-number">0</span>)
<span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.topic</span><span class="hljs-selector-class">.length</span> (<span class="hljs-number">1</span>)
<span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.Topics</span><span class="hljs-selector-attr">[0]</span>
  <span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.error_code</span> (<span class="hljs-number">3</span>)
  <span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.name</span> (unknown-topic-saz)
  <span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.topic_id</span> (<span class="hljs-number">00000000</span>-<span class="hljs-number">0000</span>-<span class="hljs-number">0000</span>-<span class="hljs-number">0000</span>-<span class="hljs-number">000000000000</span>)
  <span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.is_internal</span> (false)
  <span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.num_partitions</span> (<span class="hljs-number">0</span>)
  <span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.topic_authorized_operations</span> (<span class="hljs-number">3576</span>)
  <span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.TAG_BUFFER</span>
<span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.next_cursor</span> (null)
<span class="hljs-selector-tag">-</span> <span class="hljs-selector-class">.TAG_BUFFER</span>
</code></pre>
<p>Thank you for reading. Hope you learnt something new.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747632182208/6a680047-a766-4cf2-b920-0fd3ef369002.jpeg" alt class="image--center mx-auto" /></p>
<h3 id="heading-appendix">Appendix</h3>
<ul>
<li><p><a target="_blank" href="https://binspec.org/kafka-api-versions-request-v4?highlight=0-3">https://binspec.org/kafka-api-versions-request-v4</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/apache/kafka/blob/ce4940f9891a96819e54f8db097ce3824876e8e5/clients/src/main/resources/common/message/DescribeTopicPartitionsResponse.json">https://github.com/apache/kafka/blob/ce4940f9891a96819e54f8db097ce3824876e8e5/clients/src/main/resources/common/message/DescribeTopicPartitionsResponse.json</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Ensuring Exactly-Once execution at scale in Stateful Distributed Systems]]></title><description><![CDATA[Problem Statement
You have a task and a list of application instances that can execute it. After the task is completed, they must update the database with the result. How can we ensure that only one instance executes the task? We should focus on exac...]]></description><link>https://snehasishroy.com/ensuring-exactly-once-execution-at-scale-in-distributed-systems</link><guid isPermaLink="true">https://snehasishroy.com/ensuring-exactly-once-execution-at-scale-in-distributed-systems</guid><category><![CDATA[2Articles1Week]]></category><category><![CDATA[distributed system]]></category><category><![CDATA[System Design]]></category><category><![CDATA[System Architecture]]></category><category><![CDATA[interview]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Thu, 15 May 2025 10:52:51 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/Au6eR7Yg9CY/upload/318dc5c7c013eace49127ec831210e19.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-problem-statement">Problem Statement</h2>
<p>You have a task and a list of application instances that can execute it. After the task is completed, they must update the database with the result. How can we ensure that only one instance executes the task? We should focus on exactly-once execution of the task because each time the task runs it can have a side-effect — it might create a file, insert a row in the database, or perform an action that should happen only once.</p>
<p>Let's look at some of the options that are available to us.</p>
<h2 id="heading-leader-election">Leader election</h2>
<p>Perform a leader election amongst the instances using a consensus algorithm (e.g. Zookeeper or etcd) and let the leader execute the task.</p>
<h2 id="heading-lock-based-approach">Lock based Approach</h2>
<p>Any instance can choose the task, but before executing it, a distributed lock must be obtained using a unique ID with an external system (e.g., Zookeeper, Redis, or Aerospike). If an instance gets the lock, no other instance has it, making it safe to execute the task. The other instances will either proceed to the next available task or wait a random amount of time before trying again to get the lock on the same task.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Instead of taking a pessimistic lock, an optimistic lock using compare and swap operations could also be taken by comparing the current term/generation/version with 0 (first term). This allows us to perform a read-modify-write without acquiring a lock. Please refer to the documentation of your lock service for exact details.</div>
</div>

<p>So you think you are done?</p>
<p><img src="https://media3.giphy.com/media/v1.Y2lkPTc5MGI3NjExMG1tc2Y3MDZkNHNoMm9veGJhYWx1bmc3ZW5ncTFtanN5ajRma3Z0NyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/izywsUxDbxvdJbOSfy/giphy.gif" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-gotchas">Gotchas</h2>
<p>Let's dissect the edge cases with our previous approaches</p>
<h2 id="heading-leader-election-1">Leader Election</h2>
<p>Performing leader election using a consistent core approach, like Zookeeper, can still have edge cases during network partitions.</p>
<p>Imagine that initially, an instance named Alice was chosen as the leader. Due to a network issue or a GC pause, Alice was disconnected from the Zookeeper cluster long enough for the cluster to promote another instance, Bob, as the leader.</p>
<p>Bob picks up the same task and decides to execute it. However, Alice, the previous leader, is still active and decides to execute the same task, resulting in the task being executed twice.</p>
<h2 id="heading-optimistic-approach">Optimistic Approach</h2>
<p>If the instance having the lock dies (Alice), no other instance can pick up the task. This can be solved by having a TTL (lease period) associated with the lock.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">The lock service must ensure that the grant of the lock and the lease must be atomic operations. If they were two different operations, then it wouldn't solve the problem as the client wouldn't be able to execute the lease command if it dies after executing the lock command.</div>
</div>

<p>Choosing a TTL means the job must be completed within that time frame. If the TTL is too short, the lock might expire before the task is finished. If the TTL is too long, it takes more time to recover when an instance fails. For example, if Alice goes down, no other instance can get the lock until the lease expires.</p>
<p>There is still a possible issue where Alice thinks it has the lock, but its lease has expired and another instance, Bob, has taken the lock. This can occur if Alice is delayed by external factors (like GC pauses or page swaps) and its lease expires. When Alice resumes, it still believes it holds the lock and proceeds to execute the task.</p>
<p><img src="https://martin.kleppmann.com/2016/02/unsafe-lock.png" alt="Unsafe access to a resource protected by a distributed lock" /></p>
<blockquote>
<p><a target="_blank" href="https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html">Source</a></p>
</blockquote>
<p>Based on the above examples, it's evident that we need another guardrail that can prevent a task from being executed twice.</p>
<p><img src="https://media2.giphy.com/media/v1.Y2lkPTc5MGI3NjExcjYwMGhxNGd0NW5vMnR0ejZmczk0bWh1am1zeXpvdzhmZjU2M283aSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/jYLXtvRmXDn7uEtBZE/giphy.gif" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-solution">Solution?</h2>
<p>The problem with our earlier methods was the absence of safeguards during task execution. While a task was running, there were no checks to make sure it was safe to proceed.</p>
<p>What could some of these safeguards be?</p>
<h2 id="heading-idempotency-unique-key">Idempotency / Unique Key</h2>
<p>If the side effect introduced during task execution has an idempotency or unique key associated with it, it becomes easier to detect whether something has already happened or is currently being executed.</p>
<blockquote>
<p>If only the life were so simple.</p>
</blockquote>
<p>Sometimes, it's not possible to have a unique key linked to the side effect. Why? Because it might be intentional. For example, in an HBase Cluster, the HBase master is in charge of assigning regions to a region server. If the HBase master believes a Region Server is down, it assigns that region (shard) to a new Region Server. Clients check the Region assignment mapping and redirect their read/write requests to the correct Region Server. Everything works fine as long as there is only one HBase master. In a split-brain scenario or network partitioning, two nodes might both think they are the Master and could perform Region reassignment tasks. This can severely affect data consistency. In this case, it's hard to assign a unique key to the region assignment task.</p>
<p><img src="https://media0.giphy.com/media/v1.Y2lkPTc5MGI3NjExZmc1aGw1bDZucXNydG5yYTV0dWl6YTIxeDd4b2RkeTk3MzAwczBlcyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/FcuiZUneg1YRAu1lH2/giphy.gif" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-fencing-to-the-rescue">Fencing to the Rescue</h2>
<p>Fencing is the process of isolating a cluster node or protecting shared resources when a node seems to be malfunctioning.</p>
<p>Previously, we relied on clients to behave correctly, assuming they wouldn't issue tasks unless they were the leaders. However, this assumption led to many edge cases in our design. To address this, we should add a check during task execution. Before running the task, the task execution service itself must verify if the caller has the authority to perform it.</p>
<p>How do we check this? The caller will provide a token that can be validated against the <em>latest issued token</em> (strong consistency) or the <em>most recent token seen by the storage engine so far</em> (linearizable reads/weak consistency).</p>
<p><img src="https://martin.kleppmann.com/2016/02/fencing-tokens.png" alt="Unsafe access to a resource protected by a distributed lock" /></p>
<blockquote>
<p><a target="_blank" href="https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html">Source</a></p>
</blockquote>
<p>Let’s look at some practical examples which will help understand this concept as it can get quite tricky.</p>
<h2 id="heading-split-brain-issue-in-kafka-cluster">Split Brain Issue in Kafka Cluster</h2>
<p>In a Kafka Cluster, there are multiple brokers — of which a single broker is selected as the controller of the entire cluster. Kafka uses Zookeeper as the consensus core. (It has now moved to Raft, but the fundamentals remain the same)</p>
<p><img src="https://portworx.com/wp-content/uploads/2024/10/Screen-Shot-2024-10-08-at-6.09.14-PM.png" alt="How to Deploy Kafka on Kubernetes | K8s Devops | Portworx" /></p>
<blockquote>
<p><a target="_blank" href="https://portworx.com/kafka-kubernetes/">Source</a></p>
</blockquote>
<p>What will happen if a Controller broker dies. One of the remaining brokers must get promoted as the Controller.</p>
<p>Keep in mind that you cannot truly know whether a broker has stopped for good or has experienced an intermittent failure. Nevertheless, the cluster has to move on and pick a new controller. This can lead to a <strong>zombie controller</strong>. A zombie controller can be defined as a controller node which had been deemed dead by the cluster and has come back online. Another broker has taken its place but the zombie controller might not know that yet.</p>
<p>This can easily happen. For example, if a nasty intermittent network partition happens or a controller has a long enough stop-the-world GC pause — the cluster will think it has died and pick a new controller. In the GC scenario, nothing has changed through the eyes of the original controller. The broker does not even know it was paused, much less that the cluster moved on without it. Because of that, it will continue acting as if it is the current controller. This is a common scenario in distributed systems and is called <a target="_blank" href="http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html?ref=snehasishroy.com">split-brain</a>.</p>
<p>Let’s go through an example. Imagine the active controller really does go into a long stop-the-world GC pause. Its ZooKeeper session expires and <code>/controller</code> znode it registered is now deleted. Every other broker in the cluster is notified of this as they placed ZooKeeper Watches on it.</p>
<p><img src="https://hackernoon.imgix.net/hn-images/1*qaLzEy0SKCIv0O9eQR_E9g.png?w=800" alt="image" /></p>
<blockquote>
<p><a target="_blank" href="https://hackernoon.com/apache-kafkas-distributed-system-firefighter-the-controller-broker-1afca1eae302">Source</a></p>
</blockquote>
<p>To fix the controller-less cluster, every broker now tries to become the new controller itself. Let’s say Broker 2 won the race and became the new controller by creating the <code>/controller</code> znode first.</p>
<p>Every broker receives a notification that this znode was created and now knows who the latest leader is — Broker 2. Every broker except Broker 3, which is still in a GC pause. It is possible that this notification does not reach it for one reason or another (e.g OS has too many accepted connections awaiting processing and drops it). In the end, the information about the leadership change does not reach Broker 3.</p>
<p><img src="https://hackernoon.imgix.net/hn-images/1*sAiAkI8rz8W3IqW13no-Yw.png?w=800" alt="image" /></p>
<blockquote>
<p><a target="_blank" href="https://hackernoon.com/apache-kafkas-distributed-system-firefighter-the-controller-broker-1afca1eae302">Source</a></p>
</blockquote>
<p>Broker 3’s garbage collection pause will eventually finish and it will wake up still thinking it is in charge. Remember, nothing has changed through its eyes.</p>
<p><img src="https://hackernoon.imgix.net/hn-images/1*TSMvyzsTS2DDzOiKUhZS8w.png?w=800" alt="image" /></p>
<blockquote>
<p><a target="_blank" href="https://hackernoon.com/apache-kafkas-distributed-system-firefighter-the-controller-broker-1afca1eae302">Source</a></p>
</blockquote>
<p>You now have two controllers which will be giving out potentially conflicting commands in parallel. This is something you do not want in your cluster. If not handled, it can result in critical failures.</p>
<p>If Broker 2 (new controller node) receives a request from Broker 3, how will it know whether Broker 3 is the newest controller or not? For all Broker 2 knows, the same GC pause might have happened to it too!</p>
<p>There needs to be a way to distinguish who the real, current controller of the cluster is.</p>
<p>There is such a way! It is done through the use of an <strong><em>epoch number</em></strong> (also called a fencing token). An epoch number is simply a monotonically increasing number — if the old leader had an epoch number of 1, the new one will have 2. Brokers can now easily differentiate the real controller by simply trusting the controller with the highest number. The controller with the highest number is surely the latest one, since the epoch number is always-increasing. This epoch number is stored in ZooKeeper.</p>
<p><img src="https://hackernoon.imgix.net/hn-images/1*cHsPiCtJoNsTfKV4k7DNGQ.png?w=800" alt="image" /></p>
<blockquote>
<p><a target="_blank" href="https://hackernoon.com/apache-kafkas-distributed-system-firefighter-the-controller-broker-1afca1eae302">Source</a></p>
</blockquote>
<p>Here, Broker 1 stores the latest <code>controllerEpoch</code> it has seen and ignores all requests from controllers with a previous epoch number.</p>
<p>Let’s look at the Kafka code directly from the <a target="_blank" href="https://github.com/apache/kafka/blob/ebb1d6e21cc9213071ee1c6a15ec3411fc215b81/core/src/main/scala/kafka/zk/KafkaZkClient.scala#L108">source</a></p>
<pre><code class="lang-scala"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">registerControllerAndIncrementControllerEpoch</span></span>(controllerId: <span class="hljs-type">Int</span>): (<span class="hljs-type">Int</span>, <span class="hljs-type">Int</span>) = {
    <span class="hljs-keyword">val</span> timestamp = time.milliseconds()

    <span class="hljs-comment">// Read /controller_epoch to get the current controller epoch and zkVersion,</span>
    <span class="hljs-comment">// create /controller_epoch with initial value if not exists</span>
    <span class="hljs-keyword">val</span> (curEpoch, curEpochZkVersion) = getControllerEpoch
      .map(e =&gt; (e._1, e._2.getVersion))
      .getOrElse(maybeCreateControllerEpochZNode())

    <span class="hljs-comment">// Create /controller and update /controller_epoch atomically</span>
    <span class="hljs-keyword">val</span> newControllerEpoch = curEpoch + <span class="hljs-number">1</span>
    <span class="hljs-keyword">val</span> expectedControllerEpochZkVersion = curEpochZkVersion

    debug(<span class="hljs-string">s"Try to create <span class="hljs-subst">${ControllerZNode.path}</span> and increment controller epoch to 
<span class="hljs-subst">$newControllerEpoch</span> with expected controller epoch zkVersion <span class="hljs-subst">$expectedControllerEpochZkVersion</span>"</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">checkControllerAndEpoch</span></span>(): (<span class="hljs-type">Int</span>, <span class="hljs-type">Int</span>) = {
      <span class="hljs-keyword">val</span> curControllerId = getControllerId.getOrElse(<span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-type">ControllerMovedException</span>(
        <span class="hljs-string">s"The ephemeral node at <span class="hljs-subst">${ControllerZNode.path}</span> went away while checking 
whether the controller election succeeds. Aborting controller startup procedure"</span>))
      <span class="hljs-keyword">if</span> (controllerId == curControllerId) {
        <span class="hljs-keyword">val</span> (epoch, stat) = getControllerEpoch.getOrElse(
          <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-type">IllegalStateException</span>(<span class="hljs-string">s"<span class="hljs-subst">${ControllerEpochZNode.path}</span> existed before 
but goes away while trying to read it"</span>))

        <span class="hljs-comment">// If the epoch is the same as newControllerEpoch, it is safe to infer that the </span>
        <span class="hljs-comment">// returned epoch zkVersion is associated with the current broker during </span>
        <span class="hljs-comment">// controller election because we already knew that the zk</span>
        <span class="hljs-comment">// transaction succeeds based on the controller znode verification. Other rounds of controller</span>
        <span class="hljs-comment">// election will result in larger epoch number written in zk.</span>
        <span class="hljs-keyword">if</span> (epoch == newControllerEpoch)
          <span class="hljs-keyword">return</span> (newControllerEpoch, stat.getVersion)
      }
      <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-type">ControllerMovedException</span>(<span class="hljs-string">"Controller moved to another broker. 
Aborting controller startup procedure"</span>)
    }

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">tryCreateControllerZNodeAndIncrementEpoch</span></span>(): (<span class="hljs-type">Int</span>, <span class="hljs-type">Int</span>) = {
      <span class="hljs-keyword">val</span> response = retryRequestUntilConnected(
        <span class="hljs-type">MultiRequest</span>(<span class="hljs-type">Seq</span>(
          <span class="hljs-type">CreateOp</span>(<span class="hljs-type">ControllerZNode</span>.path, <span class="hljs-type">ControllerZNode</span>.encode(controllerId, timestamp), 
defaultAcls(<span class="hljs-type">ControllerZNode</span>.path), <span class="hljs-type">CreateMode</span>.<span class="hljs-type">EPHEMERAL</span>),
          <span class="hljs-type">SetDataOp</span>(<span class="hljs-type">ControllerEpochZNode</span>.path, <span class="hljs-type">ControllerEpochZNode</span>.encode(newControllerEpoch), 
expectedControllerEpochZkVersion)))
      )
      response.resultCode <span class="hljs-keyword">match</span> {
        <span class="hljs-keyword">case</span> <span class="hljs-type">Code</span>.<span class="hljs-type">NODEEXISTS</span> | <span class="hljs-type">Code</span>.<span class="hljs-type">BADVERSION</span> =&gt; checkControllerAndEpoch()
        <span class="hljs-keyword">case</span> <span class="hljs-type">Code</span>.<span class="hljs-type">OK</span> =&gt;
          <span class="hljs-keyword">val</span> setDataResult = response.zkOpResults(<span class="hljs-number">1</span>).rawOpResult.asInstanceOf[<span class="hljs-type">SetDataResult</span>]
          (newControllerEpoch, setDataResult.getStat.getVersion)
        <span class="hljs-keyword">case</span> code =&gt; <span class="hljs-keyword">throw</span> <span class="hljs-type">KeeperException</span>.create(code)
      }
    }

    tryCreateControllerZNodeAndIncrementEpoch()
  }

  <span class="hljs-keyword">private</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">maybeCreateControllerEpochZNode</span></span>(): (<span class="hljs-type">Int</span>, <span class="hljs-type">Int</span>) = {
    createControllerEpochRaw(<span class="hljs-type">KafkaController</span>.<span class="hljs-type">InitialControllerEpoch</span>).resultCode <span class="hljs-keyword">match</span> {
      <span class="hljs-keyword">case</span> <span class="hljs-type">Code</span>.<span class="hljs-type">OK</span> =&gt;
        info(<span class="hljs-string">s"Successfully created <span class="hljs-subst">${ControllerEpochZNode.path}</span> with initial epoch <span class="hljs-subst">${KafkaController.InitialControllerEpoch}</span>"</span>)
        (<span class="hljs-type">KafkaController</span>.<span class="hljs-type">InitialControllerEpoch</span>, <span class="hljs-type">KafkaController</span>.<span class="hljs-type">InitialControllerEpochZkVersion</span>)
      <span class="hljs-keyword">case</span> <span class="hljs-type">Code</span>.<span class="hljs-type">NODEEXISTS</span> =&gt;
        <span class="hljs-keyword">val</span> (epoch, stat) = getControllerEpoch.getOrElse(<span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-type">IllegalStateException</span>(<span class="hljs-string">s"<span class="hljs-subst">${ControllerEpochZNode.path}</span> existed before but goes away while trying to read it"</span>))
        (epoch, stat.getVersion)
      <span class="hljs-keyword">case</span> code =&gt;
        <span class="hljs-keyword">throw</span> <span class="hljs-type">KeeperException</span>.create(code)
    }


  <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">getControllerEpoch</span></span>: <span class="hljs-type">Option</span>[(<span class="hljs-type">Int</span>, <span class="hljs-type">Stat</span>)] = {
    <span class="hljs-keyword">val</span> getDataRequest = <span class="hljs-type">GetDataRequest</span>(<span class="hljs-type">ControllerEpochZNode</span>.path)
    <span class="hljs-keyword">val</span> getDataResponse = retryRequestUntilConnected(getDataRequest)
    getDataResponse.resultCode <span class="hljs-keyword">match</span> {
      <span class="hljs-keyword">case</span> <span class="hljs-type">Code</span>.<span class="hljs-type">OK</span> =&gt;
        <span class="hljs-keyword">val</span> epoch = <span class="hljs-type">ControllerEpochZNode</span>.decode(getDataResponse.data)
        <span class="hljs-type">Option</span>(epoch, getDataResponse.stat)
      <span class="hljs-keyword">case</span> <span class="hljs-type">Code</span>.<span class="hljs-type">NONODE</span> =&gt; <span class="hljs-type">None</span>
      <span class="hljs-keyword">case</span> _ =&gt; <span class="hljs-keyword">throw</span> getDataResponse.resultException.get
    }
  }
</code></pre>
<h2 id="heading-key-implementation-details">Key Implementation Details</h2>
<h3 id="heading-atomic-operations">Atomic Operations</h3>
<p><code>tryCreateControllerZNodeAndIncrementEpoch()</code> method handles two critical operations atomically using ZooKeeper's <code>MultiRequest</code> to execute both operations atomically. Either both succeed or both fail.</p>
<ol>
<li><p>Creating a controller znode indicating this broker is the controller <code>CreateOp()</code></p>
</li>
<li><p>Incrementing the controller epoch to prevent split-brain scenarios <code>SetDataOp()</code></p>
</li>
</ol>
<h3 id="heading-optimistic-concurrency">Optimistic Concurrency</h3>
<p><code>SetDataOp()</code> in <code>tryCreateControllerZNodeAndIncrementEpoch()</code> uses the ZooKeeper version number (<code>expectedControllerEpochZkVersion</code>) to ensure that the epoch is only updated if no one else has modified it since it was read - this is a classic example of optimistic updates.</p>
<h3 id="heading-ephemeral-nodes">Ephemeral Nodes</h3>
<p>The controller znode is created as an ephemeral node (<code>CreateMode.EPHEMERAL</code>), which means it will be automatically deleted if the controller broker's ZooKeeper session expires - critical for automatic failure detection.</p>
<hr />
<h2 id="heading-takeaways">Takeaways</h2>
<p>If you have read so far, I appreciate your patience. Hope you learnt something new today and Thank you for reading.</p>
<p>To support exactly once execution — performing Leader Election or taking a locks isn’t enough — there should be additional checks at the time of task execution to detect whether it’s safe to actually execute the task. Fencing Token is one such way — however generating one isn’t so straightforward.</p>
<p>Please feel free to ask any questions you might have in the comments.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747306124476/9f44a7e6-588e-4898-bc4c-d89cd336531a.gif" alt class="image--center mx-auto" /></p>
<hr />
<h3 id="heading-references">References</h3>
<ul>
<li><p><a target="_blank" href="https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html">https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html</a></p>
</li>
<li><p><a target="_blank" href="https://hackernoon.com/apache-kafkas-distributed-system-firefighter-the-controller-broker-1afca1eae302">https://hackernoon.com/apache-kafkas-distributed-system-firefighter-the-controller-broker-1afca1eae302</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/apache/kafka/blob/ebb1d6e21cc9213071ee1c6a15ec3411fc215b81/core/src/main/scala/kafka/zk/KafkaZkClient.scala#L108">Kafka Source Code</a></p>
</li>
</ul>
<hr />
]]></content:encoded></item><item><title><![CDATA[Low-Level System Design: Let's build a Distributed Task Scheduler]]></title><description><![CDATA[What do we want?
We want to create a Task Execution Service that can execute tasks in a fault-tolerant manner. The service should dynamically discover workers and assign tasks to them. If a worker dies while executing a task, the service should be ab...]]></description><link>https://snehasishroy.com/build-a-distributed-task-scheduler-using-zookeeper</link><guid isPermaLink="true">https://snehasishroy.com/build-a-distributed-task-scheduler-using-zookeeper</guid><category><![CDATA[System Architecture]]></category><category><![CDATA[System Design]]></category><category><![CDATA[backend]]></category><category><![CDATA[Databases]]></category><category><![CDATA[2Articles1Week]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Sun, 24 Mar 2024 16:45:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/YcO98VqQlnA/upload/0698226273214cb1e693038ff8a852e3.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-what-do-we-want">What do we want?</h3>
<p>We want to create a Task Execution Service that can execute tasks in a fault-tolerant manner. The service should dynamically discover workers and assign tasks to them. If a worker dies while executing a task, the service should be able to find that a worker has died (<em>or stopped responding</em>) and reassign the task to a new worker, providing at least once guarantee for job execution. All these should be done in a highly scalable and distributed manner.</p>
<p>This will be a <em>hands-on guide</em> on implementing a distributed job execution service - so get your coffee mug ready.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">The code used in this article can be found here <a target="_blank" href="https://github.com/snehasishroy/TaskScheduler">https://github.com/snehasishroy/TaskScheduler</a></div>
</div>

<h3 id="heading-architecture">Architecture</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1711297357035/783b1e1d-a0d2-44be-b1a9-ccb9b8777a71.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p>Clients submit job details to Zookeeper and listen to the status updates again via Zookeeper.</p>
</li>
<li><p>Multiple worker instances utilizes Zookeeper to perform leader election.</p>
</li>
<li><p>The leader instances watches job path to listen for upcoming job and assigns the jobs to the available workers.</p>
</li>
<li><p>The worker instances watches their assignment mapping path. When a new job is found, it gets executed and the completion status is updated.</p>
</li>
<li><p>The client instances gets notified upon task completion by Zookeeper.</p>
</li>
</ul>
<h3 id="heading-zookeeper">Zookeeper</h3>
<p><strong>Z</strong>ookeeper is a robust service that aims to <strong>deliver coordination among distributed systems.</strong> It's widely used in open-source projects like Kafka and HBase as the central coordination service.</p>
<p>We will use CuratorFramework in our project as it provides high-level API's for interacting with Zookeeper.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">This blog won't deep dive into the internals of Zookeeper. Readers are expected to know the basics of Zookeeper before proceeding to the implementation part.</div>
</div>

<h3 id="heading-implementation-details">Implementation Details</h3>
<p>Let's look at the <code>ClientResource</code> - which provides a facade for task submission.</p>
<pre><code class="lang-java"><span class="hljs-meta">@Slf4j</span>
<span class="hljs-meta">@Path("/v1/client")</span>
<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Client</span> </span>{
  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> ClientService clientService;

  <span class="hljs-meta">@Inject</span>
  <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">Client</span><span class="hljs-params">(CuratorFramework curator)</span> </span>{
    <span class="hljs-keyword">this</span>.clientService = <span class="hljs-keyword">new</span> ClientService(curator);
  }

  <span class="hljs-meta">@POST</span>
  <span class="hljs-function"><span class="hljs-keyword">public</span> String <span class="hljs-title">createSumTask</span><span class="hljs-params">(<span class="hljs-meta">@QueryParam("first")</span> <span class="hljs-keyword">int</span> a, <span class="hljs-meta">@QueryParam("second")</span> <span class="hljs-keyword">int</span> b)</span> </span>{
    Runnable jobDetail =
        (Runnable &amp; Serializable)
            (() -&gt; System.out.println(<span class="hljs-string">"Sum of "</span> + a + <span class="hljs-string">" and "</span> + b + <span class="hljs-string">" is "</span> + (a + b)));
    <span class="hljs-keyword">return</span> clientService.registerJob(jobDetail);
  }
}
</code></pre>
<p>The above code, allows clients to submit a sample runnable task that computes the sum of two numbers and prints it - this provides an easy way for input via Swagger. But the design is extensible - the client can submit any instance of the <code>Runnable</code> as a Job.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Instead of providing a <code>Runnable</code>, we could have designed our service to work with <code>Dockerfile</code> - leading to a generic task execution system! but we wanted to focus only on Zookeeper in this article.</div>
</div>

<p>Now let's look at the <code>ClientService</code></p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ClientService</span> </span>{
  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> CuratorFramework curator;

  <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">ClientService</span><span class="hljs-params">(CuratorFramework curator)</span> </span>{
    <span class="hljs-keyword">this</span>.curator = curator;
  }

  <span class="hljs-function"><span class="hljs-keyword">public</span> String <span class="hljs-title">registerJob</span><span class="hljs-params">(Runnable jobDetail)</span> </span>{
    String jobId = UUID.randomUUID().toString();
    syncCreate(ZKUtils.getJobsPath() + <span class="hljs-string">"/"</span> + jobId, jobDetail);
    <span class="hljs-keyword">return</span> jobId;
  }

  <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">void</span> <span class="hljs-title">syncCreate</span><span class="hljs-params">(String path, Runnable runnable)</span> </span>{
    <span class="hljs-comment">// create the ZNode along with the Runnable instance as data</span>
    <span class="hljs-keyword">try</span> {
      ByteArrayOutputStream byteArrayOutputStream = <span class="hljs-keyword">new</span> ByteArrayOutputStream();
      ObjectOutputStream objectOutputStream = <span class="hljs-keyword">new</span> ObjectOutputStream(byteArrayOutputStream);
      objectOutputStream.writeObject(runnable);
      curator
          .create()
          .idempotent()
          .withMode(CreateMode.PERSISTENT)
          .forPath(path, byteArrayOutputStream.toByteArray());
    } <span class="hljs-keyword">catch</span> (Exception e) {
      log.error(<span class="hljs-string">"Unable to create {}"</span>, path, e);
      <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> RuntimeException(e);
    }
  }
}
</code></pre>
<p>Once a job is registered, a unique ID is assigned to it and a <strong>Persistent</strong> node is registered on the Zookeeper with the randomly generated job ID in the path e.g. <code>/jobs/{job-id}</code>. Do notice that the <code>runnable</code> is serialized to a byte array and stored in the ZNode directly.</p>
<p>Notice that we are creating the ZNode <em>synchronously</em> i.e. the function <code>syncCreate</code> will block until the ZNode is not created. In the later section, you will notice that we have used asynchronous operations to improve throughput.</p>
<p>Why are we creating paths? So that we can set up <em>watches</em> on it. Watches allow us to be notified of any changes under the watched path. Zookeeper will invoke the <code>JobsListener</code> whenever a new node is <em>created or destroyed</em> under the <code>/jobs</code> path.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">What would happen if the client is disconnected from the Zookeeper when a new job is registered? In such cases, the watch won't be triggered and the client won't be notified. The Curator will automatically attempt to recreate the watches upon reconnection.</div>
</div>

<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">JobsListener</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">CuratorCacheListener</span> </span>{
  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> CuratorFramework curator;
  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> CuratorCache workersCache;
  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> ExecutorService executorService;
  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> WorkerPickerStrategy workerPickerStrategy;

  <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">JobsListener</span><span class="hljs-params">(
      CuratorFramework curator,
      CuratorCache workersCache,
      WorkerPickerStrategy workerPickerStrategy)</span> </span>{
    <span class="hljs-keyword">this</span>.curator = curator;
    <span class="hljs-keyword">this</span>.workersCache = workersCache;
    executorService = Executors.newSingleThreadExecutor();
    <span class="hljs-keyword">this</span>.workerPickerStrategy = workerPickerStrategy;
  }

  <span class="hljs-meta">@Override</span>
  <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">event</span><span class="hljs-params">(Type type, ChildData oldData, ChildData data)</span> </span>{
    <span class="hljs-keyword">if</span> (type == Type.NODE_CREATED &amp;&amp; data.getPath().length() &gt; ZKUtils.JOBS_ROOT.length()) {
      String jobID = ZKUtils.extractNode(data.getPath());
      log.info(<span class="hljs-string">"found new job {}, passing it to executor service"</span>, jobID);
      <span class="hljs-comment">// an executor service is used in order to avoid blocking the watcher thread as the job</span>
      <span class="hljs-comment">// execution can be time consuming</span>
      <span class="hljs-comment">// and we don't want to skip handling new jobs during that time</span>
      executorService.submit(
          <span class="hljs-keyword">new</span> JobAssigner(jobID, data.getData(), curator, workersCache, workerPickerStrategy));
    }
  }
}
</code></pre>
<p>When a new job is found, we hand over the Job ID to a different thread because we don't want to block the watcher thread.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">All ZooKeeper watchers are serialized and processed by a single thread. Thus, no other watchers can be processed while your watcher is running. Hence it's vital not to block the watcher thread. <a target="_blank" href="https://cwiki.apache.org/confluence/display/CURATOR/TN1">https://cwiki.apache.org/confluence/display/CURATOR/TN1</a></div>
</div>

<p>We are setting up the watcher using <code>CuratorCache</code> - which will be explained later on.</p>
<hr />
<h3 id="heading-jobassigner">JobAssigner</h3>
<p>Once a job has been created, we need to execute it by finding an eligible worker based on a strategy. We can either choose a worker randomly or in a round-robin manner. Once a worker is chosen, we need to create an assignment between a JobID and a Worker ID - we do so by creating a Persistent ZNode on the path <code>/assignments/{worker-id}/{job-id}</code> . Once the assignment is created, we <em>delete</em> the <code>/jobs/{job-id}</code> path.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Deletion of job details of the assigned job eases the recoverability. If a leader dies and a new leader is elected, it does not have to look at all the jobs present under <code>/jobs/</code> and figure out which one is left unassigned. Any jobs present under<code>/jobs/</code> are <em>guaranteed</em> to be unassigned - assuming that the assignment and deletion have happened <em>atomically</em>.</div>
</div>

<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">JobAssigner</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">Runnable</span> </span>{

  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> CuratorFramework curator;
  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> String jobID;
  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> CuratorCache workersCache;
  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> WorkerPickerStrategy workerPickerStrategy;
  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> <span class="hljs-keyword">byte</span>[] jobData;
  <span class="hljs-keyword">private</span> String workerName;

  <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">JobAssigner</span><span class="hljs-params">(
      String jobID,
      <span class="hljs-keyword">byte</span>[] jobData,
      CuratorFramework curator,
      CuratorCache workersCache,
      WorkerPickerStrategy workerPickerStrategy)</span> </span>{
    <span class="hljs-keyword">this</span>.jobID = jobID;
    <span class="hljs-keyword">this</span>.curator = curator;
    <span class="hljs-keyword">this</span>.workersCache = workersCache;
    <span class="hljs-keyword">this</span>.workerPickerStrategy = workerPickerStrategy;
    <span class="hljs-keyword">this</span>.jobData = jobData;
  }

  <span class="hljs-meta">@Override</span>
  <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">run</span><span class="hljs-params">()</span> </span>{
    <span class="hljs-comment">// from the list of workers, pick a worker based on the provided strategy and assign the</span>
    <span class="hljs-comment">// incoming job to that worker</span>
    List&lt;ChildData&gt; workers =
        workersCache.stream()
            .filter(childData -&gt; (childData.getPath().length() &gt; ZKUtils.WORKERS_ROOT.length()))
            .toList();
    ChildData chosenWorker = workerPickerStrategy.evaluate(workers);
    workerName = ZKUtils.extractNode(chosenWorker.getPath());
    log.info(
        <span class="hljs-string">"Found total workers {}, Chosen worker index {}, worker name {}"</span>,
        workers.size(),
        chosenWorker,
        workerName);
    asyncCreateAssignment();
  }

  <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">void</span> <span class="hljs-title">asyncCreateAssignment</span><span class="hljs-params">()</span> </span>{
    <span class="hljs-keyword">try</span> {
      curator
          .create()
          .idempotent()
          .withMode(CreateMode.PERSISTENT)
          .inBackground(
              <span class="hljs-keyword">new</span> BackgroundCallback() {
                <span class="hljs-meta">@Override</span>
                <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">processResult</span><span class="hljs-params">(CuratorFramework client, CuratorEvent event)</span> </span>{
                  <span class="hljs-keyword">switch</span> (KeeperException.Code.get(event.getResultCode())) {
                    <span class="hljs-keyword">case</span> OK -&gt; {
                      log.info(
                          <span class="hljs-string">"Assignment created successfully for JobID {} with WorkerID {}"</span>,
                          jobID,
                          workerName);
                      log.info(
                          <span class="hljs-string">"Performing async deletion of {}"</span>, ZKUtils.getJobsPath() + <span class="hljs-string">"/"</span> + jobID);
                      asyncDelete(ZKUtils.getJobsPath() + <span class="hljs-string">"/"</span> + jobID);
                    }
                    <span class="hljs-keyword">case</span> CONNECTIONLOSS -&gt; {
                      log.error(
                          <span class="hljs-string">"Lost connection to ZK while creating {}, retrying"</span>, event.getPath());
                      asyncCreateAssignment();
                    }
                    <span class="hljs-keyword">case</span> NODEEXISTS -&gt; {
                      log.warn(<span class="hljs-string">"Assignment already exists for path {}"</span>, event.getPath());
                    }
                    <span class="hljs-keyword">case</span> NONODE -&gt; {
                      log.error(<span class="hljs-string">"Trying to create an assignment for a worker which does not exist {}"</span>, event);
                    }
                    <span class="hljs-keyword">default</span> -&gt; log.error(<span class="hljs-string">"Unhandled event {} "</span>, event);
                  }
                }
              })
          .forPath(ZKUtils.ASSIGNMENT_ROOT + <span class="hljs-string">"/"</span> + workerName + <span class="hljs-string">"/"</span> + jobID, jobData);
      <span class="hljs-comment">// Storing the job data along with the assignment, so that the respective worker need not</span>
      <span class="hljs-comment">// perform an additional call to get the job details.</span>
      <span class="hljs-comment">// This also simplifies the design - because we can delete the /jobs/{jobID} path once the</span>
      <span class="hljs-comment">// assignment  is completed - indicating that if an entry is present under /jobs, it's</span>
      <span class="hljs-comment">// assignment is not yet done.</span>
      <span class="hljs-comment">// This makes the recovery/reconciliation process much easier. Once a leader is elected, it</span>
      <span class="hljs-comment">// has to only perform liveliness check for the existing assignments.</span>
      <span class="hljs-comment">// <span class="hljs-doctag">TODO:</span> Use MultiOp to perform assignment and deletion atomically</span>
    } <span class="hljs-keyword">catch</span> (Exception e) {
      log.error(<span class="hljs-string">"Error while creating assignment for {} with {}"</span>, jobID, workerName, e);
      <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> RuntimeException(e);
    }
  }

  <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">void</span> <span class="hljs-title">asyncDelete</span><span class="hljs-params">(String path)</span> </span>{
    <span class="hljs-comment">// delete the provided ZNode</span>
    <span class="hljs-keyword">try</span> {
      curator
          .delete()
          .idempotent()
          .guaranteed()
          .inBackground(
              <span class="hljs-keyword">new</span> BackgroundCallback() {
                <span class="hljs-meta">@Override</span>
                <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">processResult</span><span class="hljs-params">(CuratorFramework client, CuratorEvent event)</span> </span>{
                  <span class="hljs-keyword">switch</span> (KeeperException.Code.get(event.getResultCode())) {
                    <span class="hljs-keyword">case</span> OK -&gt; {
                      log.info(<span class="hljs-string">"Path deleted successfully {}"</span>, event.getPath());
                    }
                    <span class="hljs-keyword">case</span> CONNECTIONLOSS -&gt; {
                      log.info(
                          <span class="hljs-string">"Lost connection to ZK while deleting {}, retrying"</span>, event.getPath());
                      asyncDelete(event.getPath());
                    }
                    <span class="hljs-keyword">default</span> -&gt; log.error(<span class="hljs-string">"Unhandled event {}"</span>, event);
                  }
                }
              })
          .forPath(path);
    } <span class="hljs-keyword">catch</span> (Exception e) {
      log.error(<span class="hljs-string">"Unable to delete {} due to "</span>, path, e);
      <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> RuntimeException(e);
    }
  }
}
</code></pre>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">We are using asynchronous operations to create a ZNode to increase throughput. Being asynchronous, we don't know whether our operation actually succeeded or not, hence we have to deal with failure scenarios i.e. <code>ConnectionLoss</code> and whether the Node already exists.</div>
</div>

<h3 id="heading-workerpickerstrategy">WorkerPickerStrategy</h3>
<p>We are using <code>Strategy</code> pattern to dynamically change the way we can choose a worker at runtime. The important thing to notice is that we have used <em>compare and swap</em> as a way to perform optimistic locking for <code>RoundRobinWorker</code> .</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">interface</span> <span class="hljs-title">WorkerPickerStrategy</span> </span>{
  <span class="hljs-function">ChildData <span class="hljs-title">evaluate</span><span class="hljs-params">(List&lt;ChildData&gt; workers)</span></span>;
}

<span class="hljs-comment">// choose workers based on random strategy</span>
<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">RandomWorker</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">WorkerPickerStrategy</span> </span>{
  <span class="hljs-meta">@Override</span>
  <span class="hljs-function"><span class="hljs-keyword">public</span> ChildData <span class="hljs-title">evaluate</span><span class="hljs-params">(List&lt;ChildData&gt; workers)</span> </span>{
    <span class="hljs-keyword">int</span> chosenWorker = (<span class="hljs-keyword">int</span>) (Math.random() * workers.size());
    <span class="hljs-keyword">return</span> workers.get(chosenWorker);
  }
}

<span class="hljs-comment">// choose workers based on round robin strategy</span>
<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">RoundRobinWorker</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">WorkerPickerStrategy</span> </span>{
  AtomicInteger index =
      <span class="hljs-keyword">new</span> AtomicInteger(<span class="hljs-number">0</span>); <span class="hljs-comment">// atomic because this will be accessed from multiple threads</span>

  <span class="hljs-meta">@Override</span>
  <span class="hljs-function"><span class="hljs-keyword">public</span> ChildData <span class="hljs-title">evaluate</span><span class="hljs-params">(List&lt;ChildData&gt; workers)</span> </span>{
    <span class="hljs-keyword">int</span> chosenIndex;
    <span class="hljs-keyword">while</span> (<span class="hljs-keyword">true</span>) { <span class="hljs-comment">// repeat this until compare and set operation is succeeded</span>
      chosenIndex = index.get();
      <span class="hljs-keyword">int</span> nextIndex = (chosenIndex + <span class="hljs-number">1</span>) &lt; workers.size() ? (chosenIndex + <span class="hljs-number">1</span>) : <span class="hljs-number">0</span>;
      <span class="hljs-comment">// in case of concurrent updates, this can fail, hence we have to retry until success</span>
      <span class="hljs-keyword">if</span> (index.compareAndSet(chosenIndex, nextIndex)) {
        <span class="hljs-keyword">break</span>;
      }
    }
    <span class="hljs-keyword">return</span> workers.get(chosenIndex);
  }
}
</code></pre>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Optimistic locking is a very powerful construct and can be found in various places e.g. ElasticSearch natively provides compare and swap operations while updating documents. Zookeeper also maintains a version number with each ZNode - which can be used to perform a conditional update. <a target="_blank" href="https://www.elastic.co/guide/en/elasticsearch/reference/current/optimistic-concurrency-control.html">https://www.elastic.co/guide/en/elasticsearch/reference/current/optimistic-concurrency-control.html</a><a target="_blank" href="https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html">https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html</a></div>
</div>

<hr />
<h3 id="heading-workerservice">WorkerService</h3>
<p>Since the <code>JobHandler</code> creates an assignment using ZNode of form <code>/assignments/{worker-id}/{job-id}</code> , if a worker has to listen to upcoming assignments, a watch needs to be set on the <code>/assignments/{worker-id}</code> path.</p>
<p>Once the worker service is notified of the new assignment, it fetches the job details, deserializes it to an instance of Runnable, and passes it to an <code>ExecutorService</code> which performs the execution.</p>
<p>Once the runnable has been executed, we chain the future by updating the status of the job id. The status of a job ID is reflected by <em>asynchronously</em> creating an entry in <code>/status/{job-id}</code> . Once the entry is created, we perform the last operation in this orchestra - deletion of the assignment mapping.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">We have deliberately chosen the deletion of the assignment mapping as the last operation. In case, a worker dies during task execution, the leader can perform failure recovery and assign all the tasks that the dead worker was assigned, to a new worker instance.</div>
</div>

<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">AssignmentListener</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">CuratorCacheListener</span> </span>{
  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> CuratorFramework curator;
  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> ExecutorService executorService;

  <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">AssignmentListener</span><span class="hljs-params">(CuratorFramework curator)</span> </span>{
    <span class="hljs-keyword">this</span>.curator = curator;
    <span class="hljs-keyword">this</span>.executorService = Executors.newFixedThreadPool(<span class="hljs-number">10</span>);
  }

  <span class="hljs-meta">@Override</span>
  <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">event</span><span class="hljs-params">(Type type, ChildData oldData, ChildData data)</span> </span>{
    <span class="hljs-keyword">if</span> (type == Type.NODE_CREATED) {
      <span class="hljs-keyword">if</span> (data.getPath().indexOf(<span class="hljs-string">'/'</span>, <span class="hljs-number">1</span>) == data.getPath().lastIndexOf(<span class="hljs-string">'/'</span>)) {
        <span class="hljs-comment">// This filters out the root path /assignment/{worker-id} which does not contains any job id</span>
        <span class="hljs-keyword">return</span>;
      }
      String jobId = data.getPath().substring(data.getPath().lastIndexOf(<span class="hljs-string">'/'</span>) + <span class="hljs-number">1</span>);
      log.info(<span class="hljs-string">"Assignment found for job id {}"</span>, jobId);

      <span class="hljs-keyword">try</span> {
        <span class="hljs-keyword">byte</span>[] bytes = data.getData();
        ObjectInputStream objectInputStream =
            <span class="hljs-keyword">new</span> ObjectInputStream(<span class="hljs-keyword">new</span> ByteArrayInputStream(bytes));
        Runnable jobDetail = (Runnable) objectInputStream.readObject();
        log.info(<span class="hljs-string">"Deserialized the JobId {} to {}"</span>, jobId, jobDetail);
        CompletableFuture&lt;Void&gt; future = CompletableFuture.runAsync(jobDetail, executorService);
        <span class="hljs-comment">// Actual execution of the job will be performed in a separate thread to avoid blocking of</span>
        <span class="hljs-comment">// watcher thread</span>
        log.info(<span class="hljs-string">"Job submitted for execution"</span>);
        <span class="hljs-comment">// once the job has been executed, we need to ensure the assignment is deleted and the</span>
        <span class="hljs-comment">// status of job has been updated. Currently there is no guarantee that post the execution,</span>
        <span class="hljs-comment">// this cleanup happens.</span>
        <span class="hljs-comment">// <span class="hljs-doctag">TODO:</span> Implement a daemon service which performs cleanup</span>
        future.thenAcceptAsync(__ -&gt; asyncCreate(jobId, data.getPath()), executorService);
      } <span class="hljs-keyword">catch</span> (Exception e) {
        log.error(<span class="hljs-string">"Unable to fetch data for job id {}"</span>, jobId, e);
      }
    }
  }

  <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">void</span> <span class="hljs-title">asyncCreate</span><span class="hljs-params">(String jobId, String assignmentPath)</span> </span>{
    log.info(<span class="hljs-string">"JobID {} has been executed, moving on to update its status"</span>, jobId);
    <span class="hljs-comment">// create the ZNode, no need to set any data with this znode</span>
    <span class="hljs-keyword">try</span> {
      curator
          .create()
          .withTtl(ZKUtils.STATUS_TTL_MILLIS)
          .creatingParentsIfNeeded()
          .withMode(CreateMode.PERSISTENT_WITH_TTL)
          .inBackground(
              <span class="hljs-keyword">new</span> BackgroundCallback() {
                <span class="hljs-meta">@Override</span>
                <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">processResult</span><span class="hljs-params">(CuratorFramework client, CuratorEvent event)</span> </span>{
                  <span class="hljs-keyword">switch</span> (KeeperException.Code.get(event.getResultCode())) {
                    <span class="hljs-keyword">case</span> OK -&gt; {
                      log.info(<span class="hljs-string">"Status updated successfully {}"</span>, event.getPath());
                      log.info(<span class="hljs-string">"Performing deletion of assignment path {}"</span>, assignmentPath);
                      asyncDelete(assignmentPath);
                    }
                    <span class="hljs-keyword">case</span> CONNECTIONLOSS -&gt; {
                      log.error(
                          <span class="hljs-string">"Lost connection to ZK while creating {}, retrying"</span>, event.getPath());
                      asyncCreate(jobId, assignmentPath);
                    }
                    <span class="hljs-keyword">case</span> NODEEXISTS -&gt; {
                      log.warn(<span class="hljs-string">"Node already exists for path {}"</span>, event.getPath());
                    }
                    <span class="hljs-keyword">default</span> -&gt; log.error(<span class="hljs-string">"Unhandled event {}"</span>, event);
                  }
                }
              })
          .forPath(ZKUtils.getStatusPath(jobId), <span class="hljs-string">"Completed"</span>.getBytes());
    } <span class="hljs-keyword">catch</span> (Exception e) {
      log.error(<span class="hljs-string">"Unable to create {} due to "</span>, ZKUtils.getStatusPath(jobId), e);
      <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> RuntimeException(e);
    }
  }

  <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">void</span> <span class="hljs-title">asyncDelete</span><span class="hljs-params">(String path)</span> </span>{
    <span class="hljs-comment">// delete the provided ZNode</span>
    <span class="hljs-keyword">try</span> {
      curator
          .delete()
          .idempotent()
          .guaranteed()
          .inBackground(
              <span class="hljs-keyword">new</span> BackgroundCallback() {
                <span class="hljs-meta">@Override</span>
                <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">processResult</span><span class="hljs-params">(CuratorFramework client, CuratorEvent event)</span> </span>{
                  <span class="hljs-keyword">switch</span> (KeeperException.Code.get(event.getResultCode())) {
                    <span class="hljs-keyword">case</span> OK -&gt; {
                      log.info(<span class="hljs-string">"Path deleted successfully {}"</span>, event.getPath());
                    }
                    <span class="hljs-keyword">case</span> CONNECTIONLOSS -&gt; {
                      log.info(
                          <span class="hljs-string">"Lost connection to ZK while deleting {}, retrying"</span>, event.getPath());
                      asyncDelete(event.getPath());
                    }
                    <span class="hljs-keyword">default</span> -&gt; log.error(<span class="hljs-string">"Unhandled event {}"</span>, event);
                  }
                }
              })
          .forPath(path);
    } <span class="hljs-keyword">catch</span> (Exception e) {
      log.error(<span class="hljs-string">"Unable to delete {} due to "</span>, path, e);
      <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> RuntimeException(e);
    }
  }
}
</code></pre>
<h3 id="heading-workerslistener">WorkersListener</h3>
<p>When a worker is lost due to network partition, or application shutdown, the leader instance is notified using a watcher event. The leader then looks at all the tasks that were assigned to the lost worker by iterating over the assignment mappings <code>/assignment/{worker-id}/{job-id}</code> .</p>
<p>All the tasks are then recreated by re-creating an entry in the <code>/jobs/{job-id}</code> . This recreation triggers the entire workflow from the start.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">WorkersListener</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">CuratorCacheListener</span> </span>{

  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> CuratorCache assignmentCache;
  <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> CuratorFramework curator;

  <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">WorkersListener</span><span class="hljs-params">(CuratorCache assignmentCache, CuratorFramework curator)</span> </span>{
    <span class="hljs-keyword">this</span>.assignmentCache = assignmentCache;
    <span class="hljs-keyword">this</span>.curator = curator;
  }

  <span class="hljs-meta">@Override</span>
  <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">event</span><span class="hljs-params">(Type type, ChildData oldData, ChildData data)</span> </span>{
    <span class="hljs-keyword">if</span> (type == Type.NODE_CREATED) {
      log.info(<span class="hljs-string">"New worker found {} "</span>, data.getPath());
    } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (type == Type.NODE_DELETED) {
      <span class="hljs-comment">// notice we have to check oldData because data will be null</span>
      log.info(<span class="hljs-string">"Lost worker {}"</span>, oldData.getPath());
      String lostWorkerID = oldData.getPath().substring(oldData.getPath().lastIndexOf(<span class="hljs-string">'/'</span>) + <span class="hljs-number">1</span>);
      <span class="hljs-comment">// map of job ids -&gt; job data, which was assigned to the lost worker</span>
      Map&lt;String, <span class="hljs-keyword">byte</span>[]&gt; assignableJobIds = <span class="hljs-keyword">new</span> HashMap&lt;&gt;();
      assignmentCache.stream()
          .forEach(
              childData -&gt; {
                String path = childData.getPath();
                <span class="hljs-keyword">int</span> begin = path.indexOf(<span class="hljs-string">'/'</span>) + <span class="hljs-number">1</span>;
                <span class="hljs-keyword">int</span> end = path.indexOf(<span class="hljs-string">'/'</span>, begin);
                String pathWorkerID = path.substring(begin, end);
                <span class="hljs-keyword">if</span> (pathWorkerID.equals(lostWorkerID)) {
                  String jobID = path.substring(end + <span class="hljs-number">1</span>);
                  log.info(<span class="hljs-string">"Found {} assigned to lost worker {}"</span>, jobID, lostWorkerID);
                  assignableJobIds.put(jobID, childData.getData());
                }
              });
      <span class="hljs-comment">// Assuming atomic creation of assignment path and deletion of tasks path (using MultiOp), we</span>
      <span class="hljs-comment">// can safely assume that no entry exists under /jobs for the assigned tasks.</span>
      <span class="hljs-comment">// So we can simulate job creation by recreating an entry in the /jobs entry.</span>
      assignableJobIds.forEach(
          (jobId, jobData) -&gt; asyncCreateJob(ZKUtils.getJobsPath() + <span class="hljs-string">"/"</span> + jobId, jobData));
    }
  }

  <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">void</span> <span class="hljs-title">asyncCreateJob</span><span class="hljs-params">(String path, <span class="hljs-keyword">byte</span>[] data)</span> </span>{
    <span class="hljs-keyword">try</span> {
      curator
          .create()
          .idempotent()
          .withMode(CreateMode.PERSISTENT)
          .inBackground(
              <span class="hljs-keyword">new</span> BackgroundCallback() {
                <span class="hljs-meta">@Override</span>
                <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">processResult</span><span class="hljs-params">(CuratorFramework client, CuratorEvent event)</span> </span>{
                  <span class="hljs-keyword">switch</span> (KeeperException.Code.get(event.getResultCode())) {
                    <span class="hljs-keyword">case</span> OK -&gt; {
                      log.info(<span class="hljs-string">"Job repaired successfully for {}"</span>, path);
                    }
                    <span class="hljs-keyword">case</span> CONNECTIONLOSS -&gt; {
                      log.error(
                          <span class="hljs-string">"Lost connection to ZK while repairing job {}, retrying"</span>,
                          event.getPath());
                      asyncCreateJob(event.getPath(), (<span class="hljs-keyword">byte</span>[]) event.getContext());
                    }
                    <span class="hljs-keyword">case</span> NODEEXISTS -&gt; {
                      log.warn(<span class="hljs-string">"Job already exists for path {}"</span>, event.getPath());
                    }
                    <span class="hljs-keyword">default</span> -&gt; log.error(<span class="hljs-string">"Unhandled event {}"</span>, event);
                  }
                }
              },
              data)
          .forPath(path, data);
    } <span class="hljs-keyword">catch</span> (Exception e) {
      log.error(<span class="hljs-string">"Error while repairing job {}"</span>, path, e);
      <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> RuntimeException(e);
    }
  }
}
</code></pre>
<h3 id="heading-workerservice-the-humble-plumber">WorkerService - The humble plumber</h3>
<p>Throughout the article, you might have noticed that we talked about a leader instance performing some work but never explained it. So let's talk about what a leader instance is and how an instance becomes a leader.</p>
<p>When a worker instance comes up - it enqueues itself for a chance to become a leader. We perform leader elections using the Curator framework ensuring that only a single instance can become a leader amongst the members.</p>
<p>The leader is entrusted to perform critical actions like watching the <code>/jobs/</code> path and the <code>/workers/</code> path. <em>The remaining instances do not set up watches on these paths because we want to ensure a task is assigned to only one worker instance</em>. If multiple instances were trying to perform the assignment, it would be difficult to coordinate among them without taking a lock. This is where the Zookeeper comes in and acts as the trusty coordination service.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">WorkerService</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">LeaderSelectorListener</span>, <span class="hljs-title">Closeable</span> </span>{
  <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">WorkerService</span><span class="hljs-params">(CuratorFramework curator, String path)</span> </span>{
    <span class="hljs-keyword">this</span>.curator = curator;
    leaderSelector = <span class="hljs-keyword">new</span> LeaderSelector(curator, path, <span class="hljs-keyword">this</span>);
    <span class="hljs-comment">// the selection for this instance doesn't start until the leader selector is started</span>
    <span class="hljs-comment">// leader selection is done in the background so this call to leaderSelector.start() returns</span>
    <span class="hljs-comment">// immediately</span>
    leaderSelector.start();
    <span class="hljs-comment">// this is important as it automatically handles failure scenarios i.e. starts leadership after</span>
    <span class="hljs-comment">// the reconnected state</span>
    <span class="hljs-comment">// https://www.mail-archive.com/user@curator.apache.org/msg00903.html</span>
    leaderSelector.autoRequeue();
    setup();
    workerPickerStrategy = <span class="hljs-keyword">new</span> RoundRobinWorker();
  }

  <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">void</span> <span class="hljs-title">setup</span><span class="hljs-params">()</span> </span>{
    registerWorker();
    asyncCreate(ZKUtils.getJobsPath(), CreateMode.PERSISTENT, <span class="hljs-keyword">null</span>);
    asyncCreate(ZKUtils.getAssignmentPath(name), CreateMode.PERSISTENT, <span class="hljs-keyword">null</span>);
    asyncCreate(ZKUtils.STATUS_ROOT, CreateMode.PERSISTENT, <span class="hljs-keyword">null</span>);
  }

  <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">void</span> <span class="hljs-title">registerWorker</span><span class="hljs-params">()</span> </span>{
    <span class="hljs-keyword">if</span> (registrationRequired.get()) {
      log.info(<span class="hljs-string">"Attempting worker registration"</span>);
      name = UUID.randomUUID().toString();
      log.info(<span class="hljs-string">"Generated a new random name to the worker {}"</span>, name);
      asyncCreate(ZKUtils.getWorkerPath(name), CreateMode.EPHEMERAL, registrationRequired);
      asyncCreate(ZKUtils.getAssignmentPath(name), CreateMode.PERSISTENT, <span class="hljs-keyword">null</span>);
      watchAssignmentPath();
      <span class="hljs-comment">// irrespective of whether this node is a leader or not, we need to watch the assignment path</span>
    }
  }

  <span class="hljs-comment">// only the leader worker will watch for incoming jobs and changes to available workers</span>
  <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">void</span> <span class="hljs-title">watchJobsAndWorkersPath</span><span class="hljs-params">()</span> </span>{
    <span class="hljs-comment">// in case leadership is reacquired, repeat the setup of the watches</span>
    workersCache = CuratorCache.build(curator, ZKUtils.WORKERS_ROOT);
    workersCache.start();
    log.info(<span class="hljs-string">"Watching workers root path {}"</span>, ZKUtils.WORKERS_ROOT);
    workersListener = <span class="hljs-keyword">new</span> WorkersListener(assignmentCache, curator);
    workersCache.listenable().addListener(workersListener);

    jobsCache = CuratorCache.build(curator, ZKUtils.JOBS_ROOT);
    log.info(<span class="hljs-string">"Watching jobs root path {}"</span>, ZKUtils.getJobsPath());
    jobsCache.start();
    jobsListener = <span class="hljs-keyword">new</span> JobsListener(curator, workersCache, workerPickerStrategy);
    jobsCache.listenable().addListener(jobsListener);
  }

  <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">void</span> <span class="hljs-title">watchAssignmentPath</span><span class="hljs-params">()</span> </span>{
    <span class="hljs-comment">// No need to check for null here because once a session is reconnected after a loss</span>
    <span class="hljs-comment">// we need to start the assignment listener on the new worker id</span>
    assignmentCache = CuratorCache.build(curator, ZKUtils.getAssignmentPath(name));
    log.info(<span class="hljs-string">"Watching {}"</span>, ZKUtils.getAssignmentPath(name));
    assignmentCache.start();
    assignmentListener = <span class="hljs-keyword">new</span> AssignmentListener(curator);
    assignmentCache.listenable().addListener(assignmentListener);
  }

  <span class="hljs-meta">@Override</span>
  <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">takeLeadership</span><span class="hljs-params">(CuratorFramework client)</span> </span>{
    <span class="hljs-comment">// we are now the leader. This method should not return until we want to relinquish leadership,</span>
    <span class="hljs-comment">// which will only happen, if someone has signalled us to stop</span>
    log.info(<span class="hljs-string">"{} is now the leader"</span>, name);
    <span class="hljs-comment">// only the leader should watch the jobs and workers path</span>
    watchJobsAndWorkersPath();
    lock.lock();
    <span class="hljs-keyword">try</span> {
      <span class="hljs-comment">// sleep until signalled to stop</span>
      <span class="hljs-keyword">while</span> (!shouldStop.get()) {
        condition.await();
      }
      <span class="hljs-keyword">if</span> (shouldStop.get()) {
        log.warn(<span class="hljs-string">"{} is signalled to stop!"</span>, name);
        leaderSelector.close();
      }
    } <span class="hljs-keyword">catch</span> (InterruptedException e) { <span class="hljs-comment">// this is propagated from cancel leadership election</span>
      log.error(<span class="hljs-string">"Thread is interrupted, need to exit the leadership"</span>, e);
    } <span class="hljs-keyword">finally</span> {
      <span class="hljs-comment">// finally is called before the method return</span>
      lock.unlock();
    }
  }

  <span class="hljs-meta">@Override</span>
  <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">stateChanged</span><span class="hljs-params">(CuratorFramework client, ConnectionState newState)</span> </span>{
    <span class="hljs-keyword">if</span> (newState == ConnectionState.RECONNECTED) {
      log.error(<span class="hljs-string">"Reconnected to ZK, Received {}"</span>, newState);
      <span class="hljs-comment">// no need to start the leadership again as it is auto requeued but worker re-registration is</span>
      <span class="hljs-comment">// still required which will create a ZNode in /workers and /assignments path</span>
      registerWorker();
    } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (newState == ConnectionState.LOST) {
      log.error(<span class="hljs-string">"Connection suspended/lost to ZK, giving up leadership {}"</span>, newState);
      registrationRequired.set(<span class="hljs-keyword">true</span>);
      <span class="hljs-comment">// This is required as the assignment cache listens on the {worker id} which is ephemeral</span>
      <span class="hljs-comment">// In case of a lost session, it's guaranteed that the {worker id} would have expired</span>
      <span class="hljs-comment">// Once the session is reconnected, we need to set up the assignment listener again on a new</span>
      <span class="hljs-comment">// worker id</span>
      <span class="hljs-comment">// <span class="hljs-doctag">TODO:</span> Figure out a way to simulate the disconnection from zookeeper only by one instance</span>
      log.info(<span class="hljs-string">"Removing the watcher set on the assignment listener"</span>);
      assignmentCache.listenable().removeListener(assignmentListener);
      assignmentCache.close();
      <span class="hljs-keyword">if</span> (workersCache != <span class="hljs-keyword">null</span>) {
        log.info(<span class="hljs-string">"Removing the watcher set on the workers listener"</span>);
        workersCache.listenable().removeListener(workersListener);
        workersCache.close();
      }
      <span class="hljs-keyword">if</span> (jobsCache != <span class="hljs-keyword">null</span>) {
        log.info(<span class="hljs-string">"Removing the watcher set on the jobs listener"</span>);
        jobsCache.listenable().removeListener(jobsListener);
        jobsCache.close();
      }
      <span class="hljs-comment">// throwing this specific exception would cause the current thread to interrupt and would</span>
      <span class="hljs-comment">// cause and InterruptedException</span>
      <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> CancelLeadershipException();
    } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (newState == ConnectionState.SUSPENDED) {
      <span class="hljs-comment">// https://stackoverflow.com/questions/41042798/how-to-handle-apache-curator-distributed-lock-loss-of-connection</span>
      log.error(<span class="hljs-string">"Connection has been suspended to ZK {}"</span>, newState);
      <span class="hljs-comment">// <span class="hljs-doctag">TODO:</span> After increasing the time out, verify whether no other instance gets the lock before</span>
      <span class="hljs-comment">// the connection is marked as LOST</span>
    }
}
</code></pre>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">It's critical for the <code>LeaderSelector</code> instances to pay attention to any connection state changes. If an instance becomes the leader, it should respond to notification of being SUSPENDED or LOST Zookeeper session. If the SUSPENDED state is reported, the instance must assume it might no longer be the leader until it receives a RECONNECTED state. If the LOST state is reported, the instance is no longer the leader and its <code>takeLeadership</code> method should exit.</div>
</div>

<p>When we detect that our instance has lost its connection from Zookeeper, we remove any watches that have been set up and throw a <code>CancelLeadershipException</code>. And then we wait until we are reconnected to the Zookeeper.</p>
<p>Once reconnected, we generate a new name for the worker and set up appropriate watches. Since <code>autoRequeue()</code> was enabled during the leader election, the instance will enqueue itself for a chance of becoming a leader.</p>
<hr />
<h3 id="heading-conclusion">Conclusion</h3>
<p><img src="https://images.unsplash.com/photo-1554830072-52d78d0d4c18?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" alt="dog biting Thank You mail paper" /></p>
<p>If you have read so far, I appreciate your patience. Hope you learnt something new today. Thank you for reading.</p>
<p>Please feel free to ask any questions you might have in the comments.</p>
<hr />
<h3 id="heading-appendix">Appendix</h3>
<ul>
<li><p><a target="_blank" href="https://cwiki.apache.org/confluence/display/CURATOR/TN1">ZooKeeper watches are single-threaded.</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/snehasishroy/TaskScheduler">Link to the Code Repository</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Deep Dive of the Distributed Job Scheduler that powers over 2 Billion daily jobs at PhonePe]]></title><description><![CDATA[While working at PhonePe, I had the privilege of working on Clockwork - the system that powers job scheduling across PhonePe. During an internal knowledge transfer session, I presented an architectural overview of the system to the team. After receiv...]]></description><link>https://snehasishroy.com/deep-dive-of-the-distributed-job-scheduler-that-powers-over-2-billion-daily-jobs-at-phonepe</link><guid isPermaLink="true">https://snehasishroy.com/deep-dive-of-the-distributed-job-scheduler-that-powers-over-2-billion-daily-jobs-at-phonepe</guid><category><![CDATA[System Architecture]]></category><category><![CDATA[System Design]]></category><category><![CDATA[distributed system]]></category><category><![CDATA[2Articles1Week]]></category><category><![CDATA[Databases]]></category><category><![CDATA[backend]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Sat, 17 Feb 2024 19:26:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/rBPOfVqROzY/upload/4d77c572441701ddeac4fce436bd5cd3.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>While working at <a target="_blank" href="https://www.phonepe.com/">PhonePe</a>, I had the privilege of working on Clockwork - the system that powers job scheduling across PhonePe. During an internal knowledge transfer session, I presented an architectural overview of the system to the team. After receiving very positive feedback post the session, I decided to write a blog post for the PhonePe tech blog, explaining the internals of PhonePe to the external audience. This article is the uncut version of what I initially wrote for the blog.</p>
<h2 id="heading-preface">Preface</h2>
<p>If you have ever missed your morning flight because your alarm never rang, you understand the importance of reliable alarms.</p>
<p>At PhonePe, we ensure that your alarms always ring at the time you want them to ring. Consider a scenario that happens daily at PhonePe - Merchant Settlements. If a merchant has performed multiple transactions during the day, at the end of the day, we want to ensure the final amount gets credited to their account. Any delay in the jobs getting triggered can cause a delay in the merchant settlement which can lead to a loss of customer trust.</p>
<p>Consider another scenario - Coupons invalidation. If you have received a coupon in PhonePe, you must have noticed that the coupon has a fixed validity. Now multiply this by hundreds of thousands or even millions of coupons. This necessitates a platform that can reliably schedule tasks at a predetermined schedule to invalidate all the coupons.</p>
<p>Patterns similar to this keep presenting themselves across different systems and at incredible scale.</p>
<p>The obvious approach would have been to use embedded schedulers, i.e. a scheduler service that runs within the client application and allows the client to schedule jobs in the future. However, this presents a set of challenges, especially the <em>lack of fault tolerance.</em> What will happen to the scheduled job in case the client application goes down? To solve this, we would need <em>persistence</em>, but it would need a <em>lot</em> of complex coordination across hundreds of stateless containers. Why? Because in most cases, we would not want the same job to be executed by multiple containers. Then there is the issue of scale – we can have situations where hundreds of millions of such tasks can be triggered in the span of a few hours which could result in millions of notifications being generated at constant rates nudging users about coupons nearing expiration.</p>
<p>To ensure coordination amongst containers, we would have to employ complicated strategies like partitioning of jobs and leader-election amongst task executor instances in all services handling these kinds of use cases. While possible, this would add a <em>lot</em> of complexity to the service containers themselves, making Garbage Collection, Thread pool, and auto-scaler tuning extremely difficult due to all containers doing mixed workloads, and adding significantly to the storage layer requirements for them.</p>
<p>As a design principle at PhonePe, we keep individual systems simple and <em>build up complexity in layers</em>, similar to building complex command chains by piping simple commands on Unix or GNU/Linux.</p>
<p>As this seemed like a fairly popular requirement across many systems we decided to build a centralized platform that can allow any clients to onboard and schedule jobs in the future without doing any kind of heavy lifting on their own. In this post, we will take a look at the internals of <strong>Clockwork</strong> - the system that powers job scheduling across various teams at PhonePe.</p>
<h2 id="heading-scale">Scale</h2>
<ul>
<li><p>Over 2 Billion callbacks are made daily as per the schedule defined.</p>
</li>
<li><p>Capability to handle over 100,000 job schedules per second with a single-digit millisecond latency.</p>
</li>
<li><p>No lag in the job execution at p99 which in the worst case can extend to 1 minute.</p>
</li>
</ul>
<h2 id="heading-what-is-clockwork">What is Clockwork?</h2>
<p>Clockwork is a <em>Distributed, Durable, and Fault-Tolerant</em> Task Scheduler. Wow, that's a handful! Let's dissect it!</p>
<ul>
<li><p><strong>Task Scheduler</strong> - In Linux, a job that needs execution at a specific point in time can be scheduled using the <code>at</code> command. In the case of recurring jobs, <strong>crontab</strong> can be used. Clockwork was designed based on that ideology, allowing clients to submit jobs that can be executed as per their schedule (<em>once or repeated</em>). Instead of executing arbitrary Java code, we limit it to only providing an HTTP callback to the provided URL endpoint at the specified time duration. Clients can schedule jobs that can be executed once, or at fixed intervals e.g. after every one day or daily at 5 PM.</p>
</li>
<li><p><strong>Distributed</strong> - To support high throughput of callbacks (100K RPS), we would need a service that can scale horizontally.</p>
</li>
<li><p><strong>Durable</strong> - Any submitted task is stored in a durable storage allowing Clockwork to recover from failures.</p>
</li>
<li><p><strong>Fault Tolerant</strong> - Any failure during job execution is handled gracefully per the client configuration. If a client wants the job to be retried, it is automatically retried upon failure as per its retry strategy.</p>
</li>
</ul>
<h2 id="heading-architecture-1000-foot-view">Architecture - 1000-foot view</h2>
<ul>
<li><p>Clients schedule Jobs.</p>
</li>
<li><p>Jobs get executed.</p>
</li>
<li><p>Clients receive callbacks.</p>
</li>
<li><p>Profit 💰</p>
</li>
</ul>
<p><img src="https://lh7-us.googleusercontent.com/JY5QnY397x1b9qMrUiMf4AkQ5JLj3tBl6aFu7_wlvyJ1_GYmT9JNj-TxInCssbmWAEou7TS7QiuOfuAYIETges1pp_95nfYK3Ul9JGuY6RJamdZSwBX9OEUPOpNtr5NToRWCXO-z8fUor1s2fAe23bs" alt /></p>
<p>Whenever a client schedules jobs, we store the job details in HBase and <em>immediately</em> send an acknowledgment back to the client.</p>
<p>Asynchronously, Clockwork keeps on performing <a target="_blank" href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">HBase scans</a> to find the list of eligible jobs that need execution. Once found, those jobs are immediately pushed to RabbitMQ (RMQ) to avoid blocking the scanner threads. As the callback is an HTTP callback to a URL endpoint, it can be time-consuming, hence it's important to decouple the job execution from the job extraction. Actual callbacks to the clients are performed in different threads after extracting the Job details from RMQ.</p>
<p>Worker threads subscribe to client-specific queues on RabbitMQ. In case of new messages, they are notified. Upon notification, a callback is made to the client as per the job details present in the message. If the callback is successful, an acknowledgment is sent to the RMQ which removes the message from the queue (acks the message). In case of a failure during callback (<em>either because the downstream is down or an unexpected response code was received</em>), retries are performed based on client-specific configuration.</p>
<h3 id="heading-why-hbase">Why HBase?</h3>
<p>HBase is a key-value store structured as a sparse, distributed, persistent, multidimensional sorted map. This means that each cell is indexed by a RowKey, ColumnKey, and timestamp. Additionally, the rows are sorted by row keys. This allows us to efficiently query for rows by a specific key and run scans based on a start/stop key. We rely heavily on scans to find the candidate set of jobs whose scheduled execution time is less than or equal to the current time and callbacks are to be sent for.</p>
<h3 id="heading-why-rabbitmq">Why RabbitMQ?</h3>
<p>RabbitMQ is a messaging broker - an intermediary for messaging. It gives applications a common platform to send/receive messages and provides a durable place to store messages until consumed.</p>
<h2 id="heading-architecture-deep-dive">Architecture - Deep Dive</h2>
<p>Clockwork service can be divided into 5 modules - each entrusted to perform a single responsibility.</p>
<p><img src="https://lh7-us.googleusercontent.com/1kUPYArK2uX5WlcriyEi9M_2SMBCl1rtCnu-WGVJRxLXywLtAIjlI1ahR-G5Nt3luGZYY-KrpRvgk0evZAeXIbBuEHH-CRtrRfzrcCSo2afH4F5G5Rd0V37PrRNt7VZcOf6FvwkjE2iQh2lK02szD1Y" alt /></p>
<h3 id="heading-job-acceptor">Job Acceptor</h3>
<p>Job Acceptor is a Client Facing Module. Its responsibility is to accept and validate the incoming requests from clients, persist the job details in HBase, and return an acknowledgment to the client. While persisting job details, a random Partition ID is assigned to the Job ID. We will cover the role of partitions in the subsequent sections.</p>
<h3 id="heading-job-extractor">Job Extractor</h3>
<p>The Job Extractor’s responsibility is to find jobs that are eligible for execution. If the scheduled execution time of a job &lt;= current time, a job becomes eligible. It finds eligible plans by running an HBase scan query between a time range. Once an eligible plan is found, the plans are pushed to RMQ one by one (<em>without waiting for the job execution</em>) to perform the next scan as soon as possible.</p>
<h3 id="heading-leader-elector">Leader Elector</h3>
<p>At any point in time, there are multiple instances of Clockwork running. Each instance runs job extractors for all clients as all our containers are stateless. This poses a problem - if all of these extractors are trying to find eligible jobs for the same client at a given point in time, they all will get the same data, which will result in duplicate executions of the same job, something that cannot be allowed by any chance. The leader elector’s responsibility is to assign a leader amongst multiple clockwork instances for every client. The leader for a client assigns the partitions to the workers (extractors) running across different instances of clockwork.</p>
<p><img src="https://lh7-us.googleusercontent.com/j1WAtXnZ7f4PQhKHp16NZQwk2bN1gijaWqcnxj9-gm1WMA-uUGbDSrMc7OC5yKMTJfkhnPCoLt9ndynard_y6L7ACRV0eQO8oqEea16eKVM05WOdQ6Lft_Cvpaa866zmSimVcLPq-ZbgNJGuSALyTso" alt /></p>
<ul>
<li><p>During the application startup, the application instance registers itself with Zookeeper with a unique worker ID.</p>
</li>
<li><p>It then proceeds to check if there is a leader already elected for a client. If not, it tries to become a leader of a client.</p>
</li>
<li><p>The client leader moves on to perform the partition assignment amongst existing clockwork instances (workers).</p>
</li>
</ul>
<p>If you are still reading this article (<em>kudos</em>), you must have heard about the term Partition ID mentioned earlier. Why is it required? To support clients that need a lot of concurrent callbacks. Partitioning allows us to increase the number of concurrent scans that can be performed while ensuring a job only comes up in a single scan. If we have 64 partitions, we can perform 64 concurrent scans, allowing a higher rate of throughput.</p>
<p>But with great power comes great responsibility! We still have to ensure that no two instances scan the same partition. Otherwise, it will lead to <em>double execution</em> of the same job! This is where the partition assignment comes into the picture. The leader instance is responsible for assigning partitions to the workers - in a round-robin manner. This ensures fairness in partition distribution and ensures no two workers read the same partition.</p>
<h3 id="heading-rmq-publisher">RMQ Publisher</h3>
<p>Once the list of eligible plans is fetched and some validations are performed, the messages are pushed to RMQ which acts as our message broker. While publishing the message we use Rate limiter.</p>
<p>Rate Limiter ensures that we don't publish more than what we can consume - otherwise, it can lead to the instability of the RMQ cluster - because of a huge backlog of messages. This is achieved by dynamically pausing scans if the queue size goes to a certain configurable threshold. Once the client can receive callbacks, and the queue size starts reducing, subsequent scans start pushing data into the queues.</p>
<p>We had to redesign this rate limiter to handle spiky traffic. Some of the clients schedule <em>a lot of</em> messages (&gt;100k) that need execution at the same time. This leads to the <em>fast producer - slow consumer</em> problem, as the <strong>Job Extractor</strong> keeps finding the eligible jobs and enqueues to the RMQ queues so rapidly that the RMQ consumers are unable to catch up - leading to <em>flow control</em> and <em>cluster instability</em> across the RMQ cluster. To combat this, we used the <a target="_blank" href="https://guava.dev/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html">Guava Rate Limiter</a> to limit the publish rate to a limit that we know our 5-node RMQ cluster can handle (~100k consumer acknowledgement per second).</p>
<p>A lot of our workflows are time-sensitive, and delayed callbacks are sometimes not useful. Clockwork provides a way for clients to specify this time limit by setting a <em>relevancy window</em>. If the client is slow in accepting callbacks (<em>slow consumer problem</em>) or the scans were paused/slow during a time interval (<em>slow publisher problem</em>), expired callbacks will <em>not</em> be sent even when the system stabilizes. Besides this, we also support callback sidelining. This provides a way for clients to avoid getting overwhelmed by clockwork callbacks right after they recover.</p>
<h3 id="heading-rmq-consumer">RMQ Consumer</h3>
<p>It listens to the incoming messages in the RMQ Queues and executes them by making an HTTP call to the specific URL endpoint. Any failure while making the call is handled by client-specific retry strategies.</p>
<p>Some clients don't want to retry and just want to drop failed messages, whereas some want to perform retries based on exponential backoff with random jitter. In case the retries are exhausted and the callback still fails, the message is pushed to a Dead-Letter-Queue. It's kept there until the messages are moved back to the main queue or the messages expire.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Clockwork's architecture is tailored and scaled to our specific needs and has enabled us to manage the enormous volume of tasks that PhonePe needs to handle efficiently. This system allows us to offer our customers seamless and uninterrupted service.</p>
<p>As engineers, we know that growth comes with challenges to infrastructure and system quality. At PhonePe, we are constantly seeking solutions to maintain that quality while managing costs and accommodating ever-increasing surges in traffic. Our goal is to share our learnings with the larger engineering community so that we can all learn how to address the challenges of growth and adapt our systems accordingly.</p>
<p>The link to the official blog article can be found <a target="_blank" href="https://tech.phonepe.com/clockwork-the-backbone-of-phonepes-2-billion-daily-jobs/">here</a>.</p>
<p>Please feel free to ask any questions you might have in the comments.</p>
]]></content:encoded></item><item><title><![CDATA[Basics of Linux FileSystem]]></title><description><![CDATA[While working on creating a directory-like hierarchy for an application at work, I stumbled upon a series of questions that left me puzzled

How does Linux handle directories and files?
What happens when you try to read a 10GB of file? How does Linux...]]></description><link>https://snehasishroy.com/basics-of-linux-filesystem</link><guid isPermaLink="true">https://snehasishroy.com/basics-of-linux-filesystem</guid><category><![CDATA[Linux]]></category><category><![CDATA[Programming Blogs]]></category><category><![CDATA[technology]]></category><category><![CDATA[System Architecture]]></category><category><![CDATA[2Articles1Week]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Sun, 10 Dec 2023 18:11:04 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/o6GEPQXnqMY/upload/02525257a64355c2e97c0e37401d2057.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>While working on creating a directory-like hierarchy for an application at work, I stumbled upon a series of questions that left me puzzled</p>
<blockquote>
<p>How does Linux handle directories and files?</p>
<p>What happens when you try to read a 10GB of file? How does Linux maintain the list of blocks to read from?</p>
<p>How is Linux able to list down thousands of files associated with a directory in seconds? Why is an empty directory always 4 KB?</p>
</blockquote>
<p>In this blog, I will try to answer these questions in a simplified manner.</p>
<h3 id="heading-how-is-a-file-stored-on-the-disk">How is a file stored on the Disk?</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1702663483356/30002a90-db2d-4768-86f4-aeaddf026ad4.webp" alt class="image--center mx-auto" /></p>
<p>Disk provides a block storage. It only deals with blocks and provides a facility to read and write blocks. If you ask it to read the contents of <code>my cool movie.mkv</code> it will get confused but if you ask it to return the content of the block number 20693, it will happily return you a series of 0's and 1's.</p>
<h3 id="heading-what-is-a-block">What is a block?</h3>
<p>Block is the smallest unit of bytes that the disk can read/write in one go. If the block size is 4 KB, it can contain max 4 KB of data. If you modify a single bit of that block, the disk will need to update the entire block as it cannot modify anything smaller than the block size.</p>
<blockquote>
<p>If the block size is too high, it can lead to wasted space because even if you want to write one byte, you will occupy one entire block. Too small a block size can also be problematic, as the data will now be spread across too many blocks, causing random seeks to be slow.</p>
</blockquote>
<hr />
<h3 id="heading-how-to-maintain-a-list-of-blocks-associated-with-a-file">How to maintain a list of blocks associated with a file?</h3>
<p>Enter FileSystem! A filesystem e.g. FAT, NTFS, ext4, and ZFS provides abstraction over the underlying disks - it maintains a list of blocks a file is associated with, so an end user can read a file seamlessly. How does it do that?</p>
<p>The simplest way would be to only assign a contiguous set of blocks to a file i.e. if you want to write a 4 MB of file, the filesystem will ensure that you only get 1024 contiguous blocks of 4 KB each. However, this approach isn't practical because of fragmentation. The files can get modified which can increase its length. In that case, we would have to defragment (compact) the entire disk, and then proceed with the update - which can be time-consuming.</p>
<p>The second approach would be to use a linked list. A block will link to the next block of the same file. This approach suffers from slow performance due to random seek. If we want to jump to the 10th block of a file, we would have to read the first 9 blocks, which can be slow.</p>
<p>The third approach is to maintain a file allocation table. This strategy was used in the FAT file system.</p>
<blockquote>
<p>When you initialize a filesystem, it creates a fixed-size index based on the number of blocks. When you create a file, that spans across multiple blocks, it keeps on updating the index of the current block to the next block.</p>
</blockquote>
<p>This looks very similar to the linked list approach but the caveat is that this index is kept in memory. If you have to jump to the 100th block, you still need to traverse 99 array indices, but since it's in memory, random accesses are much faster.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1702205179692/b53e04e0-6c85-49f7-a0f8-b8722fcbe480.png" alt="File Index used in FAT file system" class="image--center mx-auto" /></p>
<p>The fourth approach is maintaining a file indexing structure. If the file spans across 2000 blocks, the easiest way would be to keep track of all those 2000 blocks in a list-like data structure - but this would be very memory-expensive.</p>
<blockquote>
<p>What if we could introduce some kind of tiered indexes? For small-sized files, we can maintain a 1:1 mapping of blocks but for bigger files, we can maintain a single level of indirection and for even larger files a double/triple layer of indirection can suffice.</p>
</blockquote>
<p>In the diagram below, 10 blocks keeps a direct mapping of 10 data blocks i.e. for a file with size &lt;= 40 KB (10 * 4 KB), we can perform random operation in one go.</p>
<p>For files larger than 40 KB, we have multiple levels of indirection. Assuming our block size is 4 KB, and a block number can be represented by an integer (32 bits i.e. 4 Bytes), a block can have 1024 such integers (4 KB / 4 Bytes). With a single level of indirection, we can keep track of 1024 other blocks = 1024 * 4 KB = 4096 KB = 4 MB.</p>
<p>As we add multiple levels of indirection, we can keep track of files with sizes of multiple GB's - all with a fixed set of memory. Isn't that cool !</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1702205608126/54e35ec8-e83c-4abf-927c-fc0950bad3f6.png" alt class="image--center mx-auto" /></p>
<blockquote>
<p><a target="_blank" href="https://github.com/suvratapte/Maurice-Bach-Notes/blob/master/4-Internal-Representation-of-Files.md">Image Source</a></p>
</blockquote>
<p>A file system maintains this information in Inode or Index Nodes. Every file has one unique Inode associated with it. Consider the below file of size <code>4.4 GB</code> It's split across <code>9163680</code> such blocks and those blocks are mapped to the INode number <code>2098469</code>.</p>
<blockquote>
<p>P.S. Note that the file name is not associated with the Inode because we can have multiple names of the same file using <a target="_blank" href="https://en.wikipedia.org/wiki/Hard_link">hard links</a>.</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1702207208233/f2725983-bf1e-45ff-990d-6ea05e2d48e8.png" alt class="image--center mx-auto" /></p>
<hr />
<h3 id="heading-how-to-maintain-directories">How to maintain directories?</h3>
<p>Directories allow us to maintain files in a hierarchical structure. So far, we have understood how to represent files in a file system, so let's deep dive into how we can model directories too.</p>
<p>In Linux, everything is a file. So directories can also be thought of as a file. So directories contain essentially a list of file names present under it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1702209350688/3ecc5cfe-2387-4138-b119-5a5058b5493e.png" alt class="image--center mx-auto" /></p>
<blockquote>
<p>When you create a new/empty directory, it still takes 4 KB of space - which is equivalent to the block size. As and when the files keep on increasing, so does the size of the directory itself. (<em>Note that when I talk about the size of the directory, it does not mean the size of the files contained inside it</em>).</p>
</blockquote>
<p>In the below screenshot, I have created 256 0-byte files in the directory, post which the size of the directory increases to 20480 bytes = 5 blocks. Why? To maintain the list of file names present under it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1702209833119/3525aa66-7bb0-4ccd-8141-86ac9164769e.png" alt class="image--center mx-auto" /></p>
<p>In the original Unix implementation, 16 bytes were reserved for one directory entry, 14 bytes for the file name and 2 bytes for the Inode number, resulting in a filesystem that supports file names up to 2^14 (16,384). Now, this has changed and varies across file system implementations like ext4 and ZFS.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1702209283656/76c9323c-868c-4830-8d3f-5d143cf081be.png" alt class="image--center mx-auto" /></p>
<blockquote>
<p><a target="_blank" href="https://github.com/suvratapte/Maurice-Bach-Notes/blob/master/4-Internal-Representation-of-Files.md">Image Source</a></p>
</blockquote>
<h3 id="heading-how-to-trace-pathnames">How to trace pathnames?</h3>
<p>With the internal representation of directories in place, let's evaluate the scenario when you perform <code>ls -l /home/movies/action/Interstellar.mkv</code></p>
<p>You start your search from the root directory. Go through the <em>directory entries</em> of the root directory present at Inode 20030 and see if any file name matches <code>home</code>. If yes, go to the Inode of 56969. Recursively, repeat the search until you find your target file which is present at Inode 60025. From the Inode, the file system can easily find the list of blocks the file is present in, and you can view the contents of that file.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1702211709275/abf27dcb-7a9c-496f-b1c4-bc6866467bf6.png" alt class="image--center mx-auto" /></p>
<p>In the above example, we performed a linear search i.e. we searched the directories one by one. But what if, the directories contain millions of files? In that case, the search will be extremely slow. To optimize for large directories, filesystems maintain an external index (<em>similar to the one used by databases</em>) to optimize the search e.g. ext4 uses <em>Hash Tree implementation</em> to maintain a directory index.</p>
<hr />
<h3 id="heading-bonus-section">Bonus Section</h3>
<p>If you are with me so far, thank you. Hope you learned something new today. In this section, I will list down random learnings that I found during the research of this article.</p>
<h3 id="heading-how-to-output-to-another-terminal-window">How to output to another terminal window?</h3>
<p>In Linux, everything is a file, even the disk, sockets and processes. When you run a terminal, the input and output to it are also modeled as a file. Open two terminal windows, find their device ID using <code>tty</code> and use <code>echo</code> command to output to a remote terminal window. So cool!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1702230394154/5c3e5d3b-5c7d-4168-bd05-5dae7872935d.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1702230467177/772c7f41-88b0-46f3-8c34-806ba0a63a37.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-how-to-find-the-list-of-free-blocks-and-free-inodes">How to find the list of free blocks and free Inodes?</h3>
<p>Whenever you want to write data to a file, you need an available block? How to quickly find an available block?</p>
<p>Enter the humble <strong>BitMap.</strong> Filesystems commonly maintain a bitmap to track free blocks. Bitmaps consume only 1 bit for 1 block and hence can be easily kept in memory to perform quick operations.</p>
<h3 id="heading-zfs-the-cool-kid-in-town">ZFS - the cool kid in town</h3>
<p>I am not going to write in detail about the ZFS but it's the cool kid in town with a completely new architecture like inbuilt snapshots (the <em>ability to version documents as and when they change</em>), has an inbuilt cache layer (<em>ARC which supposedly provides a better hit ratio than LRU cache - the default used in page cache</em>), uses Copy on Write semantics and provides inbuilt checksums using Merkle Trees.</p>
<hr />
<h3 id="heading-references">References</h3>
<ul>
<li><p><a target="_blank" href="https://pages.cs.wisc.edu/~solomon/cs537-old/last/filesys.html">https://pages.cs.wisc.edu/~solomon/cs537-old/last/filesys.html</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/suvratapte/Maurice-Bach-Notes/blob/master/4-Internal-Representation-of-Files.md">https://github.com/suvratapte/Maurice-Bach-Notes/blob/master/4-Internal-Representation-of-Files.md</a></p>
</li>
<li><p><a target="_blank" href="https://en.wikipedia.org/wiki/Hard_link">https://en.wikipedia.org/wiki/Hard_link</a></p>
</li>
<li><p><a target="_blank" href="https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries">https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries</a></p>
</li>
<li><p><a target="_blank" href="https://www.sans.org/blog/understanding-ext4-part-6-directories/">https://www.sans.org/blog/understanding-ext4-part-6-directories/</a></p>
</li>
<li><p><a target="_blank" href="https://teaching.idallen.com/cst8207/13w/notes/450_file_system.html#TOC">https://teaching.idallen.com/cst8207/13w/notes/450_file_system.html#TOC</a></p>
</li>
<li><p><a target="_blank" href="https://metebalci.com/blog/a-minimum-complete-tutorial-of-linux-ext4-file-system/">https://metebalci.com/blog/a-minimum-complete-tutorial-of-linux-ext4-file-system/</a></p>
</li>
<li><p><a target="_blank" href="https://www3.nd.edu/~pbui/teaching/cse.30341.fa18/project06.html">https://www3.nd.edu/~pbui/teaching/cse.30341.fa18/project06.html</a></p>
</li>
<li><p><a target="_blank" href="https://en.wikipedia.org/wiki/ZFS">https://en.wikipedia.org/wiki/ZFS</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Observer Design Pattern explained in 2 minutes]]></title><description><![CDATA[Problem Statement
Imagine, you have a pool of workers and a set of jobs that need execution. How are you going to model it?
class Job {
    List<Worker> workers = new ArrayList<>();

    public void executeJob() {
        workers.forEach(worker -> wo...]]></description><link>https://snehasishroy.com/observer-design-pattern-explained-in-2-minutes</link><guid isPermaLink="true">https://snehasishroy.com/observer-design-pattern-explained-in-2-minutes</guid><category><![CDATA[Java]]></category><category><![CDATA[design patterns]]></category><category><![CDATA[technology]]></category><category><![CDATA[Programming Blogs]]></category><category><![CDATA[System Design]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Mon, 20 Nov 2023 12:30:10 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/CUY_YHhCFl4/upload/8c8631b1f4d56bb9952c58ccae521073.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-problem-statement">Problem Statement</h3>
<p>Imagine, you have a pool of workers and a set of jobs that need execution. How are you going to model it?</p>
<pre><code class="lang-java"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Job</span> </span>{
    List&lt;Worker&gt; workers = <span class="hljs-keyword">new</span> ArrayList&lt;&gt;();

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">executeJob</span><span class="hljs-params">()</span> </span>{
        workers.forEach(worker -&gt; worker.execute(<span class="hljs-keyword">this</span>))
    }
}
</code></pre>
<p>The simplest strategy is to pass a list of workers to the Job and then invoke them one by one during execution.</p>
<p>The downside of this approach is that it tightly couples two different concepts - job and its execution together in the same class. It's assigning multiple responsibilities to the same class which violates SRP (Single Responsibility Principle).</p>
<p>Observer Design Pattern can simplify this solution!</p>
<pre><code class="lang-java"><span class="hljs-function"><span class="hljs-keyword">public</span> record <span class="hljs-title">Job</span><span class="hljs-params">(String jobId)</span> </span>{
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">interface</span> <span class="hljs-title">Observer</span>&lt;<span class="hljs-title">T</span>&gt; </span>{
    <span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">update</span><span class="hljs-params">(T data)</span></span>;
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">interface</span> <span class="hljs-title">Subject</span>&lt;<span class="hljs-title">T</span>&gt; </span>{
    <span class="hljs-function"><span class="hljs-keyword">boolean</span> <span class="hljs-title">addObserver</span><span class="hljs-params">(Observer&lt;T&gt; observer)</span></span>;

    <span class="hljs-function"><span class="hljs-keyword">boolean</span> <span class="hljs-title">removeObserver</span><span class="hljs-params">(Observer&lt;T&gt; observer)</span></span>;

    <span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">notifyObservers</span><span class="hljs-params">(T data)</span></span>;
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">JobDispatcher</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">Subject</span>&lt;<span class="hljs-title">Job</span>&gt; </span>{
    Set&lt;Observer&lt;Job&gt;&gt; observers = <span class="hljs-keyword">new</span> HashSet&lt;&gt;();

    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">boolean</span> <span class="hljs-title">addObserver</span><span class="hljs-params">(Observer&lt;Job&gt; observer)</span> </span>{
        <span class="hljs-keyword">return</span> observers.add(observer);
    }

    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">boolean</span> <span class="hljs-title">removeObserver</span><span class="hljs-params">(Observer&lt;Job&gt; observer)</span> </span>{
        <span class="hljs-keyword">return</span> observers.remove(observer);
    }

    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">notifyObservers</span><span class="hljs-params">(Job job)</span> </span>{
        observers.forEach(observer -&gt; observer.update(job));
    }
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">JobExecutor</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">Observer</span>&lt;<span class="hljs-title">Job</span>&gt; </span>{
    UUID uuid;

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">JobExecutor</span><span class="hljs-params">()</span> </span>{
        uuid = UUID.randomUUID();
    }

    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">update</span><span class="hljs-params">(Job data)</span> </span>{
        System.out.println(uuid + <span class="hljs-string">" is executing jobId "</span> + data.jobId());
    }
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Main</span> </span>{
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">main</span><span class="hljs-params">(String[] args)</span> </span>{
        testObserver();
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">testObserver</span><span class="hljs-params">()</span> </span>{
        JobExecutor executor1 = <span class="hljs-keyword">new</span> JobExecutor();
        JobExecutor executor2 = <span class="hljs-keyword">new</span> JobExecutor();
        JobDispatcher jobDispatcher = <span class="hljs-keyword">new</span> JobDispatcher();
        jobDispatcher.addObserver(executor1);
        jobDispatcher.addObserver(executor2);
        jobDispatcher.notifyObservers(<span class="hljs-keyword">new</span> Job(<span class="hljs-string">"job1"</span>));

        jobDispatcher.removeObserver(executor1);
        jobDispatcher.notifyObservers(<span class="hljs-keyword">new</span> Job(<span class="hljs-string">"job2"</span>));
    }
}

<span class="hljs-comment">// Output</span>
<span class="hljs-number">3139088</span>c-<span class="hljs-number">07</span>eb-<span class="hljs-number">4f</span>92-<span class="hljs-number">92d</span>a-<span class="hljs-number">0f</span>b3af53b6be is executing jobId job1
fea0bf8a-<span class="hljs-number">6</span>b2e-<span class="hljs-number">413f</span>-<span class="hljs-number">9</span>ae9-<span class="hljs-number">4</span>a286b142798 is executing jobId job1
fea0bf8a-<span class="hljs-number">6</span>b2e-<span class="hljs-number">413f</span>-<span class="hljs-number">9</span>ae9-<span class="hljs-number">4</span>a286b142798 is executing jobId job2
</code></pre>
<p>We have created two new interfaces, <code>Observer</code> and the <code>Subject</code>. The sole responsibility of the <code>Observer</code> is to act when a subject is modified. The subject keeps track of the list of observers and decides how/when to invoke them when it changes.</p>
<h3 id="heading-whats-the-benefit">What's the benefit?</h3>
<ul>
<li>Adheres to SRP - Each class has one responsibility and is decoupled from another. The logic of execution can change without impacting the subject.</li>
</ul>
<h3 id="heading-whats-the-drawback">What's the drawback?</h3>
<ul>
<li>This design pattern is very similar to the Publisher/Consumer problem and hence it suffers from all its issues as well. In a production ready code, we would have to be agnostic of various conditions like how to deal with slow observers, backpressure propagation and task backlog management.</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Chain of Responsibility Design Pattern explained in 2 minutes]]></title><description><![CDATA[Problem Statement
In real-world applications, we frequently have to run a series of validations to ensure our model class is properly created before we persist it in our database. Consider the below Employee class
public record Employee(int employeeI...]]></description><link>https://snehasishroy.com/chain-of-responsibility-design-pattern-explained-in-2-minutes</link><guid isPermaLink="true">https://snehasishroy.com/chain-of-responsibility-design-pattern-explained-in-2-minutes</guid><category><![CDATA[2Articles1Week]]></category><category><![CDATA[System Architecture]]></category><category><![CDATA[System Design]]></category><category><![CDATA[low level design]]></category><category><![CDATA[design patterns]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Sat, 18 Nov 2023 17:00:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/hAVPg-JLGfo/upload/45e2ae284d4e5911d52294b8b886c7ea.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-problem-statement">Problem Statement</h3>
<p>In real-world applications, we frequently have to run a series of validations to ensure our model class is properly created before we persist it in our database. Consider the below <code>Employee</code> class</p>
<pre><code class="lang-java"><span class="hljs-function"><span class="hljs-keyword">public</span> record <span class="hljs-title">Employee</span><span class="hljs-params">(<span class="hljs-keyword">int</span> employeeId, String firstName, String lastName, <span class="hljs-keyword">int</span> salary, <span class="hljs-keyword">int</span> managerId, <span class="hljs-keyword">int</span> age)</span> </span>{
}
</code></pre>
<p>If we have to run validations to ensure whether the names are properly set, the age is valid or a valid employee ID has been assigned before we persist our POJO into a database, what's the simplest way?</p>
<p>Presenting the humble <code>If/else</code>!</p>
<pre><code class="lang-java">  <span class="hljs-keyword">if</span> (validEmployee.employeeId() != <span class="hljs-number">0</span> 
        &amp;&amp; !validEmployee.firstName().isEmpty() 
        &amp;&amp; !validEmployee.lastName().isEmpty() 
        &amp;&amp; validEmployee.age() &gt;= <span class="hljs-number">18</span>) {
      <span class="hljs-keyword">return</span> <span class="hljs-keyword">true</span>;
  }
</code></pre>
<p>What red flag do you see in the above code? Although it's simpler, it violates OCP (Open/Closed Principle). The code is not extensible and will require frequent modifications.</p>
<p>Chain of Responsibility Design Pattern can help tidy up the code.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-keyword">abstract</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Validator</span> </span>{
    <span class="hljs-keyword">public</span> Validator nextValidator;

    <span class="hljs-function"><span class="hljs-keyword">public</span> Validator <span class="hljs-title">setNextValidator</span><span class="hljs-params">(Validator next)</span> </span>{
        <span class="hljs-keyword">this</span>.nextValidator = next;
        <span class="hljs-keyword">return</span> <span class="hljs-keyword">this</span>;
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">abstract</span> <span class="hljs-keyword">boolean</span> <span class="hljs-title">isValid</span><span class="hljs-params">(Employee employee)</span></span>;
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">EmployeeIdValidator</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">Validator</span> </span>{
    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">boolean</span> <span class="hljs-title">isValid</span><span class="hljs-params">(Employee employee)</span> </span>{
        System.out.println(<span class="hljs-string">"Running Employee ID Validator"</span>);
        <span class="hljs-keyword">if</span> (nextValidator == <span class="hljs-keyword">null</span>) {
            <span class="hljs-comment">// if there is no next validator in the chain</span>
            <span class="hljs-keyword">return</span> isIdValid(employee.employeeId());
        } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (isIdValid(employee.employeeId())) {
            <span class="hljs-comment">// delegate to the next validator</span>
            <span class="hljs-keyword">return</span> nextValidator.isValid(employee);
        } <span class="hljs-keyword">else</span> {
            System.out.println(<span class="hljs-string">"Employee ID is invalid"</span>);
            <span class="hljs-keyword">return</span> <span class="hljs-keyword">false</span>;
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">boolean</span> <span class="hljs-title">isIdValid</span><span class="hljs-params">(<span class="hljs-keyword">int</span> id)</span> </span>{
        <span class="hljs-keyword">return</span> id != <span class="hljs-number">0</span>;
    }
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">NameValidator</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">Validator</span> </span>{
    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">boolean</span> <span class="hljs-title">isValid</span><span class="hljs-params">(Employee employee)</span> </span>{
        System.out.println(<span class="hljs-string">"Running Name Validator"</span>);
        <span class="hljs-keyword">if</span> (nextValidator == <span class="hljs-keyword">null</span>) {
            <span class="hljs-keyword">return</span> isNameValid(employee);
        } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (isNameValid(employee)) {
            <span class="hljs-keyword">return</span> nextValidator.isValid(employee);
        } <span class="hljs-keyword">else</span> {
            System.out.println(<span class="hljs-string">"Employee name is invalid"</span>);
            <span class="hljs-keyword">return</span> <span class="hljs-keyword">false</span>;
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">boolean</span> <span class="hljs-title">isNameValid</span><span class="hljs-params">(Employee employee)</span> </span>{
        <span class="hljs-keyword">return</span> !employee.firstName().isBlank()
                &amp;&amp; !employee.firstName().isEmpty()
                &amp;&amp; !employee.lastName().isEmpty()
                &amp;&amp; !employee.lastName().isBlank();
    }
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">AgeValidator</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">Validator</span> </span>{
    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">boolean</span> <span class="hljs-title">isValid</span><span class="hljs-params">(Employee employee)</span> </span>{
        System.out.println(<span class="hljs-string">"Running Age Validator"</span>);
        <span class="hljs-keyword">if</span> (nextValidator == <span class="hljs-keyword">null</span>) {
            <span class="hljs-keyword">return</span> isAgeValid(employee.age());
        } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (isAgeValid(employee.age())) {
            <span class="hljs-keyword">return</span> nextValidator.isValid(employee);
        } <span class="hljs-keyword">else</span> {
            System.out.println(<span class="hljs-string">"Age is not valid"</span>);
            <span class="hljs-keyword">return</span> <span class="hljs-keyword">false</span>;
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">boolean</span> <span class="hljs-title">isAgeValid</span><span class="hljs-params">(<span class="hljs-keyword">int</span> age)</span> </span>{
        <span class="hljs-keyword">return</span> age &gt;= <span class="hljs-number">18</span> &amp;&amp; age &lt;= <span class="hljs-number">70</span>;
    }
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Main</span> </span>{
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">main</span><span class="hljs-params">(String[] args)</span> </span>{
        testChainOfResponsibility();
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">testChainOfResponsibility</span><span class="hljs-params">()</span> </span>{
        Employee validEmployee = <span class="hljs-keyword">new</span> Employee(<span class="hljs-number">1</span>, <span class="hljs-string">"Snehasish"</span>, <span class="hljs-string">"Roy"</span>, <span class="hljs-number">100</span>, <span class="hljs-number">100</span>, <span class="hljs-number">20</span>);
        <span class="hljs-comment">// Chain the validators</span>
        Validator validatorChain = <span class="hljs-keyword">new</span> EmployeeIdValidator()
                .setNextValidator(<span class="hljs-keyword">new</span> AgeValidator()
                        .setNextValidator(<span class="hljs-keyword">new</span> NameValidator()));
        System.out.println(validatorChain.isValid(validEmployee));

        Employee invalidEmployee = <span class="hljs-keyword">new</span> Employee(<span class="hljs-number">1</span>, <span class="hljs-string">"Snehasish"</span>, <span class="hljs-string">"Roy"</span>, <span class="hljs-number">100</span>, <span class="hljs-number">100</span>, <span class="hljs-number">10</span>);
        System.out.println(validatorChain.isValid(invalidEmployee));
    }
}

<span class="hljs-comment">// Output</span>
Running Employee ID Validator
Running Age Validator
Running Name Validator
<span class="hljs-keyword">true</span>

Running Employee ID Validator
Running Age Validator
Age is not valid
<span class="hljs-keyword">false</span>
</code></pre>
<p>In the code above, we created dedicated validators for handling only one validation at a time. Each validator first performs local validations and then delegates to the next validator (<em>if any</em>).</p>
<h3 id="heading-whats-the-benefit">What's the benefit?</h3>
<ul>
<li><p>Follows OCP - Any modification/extension in the validation logic of a validator will require a change in only one class.</p>
</li>
<li><p>Follows SRP (Single Responsibility Principle) - Each validator performs only one task.</p>
</li>
</ul>
<h3 id="heading-whats-the-drawback">What's the drawback?</h3>
<ul>
<li><p>Validators must be chained correctly - if there are any cycles in the chain, then it can cause issues at runtime.</p>
</li>
<li><p>The responsibility of initializing validators lies with the client. The client can either create a chain at the compile time or dynamically update the chain as per the business logic.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Prototype Design Pattern explained in 2 minutes]]></title><description><![CDATA[Problem Statement
Functional programming heavily prefers Immutability i.e. objects should not be mutated. But what about the cases when we need to modify a specific field of an object? You need to create a copy of the entire object and update that sp...]]></description><link>https://snehasishroy.com/prototype-design-pattern-explained-in-2-minutes</link><guid isPermaLink="true">https://snehasishroy.com/prototype-design-pattern-explained-in-2-minutes</guid><category><![CDATA[2Articles1Week]]></category><category><![CDATA[design patterns]]></category><category><![CDATA[Java]]></category><category><![CDATA[architecture]]></category><category><![CDATA[technology]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Mon, 13 Nov 2023 18:15:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/7Y0NshQLohk/upload/6df5e34563c899e18fdc84d10fc883c6.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-problem-statement">Problem Statement</h3>
<p>Functional programming heavily prefers Immutability i.e. objects should not be mutated. But what about the cases when we need to modify a specific field of an object? You need to create a copy of the entire object and update that specific field.</p>
<p>Okay, but how do I copy an existing object? Prototype design pattern to the rescue!</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Car</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">Cloneable</span> </span>{
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> String name;

    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> List&lt;Integer&gt; mileage;

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">Car</span><span class="hljs-params">(String name, List&lt;Integer&gt; mileage)</span> </span>{
        <span class="hljs-keyword">this</span>.name = name;
        <span class="hljs-keyword">this</span>.mileage = mileage;
    }

    <span class="hljs-comment">// Clone using Copy constructor</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">Car</span><span class="hljs-params">(Car existing)</span> </span>{
        <span class="hljs-keyword">this</span>.name = existing.name;
        <span class="hljs-comment">// deep copy</span>
        <span class="hljs-keyword">this</span>.mileage = List.copyOf(existing.getMileage());
    }

    <span class="hljs-comment">// Clone using the default clone method.</span>
    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> Car <span class="hljs-title">clone</span><span class="hljs-params">()</span> </span>{
        <span class="hljs-keyword">try</span> {
            <span class="hljs-comment">// by default provides shallow cloning i.e. only clones the references</span>
            <span class="hljs-keyword">return</span> (Car) <span class="hljs-keyword">super</span>.clone();
        } <span class="hljs-keyword">catch</span> (CloneNotSupportedException e) {
            <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> AssertionError();
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> String <span class="hljs-title">getName</span><span class="hljs-params">()</span> </span>{
        <span class="hljs-keyword">return</span> name;
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> List&lt;Integer&gt; <span class="hljs-title">getMileage</span><span class="hljs-params">()</span> </span>{
        <span class="hljs-keyword">return</span> mileage;
    }
}
</code></pre>
<p>In the above code, we demonstrated two ways to create a clone of an Object in Java - using a custom copy constructor or by implementing the <code>Cloneable</code> interface.</p>
<p><code>Cloneable</code> is a very special interface in Java - it's a <em>marker</em> interface - it does not have any method. When an Object implements Cloneable, JVM changes the <code>protected()</code> nature of the <code>clone()</code> method in <code>Object</code> class to <code>public</code>() - without which it throws a <code>CloneNotSupportedException</code>. By default, it provides a <em>shallow</em> clone i.e. only copies the reference.<br />Using Cloneable to create clones is highly discouraged because of its complicated nature.</p>
<p>Copy Constructors on the other hand are very intuitive and allow you to exactly control the behaviour. You can tweak the code to easily use shallow or deep copying as and when required.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Main</span> </span>{
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">main</span><span class="hljs-params">(String[] args)</span> </span>{
        testPrototype();
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">testPrototype</span><span class="hljs-params">()</span> </span>{
        List&lt;Integer&gt; mileage = <span class="hljs-keyword">new</span> ArrayList&lt;&gt;();
        mileage.add(<span class="hljs-number">10</span>);
        mileage.add(<span class="hljs-number">20</span>);
        Car originalCar = <span class="hljs-keyword">new</span> Car(<span class="hljs-string">"bentley"</span>, mileage);
        Car shallowClone = originalCar.clone();
        Car deepClone = <span class="hljs-keyword">new</span> Car(originalCar);

        List&lt;Integer&gt; clonedMileage = shallowClone.getMileage();
        System.out.println(shallowClone.getName() + <span class="hljs-string">" "</span> + clonedMileage);
        mileage.set(<span class="hljs-number">0</span>, <span class="hljs-number">50</span>);
        <span class="hljs-comment">// updating the mileage also updated the clonedMileage because they both </span>
        <span class="hljs-comment">// point to the same object (shallow copy)</span>
        System.out.println(clonedMileage.get(<span class="hljs-number">0</span>).equals(<span class="hljs-number">50</span>));

        <span class="hljs-comment">// updating the mileage didn't affect the deep cloned object</span>
        System.out.println(deepClone.getMileage().get(<span class="hljs-number">0</span>).equals(<span class="hljs-number">10</span>));
    }
}
<span class="hljs-comment">// Output</span>
bentley [<span class="hljs-number">10</span>, <span class="hljs-number">20</span>]
<span class="hljs-keyword">true</span>
<span class="hljs-keyword">true</span>
</code></pre>
<h3 id="heading-bonus-section">Bonus Section</h3>
<p>If you are using <code>Lombok</code> in your application to create boilerplate Java code, then you can use <code>@With</code> or <code>@Builder(toBuilder = true)</code> to create clones.</p>
<p><code>@With</code> allows you to create a cloned object by updating only a specific field whereas <code>toBuilder()</code> allows you to override as many fields as possible. You can chain multiple <code>with()</code> to update multiple fields in succession but that would create a lot of garbage.</p>
<h3 id="heading-whats-the-benefit">What's the benefit?</h3>
<ul>
<li>Domain logic to clone the object does not spill out to the outer application.</li>
</ul>
<h3 id="heading-whats-the-drawback">What's the drawback?</h3>
<ul>
<li><p>Circular dependency can become tricky to resolve. You would need to exclude a specific field to resolve the conflict.</p>
</li>
<li><p>Need to be aware of the type of cloning being performed - shallow/deep.</p>
</li>
</ul>
<h3 id="heading-references">References</h3>
<ul>
<li><p><a target="_blank" href="https://www.baeldung.com/lombok-builder">https://www.baeldung.com/lombok-builder</a></p>
</li>
<li><p><a target="_blank" href="https://projectlombok.org/features/With">https://projectlombok.org/features/With</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Strategy Design Pattern explained in 2 minutes]]></title><description><![CDATA[Problem Statement
If you want to dynamically make decisions in your class based on certain actions, then what’s the best way to achieve this?
public class OrderedList<T> {
    List<T> list;

    public OrderedList(List<T> list) {
        this.list = ...]]></description><link>https://snehasishroy.com/strategy-design-pattern-explained-in-2-minutes</link><guid isPermaLink="true">https://snehasishroy.com/strategy-design-pattern-explained-in-2-minutes</guid><category><![CDATA[2Articles1Week]]></category><category><![CDATA[Java]]></category><category><![CDATA[architecture]]></category><category><![CDATA[design patterns]]></category><category><![CDATA[technology]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Sat, 11 Nov 2023 18:06:14 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/7JX0-bfiuxQ/upload/1eadfa8ac992e2b76224b84b71a0dc9c.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-problem-statement">Problem Statement</h3>
<p>If you want to dynamically make decisions in your class based on certain actions, then what’s the best way to achieve this?</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">OrderedList</span>&lt;<span class="hljs-title">T</span>&gt; </span>{
    List&lt;T&gt; list;

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">OrderedList</span><span class="hljs-params">(List&lt;T&gt; list)</span> </span>{
        <span class="hljs-keyword">this</span>.list = list;
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> List&lt;T&gt; <span class="hljs-title">sort</span><span class="hljs-params">(String algo)</span> </span>{
        <span class="hljs-keyword">if</span> (<span class="hljs-string">"bubble"</span>.equals(algo)) {
            <span class="hljs-keyword">return</span> bubbleSort(list);
        } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (<span class="hljs-string">"merge"</span>.equals(algo)) {
            <span class="hljs-keyword">return</span> mergeSort(list);
        } <span class="hljs-keyword">else</span> {
            <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> RuntimeException(<span class="hljs-string">"Unsupported algo "</span> + algo);
        }
    }
}
</code></pre>
<p>In the above code, if you need to change the sorting strategy based on the function argument, then the simplest way would be to use <code>if/else</code> and do a dispatch to the correct method body based on the argument.</p>
<p>Why would this be a problem?</p>
<p>Because, in the future, if you need to add a new strategy, you would have to update this class - which would violate the OCP (Open/Closed Principle) of the SOLID principle.</p>
<h3 id="heading-solution">Solution?</h3>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">OrderedList</span>&lt;<span class="hljs-title">T</span>&gt; </span>{
    List&lt;T&gt; list;
    SortingStrategy&lt;T&gt; strategy;
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">OrderedList</span><span class="hljs-params">(List&lt;T&gt; list, SortingStrategy&lt;T&gt; strategy)</span> </span>{
        <span class="hljs-keyword">this</span>.list = list;
        <span class="hljs-keyword">this</span>.strategy = strategy;
    }

    <span class="hljs-comment">// You can also update the code to take in Strategy at the time of sorting itself</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> List&lt;T&gt; <span class="hljs-title">sort</span><span class="hljs-params">()</span> </span>{
        <span class="hljs-keyword">return</span> strategy.execute(list);
    }
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">interface</span> <span class="hljs-title">SortingStrategy</span>&lt;<span class="hljs-title">T</span>&gt; </span>{
    <span class="hljs-function">List&lt;T&gt; <span class="hljs-title">execute</span><span class="hljs-params">(List&lt;T&gt; list)</span></span>;
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MergeSort</span>&lt;<span class="hljs-title">T</span>&gt; <span class="hljs-keyword">implements</span> <span class="hljs-title">SortingStrategy</span>&lt;<span class="hljs-title">T</span>&gt; </span>{
    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> List&lt;T&gt; <span class="hljs-title">execute</span><span class="hljs-params">(List&lt;T&gt; list)</span> </span>{
        System.out.println(<span class="hljs-string">"Performing merge sort"</span>);
        <span class="hljs-keyword">return</span> list;
    }
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">BubbleSort</span>&lt;<span class="hljs-title">T</span>&gt; <span class="hljs-keyword">implements</span> <span class="hljs-title">SortingStrategy</span>&lt;<span class="hljs-title">T</span>&gt; </span>{
    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> List&lt;T&gt; <span class="hljs-title">execute</span><span class="hljs-params">(List&lt;T&gt; list)</span> </span>{
        System.out.println(<span class="hljs-string">"Performing bubble sort"</span>);
        <span class="hljs-keyword">return</span> list;
    }
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Main</span> </span>{
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">main</span><span class="hljs-params">(String[] args)</span> </span>{
        testStrategy();
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">testStrategy</span><span class="hljs-params">()</span> </span>{
        List&lt;Integer&gt; list = Arrays.asList(<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>);
        BubbleSort&lt;Integer&gt; bubbleSort = <span class="hljs-keyword">new</span> BubbleSort&lt;&gt;();
        MergeSort&lt;Integer&gt; mergeSort = <span class="hljs-keyword">new</span> MergeSort&lt;&gt;();
        OrderedList&lt;Integer&gt; orderedList1 = <span class="hljs-keyword">new</span> OrderedList&lt;&gt;(list, bubbleSort);
        orderedList1.sort();
        OrderedList&lt;Integer&gt; orderedList2 = <span class="hljs-keyword">new</span> OrderedList&lt;&gt;(list, mergeSort);
        orderedList2.sort();
    }
}
<span class="hljs-comment">// Output</span>
Performing bubble sort
Performing merge sort
</code></pre>
<p>Using the strategy pattern, you can create new strategies and update the behavior of the class without modifying the class!</p>
<h3 id="heading-whats-the-benefit">What's the benefit?</h3>
<ul>
<li><p>Follows the OCP principle - class should be open for extension but closed for modification. You can change the behavior of the class by injecting a new strategy at runtime.</p>
</li>
<li><p>Follows the SRP (Single Responsibility Principle) - class should have only one responsibility. Only one strategy is present in one class.</p>
</li>
<li><p>Follows the <code>Composition over Inheritance</code> principle suggested by GoF (Gang Of Four).</p>
</li>
</ul>
<h3 id="heading-whats-the-drawback">What's the drawback?</h3>
<ul>
<li>Creating new classes every time can create clutter if not done judiciously. Many languages allow you to create anonymous classes e.g. lambdas in Java - which can reduce the boilerplate code creation.</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Visitor Design Pattern explained in 2 minutes]]></title><description><![CDATA[Problem Statement
Consider the below Animal interface. If you want to add new methods to the interface, you have to update Cow and Dog classes to implement those methods.
public interface Animal {
}

public class Cow implements Animal {
}

public cla...]]></description><link>https://snehasishroy.com/visitor-design-pattern-explained-in-2-minutes</link><guid isPermaLink="true">https://snehasishroy.com/visitor-design-pattern-explained-in-2-minutes</guid><category><![CDATA[2Articles1Week]]></category><category><![CDATA[design patterns]]></category><category><![CDATA[Java]]></category><category><![CDATA[System Design]]></category><category><![CDATA[architecture]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Sun, 05 Nov 2023 17:19:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/_v2aoMh8xf0/upload/fc42ada44c18858969469b51fecf54da.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-problem-statement">Problem Statement</h3>
<p>Consider the below <code>Animal</code> interface. If you want to add new methods to the interface, you have to update <code>Cow</code> and <code>Dog</code> classes to implement those methods.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">interface</span> <span class="hljs-title">Animal</span> </span>{
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Cow</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">Animal</span> </span>{
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Dog</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">Animal</span> </span>{
}
</code></pre>
<p>But what if you want to add new functionalities across all the classes without worrying about changing them, then Visitor Design Pattern is perfect for you.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">interface</span> <span class="hljs-title">Animal</span> </span>{
    &lt;T&gt; <span class="hljs-function">T <span class="hljs-title">accept</span><span class="hljs-params">(AnimalVisitor&lt;T&gt; visitor)</span></span>;
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Cow</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">Animal</span> </span>{
    <span class="hljs-meta">@Override</span>
    <span class="hljs-keyword">public</span> &lt;T&gt; <span class="hljs-function">T <span class="hljs-title">accept</span><span class="hljs-params">(AnimalVisitor&lt;T&gt; visitor)</span> </span>{
        <span class="hljs-keyword">return</span> visitor.visit(<span class="hljs-keyword">this</span>);
    }
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Dog</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">Animal</span> </span>{
    <span class="hljs-meta">@Override</span>
    <span class="hljs-keyword">public</span> &lt;T&gt; <span class="hljs-function">T <span class="hljs-title">accept</span><span class="hljs-params">(AnimalVisitor&lt;T&gt; visitor)</span> </span>{
        <span class="hljs-keyword">return</span> visitor.visit(<span class="hljs-keyword">this</span>);
    }
}
</code></pre>
<p>In the above scenario, let's say you want to add new functionality like <code>Speak</code> or <code>NumberOfLegs</code>, then you have to create an implementation of <code>AnimalVisitor</code> and implement your business logic directly in those classes.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">interface</span> <span class="hljs-title">AnimalVisitor</span>&lt;<span class="hljs-title">T</span>&gt; </span>{
    <span class="hljs-function">T <span class="hljs-title">visit</span><span class="hljs-params">(Cow cow)</span></span>;

    <span class="hljs-function">T <span class="hljs-title">visit</span><span class="hljs-params">(Dog dog)</span></span>;
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">LegsVisitor</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">AnimalVisitor</span>&lt;<span class="hljs-title">Integer</span>&gt; </span>{
    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> Integer <span class="hljs-title">visit</span><span class="hljs-params">(Cow cow)</span> </span>{
        <span class="hljs-keyword">return</span> <span class="hljs-number">4</span>;
    }

    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> Integer <span class="hljs-title">visit</span><span class="hljs-params">(Dog dog)</span> </span>{
        <span class="hljs-keyword">return</span> <span class="hljs-number">4</span>;
    }
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SpeakVisitor</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">AnimalVisitor</span>&lt;<span class="hljs-title">String</span>&gt; </span>{
    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> String <span class="hljs-title">visit</span><span class="hljs-params">(Cow cow)</span> </span>{
        <span class="hljs-keyword">return</span> <span class="hljs-string">"Moo"</span>;
    }

    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> String <span class="hljs-title">visit</span><span class="hljs-params">(Dog dog)</span> </span>{
        <span class="hljs-keyword">return</span> <span class="hljs-string">"Bark"</span>;
    }
}
</code></pre>
<p>Driver Code</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Main</span> </span>{
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">main</span><span class="hljs-params">(String[] args)</span> </span>{
        Animal cow = <span class="hljs-keyword">new</span> Cow();
        Animal dog = <span class="hljs-keyword">new</span> Dog();
        AnimalVisitor&lt;Integer&gt; legsVisitor = <span class="hljs-keyword">new</span> LegsVisitor();
        AnimalVisitor&lt;String&gt; speakVisitor = <span class="hljs-keyword">new</span> SpeakVisitor();

        System.out.println(cow.accept(legsVisitor));
        System.out.println(cow.accept(speakVisitor));
        System.out.println(dog.accept(legsVisitor));
        System.out.println(dog.accept(speakVisitor));
    }
}

Output:
<span class="hljs-number">4</span>
Moo
<span class="hljs-number">4</span>
Bark
</code></pre>
<h3 id="heading-whats-the-benefit">What's the benefit?</h3>
<ul>
<li><p>If you refer to the code once more, then you will notice that we added new functionality to <code>Cow</code> and <code>Dog</code> class without changing it -- this functionality is critical when working with client libraries.</p>
</li>
<li><p>All the business logic is encapsulated in Visitor classes, which can help in segregating domain logic from model classes e.g. <code>LegsVisitor</code> contains all the logic for calculating the number of legs for all types of animals.</p>
</li>
<li><p>Adherence to Open-Closed Principle i.e. class is open to extension but closed for modifications.</p>
</li>
<li><p>Adherence to the Single Responsibility Principle i.e. class has only one responsibility.</p>
</li>
</ul>
<h3 id="heading-whats-the-drawback">What's the drawback?</h3>
<ul>
<li><p>If you have a lot of sub-classes, then the Visitor implementation must handle them even if you want to add new functionality to only one of the classes.</p>
</li>
<li><p>While adding a new class, you must update the existing visitor implementations to support the new class.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Building a Distributed Job Scheduler from Scratch (Part 3)]]></title><description><![CDATA[Recap
Welcome back to the third part of our tutorial series on building a distributed job scheduler! In our previous installment, we deep-dived into our storage system by designing a durable storage system to store job details effectively. Now it's t...]]></description><link>https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-3</link><guid isPermaLink="true">https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-3</guid><category><![CDATA[2Articles1Week]]></category><category><![CDATA[System Architecture]]></category><category><![CDATA[distributed system]]></category><category><![CDATA[System Design]]></category><category><![CDATA[HBase]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Sun, 17 Sep 2023 17:50:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/5pQCBcVMTqc/upload/ca3806dc2003ce84450609e2925df81c.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-recap">Recap</h3>
<p>Welcome back to the third part of our tutorial series on building a distributed job scheduler! In our <a target="_blank" href="https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-2">previous</a> installment, we deep-dived into our storage system by designing a durable storage system to store job details effectively. Now it's time to model repeated jobs and handle job executions.</p>
<h3 id="heading-modeling-job-execution">Modeling Job Execution</h3>
<p>Recapping the discussion from the <a target="_blank" href="https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-1">first part</a> of the series</p>
<blockquote>
<p>Our platform should support three types of jobs:</p>
<ul>
<li><p><strong>Once</strong> - These jobs need to be scheduled only once at a specified date and time, such as scheduling a job for August 1, 2023, at 23:00.</p>
</li>
<li><p><strong>Repeated</strong> - Repeated jobs occur within a defined date range, with a specified time interval between each occurrence. For example, scheduling a job to run every 30 minutes between August 1 and August 31, 2023.</p>
</li>
<li><p><strong>Recurring</strong> - Recurring jobs are scheduled for specific dates and times, e.g., on August 1, 2023, at 16:00, and on August 5, 2023, at 12:30.</p>
</li>
</ul>
</blockquote>
<p>This indicates that there is a <strong>one-to-many</strong> relationship between jobs and the way they are executed.</p>
<h3 id="heading-introducing-the-plan-class">Introducing the Plan Class</h3>
<p>Since a job can be executed multiple times, we need to introduce another model to track the execution history of a job.</p>
<p>Whenever a job is scheduled, we will insert a <code>Plan</code> corresponding to the <code>Job</code> in the <code>Plan</code> Table. We will monitor the <code>Plan</code> table continuously to fetch eligible <code>Plans</code>. Once a <code>Plan</code> is fetched, we will fetch the corresponding <code>Job</code> details too, and generate the next <code>Plan</code> e.g. in case of a recurring Job scheduled on August 1 and on August 5 - Initially, the Plan with the expected execution on August 1 will be created. Once that plan is executed, we will generate the next plan with expected execution on August 5 will be created. This is known as <code>lazy evaluation</code> and will prevent <em>unnecessary insertions</em> into the Plans table in case of a repeated job that runs for years.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Plan</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">Serializable</span> </span>{
    String planId;
    String jobId;
    LocalDateTime expectedExecutionTime;
}
</code></pre>
<h3 id="heading-updating-the-jobdao-class">Updating the JobDAO Class</h3>
<p>JobDAO class needs to be updated to account for the creation and storage of newly created <code>Plan</code> table. <code>getJobDetails()</code> has also been updated to return all the <code>PlanIDs</code> mapped with the provided <code>JobId</code>.</p>
<pre><code class="lang-java"><span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">registerJob</span><span class="hljs-params">(Job job)</span> <span class="hljs-keyword">throws</span> IOException </span>{
    <span class="hljs-comment">// Create and store the Job in HBase</span>
    <span class="hljs-keyword">byte</span>[] row = Bytes.toBytes(job.getId());
    Put put = <span class="hljs-keyword">new</span> Put(row);
    put.addColumn(columnFamily.getBytes(), data.getBytes(), SerializationUtils.serialize(job));
    hBaseManager.put(table, put);

    <span class="hljs-comment">// Create and store the Plan in HBase</span>
    Plan plan = planDAO.storePlan(job);
    PlanIDs planIDs = PlanIDs.builder()
            .plans(List.of(plan.getPlanId()))
            .build();

    Put jobPlanPut = <span class="hljs-keyword">new</span> Put(row);
    jobPlanPut.addColumn(columnFamily.getBytes(), plans.getBytes(), SerializationUtils.serialize(planIDs));
    hBaseManager.put(table, jobPlanPut);
    log.info(<span class="hljs-string">"Generated JobId {} and PlanID {}"</span>, job.getId(), plan.getPlanId());
}

<span class="hljs-meta">@VisibleForTesting</span>
<span class="hljs-function"><span class="hljs-keyword">public</span> JobDetails <span class="hljs-title">getJobDetails</span><span class="hljs-params">(String id)</span> <span class="hljs-keyword">throws</span> IOException </span>{
    Result result = hBaseManager.get(table, id);

    <span class="hljs-keyword">byte</span>[] jobValue = result.getValue(columnFamily.getBytes(), data.getBytes());
    Job job = (Job) SerializationUtils.deserialize(jobValue);

    <span class="hljs-keyword">byte</span>[] plansMapping = result.getValue(columnFamily.getBytes(), plans.getBytes());
    PlanIDs planIDs = (PlanIDs) SerializationUtils.deserialize(plansMapping);

    <span class="hljs-keyword">return</span> JobDetails.builder()
            .job(job)
            .planIDs(planIDs)
            .build();
}
</code></pre>
<h3 id="heading-introducing-the-plandao-class">Introducing the PlanDAO Class</h3>
<p>This will be similar to <code>JobDAO</code> and will be responsible for dealing with the CRUD operations related to <code>Plan</code> entity.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">PlanDAO</span> </span>{
    HBaseManager hBaseManager;
    String columnFamily = <span class="hljs-string">"cf"</span>;
    String data = <span class="hljs-string">"data"</span>;
    String tableName = <span class="hljs-string">"planDetails"</span>;
    Table table;

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">PlanDAO</span><span class="hljs-params">()</span> <span class="hljs-keyword">throws</span> IOException </span>{
        hBaseManager = <span class="hljs-keyword">new</span> HBaseManager();
        hBaseManager.ensureTable(tableName, columnFamily);
        table = hBaseManager.getTable(tableName);
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> Plan <span class="hljs-title">storePlan</span><span class="hljs-params">(Job job)</span> <span class="hljs-keyword">throws</span> IOException </span>{
        <span class="hljs-comment">// Use of VISITOR pattern to extend functionality of an existing class</span>
        Plan plan = job.accept(<span class="hljs-keyword">new</span> PlanGenerator());

        <span class="hljs-keyword">byte</span>[] row = Bytes.toBytes(plan.getPlanId());
        Put value = <span class="hljs-keyword">new</span> Put(row);
        value.addColumn(columnFamily.getBytes(), data.getBytes(), SerializationUtils.serialize(plan));
        hBaseManager.put(table, value);
        <span class="hljs-keyword">return</span> plan;
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> Plan <span class="hljs-title">getPlanDetails</span><span class="hljs-params">(String planId)</span> <span class="hljs-keyword">throws</span> IOException </span>{
        Result result = hBaseManager.get(table, planId);
        <span class="hljs-keyword">byte</span>[] value = result.getValue(columnFamily.getBytes(), data.getBytes());
        <span class="hljs-keyword">return</span> (Plan) SerializationUtils.deserialize(value);
    }
}
</code></pre>
<h3 id="heading-visitor-pattern-in-real-life">Visitor Pattern in Real Life</h3>
<p>If you are with me so far, you must have noticed a sneaky new class called <code>PlanGenerator</code>.</p>
<p>What does it do? Based on the type of <code>Job</code>, it returns a <code>Plan</code>.</p>
<p>Why do we need a visitor? If we had not gone with this approach, then either we would have to use <code>if/else</code> logic to generate a <code>Plan</code> based on the instance of Job object or use <code>Strategy</code> pattern to generate a <code>Plan</code>.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">interface</span> <span class="hljs-title">JobVisitor</span>&lt;<span class="hljs-title">T</span>&gt; </span>{
    <span class="hljs-function">T <span class="hljs-title">visit</span><span class="hljs-params">(ExactlyOnceJob job)</span></span>;

    <span class="hljs-function">T <span class="hljs-title">visit</span><span class="hljs-params">(RecurringJob job)</span></span>;

    <span class="hljs-function">T <span class="hljs-title">visit</span><span class="hljs-params">(RepeatedJob job)</span></span>;
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">PlanGenerator</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">JobVisitor</span>&lt;<span class="hljs-title">Plan</span>&gt; </span>{
    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> Plan <span class="hljs-title">visit</span><span class="hljs-params">(ExactlyOnceJob job)</span> </span>{
        <span class="hljs-keyword">return</span> Plan.builder()
                .jobId(job.getId())
                .planId(getRandomID())
                .expectedExecutionTime(job.getDateTime())
                .build();
    }

    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> Plan <span class="hljs-title">visit</span><span class="hljs-params">(RecurringJob job)</span> </span>{
        <span class="hljs-keyword">return</span> Plan.builder()
                .jobId(job.getId())
                .planId(getRandomID())
                .expectedExecutionTime(job.getDateTimes().first())
                .build();
    }

    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> Plan <span class="hljs-title">visit</span><span class="hljs-params">(RepeatedJob job)</span> </span>{
        <span class="hljs-keyword">return</span> Plan.builder()
                .jobId(job.getId())
                .planId(getRandomID())
                .expectedExecutionTime(getNextExecutionTime(job.getStartTime(), job.getRepeatIntervalTimeUnit(), job.getRepeatInterval()))
                .build();
    }

    <span class="hljs-function"><span class="hljs-keyword">private</span> String <span class="hljs-title">getRandomID</span><span class="hljs-params">()</span> </span>{
        <span class="hljs-keyword">return</span> UUID.randomUUID().toString();
    }

    <span class="hljs-function"><span class="hljs-keyword">private</span> LocalDateTime <span class="hljs-title">getNextExecutionTime</span><span class="hljs-params">(LocalDateTime start, TemporalUnit repeatIntervalTimeUnit, <span class="hljs-keyword">long</span> repeatInterval)</span> </span>{
        <span class="hljs-keyword">return</span> start.plus(repeatInterval, repeatIntervalTimeUnit);
    }
}
</code></pre>
<h3 id="heading-tests">Tests</h3>
<p>Finally, the tests to verify the integration.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">JobDAOTest</span> </span>{

    JobDAO jobDAO;
    PlanDAO planDAO;

    <span class="hljs-meta">@Before</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">setUp</span><span class="hljs-params">()</span> <span class="hljs-keyword">throws</span> IOException </span>{
        jobDAO = <span class="hljs-keyword">new</span> JobDAO();
        planDAO = <span class="hljs-keyword">new</span> PlanDAO();
    }

    <span class="hljs-meta">@Test</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">testRegisterJob</span><span class="hljs-params">()</span> <span class="hljs-keyword">throws</span> IOException </span>{
        JobDAO jobDAO = <span class="hljs-keyword">new</span> JobDAO();
        String jobId = UUID.randomUUID().toString();
        ExactlyOnceJob exactlyOnceJob = ExactlyOnceJob.builder()
                .id(jobId)
                .callbackUrl(<span class="hljs-string">"http://localhost:8080/test"</span>)
                .successStatusCode(<span class="hljs-number">500</span>)
                .build();

        Assertions.assertDoesNotThrow(() -&gt; jobDAO.registerJob(exactlyOnceJob));
        JobDetails jobDetails = jobDAO.getJobDetails(jobId);
        Assertions.assertTrue(exactlyOnceJob.equals(jobDetails.getJob()));
        PlanIDs planIDs = jobDetails.getPlanIDs();
        Assertions.assertNotNull(planIDs);

        String planId = planIDs.getPlans().get(<span class="hljs-number">0</span>);
        Plan plan = planDAO.getPlanDetails(planId);
        Assertions.assertEquals(planId, plan.getPlanId());
        Assertions.assertEquals(jobId, plan.getJobId());
    }
}
</code></pre>
<hr />
<h3 id="heading-appendix">Appendix</h3>
<p>Project Structure</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1694972940685/b1741ec3-b43b-4ef4-a097-074fbd3e0651.png" alt class="image--center mx-auto" /></p>
<hr />
<h3 id="heading-conclusion">Conclusion</h3>
<p>Congratulations! In this third part of our tutorial series, we've made significant progress. We modeled multiple executions of the same job, implemented the necessary code, used Strategy pattern in real life and validated it through test cases. But the journey doesn't end here. In the next installment, which we'll cover in part 4, we'll delve into an even more complicated part - <em>identifying which Plans to execute and executing them</em>. Do take a pause and think about the various challenges that can come in fetching Plans from the Plan table. Stay tuned for more exciting insights!</p>
<hr />
<h3 id="heading-references">References</h3>
<ul>
<li><p><a target="_blank" href="https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-2">Part 2 of the series</a></p>
</li>
<li><p><a target="_blank" href="https://snehasishroy.com/lets-get-dirty-building-a-distributed-job-scheduler-part-1">Part 1 of the series</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Building a Distributed Job Scheduler from Scratch (Part 2)]]></title><description><![CDATA[Recap
Welcome back to the second part of our tutorial series on building a distributed job scheduler! In our previous installment, we laid the foundation by defining the functional and non-functional requirements of our job scheduler. Now, it's time ...]]></description><link>https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-2</link><guid isPermaLink="true">https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-2</guid><category><![CDATA[2Articles1Week]]></category><category><![CDATA[System Architecture]]></category><category><![CDATA[distributed system]]></category><category><![CDATA[System Design]]></category><category><![CDATA[HBase]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Sun, 10 Sep 2023 20:48:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/Xe7BUe4uFBA/upload/0213b3ee34ebf5e954cdc5def8a7b638.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-recap">Recap</h3>
<p>Welcome back to the second part of our tutorial series on building a distributed job scheduler! In our <a target="_blank" href="https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-1">previous</a> installment, we laid the foundation by defining the functional and non-functional requirements of our job scheduler. Now, it's time to dive into the heart of our system by designing a durable storage system to store job details effectively. If you're a software engineer eager to learn new technologies, this tutorial is tailored just for you.</p>
<hr />
<h3 id="heading-modeling-job-class">Modeling Job Class</h3>
<p>Since we have already figured out what are the various Job types and the ways to configure callbacks, our actual <em>job</em> has become quite easier.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-keyword">abstract</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Job</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">Serializable</span> </span>{
    String id;
    <span class="hljs-comment">// The actual HTTP url where callback will be made.</span>
    String callbackUrl;
    <span class="hljs-keyword">int</span> successStatusCode;
    <span class="hljs-keyword">long</span> relevancyWindow;
    <span class="hljs-comment">// Defines the maximum window for callback execution</span>
    TimeUnit relevancyWindowTimeUnit;
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ExactlyOnceJob</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">Job</span> </span>{
    LocalDateTime dateTime;
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">RecurringJob</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">Job</span> </span>{
    List&lt;LocalDateTime&gt; dateTimes;
}

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">RepeatedJob</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">Job</span> </span>{
    LocalDateTime startTime;
    LocalDateTime endTime;
    TimeUnit repeatIntervalTimeUnit;
    <span class="hljs-keyword">long</span> repeatInterval;
}
</code></pre>
<hr />
<h3 id="heading-sql-vs-nosql">SQL vs NoSQL</h3>
<p>Before figuring out the database, let's figure out the query patterns.</p>
<ul>
<li><p>Store job details - We need <strong>high write throughput</strong> to store structured data. Additionally, we must be prepared for <strong>possible schema changes</strong> in the future.</p>
</li>
<li><p>Get job details provided an ID - <strong>High read throughput</strong> to get details of a job based on a key.</p>
</li>
<li><p><strong>No transaction guarantees</strong> are required.</p>
</li>
<li><p><strong>No range scans</strong> are required.</p>
</li>
</ul>
<p>Considering these requirements, we can choose a <strong>NoSQL</strong> database like <strong>Cassandra</strong> or <strong>HBase</strong>. For this tutorial, we'll leverage Apache HBase due to its capabilities.</p>
<hr />
<h3 id="heading-hello-hbase">Hello, HBase!</h3>
<p>If you are new to the world of HBase, I would recommend you to read this crisp and excellent <a target="_blank" href="https://dzone.com/articles/understanding-hbase-and-bigtab">article</a> which would give you a fair idea of the HBase data model.</p>
<p>Installing HBase is a 5-minute affair and can be completed relatively easily. Just go through this <a target="_blank" href="https://www.linkedin.com/pulse/how-install-apache-hbase-ubuntu-dr-virendra-kumar-shrivastava/">link</a>.</p>
<p>Now it's time to write some boilerplate utility code to interact with our newly created HBase Server.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">HBaseManager</span> </span>{

    <span class="hljs-keyword">private</span> Admin admin;
    <span class="hljs-keyword">private</span> Connection connection;

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">HBaseManager</span><span class="hljs-params">()</span> <span class="hljs-keyword">throws</span> IOException </span>{
        Configuration config = HBaseConfiguration.create();
        String path = Objects.requireNonNull(<span class="hljs-keyword">this</span>.getClass().getClassLoader().getResource(<span class="hljs-string">"hbase-site.xml"</span>))
                .getPath();
        config.addResource(<span class="hljs-keyword">new</span> Path(path));
        HBaseAdmin.available(config);
        connection = ConnectionFactory.createConnection(config);
        admin = connection.getAdmin();
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">boolean</span> <span class="hljs-title">tableExists</span><span class="hljs-params">(String name)</span> <span class="hljs-keyword">throws</span> IOException </span>{
        TableName table = TableName.valueOf(name);
        <span class="hljs-keyword">return</span> admin.tableExists(table);
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">createTable</span><span class="hljs-params">(String name, String columnFamily)</span> <span class="hljs-keyword">throws</span> IOException </span>{
        <span class="hljs-keyword">if</span> (!tableExists(name)) {
            TableName table = TableName.valueOf(name);
            HTableDescriptor descriptor = <span class="hljs-keyword">new</span> HTableDescriptor(table);
            descriptor.addFamily(<span class="hljs-keyword">new</span> HColumnDescriptor(columnFamily));
            admin.createTable(descriptor);
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> Table <span class="hljs-title">getTable</span><span class="hljs-params">(String name)</span> <span class="hljs-keyword">throws</span> IOException </span>{
        TableName tableName = TableName.valueOf(name);
        <span class="hljs-keyword">return</span> connection.getTable(tableName);
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">put</span><span class="hljs-params">(Table table, Put value)</span> <span class="hljs-keyword">throws</span> IOException </span>{
        table.put(value);
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> Result <span class="hljs-title">get</span><span class="hljs-params">(Table table, String id)</span> <span class="hljs-keyword">throws</span> IOException </span>{
        Get key = <span class="hljs-keyword">new</span> Get(Bytes.toBytes(id));
        <span class="hljs-keyword">return</span> table.get(key);
    }
}
</code></pre>
<p>Now that our utility code is in place, we can proceed to create a Data Access Object (DAO) layer responsible for storing and retrieving job details.</p>
<pre><code class="lang-java"> <span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">JobDAO</span> </span>{
    HBaseManager hBaseManager;
    String columnFamily = <span class="hljs-string">"cf"</span>;
    String data = <span class="hljs-string">"data"</span>;
    String tableName = <span class="hljs-string">"jobDetails"</span>;
    Table table;

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">JobDAO</span><span class="hljs-params">()</span> <span class="hljs-keyword">throws</span> IOException </span>{
        hBaseManager = <span class="hljs-keyword">new</span> HBaseManager();
        hBaseManager.createTable(tableName, columnFamily);
        table = hBaseManager.getTable(tableName);
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">registerJob</span><span class="hljs-params">(Job job)</span> <span class="hljs-keyword">throws</span> IOException </span>{
        <span class="hljs-keyword">byte</span>[] row = Bytes.toBytes(job.getId());
        Put put = <span class="hljs-keyword">new</span> Put(row);
        put.addColumn(columnFamily.getBytes(), data.getBytes(), SerializationUtils.serialize(job));
        hBaseManager.put(table, put);
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> Job <span class="hljs-title">getJobDetails</span><span class="hljs-params">(String id)</span> <span class="hljs-keyword">throws</span> IOException </span>{
        Result result = hBaseManager.get(table, id);
        <span class="hljs-keyword">byte</span>[] value = result.getValue(columnFamily.getBytes(), data.getBytes());
        Job job = (Job) SerializationUtils.deserialize(value);
        <span class="hljs-keyword">return</span> job;
    }
}
</code></pre>
<p>To ensure the functionality of our system, we'll rely on JUnit tests to validate our code. This step is crucial to confirm that our storage system works as expected.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">JobDAOTest</span> </span>{
    <span class="hljs-meta">@Test</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">testRegisterJob</span><span class="hljs-params">()</span> <span class="hljs-keyword">throws</span> IOException </span>{
        JobDAO jobDAO = <span class="hljs-keyword">new</span> JobDAO();
        String id = UUID.randomUUID().toString();
        ExactlyOnceJob exactlyOnceJob = ExactlyOnceJob.builder()
                .id(id)
                .callbackUrl(<span class="hljs-string">"http://localhost:8080/test"</span>)
                .successStatusCode(<span class="hljs-number">500</span>)
                .build();
        Assertions.assertDoesNotThrow(() -&gt; jobDAO.registerJob(exactlyOnceJob));
        ExactlyOnceJob job = (ExactlyOnceJob) jobDAO.getJobDetails(id);
        Assertions.assertTrue(exactlyOnceJob.equals(job));
    }
}
</code></pre>
<hr />
<h3 id="heading-appendix">Appendix</h3>
<p>Project Structure</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1694887852919/b1de2e6b-b271-4c43-b636-cb220a27ca36.png" alt class="image--center mx-auto" /></p>
<p>Maven pom.xml</p>
<pre><code class="lang-xml"><span class="hljs-tag">&lt;<span class="hljs-name">project</span> <span class="hljs-attr">xmlns</span>=<span class="hljs-string">"http://maven.apache.org/POM/4.0.0"</span>
         <span class="hljs-attr">xmlns:xsi</span>=<span class="hljs-string">"http://www.w3.org/2001/XMLSchema-instance"</span>
         <span class="hljs-attr">xsi:schemaLocation</span>=<span class="hljs-string">"http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd"</span>&gt;</span>

    <span class="hljs-tag">&lt;<span class="hljs-name">modelVersion</span>&gt;</span>4.0.0<span class="hljs-tag">&lt;/<span class="hljs-name">modelVersion</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">groupId</span>&gt;</span>com.scheduler<span class="hljs-tag">&lt;/<span class="hljs-name">groupId</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">artifactId</span>&gt;</span>scheduler<span class="hljs-tag">&lt;/<span class="hljs-name">artifactId</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">version</span>&gt;</span>1<span class="hljs-tag">&lt;/<span class="hljs-name">version</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">build</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">plugins</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">plugin</span>&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">groupId</span>&gt;</span>org.apache.maven.plugins<span class="hljs-tag">&lt;/<span class="hljs-name">groupId</span>&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">artifactId</span>&gt;</span>maven-compiler-plugin<span class="hljs-tag">&lt;/<span class="hljs-name">artifactId</span>&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">configuration</span>&gt;</span>
                    <span class="hljs-tag">&lt;<span class="hljs-name">source</span>&gt;</span>8<span class="hljs-tag">&lt;/<span class="hljs-name">source</span>&gt;</span>
                    <span class="hljs-tag">&lt;<span class="hljs-name">target</span>&gt;</span>8<span class="hljs-tag">&lt;/<span class="hljs-name">target</span>&gt;</span>
                <span class="hljs-tag">&lt;/<span class="hljs-name">configuration</span>&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">plugin</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">plugins</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">build</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">dependencies</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">dependency</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">groupId</span>&gt;</span>org.apache.hbase<span class="hljs-tag">&lt;/<span class="hljs-name">groupId</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">artifactId</span>&gt;</span>hbase-client<span class="hljs-tag">&lt;/<span class="hljs-name">artifactId</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">version</span>&gt;</span>2.5.3<span class="hljs-tag">&lt;/<span class="hljs-name">version</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">dependency</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">dependency</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">groupId</span>&gt;</span>org.apache.hbase<span class="hljs-tag">&lt;/<span class="hljs-name">groupId</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">artifactId</span>&gt;</span>hbase<span class="hljs-tag">&lt;/<span class="hljs-name">artifactId</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">version</span>&gt;</span>2.5.3<span class="hljs-tag">&lt;/<span class="hljs-name">version</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">dependency</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">dependency</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">groupId</span>&gt;</span>org.projectlombok<span class="hljs-tag">&lt;/<span class="hljs-name">groupId</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">artifactId</span>&gt;</span>lombok<span class="hljs-tag">&lt;/<span class="hljs-name">artifactId</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">version</span>&gt;</span>1.18.28<span class="hljs-tag">&lt;/<span class="hljs-name">version</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">scope</span>&gt;</span>provided<span class="hljs-tag">&lt;/<span class="hljs-name">scope</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">dependency</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">dependency</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">groupId</span>&gt;</span>org.apache.commons<span class="hljs-tag">&lt;/<span class="hljs-name">groupId</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">artifactId</span>&gt;</span>commons-lang3<span class="hljs-tag">&lt;/<span class="hljs-name">artifactId</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">version</span>&gt;</span>3.12.0<span class="hljs-tag">&lt;/<span class="hljs-name">version</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">dependency</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">dependency</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">groupId</span>&gt;</span>org.junit.jupiter<span class="hljs-tag">&lt;/<span class="hljs-name">groupId</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">artifactId</span>&gt;</span>junit-jupiter-engine<span class="hljs-tag">&lt;/<span class="hljs-name">artifactId</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">version</span>&gt;</span>5.2.0<span class="hljs-tag">&lt;/<span class="hljs-name">version</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">scope</span>&gt;</span>test<span class="hljs-tag">&lt;/<span class="hljs-name">scope</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">dependency</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">dependency</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">groupId</span>&gt;</span>org.junit.platform<span class="hljs-tag">&lt;/<span class="hljs-name">groupId</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">artifactId</span>&gt;</span>junit-platform-runner<span class="hljs-tag">&lt;/<span class="hljs-name">artifactId</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">version</span>&gt;</span>1.2.0<span class="hljs-tag">&lt;/<span class="hljs-name">version</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">scope</span>&gt;</span>test<span class="hljs-tag">&lt;/<span class="hljs-name">scope</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">dependency</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">dependencies</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">project</span>&gt;</span>
</code></pre>
<hr />
<h3 id="heading-conclusion">Conclusion</h3>
<p>Congratulations! In this second part of our tutorial series, we've made significant progress. We've chosen a suitable database (HBase), implemented the necessary code, and validated it through test cases. But the journey doesn't end here. In the next installment, which we'll cover in part 3, we'll delve into modeling repeated jobs. Do take a pause and think about why they need to be modeled separately. Stay tuned for more exciting insights!</p>
<hr />
<h3 id="heading-references">References</h3>
<ul>
<li><p><a target="_blank" href="https://snehasishroy.com/lets-get-dirty-building-a-distributed-job-scheduler-part-1">Part 1 of the series</a></p>
</li>
<li><p><a target="_blank" href="https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-3">Part 3 of the series</a></p>
</li>
<li><p><a target="_blank" href="https://dzone.com/articles/understanding-hbase-and-bigtab">Understanding HBase and Bigtable</a></p>
</li>
<li><p><a target="_blank" href="https://www.linkedin.com/pulse/how-install-apache-hbase-ubuntu-dr-virendra-kumar-shrivastava/">How to Install Apache HBase on Ubuntu</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Building a Distributed Job Scheduler from scratch (Part 1)]]></title><description><![CDATA[Distributed job schedulers are essential because they allow us to schedule callbacks without worrying about the scalability and reliability aspects. You can try doing what a distributed job scheduler does using ScheduledThreadPool but that won't guar...]]></description><link>https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-1</link><guid isPermaLink="true">https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-1</guid><category><![CDATA[2Articles1Week]]></category><category><![CDATA[System Architecture]]></category><category><![CDATA[System Design]]></category><category><![CDATA[jobschedular]]></category><category><![CDATA[distributed system]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Fri, 08 Sep 2023 21:07:41 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/eIkbSc3SDtI/upload/eaed639d3da515a705e0695b98d7763e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Distributed job schedulers are essential because they allow us to schedule callbacks without worrying about the scalability and reliability aspects. You can try doing what a distributed job scheduler does using ScheduledThreadPool but that won't <em>guarantee</em> callbacks as it does not guarantee <em>durability</em> - if the underlying machine crashes, so will the thread pool.</p>
<p>In this multi-part series, we're rolling up our sleeves to build a robust distributed job scheduler from scratch. Get ready to dive into the world of distributed systems!</p>
<hr />
<h3 id="heading-understanding-the-requirements">Understanding the requirements</h3>
<p>Before we jump into the technical details, let's establish a clear understanding of what our distributed job scheduler needs to achieve.</p>
<h3 id="heading-job-types">Job Types</h3>
<p>Our platform should support three types of jobs:</p>
<ul>
<li><p><strong>Once</strong> - These jobs need to be scheduled only once at a specified date and time, such as scheduling a job for August 1, 2023, at 23:00.</p>
</li>
<li><p><strong>Repeated</strong> - Repeated jobs occur within a defined date range, with a specified time interval between each occurrence. For example, scheduling a job to run every 30 minutes between August 1 and August 31, 2023.</p>
</li>
<li><p><strong>Recurring</strong> - Recurring jobs are scheduled for specific dates and times, e.g., on August 1, 2023, at 16:00, and on August 5, 2023, at 12:30.</p>
</li>
</ul>
<h3 id="heading-how-to-configure-callbacks">How to configure callbacks?</h3>
<p>To offer flexibility, our system must allow clients to configure various aspects of their callbacks:</p>
<ul>
<li><p>Retry strategies - Defines what happens if a callback fails. Should it be retried, and if so, what should the retry strategy entail?</p>
</li>
<li><p>Auth token - Provide an authentication token for client-side verification during callbacks.</p>
</li>
<li><p>Callback Path/URL - The actual HTTP URL where callback will be made.</p>
</li>
<li><p>Headers to pass - Any custom application headers to pass during the callback.</p>
</li>
<li><p>Success status codes - How to interpret whether the callback succeeded? Simply relying on 200 won't suffice for all the client use cases.</p>
</li>
<li><p>Relevancy window - Defines the maximum window for callback execution e.g. expected time of callback is 13:00 but that job somehow got picked up at 13:30, Is that job still valid? This can be configured by the client by providing the relevancy window. If the relevancy window &lt;= 30 minutes, the callback can be performed, otherwise, it can be skipped.</p>
</li>
</ul>
<p>With these functional requirements in mind, let's move on to considering some non-functional aspects that will shape our system.</p>
<hr />
<h3 id="heading-durability">Durability</h3>
<p>If a client has received a successful acknowledgment of a job being <em>accepted</em> from our platform, job details must be persisted in durable storage.</p>
<h3 id="heading-scalability">Scalability</h3>
<p>Design various components of the overall system keeping scalability in mind. Keep the components loosely coupled so that one can be scaled independently of the other.</p>
<h3 id="heading-callback-guarantees">Callback Guarantees</h3>
<p>In a distributed system, it's difficult to make <em>guarantees</em> - especially <em>exactly once</em>, so let's go with it i.e. ensure at least one callback of the scheduled job to our client. Clients <em>might</em> receive multiple callbacks but it's up to them to identify and decide whether or not to process those duplicate requests.</p>
<hr />
<h3 id="heading-conclusion">Conclusion</h3>
<p>Congratulations on making it this far! In this first part of our tutorial series on building a distributed job scheduler, we've outlined the essential functional and non-functional requirements. Let's take a pause and think about how are we going to implement a job scheduler based on the above requirements. I will not directly go into drawing boxes and assigning responsibilities to those boxes - instead, we will first figure out what kind of work we actually have to do and then we will decide whether or not we need a component to handle these kinds of tasks.</p>
<p>In the next installment, we will start by constructing a durable storage system that can persist job details efficiently and allow for quick lookups based on job IDs.</p>
<p>Link to part 2: <a target="_blank" href="https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-2">https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-2</a></p>
]]></content:encoded></item><item><title><![CDATA[How to easily convert Ebooks using Calibre]]></title><description><![CDATA[Hidden within the canvas of reality, treasures often lie in plain sight, awaiting the patient heart and the perceptive eye to unveil their concealed brilliance.

Amid this whirlwind modern world, mastering EPUB-to-PDF conversion via Calibre can feel ...]]></description><link>https://snehasishroy.com/how-to-easily-convert-ebooks-using-calibre</link><guid isPermaLink="true">https://snehasishroy.com/how-to-easily-convert-ebooks-using-calibre</guid><category><![CDATA[ebook]]></category><category><![CDATA[Calibre]]></category><category><![CDATA[Productivity]]></category><category><![CDATA[pdf]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Mon, 28 Aug 2023 17:27:01 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/1J8k0qqUfYY/upload/4e78e3bd1705d7c0cb30e1ea81bafd4b.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>Hidden within the canvas of reality, treasures often lie in plain sight, awaiting the patient heart and the perceptive eye to unveil their concealed brilliance.</p>
</blockquote>
<p>Amid this whirlwind modern world, mastering EPUB-to-PDF conversion via <a target="_blank" href="https://calibre-ebook.com/">Calibre</a> can feel like trying to win a game of hide-and-seek against a mischievous squirrel. Sure, Calibre’s got all the bells and whistles, but sometimes we just want to hit that “PDF” button and call it a day, right?</p>
<p>Fret not! I’m here to guide you through the virtual maze and turn those electronic pages into glorious PDFs without losing your sanity. Grab your magnifying glass and let’s dive into these settings that will make your PDFs shine.</p>
<h1 id="heading-where-to-start-your-journey"><strong>Where to start your journey?</strong></h1>
<p>Download and Setup Calibre. All the screenshots provided are based on v6.24.</p>
<p>Right-click any book in your library and choose “Convert Individually” present under “Convert Books”.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1500/1*g8yJLUXGfS2pvapCGJ23hw.png" alt class="image--center mx-auto" /></p>
<p>Now select PDF as the output format in the top right-hand corner.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1500/1*ceO7syszd4-FxrX2yIXS4Q.png" alt /></p>
<h1 id="heading-enable-text-justification"><strong>Enable Text Justification</strong></h1>
<p>Ever seen text that’s wandering all over the page like a kid who’s had too much sugar? We’re going to whip it into shape.</p>
<p>Head to the Look and Feel section, then navigate to Text, and there you’ll find the Text Justification option. Choose “Justify Text” like a pro, and watch those lines of text straighten up and fly right. It’s like a font intervention.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1500/1*1koT0OgLUGR20uRZUfA8MA.png" alt /></p>
<blockquote>
<p>Conversion -&gt; Look and Feel</p>
</blockquote>
<h1 id="heading-enable-image-resizing"><strong>Enable Image Resizing</strong></h1>
<p>Images that refuse to fit in their designated boxes are like trying to fit an elephant into a phone booth. To conquer this puzzle, venture into the Search and Replace section. There, you’ll encounter the enigmatic “Search Regular Expression.” It’s like a decoder ring for resizing images dynamically.</p>
<p>Add in the below secret code and watch those images slide into place like puzzle pieces.</p>
<pre><code class="lang-plaintext">\ width=\"(.*?)\" height=\"(.*?)\"
</code></pre>
<p><img src="https://miro.medium.com/v2/resize:fit:1500/1*m2Y1aEVhrnMbe3OPR0LEow.png" alt /></p>
<blockquote>
<p>Conversion -&gt; Search and Replace</p>
</blockquote>
<h1 id="heading-format-the-pdf"><strong>Format the PDF</strong></h1>
<p>Now, the grand PDF Output stage awaits your artistic touch.</p>
<p>Select your paper size, much like choosing the canvas for your masterpiece. I went with the trusty A4. Do not select the “Use the paper size set in output profile”</p>
<p>If you’re into numbers, add page numbers at the bottom, like breadcrumbs for digital travelers.</p>
<p>Choose your Serif, Sans Family and Monospace fonts (like picking the right outfit for your document’s fancy party). I chose the good old Calibri and Courier family.</p>
<p>Adjust font sizes, so your text doesn’t look like ants marching across the page.</p>
<p>Oh, and be sure to let the document’s original margins have their moment in the spotlight. In case you want to override the margins, unselect the “Use Page Margins from the document being converted” and provide in your overrides. <a target="_blank" href="https://www.thecalculatorsite.com/conversions/length/points-to-inches.php">Here</a> is a handy link that can convert pts into inches. (72 pts is 1 inch)</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1500/1*Ev75DdoNf9or3l0NkFjadA.png" alt /></p>
<blockquote>
<p>Conversion -&gt; PDF Output</p>
</blockquote>
<h1 id="heading-bonus-tip-update-default-preferences"><strong>Bonus Tip: Update Default Preferences</strong></h1>
<p>Who has time to repeat the same process for every single document? Not you, that’s for sure! Head over to Preferences like a boss. Tweak your conversion settings here once, and they’ll swoop in like a superhero whenever you’re converting more files.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1500/1*kCsAkSJ7-fRt1I17CQyc1w.png" alt /></p>
<blockquote>
<p>Preferences Section</p>
</blockquote>
<p><img src="https://miro.medium.com/v2/resize:fit:1500/1*okxicPXfE5NLvyi-SlM2CQ.png" alt /></p>
<blockquote>
<p>Preferences -&gt; Conversion -&gt; Common Options</p>
</blockquote>
<p><img src="https://miro.medium.com/v2/resize:fit:1500/1*ikwt2nG7oKimVn3IjPpwcw.png" alt /></p>
<blockquote>
<p>Preferences -&gt; Conversion -&gt; Output Options</p>
</blockquote>
<h1 id="heading-wrapping-up"><strong>Wrapping Up</strong></h1>
<p>And there you have it, dear PDF pioneers! Calibre might have thrown some curveballs your way, but armed with these settings and a good sense of humor, you’re now equipped to navigate its twists and turns like a pro.</p>
<p>So go forth, and create PDFs that’ll have your readers thinking, “This is the work of a true PDF Picasso.”</p>
<p>In case you find something new and want to share, please comment.</p>
<p>Thank you for reading :)</p>
<h1 id="heading-references"><strong>References</strong></h1>
<ol>
<li><p><a target="_blank" href="https://calibre-ebook.com/">https://calibre-ebook.com/</a></p>
</li>
<li><p><a target="_blank" href="https://maxrohde.com/2015/01/27/rendering-beautiful-pdf-documents-with-calibre">https://maxrohde.com/2015/01/27/rendering-beautiful-pdf-documents-with-calibre</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/lorenzodifuccia/safaribooks/issues/22">https://github.com/lorenzodifuccia/safaribooks/issues/22</a></p>
</li>
<li><p><a target="_blank" href="https://www.thecalculatorsite.com/conversions/length/points-to-inches.php">https://www.thecalculatorsite.com/conversions/length/points-to-inches.php</a></p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Why Buying O’Reilly Books in India Could be a Costly Mistake! Discover the Ultimate Hack for High-Quality Books at Unbeatable Prices!]]></title><description><![CDATA[There is something mystical in physical books that an ebook can never replace.

Picture this: you’re browsing through an e-commerce store in India, eying those coveted O’Reilly paperback books, your excitement building. But hold that thought! There’s...]]></description><link>https://snehasishroy.com/ultimate-hack-for-high-quality-books-at-unbeatable-prices</link><guid isPermaLink="true">https://snehasishroy.com/ultimate-hack-for-high-quality-books-at-unbeatable-prices</guid><category><![CDATA[ebook]]></category><category><![CDATA[life-hack]]></category><category><![CDATA[Calibre]]></category><category><![CDATA[books]]></category><category><![CDATA[technology]]></category><dc:creator><![CDATA[Snehasish Roy]]></dc:creator><pubDate>Sun, 27 Aug 2023 16:58:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1693154898531/70d0b698-551c-48c9-a1b1-0dff4b5c09ff.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>There is something mystical in physical books that an ebook can never replace.</p>
</blockquote>
<p>Picture this: you’re browsing through an e-commerce store in India, eying those coveted O’Reilly paperback books, your excitement building. But hold that thought! There’s a hidden caveat that could dent your wallet and leave you unimpressed. The culprit? The official seller - <a target="_blank" href="https://www.amazon.in/Designing-Data-Intensive-Applications-Reliable-Maintainable/product-reviews/9352135245/ref=cm_cr_arp_d_viewopt_kywd?ie=UTF8&amp;reviewerType=all_reviews&amp;pageNumber=1&amp;filterByKeyword=print">notorious for peddling lackluster books</a> at exorbitant prices. Don’t be fooled by their fancy facade — your hard-earned money deserves better!</p>
<p>Now, before you swear off your book-buying dreams, I'm here to reveal a game-changing solution that'll leave you with premium reads and extra cash in your pocket. It's time to roll up your sleeves and embark on a journey to book nirvana.</p>
<h1 id="heading-step-1-create-an-account-on-oreilly"><strong>Step 1: Create an account on Oreilly</strong></h1>
<p>First things first, we’re sidestepping the Publisher and heading straight to the source. Forge yourself a dummy account on <a target="_blank" href="http://Oreilly.com">Oreilly.com</a>, and voilà! You’ve unlocked 10 days of trial account bliss. Run into expiration trouble? A swift signup shuffle will sort you right out.</p>
<h1 id="heading-step-2-clone-safaribooks-git-repository"><strong>Step 2: Clone SafariBooks Git Repository</strong></h1>
<p>Clone this amazing git repo <a target="_blank" href="https://github.com/lorenzodifuccia/safaribooks">https://github.com/lorenzodifuccia/safaribooks</a> and follow the below instructions</p>
<pre><code class="lang-plaintext">$ git clone https://github.com/lorenzodifuccia/safaribooks.git
Cloning into 'safaribooks'...

$ cd safaribooks/
$ pip3 install -r requirements.txt

$ python3 safaribooks.py --cred "account_mail@mail.com:password01" XXXXXXXXXXXXX

Provide the username and password of your account on oreilly.com followed by the Book ID.
The ID is the digits that you find in the URL of the book description page:
https://www.safaribooksonline.com/library/view/book-name/XXXXXXXXXXXXX/
Like: https://www.safaribooksonline.com/library/view/test-driven-development-with/9781491958698/
</code></pre>
<p>At the end of this step, you will have received a genuine and up-to-date epub copy of the book you are trying to read.</p>
<p><em>P.S: In case you are facing login issues in the script, have a look at this</em> <a target="_blank" href="https://github.com/lorenzodifuccia/safaribooks/pull/351/files"><em>PR</em></a><em>.</em></p>
<h1 id="heading-step-3-download-and-install-calibre"><strong>Step 3: Download and Install Calibre</strong></h1>
<p>Next stop, <a target="_blank" href="https://calibre-ebook.com/download">Calibre</a> — a world of wonder masked by its humble UI. Drop your newly acquired ePub gem into Calibre’s treasure trove.</p>
<h1 id="heading-step-4-convert-epub-into-pdf"><strong>Step 4: Convert EPUB into PDF</strong></h1>
<p>Epub format internally uses HTML/CSS to represent the text, structure and format of the content document. It’s extremely customizable as it’s pretty flexible. Don’t like the font, you can change it. Don’t like the spacing, you can change it. Want to change the background, believe it or not, you can change it too.</p>
<p>PDFs on the other hand are static. It does not provide customizations but is extremely good for printing because it doesn’t change. You can print a pdf from any computer and they would come the same because pdf is sort of a digital paper. What you see is what you get.</p>
<p>If you need to print a book, you will most likely need its pdf version. So we would need to convert the epub into pdf. You need not download additional software and can use Calibre itself. I have written another <a target="_blank" href="https://snehasishroy.medium.com/unleash-your-inner-pdf-whisperer-with-calibre-conquering-the-epub-to-pdf-expedition-da2a36b83e42?sk=6b4fdca67d92ccd78bb9906c00bd5354">article</a> which provides a step-by-step way to do the same.</p>
<h1 id="heading-step-5-printing-as-a-service"><strong>Step 5: Printing as a Service</strong></h1>
<p>The crescendo approaches, as your customized masterpiece yearns for the tactile embrace of paper. Enter online book printing services like <a target="_blank" href="https://printster.in/">Printster</a>, your partner-in-printing. Get ready to call the shots — spiral binding for frugality? Check. Soft-cover lightness? Check. Hard-cover extravagance? Double check. With options galore and competitive pricing, your dream book metamorphoses into reality.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1050/0*sp4qQGkXOJjlSVLk.jpg" alt /></p>
<blockquote>
<p>Look at that subtle off-white coloring. The tasteful thickness of it.</p>
</blockquote>
<p>In my quest for bookish bliss, I ventured with 85 GSM Bond paper, a robust color cover, monochrome elegance, dual-sided splendor, and the pièce de résistance — hardcover binding. The result? An awe-inspiring tome at a fraction of Amazon’s asking price.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1050/1*6gj5WWKPaIMmDRYLN0XY1w.jpeg" alt /></p>
<blockquote>
<p>Cover image at its glory</p>
</blockquote>
<p><img src="https://miro.medium.com/v2/resize:fit:1050/1*pFo_m2sFqg4fwJWE2pyVkw.png" alt /></p>
<blockquote>
<p>Crisp off-white 85 GSM Paper at its beauty</p>
</blockquote>
<p>Dear O’Reilly, if you’re listening, heed the plea: a new publisher is overdue.</p>
<p>Armed with this revelation, I hope you’re ready to embark on your budget-friendly book-buying escapades. Say goodbye to those inflated prices and lackluster editions — a brave new world of high-quality, wallet-happy reads awaits. Happy reading, and may your literary journey be ever-illuminating!</p>
<h1 id="heading-references"><strong>References</strong></h1>
<ol>
<li><p><a target="_blank" href="https://www.oreilly.com/">https://www.oreilly.com/</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/lorenzodifuccia/safaribooks">https://github.com/lorenzodifuccia/safaribooks</a></p>
</li>
<li><p><a target="_blank" href="https://calibre-ebook.com/download">https://calibre-ebook.com/download</a></p>
</li>
<li><p><a target="_blank" href="https://snehasishroy.medium.com/unleash-your-inner-pdf-whisperer-with-calibre-conquering-the-epub-to-pdf-expedition-da2a36b83e42?source=friends_link&amp;sk=6b4fdca67d92ccd78bb9906c00bd5354">https://snehasishroy.medium.com/unleash-your-inner-pdf-whisperer-with-calibre-conquering-the-epub-to-pdf-expedition-da2a36b83e42?source=friends_link&amp;sk=6b4fdca67d92ccd78bb9906c00bd5354</a></p>
</li>
<li><p><a target="_blank" href="https://printster.in/">https://printster.in/</a></p>
</li>
</ol>
<h1 id="heading-disclaimer"><strong>Disclaimer</strong></h1>
<p>The information provided in this article is for educational and informational purposes only. The author and publisher of this article are not responsible for any actions taken based on the information presented herein. The content of this article does not constitute professional advice, and readers are advised to consult with appropriate professionals before making any decisions or taking any actions.</p>
<p>The references to specific companies, products, or services in this article are purely for illustrative purposes and do not constitute endorsements or recommendations. The author and publisher do not have any affiliation with the companies mentioned, and the information provided is based on publicly available sources as of the time of writing.</p>
<p>Readers are encouraged to independently verify the accuracy and relevance of the information provided in this article. Any reliance on the information contained in this article is at the reader’s own risk. The author and publisher disclaim any liability for any loss, damage, or inconvenience caused by reliance on the information provided herein.</p>
]]></content:encoded></item></channel></rss>