Building AtomDB: An Embedded Distributed Raft KeyValue Store

Over the past 1 year, I was working on an itch to build something - not to just vibe code - but to actually understand the primitives and build something useful. I thought, what better way to understand Distributed Systems than by building a Database? So, I started building one!

What was the motivation? Most engineers have felt this pain at least once. You are building a service that needs a strongly consistent, distributed configuration store. You reach for Redis, Etcd, or ZooKeeper — and then spend the next two weeks onboarding a new dependency, wiring up a separate deployment, managing its credentials, and debugging. The data requirements are modest - a few dozen keys, each a small string. The operational weight is disproportionate.

What if the store was just part of your service? Not a sidecar, not a separate container — but a library you integrate into your application. That is exactly the problem AtomDB was built to solve.

AtomDB is a distributed key-value store built on top of Apache Ratis, the Java implementation of the Raft consensus protocol. It ships as a Dropwizard bundle or a Standard Java Client, which can start a fully replicated Raft group within the same JVM process. There is no separate binary to run, no extra Dockerfile, and no extra deploy step. The cluster forms itself, heals itself, and can even grow and shrink automatically while your service is live.

This post is the engineering narrative behind that system: how it works, why it is designed the way it is, and what the code looks like from the inside.

Embedded Model

Traditional key-value stores like Redis or Etcd are designed as an external cluster which you connect to over a network. Your service is a client of that external system. That model carries a hidden tax - every read or write crosses a network boundary, the store must be independently deployed and monitored, and your service's availability now depends on the health of an entirely separate system.

AtomDB flips the model. Instead of your service being a client of an external store, the store is embedded inside your service. Each instance of your application carries its own Raft node. Together those replicas form a Raft cluster and maintain a shared, consistent state without ever leaving the JVM process.

The practical consequence: every instance of your service participates in the cluster. When you scale your service from 3 pods to 5, two new Raft peers come online automatically. When you deploy your service, you deploy the store too.

Registering AtomDB bundle in your application looks exactly like adding any other Dropwizard bundle.

bootstrap.addBundle(new AtomDBBundle<YourConfiguration>() {

    @Override
    public AtomDBBundleConfig getAtomDBBundleConfig(YourConfiguration config) {
        return config.getAtomDBBundleConfig();
    }

    @Override
    public int getApplicationPort(YourConfiguration config) {
        DefaultServerFactory serverFactory = (DefaultServerFactory) config.getServerFactory();
        for (ConnectorFactory connector : serverFactory.getApplicationConnectors()) {
            if (connector instanceof HttpConnectorFactory httpConnector) {
                return httpConnector.getPort();
            }
        }
        return -1;
    }
});

That single addBundle call wires together the Raft group, the state machine, the HTTP operators, and a Feign-based client — all from the YAML configuration your service already reads at startup.

In case you are not using Dropwizard, you can use the provided atomdb-client to integrate it in your Java application.

Raft Consensus: Why Every Write Needs Majority Agreement

Before understanding how AtomDB stores a value, you need to understand Raft's core guarantee. Raft is a consensus algorithm that ensures a cluster of nodes agrees on a single, ordered log of operations. In a cluster of N nodes, every write must be acknowledged by a strict majority (N/2 + 1 nodes) before it is considered committed. This majority is called the quorum.

For a 3-node AtomDB cluster, that means at least 2 nodes must confirm every PUT before the client sees a success response. For 5 nodes, at least 3 must confirm. Note that if you have a cluster of 100 Dropwizard applications, you don't have to run a 100 node AtomDB cluster. You can form a smaller 3 node cluster where your metadata can reside.

Source: https://www.mydistributed.systems/2021/04/raft.html

Reads Route to the Leader

In AtomDB, reads (GET) are always routed to the current Raft leader. This is intentional and deliberate. Because the leader is the node that manages log replication, it always has the most up-to-date committed state. By routing reads through the leader, AtomDB guarantees read-after-write consistency: if your client successfully commits a PUT, any subsequent GET on any node will observe that value.

This is markedly different from eventually-consistent systems where a write to Node 1 might not be immediately visible on a GET to Node 2. In AtomDB, consistency is a hard guarantee, not a probabilistic one.

The Ratis query() path is non-replicated — it answers directly from the leader's in-memory state without creating a log entry, keeping reads fast while preserving the consistency guarantee.

// In KeyValueStateMachine.java
@Override
public CompletableFuture<Message> query(Message request) {
    final String logData = request.getContent().toStringUtf8();
    final String[] parts = logData.split(":", 3);

    if ("GET".equals(parts[0]) && parts.length >= 2) {
        final String key = parts[1];
        final String value = keyValueStore.getOrDefault(key, "NOT_FOUND");
        return CompletableFuture.completedFuture(Message.valueOf(value));
    }
    return CompletableFuture.completedFuture(Message.valueOf("INVALID_QUERY"));
}

The State Machine: Writes Are Applied Only After Quorum

The Raft log is not the data store — it is the ordered record of operations that transform the data store. The actual key-value pairs live in the KeyValueStateMachine, a class that extends Ratis's BaseStateMachine. Crucially, applyTransaction is called by the Raft runtime only after the log entry has been replicated to a majority of nodes and committed. Your in-memory ConcurrentHashMap is never touched by a write that hasn't cleared quorum.

// In KeyValueStateMachine.java
private final ConcurrentHashMap<String, String> keyValueStore = new ConcurrentHashMap<>();

@Override
public CompletableFuture<Message> applyTransaction(TransactionContext trx) {
    final RaftProtos.LogEntryProto entry = trx.getLogEntry();
    final String logData = entry.getStateMachineLogEntry()
            .getLogData().toStringUtf8();

    // Parse command: PUT:key:value
    final String[] parts = logData.split(":", 3);
    if ("PUT".equals(parts[0]) && parts.length == 3) {
        keyValueStore.put(parts[1], parts[2]);
        reply = Message.valueOf("OK");
    }

    // This is the contract with Ratis: record the highest applied index
    updateLastAppliedTermIndex(entry.getTerm(), entry.getIndex());
    return CompletableFuture.completedFuture(reply);
}

The call to updateLastAppliedTermIndex at the end is not optional. It tells Raft which log entry this node has processed, so that in the event of a crash the runtime knows where to resume replay. Forget that call and you break the replay invariant — the node will re-apply entries it already applied, potentially corrupting state.

Think of it like Kafka KRaft, which replaced ZooKeeper with a Raft-based metadata quorum. Every broker metadata change (partition assignment, ISR list update, config change) is a log entry. The brokers' in-memory metadata is applied only after entries commit. applyTransaction in AtomDB is the direct counterpart of KRaft's metadata record application — and takeSnapshot maps cleanly onto KRaft's metadata snapshot mechanism.

The Control Plane: Dynamic Peer Management Without Manual Intervention

Raft itself is a consensus protocol, not an operations framework. It will faithfully replicate log entries once you tell it who the peers are — but it has no notion of go discover three more nodes on these ports and add them to the cluster. That operational intelligence is AtomDB's control plane.

Bootstrap: Who Goes First?

When multiple nodes start simultaneously, exactly one of them must bootstrap the Raft group; the others must join it. Choosing the wrong bootstrapper (or having two nodes both bootstrap) corrupts the group. AtomDB resolves this with a deterministic election inside ClusterManager:

clusterService.awaitDiscovery() blocks until at least memberSize peers are reachable on their admin ports (via TCP probe).
From the list of reachable peers, sorted deterministically, the first node in the list becomes the candidate bootstrapper.
Before bootstrapping, the candidate probes all other peers' HTTP /cluster/v1/peers endpoints with exponential-backoff retries. If any peer returns a non-empty follower list, an existing Raft cluster is detected and this node stands down.

// In ClusterManager.java — the safety check before bootstrapping
private boolean isCandidateNode(List<DiscoveryNode> candidateNodes) {
    // Only the first node in the sorted list may bootstrap
    Optional<DiscoveryNode> firstNode = candidateNodes.stream().findFirst();
    String localNodeId = clusterService.getDiscovery().getLocalInstanceInfo().split(":")[0];

    if (!firstNode.map(n -> n.getNodeId().equals(localNodeId)).orElse(false)) {
        return false; // Not first: wait for leader contact
    }

    // First node: probe all others for an existing cluster before starting
    boolean clusterExists = detectExistingCluster(otherCandidates);
    return !clusterExists; // Only bootstrap if no existing cluster found
}

Non-candidate nodes start with an empty Raft group — RaftGroup.valueOf(raftGroupId, List.of()). This is intentional and follows Ratis's membership-change protocol. An empty-group start means the node participates in no configuration until the leader explicitly adds it via setConfiguration().

Adding Nodes: The `AddListenerTask`

After the initial cluster forms, all subsequent membership changes are driven by a background task called AddListenerTask. It runs on the leader every 10 seconds and implements the following decision tree:

AddListenerTask runs (leader only)
        │
        ├── Are any followers unreachable?
        │       │
        │       ├── YES: Is majority quorum still intact?
        │       │       ├── NO  → Log error, do nothing (unsafe to reconfigure)
        │       │       └── YES → Promote available listeners to followers (if flag enabled)
        │       │                 OR find a spare node, POST /ratis/v1/start on it,
        │       │                 then call setConfiguration() adding it as a follower
        │       │
        │       └── NO: Are any listeners unreachable?
        │               ├── YES → Replace with a spare node as listener
        │               └── NO  → Is follower count < expectedMemberSize?
        │                           └── YES → Add spare node as follower
        │                               NO  → Is listener count < maxListeners?
        │                                       └── YES → Add spare node as listener
        │                                           NO  → Cluster at capacity, do nothing

The quorum math is deliberately based on expectedMemberSize (from config), not the current number of running followers. This is a Raft safety requirement: the quorum threshold must be stable, not a moving target that changes every time a node disappears.

// In AddListenerTask.java — quorum safety check
int majorityQuorum = (expectedMemberSize / 2) + 1;

if (reachableFollowers.size() < majorityQuorum) {
    log.error("CRITICAL: Majority quorum lost! Reachable followers: {}, Required: {}",
            reachableFollowers.size(), majorityQuorum);
    // Do nothing — unsafe to change configuration without quorum
    return;
}

When a spare node is promoted, AtomDB uses a two-step process. First, it POSTs to the spare node's /ratis/v1/start endpoint to start the Raft server process on that node. Then it calls client.admin().setConfiguration() through the Raft protocol to officially add the peer to the group. The new node joins with an empty group and learns the current log from the leader through the normal Raft log replication mechanism.

The Peer String Format

All discovery configuration encodes nodes in a compact string format:

nodeId:host:raftPort:appPort:adminPort
n1:0.0.0.0:9000:8080:8081

raftPort (9000): used by Ratis gRPC for log replication between peers
appPort (8080): used by the leader's HTTP client to call /ratis/v1/start on spare nodes
adminPort (8081): pinged during awaitDiscovery() to check liveness

Listeners: A Non-Voting Standby Mechanism

In Raft, a listener receives all log entries from the leader and maintains a current copy of the state, but it does not count toward the quorum. This has a useful operational property: a listener can be added to the cluster without changing the quorum size, making it a zero-risk standby.

AtomDB uses listeners as a first tier of high-availability:

Config Field	Effect
`maxListeners: 1`	Keep one spare node as a listener at all times
`enablePromotionOfListenersToFollowers`	When `true`, promote a healthy listener to replace a failed follower

With enablePromotionOfListenersToFollowers: false (the safe default), the leader will never silently swap a follower for a listener — any such promotion is a deliberate operator action. With it set to true, recovery from a single-follower failure becomes fully automatic, in a matter of seconds.

// In AddListenerTask.java — listener promotion gate
boolean enableListenerPromotion = clusterService.getClusterHealthStrategy()
        .isEnablePromotionOfListenersToFollowers();

if (enableListenerPromotion && !currentListeners.isEmpty()) {
    int promotionCount = Math.min(followersNeeded, currentListeners.size());
    for (int i = 0; i < promotionCount; i++) {
        RaftPeer listenerToPromote = currentListeners.get(i);
        RaftPeer promotedFollower = RaftPeer.newBuilder()
                .setId(listenerToPromote.getId())
                .setAddress(listenerToPromote.getAddress())
                .setStartupRole(RaftPeerRole.FOLLOWER)
                .build();
        currentFollowers.add(promotedFollower);
    }
}

A concrete scenario with a 5-node cluster (memberSize: 3, maxListeners: 1):

Initial state:   n1(follower) n2(follower) n3(follower) n4(listener) n5(spare, not started)

n2 goes down:
  AddListenerTask detects unreachable follower n2
  Quorum check: 2 reachable followers ≥ majorityQuorum(2) — safe to proceed
  enablePromotionOfListenersToFollowers: false → skip listener promotion
  Find spare node n5 → POST /ratis/v1/start to n5
  Call setConfiguration([n1, n3, n5], [n4])

Result:          n1(follower) n3(follower) n5(follower) n4(listener)
                 Quorum restored. n2 can rejoin later as a follower.

Snapshot Durability: Surviving a Full Cluster Restart

An in-memory state machine is convenient but fragile. If all nodes lose their disks simultaneously — or if a containerised deployment recycles all pods at once — every committed write is gone. To address this, AtomDB supports uploading snapshots to S3-compatible object storage.

Taking a Snapshot

A snapshot serialises the entire ConcurrentHashMap to a local file, computes its MD5, and then uploads both the snapshot and its checksum to S3 under a configurable prefix.

// In KeyValueStateMachine.java
@Override
public long takeSnapshot() {
    final TermIndex last = getLastAppliedTermIndex();

    // 1. Serialise the in-memory map to a local file
    File snapshotFile = s3Storage.getSnapshotFile(last.getTerm(), last.getIndex());
    try (ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(snapshotFile))) {
        out.writeObject(new ConcurrentHashMap<>(keyValueStore));
    }

    // 2. Compute and save an MD5 checksum alongside the snapshot
    MD5Hash md5 = MD5FileUtil.computeAndSaveMd5ForFile(snapshotFile);

    // 3. Upload the snapshot file to S3 at the configured prefix
    s3Storage.uploadSnapshotToS3(last.getTerm(), last.getIndex(), snapshotFile);
    s3Storage.updateLatestSnapshot(new SingleFileSnapshotInfo(...));

    return last.getIndex();
}

The snapshot file name encodes the Raft term and index at which it was taken (e.g. snapshot.4_1892). On startup, loadSnapshot downloads the latest snapshot from S3 before the Raft log replay begins. The new node catches up to the snapshot's index instantly, and then only needs to replay the delta between the snapshot and the current log head — which is typically small.

# Minimal S3 storage config in your YAML
storageConfig:
  type: S3
  bucketName: "my-service-atomdb"
  region: "us-east-1"
  accessKey: "your-access-key"
  secretKey: "unused"                        # token-based auth
  endpoint: "https://your-s3-endpoint"
  pathStyleAccess: true
  snapshotPrefix: "my-service-snapshots-prod/"  # unique per service+env

The snapshotPrefix is the single most important knob to get right in production: use a unique value per service and per environment. Sharing a prefix between staging and production is the fastest path to a corrupted cluster after an accidental cross-environment snapshot download.

Integration Tests: Real Servers, Real Raft

Unit tests that mock Raft are not very useful — the interesting bugs live in timing, log replication races, and node failure/recovery sequences. AtomDB's integration tests spin up real in-process Dropwizard servers on dynamic ports using JUnit 5 extension.

@RegisterExtension
static MultiInstanceAtomDbExtension cluster = MultiInstanceAtomDbExtension.builder()
        .instanceCount(3)
        .quorumSize(3)
        .startupTimeout(Duration.ofSeconds(120))
        .build();

beforeAll calls ConfigGenerator.generateClusterConfigs() to produce dynamic-port YAML configs, starts each instance sequentially with a 3-second gap (to avoid bootstrap races), and then polls ClusterHealthChecker.waitForAllPeersInCluster() until the expected number of followers are visible in the cluster's listPeers() response. Only then does the first test method run.

Verifying Replication

The most fundamental correctness property: a PUT on node 1 must be visible on node 2.

@Test
void testPutOnOneNodeGetOnAnother() {
    AtomDbClient client1 = cluster.getClient("n1");
    AtomDbClient client2 = cluster.getClient("n2");

    AtomDbResponse<String> putResponse = client1.put("replication-key", "replication-value");
    assertThat(putResponse.isSuccess()).isTrue();

    cluster.waitForReplication();

    AtomDbResponse<String> getResponse = client2.getKey("replication-key");
    assertThat(getResponse.isSuccess()).isTrue();
    assertThat(getResponse.getResponse()).isEqualTo("replication-value");
}

Leader Failover

LeaderFailoverTest validates the complete failover lifecycle: stop the leader, verify the remaining two nodes elect a new leader, write new data, restart the former leader, and confirm it rejoins as a follower and catches up.

void testLeaderFailoverAndRecovery() throws Exception {
    verifyInitialClusterFormation();    // all 3 nodes up, listPeers returns 3 followers
    testClusterBeforeFailover();        // PUT/GET works
    stopLeaderNode();                   // cluster.stopNode("n1")
    testClusterAfterLeaderFailure();    // PUT/GET still works via new leader (n2 or n3)
    restartFormerLeaderNode();          // cluster.startNode("n1")
    verifyNodeRejoinsCluster();         // listPeers shows n1 as follower again
    testClusterAfterRecovery();         // all 3 nodes serve reads/writes
}

Follower Replacement

FollowerRecoveryTest uses a 5-node cluster (3 followers, 1 listener, 1 spare) and asserts that when a follower dies, the spare is automatically promoted within the AddListenerTask cycle:

@RegisterExtension
static MultiInstanceAtomDbExtension cluster = MultiInstanceAtomDbExtension.builder()
        .instanceCount(5)
        .quorumSize(3)
        .expectedPeerCount(4) // n1-n4 join; n5 remains spare until needed
        .startupTimeout(Duration.ofSeconds(180))
        .build();

After killFollowerAndVerifyPromotion() stops node n2, the test uses Awaitility to poll listPeers() until it sees exactly 3 followers again — with n5 promoted into n2's slot.

Listener Synchronisation

ClusterListenerSyncTest proves the non-voting-member guarantee: a listener keeps a current replica of the key-value store but its failure does not affect the voting quorum. The test simultaneously kills the listener node and one quorum follower, verifying that the remaining 2 quorum nodes still accept writes and serve reads.

Integrating as a Client

If you are writing a Java service that wants to talk to an AtomDB cluster without embedding the full Raft server (for example, a service that is not part of the cluster but needs to read configuration from it), the atomdb-client module provides a thin Feign interface:

<dependency>
    <groupId>com.snehasishroy</groupId>
    <artifactId>atomdb-client</artifactId>
    <version>1.0.0</version>
</dependency>

AtomDbClient client = Feign.builder()
        .client(new OkHttpClient())
        .encoder(new PlainTextEncoder(new JacksonEncoder()))
        .decoder(new JacksonDecoder())
        .target(AtomDbClient.class, "http://atomdb-node1:8080/");

// Store a value
AtomDbResponse<String> putResp = client.put("feature.flag.rollout", "true");
assertThat(putResp.isSuccess()).isTrue();

// Read it back
AtomDbResponse<String> getResp = client.getKey("feature.flag.rollout");
System.out.println(getResp.getResponse()); // "true"

The client interface itself is minimal by design:

public interface AtomDbClient {
    @RequestLine("GET /kv/v1/{key}")
    AtomDbResponse<String> getKey(@Param("key") String key);

    @RequestLine("PUT /kv/v1/{key}")
    @Headers("Content-Type: text/plain")
    AtomDbResponse<String> put(@Param("key") String key, String value);

    @RequestLine("GET /cluster/v1/peers")
    AtomDbResponse<ClusterPeersResponse> listPeers();

    @RequestLine("POST /snapshot/v1/")
    void triggerSnapshot();
}

Because reads are served by the leader and the leader can change during a failover, it is good practice to point your Feign target at a load balancer (or use retry logic with multiple target URLs) rather than hard-coding a single node.

Discovery Strategies: From Local Dev to Production

AtomDB supports three discovery modes to cover the full lifecycle from laptop to production:

Mode	Config type	When to use
STATIC	`type: STATIC`	Local Dev with fixed ports; no automatic node promotion
DYNAMIC	`type: DYNAMIC`	Local tests with spare nodes; simulates failover
DROVE	`type: DROVE`	Production; auto-discovers live instances via Drove

In DYNAMIC mode every discoverable node — including future spares — must appear in the peers list upfront. A node not listed can never be discovered by the AddListenerTask.

In DROVE mode the control plane queries the Drove orchestration API (https://github.com/PhonePe/drove-orchestrator) to discover all healthy instances for the service. This is the zero-configuration production path - deploy a new instance, Drove registers it, AtomDB discovers it on the next AddListenerTask tick.

# Production DROVE config
atomDBBundleConfig:
  clusterConfiguration:
    discoveryConfig:
      type: DROVE
      droveEndpoint: ${DROVE_ENDPOINT_URL}
      raftPortName: raft
      communicationPortName: main
    clusterHealthConfig:
      type: DYNAMIC
      memberSize: 3
      maxListeners: 1
      enablePromotionOfListenersToFollowers: false
    groupUUID: 02511d47-d67c-49a3-9011-abb3109a44c2

Conclusion

AtomDB started as an experiment: what is the minimum viable design for a strongly-consistent, embedded key-value store that a team could adopt with focus on minimalistic external dependency? The answer turned out to be a surprisingly thin layer on top of Apache Ratis.

The core insight is the separation of concerns between Raft (the protocol) and the control plane (the operational intelligence). Ratis handles log replication, leader election, and membership changes — but it does not decide when to add nodes, which nodes to promote, or how to bootstrap a fresh cluster. That reasoning lives in ClusterManager and AddListenerTask, and it is the part of the system that is most specific to AtomDB's operational model.

The result is a system where you can:

PUT a key on any node, get a linearisable write-after-quorum guarantee
GET a key and always read the latest committed value
Lose a minority of nodes and have the cluster self-heal with spare nodes within seconds
Survive a full cluster restart by downloading a snapshot from S3
Write tests that start real Raft clusters in-process, exercise actual failover sequences, and clean up after themselves in a single JUnit lifecycle

The source code is available at github.com/snehasishroy/atomdb. If you embed it in a Dropwizard application and find a bug or a missing feature, pull requests are very welcome.

P.S: I have done extensive testing to make sure the first release does not have any known bugs. However, this has not been tested on production env yet. Thank you for sticking to the end.

Introducing AtomDB - A Strongly Consistent, Embedded KeyValue Database built from scratch

Embedded Model

Raft Consensus: Why Every Write Needs Majority Agreement

Reads Route to the Leader

The State Machine: Writes Are Applied Only After Quorum

The Control Plane: Dynamic Peer Management Without Manual Intervention

Bootstrap: Who Goes First?

Adding Nodes: The `AddListenerTask`

The Peer String Format

Listeners: A Non-Voting Standby Mechanism

Snapshot Durability: Surviving a Full Cluster Restart

Taking a Snapshot

Integration Tests: Real Servers, Real Raft

Verifying Replication

Leader Failover

Follower Replacement

Listener Synchronisation

Integrating as a Client

Discovery Strategies: From Local Dev to Production

Conclusion

Comments

More from this blog

The String That Should Never Have Existed

Shuffle Sharding - the secret sauce behing AWS reliability

Building a Request Coalescer from Scratch

Understanding the basics of Kafka Binary Protocol

Command Palette

Embedded Model

Raft Consensus: Why Every Write Needs Majority Agreement

Reads Route to the Leader

The State Machine: Writes Are Applied Only After Quorum

The Control Plane: Dynamic Peer Management Without Manual Intervention

Bootstrap: Who Goes First?

Adding Nodes: The AddListenerTask

The Peer String Format

Listeners: A Non-Voting Standby Mechanism

Snapshot Durability: Surviving a Full Cluster Restart

Taking a Snapshot

Integration Tests: Real Servers, Real Raft

Verifying Replication

Leader Failover

Follower Replacement

Listener Synchronisation

Integrating as a Client

Discovery Strategies: From Local Dev to Production

Conclusion

Comments

More from this blog

Adding Nodes: The `AddListenerTask`