Building a Distributed Job Scheduler

Distributed job schedulers are essential because they allow us to schedule callbacks without worrying about the scalability and reliability aspects. You can try doing what a distributed job scheduler does using ScheduledThreadPool but that won't guarantee callbacks as it does not guarantee durability - if the underlying machine crashes, so will the thread pool.

In this multi-part series, we're rolling up our sleeves to build a robust distributed job scheduler from scratch. Get ready to dive into the world of distributed systems!

Understanding the requirements

Before we jump into the technical details, let's establish a clear understanding of what our distributed job scheduler needs to achieve.

Job Types

Our platform should support three types of jobs:

Once - These jobs need to be scheduled only once at a specified date and time, such as scheduling a job for August 1, 2023, at 23:00.
Repeated - Repeated jobs occur within a defined date range, with a specified time interval between each occurrence. For example, scheduling a job to run every 30 minutes between August 1 and August 31, 2023.
Recurring - Recurring jobs are scheduled for specific dates and times, e.g., on August 1, 2023, at 16:00, and on August 5, 2023, at 12:30.

How to configure callbacks?

To offer flexibility, our system must allow clients to configure various aspects of their callbacks:

Retry strategies - Defines what happens if a callback fails. Should it be retried, and if so, what should the retry strategy entail?
Auth token - Provide an authentication token for client-side verification during callbacks.
Callback Path/URL - The actual HTTP URL where callback will be made.
Headers to pass - Any custom application headers to pass during the callback.
Success status codes - How to interpret whether the callback succeeded? Simply relying on 200 won't suffice for all the client use cases.
Relevancy window - Defines the maximum window for callback execution e.g. expected time of callback is 13:00 but that job somehow got picked up at 13:30, Is that job still valid? This can be configured by the client by providing the relevancy window. If the relevancy window <= 30 minutes, the callback can be performed, otherwise, it can be skipped.

With these functional requirements in mind, let's move on to considering some non-functional aspects that will shape our system.

Durability

If a client has received a successful acknowledgment of a job being accepted from our platform, job details must be persisted in durable storage.

Scalability

Design various components of the overall system keeping scalability in mind. Keep the components loosely coupled so that one can be scaled independently of the other.

Callback Guarantees

In a distributed system, it's difficult to make guarantees - especially exactly once, so let's go with it i.e. ensure at least one callback of the scheduled job to our client. Clients might receive multiple callbacks but it's up to them to identify and decide whether or not to process those duplicate requests.

Conclusion

Congratulations on making it this far! In this first part of our tutorial series on building a distributed job scheduler, we've outlined the essential functional and non-functional requirements. Let's take a pause and think about how are we going to implement a job scheduler based on the above requirements. I will not directly go into drawing boxes and assigning responsibilities to those boxes - instead, we will first figure out what kind of work we actually have to do and then we will decide whether or not we need a component to handle these kinds of tasks.

In the next installment, we will start by constructing a durable storage system that can persist job details efficiently and allow for quick lookups based on job IDs.

Link to part 2: https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-2