Task Scheduling and Slippage

Published by

Jack Coates

on

January 6, 2024

Enterprise systems have a lot of things that need to happen. If they all happen at the same time, you’ll either overload your constrained resources or overload the budget attached to your elastic resources. Plus, some of these things are supposed to occur at a specific time, and others should happen around a specific time! Worst of all, most of these tasks are doing something with inbound data, so there’s no point running them unless that data has arrived. How to make it go?

There are a lot of ways to answer this problem, and many systems do not have a one-size fits all story. Instead they build different tools for different positions on a range of possible solutions, or at least different operational modes for their one scheduler tool that make it align with multiple positions.

Position 1: “real-time” streaming, or continuous fire. The task execution engine watches the ingest pipeline or the data storage interface or the data modeling product… wherever your system considers data to be properly arrived and ready to use. If new data is arriving or new tasks are needed or whatever, the engine just does the thing right away. What’s great about this is each task is probably really small (although users might decide to put a hefty look-back window on a task, making the engine work harder than you’d expect). Real-time is in air quotes because nothing is really real-time in our general purpose computing world, it’s more of a “real enough for humans” sort of thing. What sucks about streaming is that data pipeline is imperfect. Out of order data, late-arriving data (hence the long look-backs), missing data, and just plain wrong data happens, and real-time streaming can consequently produce more transient noise events than you might expect. The other thing that sucks is of course load: if you get a massive inbound bolus of data, then your scheduling system is going to spend a lot of resources and possibly fall over anyway.

Position 2: Streaming with some slippage allowed. The fallback from “real-time” is to keep the same basic principles, but add a per-job time target that allows the job to be deferred if load is high. This might be called something like a freshness goal on acceleration or an allowed slippage window. It lets the continuous stream part slip so your engine can survive under load, which can have an added benefit of smoothing out data weirdnesses. What’s great about this is it turns a sharp cliff into a smooth curve if it’s used properly, meaning you can get more work out of the system. What sucks is that you have to train your users on a whole new level of complexity, and then they have to use it properly, and that means your field engineers end up doing a lot more work.

Those two positions are streamers. Both of them suck really badly at another problem: “I just want this task to run once per weekday.” And so… the schedulers.

Position 3: cron based schedules. There’s a clock, and the task engine wakes up periodically to see what time it is and if there’s some stuff to do. As a user, you schedule your tasks: “Every five minutes, activate the retroencabulator or process the accumulated data or whatever.” Nice thing here is that it’s way more predictable, though it still can get walloped by surprises. If you plan for processing three hundred data points every five minutes but this period has three hundred million data points… again, your resource budget will see some bad times. Products can encourage users to do that to themselves through poorly planned user experience as well. For instance if the schedule CRUD (Create, Read, Update, and Delete) form defaults new tasks to run at the top of the hour, then after a while your system will sit idle for 59 minutes and try to execute five thousand tasks in the 60th minute. Another thing that sucks about scheduling tasks: it’s very clearly not a “real-time” system, and so it can look bad in a sales dogfight with another product that offers “real-time” tasks.

Position 4: flexible scheduling. As in Position 2, we add a property to each job that lets the exact launch time slip by a bit. “This task starts at 14:00, but it just needs to run somewhere around there.” If 14:00 rolls around and the task engine is struggling, this task can be delayed so it doesn’t get in the way. This type of system probably has a delay ceiling so it doesn’t run your task forty-eight hours late — say at fifteen minutes late, it might decide to error out if it hasn’t been able to start. What’s cool about this: it squeezes more work from the system by smoothing load curves. What sucks: the system becomes less predictable and more complex to configure for a user.

Schedulers and streamers are both useful, so it’s really nice to have both capabilities. If you’ve only got resources to build one, you need to look at your competitive landscape and determine what your customers and analysts will find most important. If it’s fast alerting, go with a streamer and your reporting will comparatively suck. If it’s predictable reporting, go with a scheduler and your alerting will comparatively suck. If you can’t decide… the default position should be a scheduler. You can do some really janky workarounds, but it’s painful as hell to make a streamer act like a scheduler. A scheduler can act like a streamer much more easily.

Discover more from Monkeynoodle.Org

Subscribe to get the latest posts sent to your email.

Task Scheduling and Slippage

Share this:

Discover more from Monkeynoodle.Org