Shewhart Control Charts

A raccoon shown twice in a pink and turquoise 1980's junior high graduation photo format, captioned "I'm sorry for being annoying, it will happen again"

As a monitor writer, I want to alert when a value has changed quickly a lot in one direction or another, but i don’t want to set hard-coded thresholds because the value’s range is expected to slowly evolve. My goal is to get useful alerts and avoid false alarms.

Examples:

  • A machine makes washers. If the washers are really thick or really thin, they’re no good. This was Shewhart’s original use case, and frankly it seems to me that a pair of hard thresholds would probably have worked. But I’m not a math knowing PhD science guy.
  • A script collects threat intel. If the number of records goes really high or really low, something might be wrong, but it’s expected for the number to go up gradually over time. It might even go down over time, as criminal sites are taken down or TIP inclusion criteria change.
  • I’m watching the amount of storage left in production. A hard threshold of a numeric or percentage value is easy for one device, but surprisingly tough to generalize. Alerting on rapid and significant changes works for all devices.
superhero making a choice meme... numeric thresholds in small environments, percentage thresholds in large environments
/*
A Shewhart control chart in OPAL. It calculates a daily count, running median, and running standard deviation.
If the daily count is more than two running standard deviations from the running median, then it alerts.
This technique is not suitable for things with expected rapid change or human choice involved.
*/
// select some records
//
// replace "block" with the thing you're counting
timechart 1d, daily_count:count_distinct_exact(block), group_by()
make_event
make_col runningAvg:window(avg(daily_count), frame(back:7d)),
runningStdev:window(stddev(daily_count), frame(back:7d))
make_col daily_drift:(daily_count - runningAvg > runningStdev*2) OR (daily_count + runningAvg < runningStdev*2)
// and then it should be usable as a monitor. A 24 hour window should do.
// replace ___ with the thing you're monitoring
make_col message:"The ___ has returned an unusual number of events based on recent running averages."
filter daily_drift=true

What It Doesn’t Do

It’s not good at things like “how many things do human workers do” because it will alert at the beginning and end of each workday (at least). Some obvious fails:

  • How many logins succeeded
  • How many messages sent
  • How many reports generated

And some surprises:

  • How much bandwidth used
  • How many kilowatt/hours drawn by the production data center
  • How much wear on the office HVAC (heating, ventilation, air conditioning) system

Can We Fix It?

No We Can’t! If people are involved in a system, you will have surprise surges and drops in load. You could technically prepare for those or just suppress alerting… but realistically this almost never happens. Do you know your company’s holiday schedule? What about international offices? How about your partners’? Do you know your customers’ holiday schedules? There’s some other tricks but nothing perfect.

First thing I’ve tried is to compare a lattice of time blocks to each other (like Monday morning to Monday morning), but that still fails on holidays. Every three day weekend becomes a storm of alerts.

Second thing I’ve tried is to then stack lattices so you’re comparing the typical January 1st to the typical January 1st, but irregular holidays still burn you… and days that are just unusual, because there was a tornado near corporate HQ and no one went to the office for a few days.

Another trick is to adjust how long your rolling windows are, how much time goes into the current measurement that is compared to the baseline. In the OPAL example, this is the 1 day value. Unfortunately, that is a trade off. Longer windows increase the time before you know of a change. Shorter windows increase the likelihood of a false positive.

Lastly, you can adjust your baseline, the look-back dataset that averages and standard deviations come from. In the OPAL example, these two values are calculated from a 7 day time frame. Again, a longer baseline decreases sensitivity by accepting more variability.

There’s one last trick, I guess. Silence the alert on the weekends and just put up with false positives on holidays (one for entering the holiday, one for leaving it).

Links


%d bloggers like this: