Heisenberg’s World of Uncertainty

Security analysts can’t ever be certain of what they’re seeing and not seeing.

See something, do something

My entire career has been in some form of “see what’s important, then do something about it.” It’s Heisenberg’s world though. Collecting and moving data has impact and cost, which can be hard to continue justifying. That often means gaps and uncertainty in your source data, which AI models just love to fail on.

So it’s easy to end up with a lot of data, that isn’t doing much for you yet. You could end up investing quite a lot of effort to make it valuable for anyone but a handful of experts. You can see things, but it’s hard to automate and takes continual maintenance. All of the rules will eventually fail, because they’re not looking at the correct time range, because they don’t have access to the correct data, and/or because they don’t have the current schema for that data.

Recognizing the something that you want to see is costly, particularly in security

The core problem with security is kind of obvious… the bad people don’t want to be caught, so they keep changing tactics so they can get around your rules. You could still stop anything bad from happening, either by stopping anything from happening or by carefully reviewing everything, but neither of those is a realistic option for most enterprises. Instead, the security industry tends to drift to CYA and theatrics.

Still, we can’t do nothing, and an effort that’s 75% effective is better than no effort at all. So you write rules for what you want to see and do something about. There was a root login on the server after some failed attempts. That works, but has a long tail of noise generation — things that on their own are only minor issues, but can’t be ignored because what if they’re part of a larger effort. Root is allowed to login via ssh at all, and that’s something you’d totally fix if you had the time and patience to deal with explaining to Bob why he has to login as himself and sudo when necessary.

The next step is correlating rules: fancier rules that cross multiple streams of data to alert on really bad combinations of factors. These rules should fire less often and mean more when they fire. This is what SIEMs are for. Bob is an Admin, Bob has just logged into a laptop, that laptop has a critical severity malware infection on it. These are awesome! But wait, what if I want it to work across all of our laptops and data centers, matching Windows plus Sophos, Mac plus BitKeeper, and Linux plus ClamAV with one rule?

To hit that goal I need an abstraction model: instead of rewriting that rule several times for all your laptop operating systems and EPP or EDR products, you abstract the sources of data into a data model of generic concepts like “authentication” and “infection”. In theory this makes maintaining and operating the rule system easier. If the abstraction is perfect, you can even put data collection and rule maintenance into two different teams of cheaper people, instead of bottlenecking all security operations on a small set of people who can understand the entire data collection to data transformation to rule firing pipeline.

Modeling data for information extraction can maximize system performance when you need to execute a lot of related searches at the same time. In practice, one of the reasons abstraction models reduce cost is because they help slow down detection. If you’re putting a scheduled transformation in front of the alerting rule, there’s no point running the rule at any time other than when the transformation is fresh. That’s not exactly an advertised benefit of the concept of course, and an attacker who can figure out the schedule has an advantage against the defenders.

Abstraction models are clearly very successful in security because there are so many to choose from.

All this costs a lot of time and money to setup and operate, not to mention the vendor lock-in aspect. It’s theoretically pretty easy to abstract simplistic tools like firewalls — there’s a clear in and out, and only two actions that matter. (That said, if you’d like to see how bad even simple things can get, take a look into the abyss). But how common are the security commonalities between Microsoft SQL Server, Oracle Database Server, and Snowflake? They’re all just databases right? How about Google Workspace and Microsoft Office 365? Is there ever going to be a third vendor in that cloud productivity marketplace, and if so will it have any overlap with whatever is common between GWP and O365? Or, do you end up trying to move thousands of event types from raw [syslog|xml|json|protobuf] into meaninglessly generic English subject-verb-object sentences? By the way, there is no defined boundary to that universe, you’ll get new types of events and changes to existing event types without warning. That event type you don’t see anymore: was it retired, did your collection break, or has it just become uncommon?

Risk modeling helps

Abstraction models and rules aren’t going away any time soon, but there is a way to improve their performance. Using a risk model lets you move to a two tier alerting system and focus more attention on the alerts that matter. Instead of letting every rule create alerts for people to look at, most rules modify a per-entity risk model instead. For instance, in the last hour Account=Bob has accessed more than five S3 buckets (10 points), has triggered a Superman distance rule (2 points), has logged in from Albania (2 points), has a static IAM key associated (1 point) and has used a laptop with an infection (10 points). Then you have one rule for sending events to the security analysts to look at, which says if risk points accrued by any user or server in the last six hours >= 25, fire an alert. There’s some challenges to constructing a good alert with the right set of data, but this approach avoids a lot of manually generated combinatorics.

Another way that system designers try to squeeze more detectors from the same budget is heuristic algorithms, machine learning rules that look at a system or actor’s behavior over time instead of matching specific conditions. These anomaly detection algorithms were used in the old Anti-Virus and EPP (Endpoint Protection Platform) products, as well as the EDR (Endpoint Detection and Response) products that have replaced them. They are also common in the UEBA (User and Entity Behavior Analytics) feature of SIEMs, as well as fraud prevention programs. It’s critical to remember that “anomalous” is not the same as “bad”; used without risk modeling frameworks they just produce a lot of false positive noise to sift through. Used with a risk system, these techniques can usefully increase entity risk scores and push real incidents into an analyst’s attention.

The cost of seeing each thing comes from your ability to see more things

Detecting security incidents is not free, and there’s only so much budget allocated. Somewhere between zero and a lot, right? So the techniques and tools that can be used in an organization have to fit into the budget that organization allocates to security. Writ large, that’s the range from “Alice spends an hour on log review whenever a meeting is canceled” to “the blue team and the red team have pizza with their favorite vendors every week”. Writ small within a security analysis tool, it’s how many searches of what types at what frequency. Any fancy stuff with scheduled transformations, risk models and algorithms comes off the top before you get to using rules that your entry level responders can follow.

In security, it’s the active response to an observation that typically causes change to a system, so analysts tend to be cautious about responding until they’re sure they know as much as possible. The defender might let the attacker know that they’ve been detected by closing an access route, taking something offline, or increasing logging activity. Closing routes and stopping services have operational impact as well, so that’s another reason analysts might go easy on that button. The number one use case I’ve seen for SOAR (Security Orchestration, Automation, and Response) is to increase telemetry collection from an EDR agent or sidecar in response to perceived attack, helping the analyst to be sure that attack is real. Absent response, the observation that rules and transformations typically operate in a zero-based resource budget is not exactly a match for Heisenberg’s Principle that observing a particle changes that particle’s speed and/or velocity… maybe more a match for “Werner Heisenberg usually had the budget to pursue his theories”… but that observation still matters a lot.

Bigger budgets mean more ability

More resources means that you can run more rules… but more importantly, you can run better rules, spending more time with design, implementation, and tuning of complex systems. For instance, anomaly detection systems are sometimes considered dangerous to use in general cyber security but have done better in finance. This is not because the technology or use case or decisions to make are wildly different, it’s because a fintech or insurance company or bank has more budget allocated to fraud prevention than a retailer or manufacturer or hospital has allocated to stopping cyber attacks. Those links compare the estimated budget of fraud prevention (US$9.3B in 2022) to the estimated total addressable market of SIEMs, of which UEBA is a sub-feature (US$4.21B in 2021). The data manipulation techniques are pretty much the same (abstraction, cluster detection, behavior analysis) and the output is pretty much the same (allow this or don’t, raise this to a human or don’t), but the bank can staff data scientists to tune the system and responders to handle less-than-perfect output. Most other businesses cannot staff security data specialists, but they might be lucky enough to have a security analyst or two with interest in data science. If those analysts can make the time to manage their rules and tools, they can squeeze more security value from their budget.

That increased value might take a few forms, such as:

simpler rules being “pushed left” into earlier systems. If you can process known known badness at the edge, you can increase speed and reduce cost.
risk based analysis on top of behavior analytics and correlation rules
automated responses where possible, run-books where possible
frequent review and maintenance of the system