Security products need to detect known knowns, so they build up a corpus of rule content. This corpus grows faster than it shrinks, if it’s maintained at all: new known bad is found at a rapid clip, while software is retired from use very slowly.
There are two constraints on security products’ ability to use that rule content: the speed of performing the detections gets slower as you add more rules, and an increase in quantity of alerts produced reduces the quality of human response.
First constraint: Detection speed caps come from the resource allocation to this function, which is influenced by stack location (endpoint agent, container sidecar, network or cloud service, SIEM) and budget priorities. For example, an endpoint agent or sidecar container might be constrained to a few percent of CPU usage, or a SIEM might be limited to the documented minimum number of cores. The organization can do some balancing between locations, pushing appropriate detections to the edge or the center. The resource budgets of each of these locations are best expressed in the number of concurrent things that can be done; which is to say, you can get lots of rules tested, if you’re cool with a 24 hour alert cycle time. If you want to know about an attack in real time, you’ll need more tests to run concurrently and consequently will spend more resources to get that awareness.
Second constraint: Analyst capacity to respond to the alerts. This constraint is driven by the obvious factor of how many alerts versus how many analysts. Another critical factor is how much effort each alert consumes. An alert that is fully understood can be automatically responded to, perhaps by a SOAR. An alert that requires hours of attention either gets that attention or is ignored. In between those two extremes, the alerts are probably going to be ignored.
The first constraint is all about customer budget and architecture, which are largely out of a vendor’s hands. But as a security vendor, we can possibly help with the second constraint by allowing human reinforcement, sharing data pipelines, and improving user experience design.
Reinforcement is the necessary component for making useful cognitive computing tools for human-facing problems. A Bayesian model on its own can generate rules and weight those rules, but all it’s doing is reacting to inputs. Without input from the context-owning human users, it will go wrong. Allowing users to provide feedback on rules can let a system move those rules up and down the resource priority scale, so that a lower quality rule is run less often. Note that this idea can lead to unintended consequences if it’s implemented without administrator oversight, such as infrequent runs of rules that are expected to run more often. I wish that all organizations could accept that their security intentions are not supportable by the resources they’ve allocated, but they all can’t, and a tool that auto-tunes to fit the real capability will produce surprise, anger, and disappointment when alert speeds aren’t what was desired. It’s counter-intuitively better to attempt what was asked and fail, even though this produces a terrible experience for all users involved with the system.
A shared data pipeline approach looks at the rules running in a given system and consolidates similar components. For instance, if there are fifteen rules that all look at network flows, the SIEM can save time by calculating the src-dest activity table once and putting those fifteen rules downstream of it. Of course, this means those rules are no longer able to provide real-time data; in addition to waiting for the raw data to be ingested and normalized, they must also wait for the table update to occur before they can look for the known bad indication. In fact these rules may even be executed less frequently since there’s little point in running them at any time other than ASAP after the table update. A further drawback of this approach is that it requires human data engineering expertise. Note, use of Threat Intel is a special case of this approach: it’s a single rule that depends on external data updates that are periodically retrieved, but the rule content and the data pipeline can both be updated so there’s multiple useful execution windows.
Finally, the system’s user experience can be designed to surface priority in order to drive simplicity – that’s an Anton Chuvakin blog post calling for products to be as simple as possible (but not too simple). While the full output from dozens of known known rules is often requested during the product selection process, it’s usually unusable in production. To help the analysts do their job, a good security product should prioritize the events it produces, putting higher impact issues first so that limited time can be spent where it’s most effective. From a complexity standpoint, this means designing two different experiences in the same product:
- SOC: low complexity experience for a low complexity need. Analysts will process the notable events in order of priority and simplicity. Processing means following a predetermined runbook of activities, and the time allocated per task is small.
- Hunt: high complexity experience for a high complexity need. Analysts will work directly with the data lake and search language.
Events should also be distinguished by their source: known knowns from the standard rule corpus go into the SOC workflow, and unknown unknowns from analytics, new community rules, and internal analyst notebooks go into the Hunt workflow. A third valuable user experience is self-reporting, tracking the KPIs and SLOs of the SOC for use in driving improvement and budgeting; however, this can often be produced elsewhere or simply ignored.