Using a Data Lake for Business Insight

At a former employer some of us used to joke internally that we made the world’s cheapest business intelligence tool and the world’s most expensive log search tool. Business intelligence (BI) use cases are cheap from a data platform perspective, because value and volume are inversely proportional. All the work is in complex and customer-specific flows like “how does the cat food get on the shelf” and “when do we decide to open or close a location”? BI oriented visualization and analysis tools understand this, focusing on Enterprise Data Warehouse (EDW) or spreadsheet backends instead of data lakes and licensing per user or as a service instead of by data volume.

That means that you won’t see a lot of product encouragement for building a BI use case on a data lake backend, which is a shame for customers. If you can invest some time, you can solve BI problems with the cost profile of open source and the support model of enterprise software. Instead, the data lake vendors are going to focus on extracting signal from extremely voluminous data sets, the sort of things that have little value unless you apply huge effort that only they can scale to provide. Therefore, data lake tools tend to focus on machine data, observability, and security. The line can be fuzzy between EDW and data lake, but I tend to think of visualization as the divider. Classical EDWs like Oracle typically just offer SQL (Structured Query Language) as the interface; a more modern data lake will have its own search language.

I’ve written many times about enterprise software as an engine needing fuel. In the case of a business events logging tool for use in Observability or SIEM (Security Incident and Event Management), the fuel starts with a customer-defined list of events, metrics, traces, and reference data, constrained by the effort required to model it and the cost impact to ingest and search it. As hard as that part is… and it is hard! It comes with problems like handling volume, back pressure, complete but non-duplicated delivery, redaction of sensitive data, selecting an efficient format, using secure and reliable transport, routing to the best storage… but it’s the relatively simple part. In so far as the data sources for your BI problems are the same as for your IT problems, data collection is already handled. Where they don’t overlap, you’ll need to get the data in. Side note: yes, technically, you could extract information about the movement of cat food from deep packet inspection of NetFlow records… but that would be a dumb use of tremendous amounts of effort, and it’s much simpler to just read database records. Even if that database is Alice’s scary Excel sheet.

Transforming business data into information that powers the reports and KPIs you’re trying to measure is the next problem, and that’s where visualization tools and search interface languages start to matter. BI tools tend to be stellar at visualization, though some bundled freebies only land at marginally acceptable. Data lakes usually aim at a simpler level of reporting. It’s important to focus on what you need to improve your business instead of flashy visualizations that go beyond that need; you’re not going to build something for the NYT here. Instead, the focus should be on powerful and elegant data transformation, at scale, using search languages.

A BI use case typically means tiny amounts of structured or semi structured data, which can fit just fine in spreadsheets, relational databases, EDWs, Snowflake. Why not hold fast at that status quo and stick to pure play analytics tools on top? Three reasons. The first is time: data lakes typically do a much better job putting historical data into time series, making it a lot easier and more accurate to answer quarter over quarter questions or do predictive analysis. The second is correlation: it’s much easier to mix your BI data with your unstructured machine data when it’s useful to do so, which gives you the benefits of observability and monitoring in your BI. The third is discoverability: with a BI tool on top of a database, you’ve got to understand the schema before you can solve the problem. If you’re the only one producing and collecting the data and you intend to stay in that role for the life of the company, everything’s great! But if you or anyone else is going to need to discover schema, learn what’s valuable, and extract information from data in someone else’s playground, it’s not so fun with inadequate tools.

Those problems are generally well-solved by the data search and processing capabilities of a data lake tool. Since they’re designed to correlate data, enrich data, organize time streams, and report on the results, they typically excel over the spreadsheet interface of a more classic database or EDW (pun completely intended). I’ve seen teams build amazing solutions this way: monitoring for retail JIT (Just In Time) logistics, vulnerability assessment and patch prioritization that used CMDB context, even R&D team load and throughput assessment.

In an ideal world, using common tooling such as the data lake between IT’s big data use cases and the business intelligence needs of planners can make better results for both sides. While the initial benefits of sharing cost should be obvious, sharing the event-based logging mindset and basic security awareness is an even bigger win. The biggest opportunity of all is to improve on the data generation end of the pipeline; with IT’s help, the business intelligence analyst may be able to improve the structure of the forms, scripts, and tools that generate data for later use. Closer relations between the data analyst, data producer, and data consumer help everyone involved, Spend less time forcing real data into a model of an abstraction, more time answering questions the business needs answered.

Be warned, there is not a lot of support if you choose to go down this path of supporting business intelligence from the data lake. As stated before the data lake vendor probably won’t have much product support, though you may get a lot of help from interested field engineers. The roll-your-own data solution community also won’t be a lot of help, as they tend to go a step further into Jupyter and Pandas, or even RStudio. More power to those folks, but there’s a hefty gap between “willing to solve a data problem with office tools” and “willing to learn Python or R to develop solutions”. That gap can be filled with a data lake based solution.

I’m a big fan of event-based thinking, at the end of the day that’s what observability is all about… as complex systems begin to show emergent and chaotic behavior, you’ve got to use the events (or metrics which turn into events, or traces which turn into events) that you can reach to understand what’s happening and decide what to do. That’s just as true of the business operations in an organization as it is of the software systems that support them.

Monkeynoodle.Org

Using a Data Lake for Business Insight

Discover more from Monkeynoodle.Org

Using a Data Lake for Business Insight

Share this:

Discover more from Monkeynoodle.Org