On Product Compromise

Product management is full of compromises, and one of the things about compromises is that they definitionally leave everyone at least a little bit unhappy. How you handle that disappointment is what distinguishes better product management from the less good. For clarity, I’m going to get more detailed about a specific situation than I usually do in this blog; no technical detail is going to be discussed that couldn’t be seen in the public documentation and release notes, but the product and company are identifiable to those who care.

The Problem

The product module in question was a conceptually simple data pump that had picked up a bunch of ETL complexity through years of feature addition. Customers and field engineers complained regularly about its complexity and brittleness. Because it was the only tool for emitting alerts and reports, any breakage caused highly visible failure that affected the entire company’s position with customers. While the architecture was fundamentally scalable enough for its environment and sound enough in execution, it was complex to extend and integrate. Fiendishly subtle bugs were legion, and difficult to prove or disprove. The icing on the cake? A tiny team: three people when I took on the project, handed off to a different two people shortly after.

Many engineers reading this will stop right there and say “obviously the right thing to do is to stop maintaining this: allocate a properly sized team to rearchitect the project to better meet customer needs.” To which the business responds “not today”. This story is not about the architecture or team size, though.

My first move was to review support cases, bug tickets, and feature requests, looking for big common problems. And I found one: the most popular and expensive use case for our company’s customers produced a big blob of JSON as an alert. This blob was arbitrarily sized, of unknown depth, with lots of challenging text in it. You might be getting strings in any human language (thankfully Unicode), Unix or Windows path names, epoch and locale-specific times and formats. It’s complex stuff, and very valuable. Customers would send it into their security data processing pipelines, all of which were natively adept at handling JSON, and so everyone should be happy as long as the pump worked. Except, high dozens of tickets per month centered on this feature, because lots of customers were embedding this JSON blob into a syslog frame before sending it. Doing so would break the receiving data sink’s ability to recognize it was JSON, so they lost native parsing. So they’d write regular expressions to grab out bits of data that they wanted… but remember the blob’s structure is arbitrary, so the regexes didn’t work well. And any time that our upstream data source changed, including because of changes in the Linux or Windows operating systems… these extraction hacks would break. Another nasty symptom: some alerts would not transit the customer’s pipeline! That last one is because there is a maximum size to a syslog message, and our alert blobs would often exceed it. Usually the result was truncation, but sometimes it would just fail entirely.

How did this situation happen? A mix of poor UX design and mistaken tribal knowledge. Many customers had syslog transit hubs built into the hearts of their data collection systems. Not realizing that both important syslog implementations have native JSON support, people assumed they had to embed the JSON alerts into syslog messages before handing them into a syslog forwarder. Not to mention how easy it is to skip the syslog and fire JSON into the sink directly. That said, it was easy to end up here without thinking that hard: the user experience workflow encouraged you to re-use previously configured destination sinks, and didn’t make it clear that when you activated filtration features you’d be making changes to format and framing as well. Besides, if you’re not deeply versed in how the receiving data sink presents different types of data, you’re probably not even going to perceive that there’s a problem: I sent data, data is there, the end. Our field engineers and customers saw text, and often didn’t realize that what they were looking at was a mangled blob that should have been automatically parsed into fields and values. So they’d crack out the regular expressions and get to work building brittle extractions. Have you ever seen the backslash escaping that results from Unicode in JSON in syslog? It’s not pretty.

And so: large numbers of cases after every change anywhere. Unhappy customers’ alerts were getting lost and mangled. Each case would take hours to figure out, leading to unhappy field engineers as well. I saw fixing this as a good bang for buck opportunity.

The Solution

Given the opportunity, and remembering there’s an overwhelmed engineering team without a lot of time for experimentation… how to select what to do.

I am fond of simple framework tools, and my go to in this case is the Decision Matrix: put your best three ideas under Do Nothing (obvious), Incremental (use existing resources and components differently), and Radical (change resources and/or components), then list the single most important pro and con for each. Note that you can and should rerun this matrix a few times because lots of ideas fit into those three buckets, but using the decision matrix framework helps you to identify where the myriad options land in feasibility. The final form of my matrix for this decision was something like this:

Do Nothing… I could tweak the docs or write a blog post or train the field engineers, and leave the developers out of it. The pro is always obvious here, doing nothing usually costs little. The con in this case was that data showing the status quo was costing us hard money and soft goodwill.
Incremental… we considered adding a bunch of in-app warnings explaining the problem, but mock-ups made it clear that there were leaky paths in configuration that would let the customer still do it by accident. A better choice would be to just remove the ability to create the troublesome configuration. Pro: halts production of new support problems. Con: moves the cheese for existing users who’ve worked around the problem.
Radical… The radical choice was to start over with a next generation design that would be easier to maintain. Pro would be shiny and modern software, but cons were very numerous. Opportunity costs were a non-starter, it wasn’t crystal clear how a new design would actually fix this particular problem, and even if it did, that gratification would take several quarters of sustained effort to reach. Nope.

So, we made a single change: if the source of the data stream was the popular JSON alert blob, then we’d disallow selection of a syslog format. Existing connections were not modified, and we didn’t enforce this in the backend: so if you really wanted to have syslog-encapsulated JSON you could still export the configuration, modify its definition, and reimport.

The Result

As always, the incremental compromise move had mixed results. Overall the impact was quite positive, but it made some people unhappy.

Good: the cluster of support cases, bugs, and enhancement requests that clued me into this problem immediately began to decline, and within six months was no longer detectable. Problem solved: many hours per week of field engineering and developer time were recovered, and we produced much better results for customers. This was especially true for the all-important new customers, whose first experience with the product was no longer mediated by a dive into regex101.

Bad: vocal disappointment from existing customers. While I did spend time socializing the change, and we did make sure that existing configurations were not altered, there were still a number of customers who had built massive towers of regular expressions and didn’t want to abandon or revisit that work. There was also a lot of overlap between that type of customer and the type of customer with tightly controlled and infrequent change windows… so they didn’t upgrade to the version with the change for many months, and didn’t create new configurations for many more months. I was still hearing occasional complaints about this change over four years after it was made.

I could have done a number of things differently. I could have phased the change in more slowly and advertised it more: this would have certainly delayed its benefit, and possibly reduced complications. I also could have done nothing and waited for the problem to go away: looking back now from a cloud-native SaaS, the whole conversation about appliances and syslog is a little quaint, though all of the components are still supported and actively used.

Ultimately I made the move and left the decision to stand because our model devoted field resources to customer support. Because each customer who was going to be challenged by this change had at least one dedicated field engineer, I made the call, devoted time to internal enablement, and let the results roll. Had the model been more self-serve, I may have spent more effort on enabling but discouraging the “problem” behavior, but I can’t be sure.

Monkeynoodle.Org