Thinking about how SALAMI like CoPilot is supposed to make systems programming cheap and easy, in light of my challenge this week…
The general problem statement: as a security analyst I want to know as much about an IP address as possible so I can make an informed decision, write rules, &c.
The specific feature area: I want to efficiently collect all the IaaS vendor IP ranges so I can compare observed addresses against them. Am I seeing unexpected traffic to a system?
That’s actually not so hard for AWS, GCP, and OCI. But Microsoft gonna Microsoft, and Azure is more difficult. Here’s the proof of concept and simple fetcher that I wrote with the help of lots of StackOverflow. Let’s posit that a SALAMI could come up with most of this, at least well enough to make it a debugging problem for a human (momentarily leaving aside that debugging is harder than writing it in the first place).
But now, the specific ticket for this week: I want to use this stuff in a product, not a terminal, and the product environment isn’t going to just shell out and execute my Python. I have to do that Azure fetch work in a Lambda and write the output to a bucket for fetching.
- Hey SALAMI, make this code work in Lambda. 🤣
- Hey SALAMI, write me a Lambda function that finds the download links in the source of these three pages, pulls the JSON from those download links, and posts it into a S3 bucket.
First, drop the local and straight to Observe options. Second, drop the authentication because AWS policies will take care of that (note that SALAMI can’t help with setting those up, and it took 90 minutes of troubleshooting to get IAM, Lambda, S3, and EventBridge singing together, with one false victory lap in the middle related to new object creation).
Also, AWS’s Lambda Python doesn’t support Beautiful Soup, or Requests (technically you can load a botocore.vendored version of Requests, but it’s missing the Get function, so, umm, why (even more technically, AWS’s blog apologizes for this state of affairs and promises to reverse the decision as of April 2022, but whoever was going to do the work must have been laid off)). StackOverflow posts and Medium articles suggest building a layer to do this, and GitHub is full of five plus year old pinned layers. Pinning libraries in Internet scrape-and-parse code is a great way to build in a software vulnerability, and we’re not going to do that. So the script has to be rewritten to use urllib3 and to split-hack the returned page for the actual URL to fetch.
Next, debugging… I went through at least five rounds of “that looks like it will work, but it doesn’t”. This work does not go away with SALAMI, but does it get faster? I think that the answer is no, based on analysis of how this debugging time is spent. For each problem, the process is:
- Observe error output
- Sigh, and determine what and where it is (complicated by the fact that Lambda is sometimes showing you previous error output instead of current because 🤷)
- Fix it if it’s obvious (like a typo or indenting problems) and try again
- Search for an answer and/or ask someone else if it’s not obvious. This part involves articulating your problem, just as you would with a SALAMI.
- Try out the suggested answer and see what happens
Step 4 can be speedy or time consuming. It takes more time when the most common answers are wrong. The typing parts are very fast. The thinking parts are fairly fast. The observation and deduction parts are very slow, and complex because they involve lots of different systems. Where does SALAMI fit into a process that’s crossing half a dozen browser tabs?
32 lines of Python, about two dozen configuration actions in AWS, maybe fifteen minutes of typing, several hours of troubleshooting and configuring services. I don’t think any large language model built from Internet searches could have provided correct code on the first try, because Internet searching consistently produced and preferred incorrect answers at every level of the problem. GIGO.
- How to get the data in the first place: there’s some PowerShell scripts for parsing these files after you manually download them, but only one for automating the download and it relies on a POSH function. In order to do this from a non-Windows system, you need to start differently.
- How to get the URL from the pages using Python: The Internet is clear that Beautiful Soup is the smart way to do this, and other solutions are discouraged. So, how to use BS in Lambda: only suggestion is to use a layer. Would a SALAMI suggest split-hacking at all? Unknown.
- How to get the page you’re going to parse for the url AND how to get the data-holding page using Python: searching is divided between requests and urllib, unless you specify Python 3. Requests is then the preferred way. So, how to use Requests in Lambda: preferred suggestion is to use AWS’s version, which is missing the necessary function. Second suggestion is to build a layer, which is taking on unwanted security and maintenance burden.
- Finding the encoding and ACL problems that made my prototype work for interactive browsing but fail for scripted ingest: I don’t see how SALAMI could do this at all since you don’t know to ask about the problem until you observe it.
Now, let’s posit that tomorrow Microsoft sees the light and just posts their IP lists on a public page like the other SaaS vendors do; change in environment produces new code need. The new code is 24 lines of Python, and a SALAMI could easily produce it… but so could I. Delete 8 lines, Deploy, Test, done.
I don’t think faster delivery of wrong but plausible script is a game changer.
