Learning from Solar Winds

ProPublica is a national journalistic treasure, and recent reporting on the software industry is a terrific impetus to drive much needed change. I sat in on many bug triage discussions over twenty years ago working at Microsoft, and despite great technology advances, the way these decisions are made appears to be little evolved. My purpose here is not to judge what transpired and who is at fault, but to glean from the reporting better software practices so we can at least learn from these events.

In the wake of the Solar Winds security vulnerability, for perhaps the first time, we have detailed reporting of behind-the-scenes efforts over years to get Microsoft to fix known flaws which, through intentional inaction, directly enabled one of the most notorious and damaging security incidents in this decade. Reading this summary article (which links to the full reporting) a couple of sentences caught my attention as key insights.

https://www.propublica.org/article/microsoft-solarwinds-what-you-need-to-know-cybersecurity

“The MSRC argued that, because hackers would already need access to an organization’s on-premises servers before they could take advantage of the flaw, it didn’t cross a so-called “security boundary.” Former MSRC members told ProPublica that the center routinely rejected reports of weaknesses using this term, even though it had no formal definition at the time.”

Precisely what the reasoning involving a "security boundary" was: without sufficient details it's impossible to accurately analyze the decision factors. Presumably the whistleblower filed a bug so it would be informative to see the bug log (redacted for sensitive details), and there may be email and/or meeting notes. If everything was transacted informally without any documentation, well then I'd say that's a big red flag since that's no way for a big corporation to make decisions that potentially impact countless customers. So if there are records, and they are doing a professional job as they continue to assert, why in the world wouldn't they share that in some form?

I know what a "trust boundary" is, and I think that's what they mean by the expression "security boundary". According to NIST (the closest thing we have to an authority), Security boundary has another meaning that doesn't make sense here. The final sentence of that quote about defining our terms raises an important issue: our field is full of sloppy and inconsistent terminology. We need some leeway in word meanings or it gets like a math proof, but we also need some rigor and core principles especially when debating slippery things like bugs. If something thinks a "security boundary" is good for triaging vulnerabilities then they should define it as best they can, and suggest ways to test to determine what it does or does not apply to.

In any case, reasoning about whether to fix a bug involves a lot more than citing one technical term or simplistically claiming that the necessary conditions "will never happen". If MSRC has a policy of not fixing vulnerabilities that they believe are safely behind a "security boundary", I would say they likely are piling up a growing collection of "ticking time bombs". Even if they infallibly determine the harmlessness of these flaws, how is it humanly possible to know how future releases, used by future customers, building all kinds of systems in the future, for new applications? Or just what if it should never happen but somebody makes a mistake?

Without having researched this particular incident in depth, and only very limited information about what happened has been made public, I can only speculate. According to another ProPublica report, Cyber Safety Board Never Probed Causes of SolarWinds Breach and Microsoft has not been exactly forthcoming. The "security boundary" rationale seems to state that the bug wasn't fixed because they were convinced it couldn't be reached by an attacker — that the "ticking bomb" couldn't possibly explode. In my book I write about vulnerability chains and how just this sort of triage mistake happens. It's remarkable to think that concurrent to writing that these events were playing out behind the scenes that would result in a massive breach.

security