Crowdstrike and the threat of friendly fire

Threat modeling methodology centers on asking, “What could go wrong?” and then considering mitigations to address such an eventuality. The unending calamities of history vividly demonstrate how human intuition repeatedly fails to foresee many such events until after they happen, and even then we sometimes fail to learn and act. For example, consider the 2008 financial crisis: after all the bailout money was handed out around Wall Street, Congress never confronted the glaringly obvious problem of “too big to fail” institutions. As a result, large firms continued to consolidate, concentrating power and risk in still fewer institutions, creating conditions for a repetition that appears to be a matter of “when” rather than “if”. Traditionally threat modeling has been deployed exclusively within the context of secure software engineering, but I posit that it is just as effective and important a tool for anticipating potential harms of all kinds — not just malicious exploitations.

The recent Crowdstrike incident was not caused by a malicious actor, but the harm it caused is no less devastating because it was caused by the duly authorized company pushing a bad update to all those customers. It’s critical that we learn all that we can from these debacles, and that begins with maximal disclosure of details by the responsible party or parties, but in this case the problem is already evident (at least in part) in the rear view mirror. This is an excellent example of my claim for the value of threat modeling beyond cybersecurity in the usual context of attacks.

Let’s assume that in architecting their Falcon product some degree of threat modeling was done. In order to detect and report potential attacks, Falcon must run at root privilege to fully surveil the system, so pushing updates clearly requires some serious engineering. Only authentic updates from the company should be used, tampering with the code deployed to customers must be prevented, and the update should be installed promptly in order to narrow the window of attack when known vulnerabilities become public. All of this is standard threat modeling practice and I have no doubt all of this was designed for, carefully implemented, and tested.

But it appears (in this surely over-simplified treatment) this was the extent of the threat model. Apparently no consideration of non-malicious potential threats happened. The obvious additional threat in hindsight: what if they somehow pushed an authentic update that bricked the system once installed? (From past experience, I can easily imagine that had someone raised this issue they would have roundly been told that such a thing could never happen. Remember that the RMS Titanic was touted as “unsinkable” so lifeboat capacity was reduced since it would never be needed.) No attacker is involved in this scenario but it’s a serious threat to system availability as we recently saw when this happened. Just because it was self-inflicted, why isn’t it just as important to anticipate and mitigate this very real threat as well?

There are many ways this threat could have been handled proactively to reduce the harm we just witnessed — but first the potential threat needed to be identified, and the naysayers (“don’t waste time on something that cannot happen!”) muffled so they can listen and learn. Exactly what mitigations are called for is a matter of debate and depends on details held proprietary, but we can sketch several easily.

Enforce a more rigorous vetting process to ensure thorough testing of updates always happens.
Don’t authorize single individuals to push releases, instead require two approvers.
Perform slow roll outs to ensure there is time to catch mistakes made and nip the problem in the bud.
Require one or more digital signatures on releases ensuring one or more authority signs off on updates.
After pushing any update, require each updated system to report back its status to confirm success, or stop and investigate if there is no response within a reasonable number of minutes.
Design a customizable shim inserted very early in the boot up process that checks for a digitally signed “roll back” script to repair damage to systems caught in a cycle of rebooting and crashing (BSOD).

Which of these, or many other possible mitigations, is best for Crowdstrike their engineers would know best, but the point is that mitigation to prevent this was very possible, and it only begins with acknowledging the potential threat, non-malicious as it may be. There’s no reason to restrict threat modeling to scenarios involving malicious actors, let’s open the scope to all foreseeable major adverse impacts, regardless of human intention.

security