Threat-based questions to understand the Crowdstrike incident (1081 words)
Every chance I get I’ve been offering this guidance: We understand software security best through specific threats and mitigations, articulated by threat models shared openly. While I doubt the folks at Crowdstrike are interested in my help, this is a great opportunity to test how this works in practice.
To begin, it’s important to start by saying that the company has insisted that the incident was not a cyberattack, which is a statement about the cause; however, in terms of impact it certainly damaged Availability (the “A” of the fundamental software security C-I-A triad).
It’s early to ask questions and expect answers, but in time there should be a full investigation (the Cyber Safety Review Board seems perfectly positioned to do this). Until then, here’s one line of Q&A that I think would shed light. Until we can get answers I’ll consider multiple alternatives.
Q1: Let’s start by looking at the most recent threat model for Falcon at the time of the incident.
-
A: We have no formal threat model. OK, that’s a big problem, you failed to anticipate this risk. [stop here]
-
A: Here is our threat model document. If it isn’t fairly recent that would be a problem. [continue…]
Q2: Does the threat model list the threat of pushing content files (e.g Channel File 291) causing system crashes?
-
A: No. OK, that’s not a good threat model if it omits critical risks like this incident. Redo it now! [stop here]
-
A: Yes it does. So far, so good, publishing this portion of the threat model would help customers a lot. [continue…]
Q3: Are there mitigations for the risk of crashes due to bad content file updates in the threat model?
-
A: None, we assumed that testing would always prevent this happening. OK, obviously no, try again! [stop here]
-
A: None, the risk was not considered to need any mitigation. OK, clearly you need mitigations! [stop here]
-
A: Yes, mitigations were listed. [we can only guess since no threat model has been published; continue…]
Q4: Which (for each of one or more) mitigation(s) listed should have prevented the problem that occurred?
-
A: Actually none of them would have prevented it. OK, you need additional mitigations! [stop here]
-
A: Pre-release testing should have detected the flaw. [go to Q5]
-
A: Releasing content files requires admin privileges in our production cloud systems. [go to Q6]
Q5: Why didn’t testing detect the flaw in the Channel File 291 that caused the incident?
-
A: In this case the testing was never performed. OK, why wasn’t the testing performed? . . .
-
A: The tests detected the problem, but the test results were misrecorded. OK, how can this be prevented? . . .
-
A: The file that was released was different from the file we tested. OK, how can we prevent this error? . . .
-
(and so on …)
Q6: Why didn’t limited access prevent this error from occurring?
-
A: An intern did it by mistake. OK, who is responsible for giving an intern such a large blast radius?
-
A: Over 100 people have this access, “too many cooks spoil the broth”. OK, you need to limit that privilege . . .
-
A: It was human error … OK, you need a redundant system with multiple people approving releases.
-
A: There was miscommunication about which file to release. OK, use hashes to identify file contents.
-
(and so on …)
If we had real answers to these questions we could bring this Q&A to a close, but beyond Q3 it’s all guesswork and the number of possibilities expands quickly and covering all bases becomes complicated. But this much should convey the idea: work from the threat model, if it missed a threat the fix that and proceed putting in mitigations, or if the mitigations listed are insufficient add more, or if the mitigation is in the design but the implementation or execution failed that add redundant mitigations to shore that up. Keep asking questions until all open avenues are closed by one, and ideally more defenses.
As of this writing we do have the Falcon Content Update Preliminary Post Incident Report which states, “Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data. Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.” Being unfamiliar with this large and sophisticated I won’t pretend to offer a serious analysis here, but the obvious questions this statement arises include:
-
Q7: How did the (March 05) testing not detect the serious problem with this content file?
-
Q8: Why is it safe to rely on four-month-old testing (March to July), and not test with latest versions?
-
Q9: Did you identify the extreme risk of entrusting this Content Validator to prevent such a massive failure?
With knowledge of the system and more Q&A it should be straightforward to drill down to a solid explanation, but only starting from risk awareness. Adam Shostack’s Four Questions serve to guide the analysis:
-
What are we working on? The Falcon product (we skipped this in the Q&A above)
-
What can go wrong? Q1 & Q2
-
What are we going to do about it? Q3 & Q4
-
Did we do a good job? Q5 through Q9 (and more if we can get answers)
Why is this simple Four Question framework so wickedly effective for such a vast range of problems? Most fundamentally, because every system in the world is subject to threats, and to the extent we can anticipate these effectively the system can perform as designed. Any good analysis must begin in the context of understanding what the system does (1), in this particular case we are focused on the recent problem and preventing that from recurring. Next we must fully identify potential risks (2) in order to have a chance of protecting against them (3), imperfectly as may be. Finally, we must assess the completeness of our mitigations (4), which is easy to zero in on when we have the fact of an actual lapse to aim for.
No doubt the story of this incident is complicated given the sophistication of the product; however, the calculus of risk and mitigation remains fundamental and it’s quite straightforward. If we identify the relevant threats and follow the risk, it shouldn’t be hard to zero in on where the failure occurred and how to begin remediating effectively.