The Crowdstrike July incident root cause analysis report provides new detail and requires reading between the lines to interpret (I welcome corrections with references if I got it wrong).
They list six problems in the Findings and Mitigations section: the bug, a lack of bounds checking in the kernel code (2); four errors in validation and testing (1, 3, 4, 5); and staging deployment so they don’t break all customers within minutes, but instead start small and it is hoped realize is such problems occur so they can remediate it soon.
Canary testing is only mentioned in context of a new mitigation, so astoundingly they were not doing that and only just began. It seems clear that any test of that kernel code with Channel File 291 is going to crash, so a canary test would have found the problem. This should be cheap and easy to do routinely, and I cannot imagine why it wasn’t — they don’t mention the lack of canary testing explicitly. To rely on validators and test suites (which in this case omitted testing one category of values for one field of a record) alone is surprising, and it’s remarkable that this was the first time such a problem occurred.
It sounds like only since the incident have they been fuzz testing: “We have completed fuzz testing of the Channel 291 Template Type and are expanding it to additional Rapid Response Content handlers in the sensor.” Microsoft Windows Hardware Quality Labs (WHQL) program involves fuzz testing it is stated, but clearly it did not fuzz Channel File 291 effectively.
Going beyond the actual chain of events, I wish they detailed how deployment works as well. How does Channel File 291 (and all the rest) get onto customer machines, are the files timestamped and digitally signed to protect integrity and foil replay attacks? We’ve seen that millions of machines are at risk so it’s also important to lock down the entire deployment infrastructure.