CrowdStrike Apocalypse

So a few days ago as of this writing, a security software vendor called CrowdStrike released a configuration update to their product which caused Windows machines running the product to crash on boot with the dreaded Blue Screen of Death. While they did release a fixed update and guidance on fixing the issue, somewhere on the order of 8.5 million machines were impacted and every one of those required a manual intervention to restore operation. But why did this happen and whose fault is it?

It turns out that Microsoft is entirely blameless in this instance. I do not write that statement lightly. The crash was not caused by Windows itself despite the fact that it only affected Windows systems. Instead, the crash was caused by CrowdStrike’s device driver for their security system. Apparently, things their software needs to observe are only visible in privileged operating system so they provide a driver which handles those details. It was this driver that crashed.

You may be wondering why Windows doesn’t just disable the broken driver and continue operating. Well, for ordinary drivers it can do that. This one, however, is flagged as boot required which means if it fails, the system is not allowed to disable it. That actually makes some sense when you think about it since you wouldn’t want your security software to be disabled by simply corrupting a driver file on the hard drive, would you? So, as a result, Windows cannot automatically recover the system to a usable state to allow correcting the fault, which would have been corrected by CrowdStrike’s update, presumably.

You may also be wondering why Windows doesn’t provide an interface to obtain the telemetry needed to run CrowdStrike’s software (or other similar software) without using a system mode driver. Well, it turns out did develop a scheme for doing exactly that and were legally blocked by the European Union meaning they cannot legally deploy such a solution. I do not have specific details of exactly what happened to come to that state of affairs so there may have been some detail that would have justified the EU decision, but it is unclear to me what that might have been. It is this detail that leads me to absolve Microsoft here. It will be interesting to see if the EU takes its head out of its ass on this one as a result of this incident. It certainly gives Microsoft a pretty sizeable bludgeon to whack them with.

Anyway, it’s clear that the fault for the crash lies with CrowdStrike. Their device driver was unable to handle a faulty configuration file and, based on analysis by Dave’s Garage, likely dereferenced a NULL pointer. This is clearly not a good thing and they obviously should have done better. However, this is not the entire cause of the blue screen crashes.

Somehow an update file that was all zeros escaped into their update delivery infrastructure. There are a few possibilities for what happened, with different implications for CrowdStrike:

  1. They created the bad file themselves and did no testing before deploying it. If this is the case, then regardless of their terms of service, there is a likely case for gross negligence on their part and it may be enough to bankrupt them with all the potential consequences that carries.
  2. They created the bad file themselves and it did not fail in their testing environment. This may be less bad for them, but it seems likely that they might still face some liability that cannot be disclaimed.
  3. They created a good update file, tested it and it worked, and then sent it into the distribution pipline where the file was then corrupted somehow. In this instance, their liability disclaimers may hold up depending on the exact manner the corruption occurred.

Option (3) above is probably the most interesting since there are a few ways the file could have been corrupted in a manner that CrowdStrike could not reasonable expect, or even if they could, there is no effective safeguard. The most likely situation is the content delivery network corrupted the file, which could happen if the wrong disk fails at the wrong time.

You would think some sort of signature and checksum scheme on the updates would help, and, in fact, may have in this case depending on how the file was corrupted. However, it is not certain that it would help in all cases since it’s just as possible that the corrupted file would be signed and checksummed.

So the end result is that CrowdStrike has to do a lot of work to resolve the black eye they get from this. First, they need to fix their driver so it handles erroneous input from these coniguration files in a sensible manner. Then they need to study their distribution infrastructure and identify any possible things they could have done that would have prevented this unfortunate result. But whatever they do, they are looking at a very expensive error which may have existential implications.

Leave a Reply

Your email address will not be published. Required fields are marked *