On Friday July 19th 2024, much of the digital world stood still as IT chaos took hold. As business terminals crashed, healthcare systems stalled, and the aviation industry was dumped into a world of endless queuing and frustrated customers – nobody was quite sure what had gone wrong, just that whatever had must have been absolutely huge and, at least on a macro scale, catastrophic enough to impact industries on a wide enough trajectory that indicated mass system failures on a worldwide scale. The incident supposedly impacted 8.5 million Windows devices, many of which were in critical industries that could not afford substantial failure at any point.
In actuality, a single routine software update from cyber security firm Crowdstrike had caused the mass-scale damage to the world’s digital infrastructure and brought a level of software-induced panic and disruption that many conspired ‘Y2K’ would cause at the turn of the millennium. No malicious intent or multiple system failovers were to blame – this was the impact of a fairly standard process people undertake with all their software and hardware on a daily basis. Specifically, a faulty sensor configuration update caused the problems – which then brought on an error as the systems fell suspect to a “out-of-bounds memory read” that could not be correctly handled. The ironic thing about this sensor process is that it is supposed to provide users with the latest mitigation and threat protection mechanisms and data; in this case, the threat came from within.
Regardless of how ‘small’ the error was; the aftershock was substantial and left an important question for the software industry to consider – how exactly do you protect against this? If a single point of failure can bring entire industries to their knees, how high is the threat of something like this at any given point? Companies can invest in substantial disaster recovery plans and protocols but if one of the foundational systems at the heart of an Operating System doesn’t play ‘nice’ – all of this could be in vain. The impact will be felt most prominently by Crowdstrike directly, who are battling with a 20% decline in shares as competitors such as Palo Alto Networks and SentinelOne benefit from the silent gloating that they weren’t at fault for an issue such as this. However, if all of the world’s most important IT systems simply move from one central point to another – will this really resolve the problems and issues at hand? Seamless interoperability is an obvious benefit to putting all your eggs in one basket – but it also introduces a hard to swallow truth that no architecture is truly 100% reliable – and if you’re tightly coupled to specific vendors, you may be at notable risk.
The key learnings from Crowdstrike’s failure will differ from organization-to-organization, but if you do think about your IT infrastructure as a result of this occurrence, it would be key for the takeaway to not be ‘oh, that would never happen to me’ – since you can never be sure when one little software update could cause a rift as large as the great 2024 Crowdstrike outage…