Cybersecurity vendor CrowdStrike initiated a series of computer system outages across the world on Friday, July 19, disrupting nearly every industry and sowing chaos at airports, financial institutions, and healthcare systems, among others.
At issue was a flawed update to CrowdStrike Falcon, the company’s popular endpoint detection and response (EDR) platform, which crashed Windows machines and sent them into an endless reboot cycle, taking down servers and rendering ‘blue screens of death’ on displays across the world.
Australian businesses were among the first to report encountering difficulties on Friday morning, with some continuing to encounter difficulties throughout the day. Travelers at Sydney Airport experienced delays and cancellations. At 6pm Australian Eastern Standard Time (08:00 UTC), Bank Australia posted an announcement to its home page saying that its contact center services were still experiencing problems.
Businesses across the globe followed suit, as their days began. Travelers at airports in Hong Kong, India, Berlin, and Amsterdam encountered delays and cancellations. The Federal Aviation Administration reported that US airlines grounded all flights for a period of time, according to the New York Times.
As one of the largest cybersecurity companies, CrowdStrike’s software is very popular among businesses across the globe. For example, it is estimated that over half of Fortune 500 companies use its security products.
Because of this, fallout from the flawed update has been widespread and substantial, with some calling it the “largest IT outage in history.”
To provide scope for this, more than 3,000 flights within, into, or out of the US were canceled on July 19, with more than 11,000 delayed. Planes continued to be grounded in the days since, with nearly 2,500 flights canceled within, into, or out of the US, and more than 38,000 delayed, three days after the outage occurred.
The outage also significantly impacted the healthcare industry, with some healthcare systems and hospitals postponing all or most procedures and clinicians resorting to pen and paper, unable to access EHRs.
Given the nature of the fix for many enterprises, and the popularity of CrowdStrike’s software, IT organizations have been working around the clock to restore their systems, with many still mired in doing so days after the initial faulty update was served up by CrowdStrike.
In a blog post on July 19, CrowdStrike CEO George Kurtz apologized to the company’s customers and partners for crashing their Windows systems. Separately, the company provided initial details about what caused the disaster.
According to CrowdStrike, a defective content update to its Falcon EDR platform was pushed to Windows machines at 04:09 UTC (0:09 ET) on Friday, July 19. CrowdStrike typically pushes updates to configuration files (called “Channel Files”) for Falcon endpoint sensors several times a day.
The defect that triggered the outage was in Channel File 291, which is stored in “C:WindowsSystem32driversCrowdStrike” with a filename beginning “C-00000291-” and ending “.sys”. Channel File 291 passes information to the Falcon sensor about how to evaluate “named pipe” execution, which Windows systems use for intersystem or interprocess communication. These commands are not inherently malicious but can be misused.
“The update implemented at 04:09 UTC aimed to address newly detected, malicious named pipes utilized by common C2 [command and control] frameworks in cyberattacks,” the technical post detailed.
Yet, as CrowdStrike reported, “The configuration update caused a logic error that led to an operating system crash.”
After an automatic reboot, the Windows systems with the faulty Channel File 291 installed would crash repeatedly, resulting in an infinite reboot loop.
The problematic update affected only machines running Windows. Linux and MacOS machines using CrowdStrike were not impacted, the company stated.
According to the company, CrowdStrike pushed out a fix removing the defective content in Channel File 291 just 79 minutes after the initial flawed update was sent. Machines that had not yet updated to the faulty Channel File 291 update would not be impacted by the flaw. But those machines that had already downloaded the defective content weren’t so lucky.
To remediate those systems caught up in endless reboot, CrowdStrike published another blog post with a far longer set of actions to perform. Included were suggestions for remotely detecting and automatically recovering affected systems, with detailed sets of instructions for temporary workarounds for affected physical machines or virtual servers, including manual reboots.
For many organizations, recovering from the outage is an ongoing issue. With one suggested solution for remedying the defective content being to reboot each machine manually into safe mode, deleting the defective file, and restarting the computer, doing so at scale will remain a challenge.
It has been noted that some organizations with hardware refresh plans in place are considering accelerating those plans as a remedy to replace affected machines rather than commit the resources necessary to conduct the manual fix to their fleets.
CrowdStrike Falcon is endpoint detection and response (EDR) software that monitors end-user hardware devices across a network for suspicious activities and behavior, reacting automatically to block perceived threats and saving forensics data for further investigation.
Like all EDR platforms, CrowdStrike has deep visibility into everything happening on an endpoint device — processes, changes to registry settings, file and network activity — which it combines with data aggregation and analytics capabilities to recognize and counter threats by either automated processes or human intervention.
Because of this, Falcon is privileged software with deep administrative access to the systems it monitors, making it tightly integrated with core operating systems, with the ability to shut down activities that it deems malicious. This tight integration proved to be a weakness for IT organizations in this instance, rendering Windows machines inoperable due to the flawed Falcon update.
The company has also introduced AI-powered automation capabilities into Falcon for IT, to help bridge the gap between IT and security operations, according to the company.
In addition to dealing with fixing their Windows machines, IT leaders and their teams are evaluating lessons that can be gleaned from the incident, with many looking at ways to avoid single points of failure and re-evaluating their cloud strategies.
As for CrowdStrike, US Congress has called on CEO Kurtz to testify at a hearing about the tech outage. According to the New York Times, Kurtz was sent a letter by Representative Mark Green (R-Tenn.), chairman of the Homeland Security Committee, and Representative Andrew Garbarino (R-NY).
Americans “deserve to know in detail how this incident happened and the mitigation steps CrowdStrike is taking,” they wrote, according to the New York Times.