Lessons from the CrowdStrike Crisis: Organizational Strategies for Future Cyber Resilience

Blue Death Screen

Recently, the world witnessed a major cyber crisis impacting over 8.5 million Microsoft Windows devices around the globe. This disruption affected both end-user devices and servers, causing significant financial losses for organizations relying on the CrowdStrike EDR (Endpoint Detection and Response) solution. On Friday, July 19, 2024, at 04:09 UTC, CrowdStrike released a content configuration update for the Windows sensor to gather telemetry on possible novel threat techniques. Unfortunately, this update led to a Windows system crash.

Mac and Linux hosts remained unaffected due to differences in operating system architectures and due to the specific design of the CrowdStrike update, which targeted telemetry gathering on Windows systems. Although Microsoft and CrowdStrike made efforts to revert the update and provide recovery solutions, the process was delayed because many devices had disconnected due to BSOD (Blue Screen of Death) resulting from the initial update and required physical access to apply the patch. It took nearly 3-4 days for the majority of organizations to restore their impacted devices.

This incident resulted in a global cyber crisis and serves as a reminder for organizations to re-strategize their IT operations to ensure better resiliency for future eventualities.

Reasons Behind the Occurrence

The crisis was triggered by an update intended to enhance security by gathering telemetry on new threat techniques. These updates are a regular part of the dynamic protection mechanisms of the Falcon platform. CrowdStrike delivers security content configuration updates to the sensors via:

Sensor Content: Shipped directly with the sensor

Rapid Response Content: Designed to respond to the changing threat landscape at operational speed. The issue originated from the Rapid Response Content update that was pushed to all Windows devices.

Technical Details:

Rapid Response Content is delivered as Template Instances. Each Template Instance maps to specific behaviors for the sensor to observe, detect, or prevent. Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published.

On July 19, 2024, two Template Instances were deployed by CrowdStrike. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data. This problematic content resulted in an out-of-bounds memory read, triggering an exception. This unexpected exception caused Windows operating system crash (BSOD).

Impact Across Different Industries

The impact of the crisis was extensive, affecting various industries that rely on Windows-based systems. Financial institutions, healthcare providers, aviation industries, metro systems, government agencies, and the list goes on…  experienced significant disruptions. More than 50 companies in the aviation and transport industry were badly hit by this crisis, impacting passengers traveling domestically and internationally. The inability to access critical systems and data led to operational downtime, financial losses, customer dissatisfaction, and compromised security postures. The crisis underscored the vulnerability of relying heavily on a single security solution, in this case, the CrowdStrike EDR.

How to Avoid/Minimize Future Crises

Though this crisis was due to a faulty patch provided by CrowdStrike, it raises concerns about our preparedness to handle such cyber crises in the future. We cannot ignore the possibility that similar events might occur again. Below are some strategies that organizations should follow to avoid or minimize the risk of a cyber crisis:

  1. Implement Comprehensive Backups: Regularly back up all critical systems and data to ensure quick recovery in case of a failure. These backups should be stored in multiple secure locations. To protect end-user devices, use cloud storage solutions such as OneDrive for data safety, ensuring that users’ files are continuously synced and backed up in the cloud. For servers, establish a robust backup strategy that includes secondary backups stored offsite or in the cloud. Employ automated backup solutions to regularly capture system images, application data, and configurations.
    If the backup server can use different operating system (OS), it can add an extra layer of resilience by diversifying the technology stack. This approach minimizes the risk that an issue affecting one OS will impact the backup system. Regularly perform routine tests of backup restoration processes to verify data integrity and accessibility.
  1. Thorough Testing: While every organization follows the SDLC or Secure SDLC process to test products before launching them, they must ensure that all updates undergo rigorous and comprehensive regression testing in a controlled environment before deployment. Companies can greatly benefit by automating all these test cases and religiously keeping their automated test cases updated and augmented with new scenarios constantly. Over kill is never a bad thing when it comes to regression testing especially when it is automated. This should also include compatibility checks across different systems and configurations.
  1. Incident Response Planning: Develop and regularly update an incident response plan that includes protocols for various types of cyber incidents. Conduct regular drills to ensure preparedness, including extreme cases of remotely restoring system images for handling those worst-case scenarios. Various tools operating even at the BIOS level are available in the market to achieve the same.
  1. Vendor Communication and SLAs: Maintain open communication channels with service providers/suppliers and establish clear SLAs that outline vendor responsibilities for liabilities, uptime, and response times in case of failures. These agreements should explicitly state that the vendor or its suppliers will be held liable in case of business loss. Ensure the SLAs include provisions for regular performance reviews, penalties for non-compliance, and detailed protocols for incident response. This proactive approach not only holds vendors accountable but also reinforces the organization’s commitment to maintaining robust cybersecurity and operational continuity.

References:

  1. https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
  2. https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/

Authored by Rahul Singh

Move into a smarter future with SLK