Brian Sims
Editor

Cyber sector professionals respond to CrowdStrike-induced global IT outage

THE MASS global IT outage that occurred on 18 July, which was caused by a defect in a Microsoft Windows content update initiated by CrowdStrike, actively hit businesses worldwide, forcing banks and media broadcasters offline and grounding flights. Industry practitioners have subsequently issued their verdicts on the hugely disruptive episode.

The outage resulted in worldwide travel disruption that led to delayed/cancelled flights in many countries, temporarily forced broadcasters offline, realised delays with global port and rail transport, resulted in the failure of payment systems and, in Alaska, witnessed interruptions to the 911 emergency systems.

Cyber security company CrowdStrike, itself a software provider for Microsoft, issued this statement on its website: “CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts. Mac and Linux hosts are not impacted. This is not a security incident or cyber attack. The issue has been identified, isolated and a fix has been deployed. We refer customers to the support portal for the latest updates and will continue to provide complete and continuous updates on our website.”

In a detailed review of the incident, CrowdStrike reported that the problem occurred due to a ‘bug’ in the system. This was meant to check that software updates were working properly. The glitch meant that CrowdStrike’s system didn’t identify “problematic content data” in a file. The faulty update crashed circa 8.5 million Microsoft Windows computers worldwide. George Kurtz, founder and CEO of Crowdstrike, apologised for the impact of the outage.

This global outage has served to underline how reliant organisations are on technology to carry out ‘business as usual’ and, because the outage was due to a software update provided by a supplier, the challenges organisations face when working with third party suppliers.

Ryan Thornley, security practice lead at Appsbroker CTS, informed Security Matters: “For organisations around the world, this incident underscores the crucial importance of having a clear and thorough disaster recovery plan in place. In an era where we are so completely dependent upon interconnected devices, organisations need to plan for the modern risks that come with modern IT practices and account for third party providers and services.”

Thornley added: “The recovery phase for businesses and public organisations is going to be huge and very costly. Potentially millions of machines around the globe, from hospital computers to supermarket checkouts, have received a defective content update to Crowdstrike and in many situations will need to be physically accessed to make them bootable again. This will be a mammoth undertaking and a real stress test for companies’ disaster recovery plans.”

Community approach

Tony Law, IT infrastructure manager at CovertSwarm, commented: “Even well governed software release and change management processes sometimes fail. What end user businesses can do to best protect themselves against these thankfully rare and unfortunate occurrences is to ensure any auto-update and other software release practices thoroughly test any changes firstly within pre-production environments prior to any push to production. Now is not the time to throw stones, but rather pull together as a technology community so that we can learn from this episode and support one another.”

Douglas Wadkins, vice-president of product management and technology at Opengear, explained: “The sheer scale of this incident is a stark reminder of the risks associated with a single point of failure. Identifying and mitigating single points of failure within an IT system is crucial for the level of continuity planning that could have kept systems up-and-running. This was an operating system issue, but tomorrow it could be a network failure.”

Wadkins added: “When a software misconfiguration such as this once occurs, secure remote network access plays a vital role in swiftly addressing the issue and remediating it before the network goes down. The financial impact this will episode will have cannot be overstated. Ensuring network resilience across the entire IT stack is imperative when it comes to safeguarding against such widespread disruptions in the future.”

Mark Grindey, CEO of Zeus Cloud, believes that lessons are to be learned from the CrowdStrike incident. “It’s clear that adequate testing for updates should be conducted in a safe environment before they’re issued company-wide. Companies should never have auto-updates set in a live environment and always test an update in a safe environment before releasing it live to minimise potential risks. This global outage highlights the need for businesses to not blindly trust their suppliers when it comes to updates before testing.”

Grindley concluded: “The only fix now is to re-boot in safe mode and remove the erroneous file. Unfortunately, this cannot be done remotely. It could so easily have been a security incident or cyber attack and this manual intervention required to be back up-and-running opens the door for other potential security risks and vulnerabilities. The only course of action now is to manually and safely re-boot the thousands of computers affected. It’s a task that will undoubtedly be challenging and time-consuming.”

Hurdle to national resilience

The Business Continuity Institute (BCI) and the British Computer Society’s joint report on ‘Service Resilience and Software Risk 2023’ revealed how the risk from software failure is a hurdle to national resilience and outlined how there’s insufficient shared understanding of the actual and potential risk of software failures and their impact.

Indeed, the BCI’s Horizon Scan Report 2023 showed that the greatest single disruption for organisations in the past 12 months was IT and telecoms outages, and also that the shift to remote and hybrid working emphasised the need to implement mitigation strategies to deal with them.

In order to mitigate the effects of IT outages, practitioners should conduct an audit of ICT systems – so too the critical processes and systems reliant on them – in order to uncover challenges posed in the face of technology failure or cyber attack. They should then determine to partner with top management to ensure a shared understanding of ICT risks in order to adopt adequate policies, budget and processes in preparation for software failures.

Practitioners can also look to regulation pertaining to other sectors such as the Digital Operational Resilience Act, which focuses on digital third party suppliers in order to prevent and manage disruption of entities in the financial sector. Although this is a European Union regulation, practitioners could extract strategies to mitigate and manage IT disruption and align with good practice.

The global IT outage that occurred on 18 July highlights the reliance organisations have on their suppliers. BCI research points to the risk of reputational damage posed by third party suppliers. Indeed, the fall-out from this event has already caused reputational damage for Microsoft and CrowdStrike, which will require a robust reputational resilience strategy.

The BCI’s own Cyber Resilience Special Interest Group supports cyber resilience and invites subject matter experts to share their insights and offer guidance to organisations. Ultimately, the primary goal is to leverage collective experience and explore new concepts to improve the field of cyber resilience. Practitioners are welcome to join this group, which is hosted on LinkedIn.

Company Info

WBM

64 High Street, RH19 3DE
EAST GRINSTEAD
RH19 3DE
UNITED KINGDOM

03227 14

Login / Sign up