November 14, 2024
The Crowdstrike outage and global software's single-point failure problem
The CrowdStrike software bug that took down Microsoft operating system-based IT around the globe exposed a 'single-point' failure risk that is rising.

The frequency of large-scale attacks on corporate enterprise IT is increasing. That’s not unusual or unexpected as companies spend heavily on cyber defense in an asymmetric war against hackers who can string together a few lines of code and wreak havoc.

But the largest IT outage ever on Friday, resulting from a CrowdStrike software bug being uploaded to Microsoft operating systems rather than any malicious attack, shows a type of tech threat that has been increasing alongside hacks but gets less attention: the single-point failure — an error in one part of a system that creates a technical disaster across industries, functions, and interconnected communications networks; a massive domino effect. 

Earlier this year, AT&T had a nationwide outage attributed to a technical update. Last year, the FAA had an outage that occurred after a single individual replaced a critical file in a route update (now that FAA has a backup system to prevent that from ever happening again).

“It’s more frequent even when it’s just routine patching and updates,” Chad Sweet, The Chertoff Group co-founder and CEO and former Chief of Staff at the Department of Homeland Security, told CNBC on Friday.

Digital boards are seen due to the global communications outage caused by CrowdStrike, which provides cyber security services to US technology company Microsoft, it was observed that some digital billboards in Times Square in New York City, United States, displayed a blue screen and some screens went completely black on July on 19, 2024.

Selcuk Acar | Anadolu | Getty Images

Single-point failure risk management is an issue that companies need to plan for and protect against. There’s no software in the world that gets released and doesn’t later need to be patched or updated, and there are best security practices that exist for the period of time well after a production release that cover the ongoing software maintenance, Sweet said. 

Companies that the Chertoff Group works with are closely reviewing software development and update standards in the wake of the CrowdStrike outage. Sweet pointed to a set of protocols the government already provides, the SSDF (Secure Software Development Framework), that may give the market an idea of what to expect as Congress starts looking at the issue more closely. That’s likely after the recent string of incidents, from AT&T to the FAA and CrowdStrike, since this type of technical failure has now been shown to impact the lives of citizens and operations of critical infrastructure on a widespread basis.

“Get ready on the corporate side,” Sweet said.

Aneesh Chopra, Arcadia chief strategy officer and former White House chief technology officer, told CNBC on Friday that critical sectors including energy, banking, health care and airlines have separate regulations overseeing risk, and measures may be unique in the most regulated sectors. But for any business leader the question now is, “Assuming systems go down, what is plan B? We will see lots more scenario planning and if this is not Job No. 1, it is Job No. 2 or 3 to have those scenarios outlined,” he said. 

Former White House CTO Aneesh Chopra on major technical outages worldwide: 'It's a wake-up call'

Unlike many issues in D.C., Chopra noted there is a bipartisan commitment to issues of critical infrastructure and systemic risk, and technical standards are a “hallmark” of the U.S. system. There may now be efforts he described as designed with the goal of “improving competition” as a means to strength accountability. 

“If there is a mechanism to update in a more open and competitive way there might be pressure to make sure that that is done in a manner that has i’s and t’s dotted and crossed,” Chopra said.

Sweet said that will inevitably lead to business world concerns about the risk of overregulation. While there is no way to know for sure now whether there was a way for CrowdStrike to operate using a more open process that allowed for detection of the single-point failure, he said it is a legitimate question to ask.

The best method to avoid overregulation, according to Sweet, is to look to market-reinforcing mechanisms, such as the insurance industry. “The short answer is, ‘Let the free market do it, through things like the insurance industry, which will reward good actors with lower premiums,” he said.

Sweet also said more companies should embrace the idea of “anti-fragile” organizations, as he does with his clients, a term coined by risk analyst Nassim Nicholas Taleb. “Not just an organization that is resilient after a disruption, but ones that thrive and innovate and outpace competitors,” he said. In his view, any single legislation or regulation would be hard pressed to keep up with both malicious attacks and technical updates that are pushed through with unintended consequences.

“It’s a wakeup call for sure,” Chopra said.