Most IT service-related issues arise from network problems, as per the findings of the Uptime Institute. Power issues account for the majority of data center downtime, and cyberattacks significantly contribute to such outages.
As found in the latest annual outage analysis by Uptime Institute, networking and connectivity issues lead IT service-related outages. When it comes to data center outages, however, the chief culprit is power.
Based on the data from Uptime Institute Data Center Resiliency Survey 2024, out of 442 respondents, 31% specified networking and connectivity issues as the main cause of IT service-related outages. IT system/software was cited by 22%, making it the second most common issue. Other significant causes were power (18%), cooling (7%), and third-party IT service (10%).
Uptime went back and reviewed some of the most significant publicly reported outages. Surveys were also conducted regarding IT services-related outages and data center downtime to understand the elements impacting enterprise networks and data centers. As per Annual Outage Analysis 2024 by Uptime, the major causes of publicly reported IT service outages include:
<p>“We have identified that IT software is the single biggest cause. But if we add network software and configuration to fiber connectivity, that becomes the biggest single cause,” said Andy Lawrence, executive director at Uptime Institute Research during a <a href=”https://uptimeinstitute.com/webinars/webinar-annual-outage-analysis-2024″ target=”_blank” rel=”nofollow”>webinar</a> sharing the report results. </p>
<p>Uptime Institute’s Annual Outage Analysis 2024 features data that incorporates responses from the Uptime Intelligence Annual Global Data Center Survey conducted in Q2 and Q3 of 2023 with 850 respondents; the Uptime Intelligence Data Center Resiliency Survey conducted in Q1 2024 also with 850 respondents; and the Uptime Intelligence Public Outage Tracking report that monitored more than 750 outages between 2016 and 2023.</p>
<p>Uptime analysts said that overall outage frequency and severity continue to decline, but cyber-related incidents are increasing—and “are responsible for many of the most severe outages … causing extensive and serious disruption,” the report states. </p>
<p>“We’ve seen that [cyberattack/ransomware] is a fast-growing component accounting for 11% of serious outages. One of the notable features of a ransomware attack is they usually last days, some have lasted weeks. And in a few rare instances, the company involved has never recovered their business, so that does open up a new, very serious category,” Lawrence explained.</p>
The collected data indicates a significant change in the impact of cyberattacks today compared to a few years ago. Uptime has pointed out that the majority of control systems in data centers are now IP-running, making them more prone to attack and likely to be involved in an outage. In previous years, Operational Technology (OT) systems utilized private serial communication, separate from the company network. With IP-operated OT systems, network security comes into focus since if adversaries gain entry, they can dismantle operations.
“Most IP systems get regular security patches, however, many equipment such as chillers, generators, building management systems, etc., are not updated frequently for security and their security features are usually not that strong or sophisticated. They usually depend on a secure network as their primary line of defense,” expressed Chris Brown, the Uptime Institute’s Chief Technical Officer.
The research company noted that most operators registered negligible outages in the past three years, signifying that companies did not suffer substantial damage due to downtime. As per outage classification, 41% reported negligible outages having no significant impact on services. Furthermore, 32% reported minimal outages causing minimal disruption to users/customers/reputation. Only 17% experienced significant outages that caused some disruption to customer/user services with minimal or no financial impact but some reputational or compliance consequence.
Some 6% reported serious outages that included service or operational disruption, financial losses, compliance violations, safety worries, reputational damage, and potential customer losses. Lastly, 4% reported severe outages resulting in a major disruption of services or operations causing substantial financial losses, possible safety issues, compliance violations, customer losses, and reputation damage.
“There is no question that the data seems to show that the outage severity is improving. In other words, a lower proportion falls into that very severe category of serious, or severe that means our financial reputation, or other extreme consequences,” Lawrence explained.
Uptime pointed to a few public outages that severely impacted an organization. For instance, the U.S. Federal Aviation Administration experienced an outage that pointed to an IT software configuration error as its cause, when mistakenly deleted files in a pilot-alert system affected more than 30,000 flights, impacting stocks for major airlines. Australian telecommunications provider Optus experienced a costly outage due to a network issue that caused transport delays, resulted in banking issues, and cut hospital phone lines for 12 hours, impacting more than 10 million users and 400,000 businesses. Another example included a ransomware cyberattack on Dish Network that involved cybercriminals encrypting critical data, which disrupted services for nearly 300,000 users and caused the company’s share values to drop by more than 6%.
Despite improved data center design and redundancy, power continues to be identified as the top contributor to data center outages, according to Uptime. Uptime’s surveys found that 30% of respondents experienced an outage directly caused by a power problem. Among those, 42% pointed to uninterruptible power supply (UPS) failure as the leading cause of power issues. Another top cause for 30% involved the transfer switch over to a generator, which continues to be problematic for organizations. Generator failures accounted for 28% of power-related outages, and close to one-fifth (18%) said a transfer switch between paths (A/B) failure led to a power outage.
“Everything requires power, and power is so binary, and tolerance to power fluctuations can be very small,” Brown said. “The one thing that most people forget about is testing. They’ll have redundant systems, but they don’t test those on a regular basis,” Brown said. “It’s important to test these systems, and it’s important to test them under real-world conditions.”
Uptime also observed that the number of organizations investing in physical site redundancy has been trending upwards. According to their findings, 39% of enterprise participants reported heightened redundancy in power, and 37% reported the same for cooling. Increases in power (35%) and cooling (33%) redundancy were also reported by colocation and data center providers, while 37% of cloud/hosting/SaaS providers boosted power redundancy and 33% increased cooling redundancy.
Though communications and cloud providers can partly be attributed to some of the publicized outages, nearly 40% of respondents could link an outage directly to human error. For example, 48% of those who experienced outages identified failure of data center staff to adhere to procedures as the cause of an outage. Another 45% blamed flawed staff processes or procedures, while 23% cited installation problems as the source of outage-causing human error. Other contributors to human error include:
Douglas Donnellan, a research analyst at Uptime Institute, notes, “It’s important to acknowledge that human error, whether directly or indirectly, figures into virtually every outage. Any system that’s designed, installed, or constructed by humans inherently has a potential for failure.”