The Costly Consequences of Crashes in the Clouds

Service outage detection website Downdetector on Nov. 19, reported that the Amazon Web Services (AWS) Internet infrastructure service, which many websites and apps use as a backbone, was having problems.

Downdetector’s historical data shows that AWS had also experienced problems on Nov. 2 and Nov. 16.

“Crashes in the cloud are highly unusual,” Kristin Brown, senior PR manager at Amazon Web Services, told the E-Commerce Times when asked about the Nov. 19 crash.

Amazon’s service health dashboard “indicates that everything has been operating normally…with no widespread disruptions,” Brown said. “We have millions of customers. If there had really been a disruption of service, we would probably see more reports, in addition to the service health dashboard reporting the disruption.”

The AWS global infrastructure is divided into regions and availability zones for reliability, Brown added.

Amazon “often sees misreports on sites like Downdetector for a number of reasons,” Brown remarked. “There is a lot of redundancy and security built into cloud infrastructure, AWS in particular.”

Downdetector defended the accuracy of its data.

The company “collects status reports from a series of sources, including Twitter and reports submitted on our websites and mobile apps,” Adriane Blum, VP, marketing and communications at Ookla, the parent company of Downtector, told the E-Commerce Times.

“Our system validates and analyzes these reports in real time, allowing us to automatically detect outages and service disruptions in their very early stages,” she explained. “We do not have problems with misreporting.”

Subsequently, on Nov. 25, an AWS outage took out “thousands of online services,” ZDNet reported.

Importance of Cloud Services

“Workloads are being shifted to public clouds even more quickly than anticipated, and hosted software apps are especially attractive for enterprises navigating their way through a worldwide pandemic,” said John Dinsdale, a chief analyst at market intelligence firm Synergy Research Group.

“Rapid adoption is also being helped by a plethora of hybrid cloud services which are helping to smooth the path towards greater usage of public clouds.”

Enterprise spending on cloud services increased by $1.5 billion in the third quarter of the year because of the pandemic, speeding up the transition from on-premise operations to cloud-based services, according to SRG.

Infrastructure as a service (IaaS), platform as a service (PaaS) and software as a service (SaaS), all of which are offered on a subscription basis, grew about three percentage points more than expected.

Companies offering their services or platforms on the cloud on a subscription basis include Microsoft, with its Office 360 and other services; customer relationship management (CRM) giant Salesforce; Google and Amazon’s AWS.

Market research firm IDG’s 2020 Cloud Computing Survey, published in June, reported that 81 percent of more than 550 organizations polled are already using cloud infrastructure or have applications in the cloud.

There are public clouds, such as those offered by Google, Amazon and Microsoft; private clouds such as IBM’s cloud service; and hybrid clouds, which are a combination of the two.

Recent Outages

When users cannot access a cloud service, what’s the real cost?

Thousands of users worldwide lost access to Gmail, Google Drive, Google Docs, Google Meet and Google voice on Aug. 20, when Google cloud services worldwide went down for hours.

In late September, a global outage took down Azure Active Directory (AD), Microsoft’s cloud-based enterprise identity and access management solution, which is the backbone of its cloud-based Office 365 system.

Customers could not access Teams, Microsoft 365 and other of the company’s online services.

The Nov. 25 AWS crash, which lasted for hours, impacted thousands of online services ranging from Adobe Spark to Roku to Flickr, smart devices, cryptocurrency portals and streaming and podcast services.

Private cloud services did not fare any better.

In June, the IBM Cloud suffered a worldwide outage. In July, a router on the global backbone of Web infrastructure and website security provider Cloudflare’s domain name system (DNS) service misrouted Internet traffic for about half an hour, disrupting a large part of the Internet.

Downtime can cost enterprises that depend solely on a data center’s ability to deliver IT and networking services to customers — such as e-commerce companies — up to $11,000 a minute, according to Evolven, a technology company that provides IT Operations Analytics (ITOA) solutions for enterprise businesses.

The cost to businesses, entrepreneurs and individuals who use subscription services in their work has yet to be calculated.

Evolven suggests this equation for calculating revenue lost due to downtime:

(GR/TH) x I x H, where GR = gross yearly revenue; TH = total yearly business hours, I = percentage impact, and H = number of hours of the outage.

Gargantuan Task

Crashes in cloud services cannot be prevented because “these are complex systems undergoing maintenance at a component level and almost always under attack,” Rob Enderle, principal at the Enderle Group, told the E-Commerce Times.

For example, AWS’ Nov. 20 crash occurred because Amazon added capacity to the front-end cluster of its Kinesis service and the back-end servers did not pick up on the changes fast enough for technical reasons.

Kinesis enables the real-time processing of streaming data and is used directly by AWS customers as well as by other AWS services.

Still, crashes can be mitigated, and redundancy built in, so users rarely see them, Enderle noted.

That said, “Increasing redundancy, resiliency and security is an ongoing process with cloud providers,” he pointed out. “But budgets aren’t unlimited so some acceptance that failures will occur is understood and, as long as they are brief, largely accepted.”

This is where risk management — the process of identifying, assessing and controlling threats to an organization’s capital and earnings — comes in.

The threats or risks could include financial uncertainty, legal liabilities, strategic management errors, accidents and natural disasters.

“Crashes will never go away,” Enderle said. “These systems are both too complex and too attractive a target to fully eliminate the risk.”



Richard Adhikari has been an ECT News Network reporter since 2008. His areas of focus include cybersecurity, mobile technologies, CRM, databases, software development, mainframe and mid-range computing, and application development. He has written and edited for numerous publications, including Information Week and Computerworld. He is the author of two books on client/server technology. Email Richard.