Tuesday, June 9th was a bad day for IBM’s flagship Cloud offering, whose network suffered a catastrophic outage that knocked their entire fleet of products offline for hours. The problem seemed to start at 22:00 UTC according to user reports on Twitter and Downdetector, with the Cloud suffering complete instability until 00:10, when products slowly started to come back to life. Reports continued until 05:00 UTC however, when it seems IBM finally put a fix in place. The outage comes at a particularly ironic time for the company, who struggled with an outage at its Dallas facility in March, then published a paper about maintaining ‘high availability’ and ‘zero downtime’ in April, which their marketing team are still riding.
IBM announced a summary of the outage, which blames an “external network provider [who] flooded the IBM Cloud network with incorrect routing”. However, Techzim received a sleuth of information from a technical source who was monitoring the outage from start to finish, and shows the issue happening within the IBM Cloud network itself.
“I got a handful of down alerts for IBM hosted services including Softlayer’s nameservers, then saw Twitter blow up shortly after. Tracing traffic to IBM’s IPs showed traffic hitting the local PoP as normal but then being routed into Washington DC, rather than the intended closer/local source. It was very reminiscent of the Google issue last year. Blaming an external provider is nonsensical; if true, IBM could have disabled advertisements or BGP session to such provider and resolved the problem.”Olly at ISP Backbone Ltd, a managed network and connectivity specialist
On June 9th 2019, Google suffered a huge outage that impacted most of its services across the Americas and some of Europe. The outage was touted “network congestion” as it unfolded Google had routed a disproportionate amount of traffic to its US-East data centres by mistake. Eerily almost a year to date, IBM Cloud seems to have suffered the same fate. They were effectively DDoSing themselves by sending all their internet data through a single point on the network.
It’s common for cloud providers to suffer outages and network blips, but it’s unusual for an issue of such severity to be prolonged. IBM Cloud’s Twitter feed was inundated with @mentions yet remained completely silent for hours, which doesn’t bode well for a company of big blue’s size. What’s worse, their status page is hosted on the same network and was completely offline throughout most of the outage, leaving struggling customers stranded with no way of knowing what was going on. IBM Cloud hosts many large corporations including accountancy specialist Deloitte, technology giant Panasonic, American Airlines, Vodafone, Bitly, Allianz and more — so the outage undoubtedly hit millions of end users around the globe. The company also hosts several US Government units which all suffered too.
IBM Cloud stems from its purchase of hosting provider Softlayer in 2013, which underpins its current service offering. Softlayer was a dedicated server provider with data centers in US and EU locations, but due to the cloud compute boom of 2011, began a downtick in popularity. IBM’s healthy bank balance and cloud-envisaged future seemed a perfect match, which saw Softlayer rename to an IBM moniker in 2016 and then fully morph into to IBM Cloud in 2018. It has been a mostly turbulent and downward-spiraling decade for the company however, whose revenue has dropped over 27% since a 2011 peak, and its Q1 2020 financial report shows no end to the big blue’s blues. We predict a mass exodus of IBM Cloud customers in the next quarter due to this outage, leaving the fate of IBM and its Cloud division in an interesting predicament.