Customer Configuration Change Triggers Fastly Outage

0

Cloud computing company Fastly suffered an outage on Tuesday that temporarily shut down websites worldwide, including Amazon, Reddit, CNN, and the BBC. The outage lasted less than one hour but had significant reverberations.

Within hours, Nick Rockwell, Senior Vice President of Engineering and Infrastructure at Fastly, confirmed a customer inadvertently caused the outage.

“We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when a valid customer configuration change triggered it.”

That bug entered Fastly’s systems on May 12 during a software deployment. A specific customer configuration could trigger the bug under a specific set of circumstances. On Tuesday, a customer configuration change activated the bug, and 85% of Fastly’s network experienced outages. Fastly says their network was back up within 49 minutes. But Rockwell admits the impact was severe.

Fastly provides a content delivery network (CDN) to more than 2,000 customers. Lotem Finkelstein, Head of Threat Intelligence at Check Point Software Technologies, says CDNs generate replicas of original websites for the website owners to allow load balancing.

“Instead of everyone all over the world accessing one centralised server and causing an overload, what they do is actually spread the load between different replicas,” Finkelstein said. ”The original server could sit in San Francisco, but there are replicas in Paris, Manhattan, Tel Aviv and Hong Kong. Everyone is routed to the nearest server to their device. When a CDN fails, it means that all the replicas are unavailable, and no one can see the content from the original server.”

Akamai, Cloudflare and Amazon’s CloudFront all provide similar CDN services to Fastly. While Fastly’s customer base is relatively small, it includes many of the world’s best-known websites and many major news publications. The ubiquity of these websites helps explain why the outage garnered so much attention.

Fastly says there were specific events that triggered the outage, but the business accepts they should have anticipated it.

“We have been, and will continue to, innovate and invest in fundamental changes to the safety of our underlying platforms,” says Rockwell. In the outage’s washup, Fastly says it is taking several steps to prevent a recurrence.

That includes deploying the bug fix across the Fastly network as quickly as possible, running a post-mortem of the processes and practices during the outage, establishing why the bug was not detected earlier, and working on improving remediation time.

Experts suggest the incident highlights the risks run when so many of the world’s biggest websites rely on the same CDN provider. Many large organisations have backup systems, but switching over can take time and often gets done manually. While the Fastly outage was short-lived, one SEO agency estimates it cost global retailer Amazon US$32 million in lost sales.

“Although it seems they weren’t down for long, the impact it would have had will be huge, especially on e-commerce sites,” said Naomi Aharony, CEO of United Kingdom-based Reboot. “Our research estimated Amazon could have potentially lost $6,803 every second it was down.”

Share.

Comments are closed.