Cloudflare Outage Causes Grief For Some Of The World’s Biggest Websites For About 30 Minutes

Cloudflare, one of the world’s largest cloud network platforms, experienced an outage for around 30 minutes yesterday. It led to websites for some of the world’s biggest companies going down and visitors receiving 502 errors. It was caused by a massive spike in CPU utilisation on their network. Cloudflare claims to provide services to over 16,000,000 “internet properties”.

The cause of this outage was deployment of a single misconfigured rule.

The route leak also affected access to some AWS services. However, Verizon was the prime offender but many Cloudflare users were impacted.

“The intent of these new rules was to improve the blocking of inline JavaScript.  These rules were being deployed in a simulated mode. Here issues are identified and logged by the new rule but no customer traffic is actually blocked. We can measure false positive rates.  We also ensure that the new rules do not cause problems when they are deployed into full production.”

“Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82%.”

The outages was caused following a software deployment. They constantly deploy across the network. They have automated systems to run test suites and a procedure for deploying progressively to prevent incidents. Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage.

Cloudflare admitted their testing processes were insufficient in this case and they are reviewing and making changes to their testing and deployment process to avoid incidents like this in the future.

Leave a Comment

Your email address will not be published. Required fields are marked *