Cloudflare, one of the world’s largest cloud network platforms, experienced an outage for around 30 minutes yesterday. It led to websites for some of the world’s biggest companies going down and visitors receiving 502 errors. It was caused by a massive spike in CPU utilisation on their network. Cloudflare claims to provide services to over 16,000,000 “internet properties”.
The cause of this outage was deployment of a single misconfigured rule.
The route leak also affected access to some AWS services. However, Verizon was the prime offender but many Cloudflare users were impacted.
“Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82%.”
The outages was caused following a software deployment. They constantly deploy across the network. They have automated systems to run test suites and a procedure for deploying progressively to prevent incidents. Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage.
Cloudflare admitted their testing processes were insufficient in this case and they are reviewing and making changes to their testing and deployment process to avoid incidents like this in the future.