
How Workers powers our internal maintenance scheduling pipeline
Cloudflare has data centers in over 330 cities globally , so you might think we could easily disrupt a few at any time without users noticing when we plan data center operations. However, the reality is that disruptive maintenance requires careful planning, and as Cloudflare grew, managing these complexities through manual coordination between our infrastructure and network operations specialists became nearly impossible. It is no longer feasible for a human to track every overlapping maintenance request or account for every customer-specific routing rule in real time. We reached a point where manual oversight alone couldn't guarantee that a routine hardware update in one part of the world wouldn't inadvertently conflict with a critical path in another. We realized we needed a centralized, automated "brain" to act as a safeguard - a system that could see the entire state of our network at once. By building this scheduler on Cloudflare Workers , we created a way to programmatically enforce safety constraints, ensuring that no matter how fast we move, we never sacrifice the reliability of the services on which our customers depend. In this blog post, we’ll explain how we built it, and share the results we’re seeing now. Building a system to de-risk critical maintenance operations Picture an edge router that acts as one of a small, redundant group of gateways that collectively connect the public Internet to the many Cloudflare data centers operating in a metro area. In a populated city, we need to ensure that the multiple data centers sitting behind this small cluster of routers do not get cut off because the routers were all taken offline simultaneously. Another maintenance challenge comes from our Zero Trust product, Dedicated CDN Egress IPs, which allows customers to choose specific data centers from which their user traffic will exit Cloudflare and be sent to their geographically close origin servers for low latency. (For the purpose of brevity in this post, we'll refer to the Dedicated CDN Egress IPs product as "Aegis," which was its former name.) If all the data centers a customer chose are offline at once, they would see higher latency and possibly 5xx errors, which we must avoid. Our maintenance scheduler solves problems like these. We can make sure that we always have at least one edge router active in a certain area. And when scheduling maintenance, we can see if the combination of multiple scheduled events would cause all the data centers for a customer’s Aegis pools to be offline at the same time. Before we created the scheduler, these simultaneous disruptive events could cause downtime for customers. Now, our scheduler notifies internal operators of potential conflicts, allowing us to propose a new time to avoid overlapping with other related data center maintenance events. We define these operational scenarios, such as edge router availability and customer rules, as maintenance constraints which allow us to plan more predictable and safe maintenance. Every constraint starts with a set of proposed maintenance items, such as a network router or...
Preview: ~500 words
Continue reading at Cloudflare
Read Full Article