Cloudflare BYOIP customers need a rollback plan, not just trust

By Greg Nowak. Last updated 2026-06-18.

Cloudflare's February 20, 2026 BYOIP outage is the kind of incident that should reset how customers think about resilience. In Cloudflare's own postmortem, the trigger was an internal change, not an attack, and the result was about 1,100 customer prefixes being withdrawn from the Internet for 6 hours and 7 minutes. If those prefixes front production traffic through CDN, Spectrum, Magic Transit, or egress, the business implication is plain: a capable provider can still leave you unreachable.

That is why BYOIP needs a rollback plan, not just vendor trust. Cloudflare deserves credit for publishing the postmortem and for the broader Code Orange: Fail Small work on safer rollouts, better failure handling, and cleaner emergency access. But none of that removes the customer's job. Your team still needs to know which prefixes matter, which services are bound to them, who can re-advertise them, what can be fixed internally, and where recovery depends on Cloudflare.

What the outage actually exposed

The failure mode was not subtle. Cloudflare says some BYOIP customers became unreachable because routes were withdrawn over BGP. It also says the impact extended to any product that depended on those advertisements reaching the Internet: core CDN and security services, Spectrum, Magic Transit, and dedicated egress use. ITPro fills in the customer-facing picture with examples including Uber, Workday, Minecraft, Wikipedia, and Microsoft Outlook. For operators, the lesson is blunt: when BYOIP breaks, this is not a slower edge problem. It can be a hard reachability failure on the production path.

The incident also separated two layers that many teams still treat as one. Prefix advertisement is one thing; service binding is another. Cloudflare's postmortem lays out the sequence: customers signal advertisement or withdrawal through the Addressing API or BGP Control, routers update BGP once enough machines process the change, and then BYOIP service bindings decide which Cloudflare product handles traffic for those IPs. The current BYOIP documentation makes that binding layer explicit: bindings map traffic for your IP space to Magic Transit, CDN, or Spectrum, and a default binding is required when a prefix is onboarded.

Why re-advertising a prefix is not a full rollback

Not every affected customer landed in the same failure state on February 20. Some lost route advertisement only, and those customers could restore service by toggling advertisements in the dashboard. Some lost route advertisement and some bindings, which meant only partial recovery. Others lost prefixes and all service bindings. In that last case, the dashboard was not enough because there was no longer a service attached to the ranges; Cloudflare had to push a global configuration update to restore bindings across the edge.

That distinction matters because it changes what a useful runbook looks like. A rollback step that says re-advertise the prefix assumes the bindings are still there, assumes the right people know what should be attached, and assumes the recovery path is under your control. Cloudflare's own account does not support those assumptions. The slowest recovery path in the incident was not route advertisement on its own. It was the cases where service bindings had also been removed.

Failure state	What users saw	What helped on February 20	What your runbook should already define
Prefix withdrawn, bindings intact	Traffic stops reaching Cloudflare; users hit timeouts or failed connections.	Customers could toggle advertisements in the dashboard.	Named owner, re-advertisement procedure, validation checks, and a communication trigger.
Prefix withdrawn, some bindings removed	Recovery is uneven; some IPs return and others do not.	Only part of the estate could be self-remediated.	Binding inventory by prefix and subnet, plus a fallback path for mixed-state recovery.
Prefix withdrawn, all bindings removed	Ranges have no active service attached, so self-recovery is limited.	Cloudflare had to reapply service bindings globally.	Escalation path, provider case template, affected-service map, and predefined incident roles.
Planned binding create or delete	Bindings can take four to six hours to propagate, and Cloudflare warns of likely disruption for IPs in scope.	The docs do not describe a fast, invisible cutover.	Maintenance window, blast-radius review, rollback decision point, and approval for API-only binding work.

A BYOIP rollback plan has to cover both route advertisement and service-binding state. The February 20, 2026 outage showed they do not fail or recover in one clean, uniform way.

The timing problem is worth underlining. Cloudflare's current BYOIP binding documentation says binding operations are API-only, and that created or deleted bindings take four to six hours to propagate across its global network, with likely disruption for IPs in scope during that window. The same documentation also shows how much can sit behind a single prefix strategy. Magic Transit can act as the default binding across an entire prefix, while more specific IPs or ranges can be directed to CDN or Spectrum. For dedicated CDN egress, a prefix can be used for ingress or egress, but not both. That is a lot of production dependency packed into configuration that does not change instantly.

Provider-side improvements help, but they do not carry your operations

Cloudflare's broader resilience work is worth paying attention to. Code Orange: Fail Small is about tighter configuration rollouts, better-tested failure modes, and clearer break-glass procedures. In the BYOIP postmortem, Cloudflare also describes specific follow-up work: standardizing the Addressing API schema, separating configured state from operational state, adding snapshot-based rollback, and building circuit-breaker behavior for large withdrawal actions. Those are the right kinds of controls.

They are also not a reason to relax locally. Cloudflare says the fast rollback system it wants was not in production when the outage happened. It also says some customers could only be fully restored after global configuration updates reattached lost bindings. And the current BYOIP docs still tell customers to expect a four-to-six-hour propagation window when bindings are created or deleted. The sensible reading is shared responsibility. Provider controls matter. So do your own change control, dependency mapping, and rehearsed recovery steps.

What belongs in the runbook

Start with the inventory most teams never finish. List every prefix and write down what production traffic it fronts today: CDN ingress, Spectrum services, Magic Transit protection, or egress use. Record which sub-ranges override the default service. Without that map, you cannot judge blast radius when a prefix is withdrawn or a binding is changed.

Then get specific about ownership. Because service bindings are API-only, the rollback path cannot live in one engineer's head or inside half-finished automation. Document who can advertise, withdraw, or rebind prefixes; what approval is required; which checks confirm traffic has returned; and when the team stops troubleshooting locally and escalates because the issue is no longer just advertisement loss.

Set expectations around time as well. Cloudflare's BGP zombie analysis is a useful reminder that routing problems do not clean up neatly just because a corrective change has been issued. When more specific prefixes are withdrawn, routers can hunt for alternate paths, traffic can loop, and misleading visibility can linger. Cloudflare observed zombie behavior that could affect traffic for more than ten minutes, and in one test it could still see zombie routes more than 30 minutes after a withdrawal. That does not mean rollback failed. It means your runbook should distinguish between change issued, routing convergence, and user experience recovered.

Finally, put communication inside the rollback plan rather than treating it as a separate workstream. If a prefix fronts customer or partner traffic, the right update depends on whether you are dealing with simple re-advertisement, partial binding loss, or a provider-side recovery path. The February 20 timeline mattered because self-mitigation started before full restoration, and different customers recovered by different routes.

The working standard is straightforward. If BYOIP is attached to production traffic, your team should be able to answer four questions quickly: what is bound where, what breaks if a prefix disappears, what can you restore yourselves, and how long do you wait before escalating. Greg at GrN can help turn that into an operating document your team will actually use: a live dependency inventory, documented binding and fallback paths, tested self-remediation steps, tighter approval around Addressing API changes, and a runbook that covers re-advertisement, rollback, and customer communication. After February 20, 2026, that is basic operational hygiene.