NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Cloudflare outage on February 20, 2026 (blog.cloudflare.com)
CommonGuy 46 minutes ago [-]
Insufficient mock data in the staging environment? Like no BYOIP prefixes at all? Since even one prefix should have shown that it would be deleted by that subtask...

From all the recent outages, it sounds like Cloudflare is barely tested at all. Maybe they have lots of unit tests etc, but they do not seem to test their whole system... I get that their whole setup is vast, but even testing that subtask manually would have surfaced the bug

dabinat 31 minutes ago [-]
I think Cloudflare does not sufficiently test lesser-used options. I lurk in the R2 Discord and a lot of users seem to have problems with custom domains.
asciii 37 minutes ago [-]
It was also merged 15 days prior to production release...however, you're spot on with the empty test. That's a basic scenario that if it returned all...is like oh no.
atty 46 minutes ago [-]
I do not work in the space at all, but it seems like Cloudflare has been having more network disruptions lately than they used to. To anyone who deals with this sort of thing, is that just recency bias?
Icathian 42 minutes ago [-]
It is not. They went about 5 years without one of these, and had a handful over the last 6 months. They're really going to need to figure out what's going wrong and clean up shop.
NinjaTrance 34 minutes ago [-]
Engineers have been vibe coding a lot recently...
dakiol 9 minutes ago [-]
No joke. In my company we "sabotaged" the AI initiative led by the CTO. We used LLMs to deliver features as requested by the CTO, but we introduced a couple of bugs here and there (intentionally). As a result, the quarter ended up with more time allocated to fix bugs and tons of customer claims. The CTO is now undoing his initiative. We all have now some time more to keep our jobs.
jsheard 25 minutes ago [-]
The featured blog post where one of their senior engineering PMs presented an allegedly "production grade" Matrix implementation, in which authentication was stubbed out as a TODO, says it all really. I'm glad a quarter of the internet is in such responsible hands.
dana321 21 minutes ago [-]
Thats a classic claude move, even the new sonnet 4.6 still does this.
bonesss 14 minutes ago [-]
It’s almost as classic as just short circuiting tests in lightly obfuscated ways.

I could be quite the kernel developer if making the test green was the only criteria.

brutalc 14 minutes ago [-]
[dead]
dazc 19 minutes ago [-]
Launching a new service every 5 minutes is obviously stretching their resources.
candiddevmike 15 minutes ago [-]
Wait till you see the drama around their horrible terraform provider update/rewrite:

https://github.com/cloudflare/terraform-provider-cloudflare/...

lysace 24 minutes ago [-]
It has been roughly speaking five and a half years since the IPO. The original CTO (John Graham-Cumming) left about a year ago.
jacquesm 20 minutes ago [-]
They coasted on momentum for half a year. I don't even think it says anything negative about the current CTO, but more of what an exception JGC is relative to what is normal. A CTO leaving would never show up the next day in the stats, the position is strategic after all. But you'd expect to see the effect after a while, 6 months is longer than I would have expected, but short enough that cause and effect are undeniable.

Even so, it is a strong reminder not to rely on any one vendor for critical stuff, in case that wasn't clear enough yet.

dazc 18 minutes ago [-]
I wondered what happened to him?
brcmthrowaway 11 minutes ago [-]
He's on a yacht somewhere
tedd4u 4 minutes ago [-]
For real
Betelbuddy 30 minutes ago [-]
Cloudflare Outages are as predictable, as the Sun coming up tomorrow. Its their engineering culture.

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

anurag 2 minutes ago [-]
The one redeeming feature of this failure is staged rollouts. As someone advertising routes through CF, we were quite happy to be spared from the initial 25%.
jaboostin 3 minutes ago [-]
Hindsight is 20/20 but why not dry run this change in production and monitor the logs/metrics before enabling it? Seems prudent for any new “delete something in prod” change.
boarush 46 minutes ago [-]
While neither am I nor the company I work for directly impacted by this outage, I wonder how long can Cloudflare take these hits and keep apologizing for it. Truly appreciate them being transparent about it, but businesses care more about SLAs and uptime than the incident report.
llama052 32 minutes ago [-]
I’ll take clarity and actual RCAs than Microsoft’s approach of not notifying customers and keeping their status page green until enough people notice.

One thing I do appreciate about cloudflare is their actual use of their status page. That’s not to say these outages are okay. They aren’t. However I’m pretty confident in saying that a lot of providers would have a big paper trail of outages if they were more honest to the same degree or more so than cloudflare. At least from what I’ve noticed, especially this year.

boarush 28 minutes ago [-]
Azure straight up refuses to show me if there's even an incident even if I can literally not access shit.

But last few months has been quite rough for Cloudflare, and a few outages on their Workers platform that didn't quite make the headlines too. Can't wait for Code Orange to get to production.

jacquesm 16 minutes ago [-]
Bluntly: they expended that credit a while ago. Those that can will move on. Those that can't have a real problem.

As for your last sentence:

Businesses really do care about the incident reports because they give good insight into whether they can trust the company going forward. Full transparency and a clear path to non-repetition due to process or software changes are called for. You be the judge of whether or not you think that standard has been met.

boarush 11 minutes ago [-]
I might be looking at it differently, but aren't decisions over a certain provider of service being made by the management. Incident reports don't ever reach there in my experience.
blibble 29 minutes ago [-]
is this blog post LLM generated?

the explanation makes no sense:

> Because the client is passing pending_delete with no value, the result of Query().Get(“pending_delete”) here will be an empty string (“”), so the API server interprets this as a request for all BYOIP prefixes instead of just those prefixes that were supposed to be removed. The system interpreted this as all returned prefixes being queued for deletion.

client:

     resp, err := d.doRequest(ctx, http.MethodGet, `/v1/prefixes?pending_delete`, nil)
server:

    if v := req.URL.Query().Get("pending_delete"); v != "" {
        // ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
        prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
        if err != nil {
            api.RenderError(ctx, w, ErrInternalError)
            return
        }

        api.Render(ctx, w, http.StatusOK, renderIPPrefixAPIResponse(prefixes, nil))
        return
    }
even if the client had passed a value it would have still done exactly the same thing, as the value of "v" (or anything from the request) is not used in that block
bretthoerner 23 minutes ago [-]
> even if the client had passed a value it would have still done exactly the same thing, as the value of "v" (or anything from the request) is not used in that block

If they passed in any value, they would have entered the block and returned early with the results of FetchPrefixesPendingDeletion.

From the post:

> this was implemented as part of a regularly running sub-task that checks for BYOIP prefixes that should be removed, and then removes them.

They expected to drop into the block of code above, but since they didn't, they returned all routes.

blibble 9 minutes ago [-]
okay so the code which returned everything isn't there

actual explanation: the API server by default returns everything. the client attempted to make a request to return "pending_deletes", but as the request was malformed, the API returned everything. then the client deleted everything.

makes sense now

but is that explanation is even worse

because that means the code path was never tested?

himata4113 20 minutes ago [-]
yep, no mention that re-advertised prefixes would be withdrawn again as well during the entire impact even after they shut it down.
bstsb 26 minutes ago [-]
doesn't look AI-generated. even if they have made a mistake, it's probably just from the rush of getting a postmortem out prior to root cause analysis
himata4113 21 minutes ago [-]
This blog post is inaccurate, the prefixes were being revoked over and over - to keep your prefixes advertised you had to have a script that would readd them or else it would be withdrawn again. The way they seemed to word it is really dishonest.
NinjaTrance 29 minutes ago [-]
The irony is that the outage was caused by a change from the "Code Orange: Fail Small initiative".

They definitely failed big this time.

ssiddharth 25 minutes ago [-]
The eternal tech outage aphorism: It's always DNS, except for when it's BGP.
VirusNewbie 10 minutes ago [-]
If you track large SaaS and Cloud uptime, it seem to correlate pretty highly with compensation for big companies. Is cloudflare getting top talent?
bombcar 5 minutes ago [-]
Based on IPO date and lockups, I suspect top talent is moving on.
28 minutes ago [-]
henning 11 minutes ago [-]
Sure vibe-coded slop that has not been properly peer reviewed or tested prior to deployment is leading to major outages, but the point is they are producing lots of code. More code is good, that means you are a good programmer. Reading code would just slow things down.
dryarzeg 40 minutes ago [-]
DaaS - Downtime as a Service©

Just joking, no offence :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 20:15:06 GMT+0000 (Coordinated Universal Time) with Vercel.