Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲We found a bug in Go's ARM64 compiler (blog.cloudflare.com)

835 points by jgrahamc 122 days ago | 140 comments

Neywiny 122 days ago [-]

That's an incredible find and once I saw the assembly I was right along with them on the debug path. Interestingly it doesn't need to be assembly for this to work, it's just that that's where the split was. The IR could've done it, it just doesn't for very good reasons. So another win for being able to read arm assembly.

Unsure if this would be another way to do it but to save an instruction at the cost of a memory access you could push then pop the stack size maybe? Since presumably you're doing that pair of moves on function entry and exit. I'm not really sure what the garbage collector is looking for so maybe that doesn't work, but I'd be interested to hear some takes on it

Veserv 122 days ago [-]

You would normally use the “LDR Rd, =expr” pseudo-instruction form [1]. For immediates not directly constructible, it puts a copy of the immediate value in a PC-relative memory location, then does a PC-relative load into register.

So that would turn the whole sequence of “add constant to SP” into 2 executable instructions, 1 for constructing immediate and 1 for adding for a total of 8 bytes, and a 4 byte data area for the 17-bit immediate for a total of 12 bytes of binary which is 3 executable instructions worth.

[1] https://developer.arm.com/documentation/dui0801/l/A64-Data-T...

comex 122 days ago [-]

I've usually seen compilers handle large constants with MOV/MOVK sequences (encoding 16 bits of data per 32-bit instruction) instead of loading them from memory. Loading from memory was more common on 32-bit ARM.

pklausler 122 days ago [-]

I'm a little surprised that this bug wasn't fixed in the assembler as a special case for immediate adds to RSP. If the patch was to the compiler only, other instances of the bug could be lurking out there in aarch64 assembly code.

moefh 122 days ago [-]

Would that be wise? The implemented solution uses a temporary register to hold the full value being added to rsp.

I don't know enough about how people use the go assembler, but I imagine it would be very surprising if `add $imm, rsp, rsp` clobbered an unrelated register when `$imm` is large enough. Especially since what's clobbered is the designated "temporary register", which I imagine is used all the time in handwritten go assembly.

pklausler 122 days ago [-]

Some architectures, and I believe aarch64 is one, have scratch registers reserved for being clobbered in special situations required by the assembler.

anyfoo 122 days ago [-]

Not really, or at least not that I know if in the case of arm64. What you have is calling conventions that specify what one function/procedure/whatever can expect both from the caller and the callee's side.n I.e. some registers are caller-saved, some are callee-saved, which basically means the called function can treat them as "scratch".

Additionally, they call out interactions with the OS/execution environment. For example, x18 is the "platform register", and it's unspecified what the OS does with it. It's entirely possible that it clobbers it on context switch or during an interrupt or whatever. So don't use that one unless you have a contract with the OS itself.

But locally, i.e. "from instruction to instruction", no such convention exists to my knowledge, and you probably don't want to have registers that pseudo-instructions might trash inadvertently in general, because it means you can't optimally use these registers.

It's possible for pseudo-instructions or generally macros to be documented as, e.g., "this macro uses x3 as a temporary register and trashes it", but in my experience most macros that need additional temporary registers actually ask you to specify them as part of the macro invocation.

E.g. suppose you have a macro "weirdhash" that takes two registers and saves some kind of hash of them in a third register, but that also needs an extra register to perform its work. You would call it with:

    weirdhash x9, x10, x11, x0

Where x0 would be the scratch register you don't care about.

adastra22 121 days ago [-]

There are some architectures that do, but they're all old RISC chips.

immibis 121 days ago [-]

On some architectures there is a register reserved for assembler use, and even registers reserved for kernel use which can be changed in interrupt handlers and not changed back.

anyfoo 121 days ago [-]

Yeah, I mentioned x18 on arm64 as an example for the latter one. Didn't know about the register reserved for assembly use, apparently MIPS indeed had that: $1, also known as $at, is the "assembler temporary".

saagarjha 122 days ago [-]

No, I think that’s just a MIPS thing.

Someone 121 days ago [-]

Is that possible? I think you would have [1] to use a register to build up the immediate value. The assembler cannot/should not default to one, so I think the best one could do is having another macro for ADD that takes that helper register as an argument. That wouldn’t fix other instances in the AArch64 assembly code.

[1] I’m not familiar with AMD64, but maybe, you could use a thread local (edit: wouldn’t work with M:N threads. You’d need a coroutine-local. That would tie the assembler to golang, and thus would, even on that alone, be a very bad idea) or reserve space in the stack frame for it, too, but I don’t see those as realistic options

bloak 122 days ago [-]

> So another win for being able to read arm assembly.

Yes, though that weird stuff with dollars in it is not normal AArch64 assembly!

The article could have mentioned the "stack moves once" rule.

pjmlp 122 days ago [-]

It is due to the Plan 9 Assembly dialect most likely, because it wasn't enough that we already have differences between AT&T and Intel.

https://go.dev/doc/asm

Still, I find great that Go got back the 1990's tradition that compiled languages have an assembler as part of their tooling, regardless of the syntax.

Neywiny 122 days ago [-]

I've never heard of that rule (though tbh I'm not allocating > 64KB of stack when I'm in assembly) and it seems Google hasn't either. While I'm sure it makes sense, I don't think I've ever seen that be enforced. At least in C/C++. Maybe it makes more sense for these stack inspecting garbage collectors but I've also heard of ones that just scan the stack without unwinding anything. I did a test asking Google's AI to generate a complicated C function, put it in godbolt, and there's plenty of push push push push ..... Pop Pop Pop Pop going on

mananaysiempre 122 days ago [-]

> While I'm sure [bumping the stack pointer atomically] makes sense, I don't think I've ever seen that be enforced. At least in C/C++.

That’s because the C ABI supports unwinding with a fairly expressive set of tools for describing stack-pointer state on a per-instruction level. Even the simpler Microsoft ABI essentially uses bytecode for that[1]; and on the more complicated Itanium ABI, you get DWARF CFI instructions, which make the correct way to preserve a(n x86) register in the function prologue look like

  push rbx
  .cfi_adjust_cfa_offset 8
  .cfi_rel_offset rbx, 8

which are impossible to miss when reading compiler-generated assembly because of the sheer amount of annoying noise they create.

The Go authors decided to sidestep all of this complexity, which is understandable to a degree, but apparently they did not think through all the ramifications of doing so.

[1] https://learn.microsoft.com/en-us/cpp/build/exception-handli...

dwattttt 122 days ago [-]

MS's ARM64 unwinding ABI looks even more complicated: https://learn.microsoft.com/en-us/cpp/build/arm64-exception-...

mananaysiempre 121 days ago [-]

Ehh I wouldn’t say so (thanks for the correct link for ARM64 though in any case). What you need to be comparing to here is DWARF[1,2] section 6.4, and while it’s not as bad as other parts of DWARF, I still think it’s plenty complicated.

[1] https://dwarfstd.org/doc/DWARF5.pdf#page=171

[2] Slightly modified by psABI[3] section 3.7 for x86-64 or the LSB[4] section 11.6 for ARM64, but at this point that’s a drop in the bucket as far as overall complexity is concerned.

[3] https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/ma...

[4] https://refspecs.linuxfoundation.org/LSB_4.0.0/LSB-Core-gene...

dwattttt 121 days ago [-]

I was actually looking to point out MS's x64 ABI requires a standardised function epilog since this bug occurred during an epilog, only to find ARM64's epilogues are also described by bytecode (at least at a cursory glance).

JdeBP 122 days ago [-]

You need to look at non-x86 architectures. It was common years ago on MIPS.

* https://jdebp.uk/FGA/function-perilogues.html#StandardMIPS

I wrote up the x86 equivalent of doing just two read-modify-write operations on the stack pointer over 16 years ago.

* https://jdebp.uk/FGA/function-perilogues.html#Standardx86

rcxdude 122 days ago [-]

Did you compile with optimisations? I think GCC will do a bunch of activity on the stack with -O0, but it'll generally coalesce everything into one push/pop per function with optimisations (not because of any rule, but just because it's faster). alloca and other dynamic stack allocation may break this, but normal variables should in pretty much all just get turned into one block on the stack (with appropriate re-use of space if variable lifetimes don't overlap)

ori_b 122 days ago [-]

It will generate code to touch each page of the stack, because otherwise a very large stack allocation controlled by users (eg, in the case of a variable sized array) can be turned into a pointer to any location in memory by an attacker. Faulting in each page of the stack turns that into a crash.

There was a userspace thread library I came across a long time ago that used variable length arrays to switch between thread stacks; the scheduler would allocate an array of the right size to bump the stack pointer to the different thread's stack.

saagarjha 122 days ago [-]

Wow, that’s horrible.

i80and 121 days ago [-]

The engineers were so preoccupied with whether or not they could that they didn't stop to think if they should

Neywiny 122 days ago [-]

Yes

freep1zza 121 days ago [-]

> Yes, though that weird stuff with dollars in it is not normal AArch64 assembly!

See the AT&T vs Intel syntax since you aren't familiar with assembly:

https://en.wikipedia.org/wiki/X86_assembly_language#Syntax

dpassens 121 days ago [-]

That's an x86 thing, though.

indrora 121 days ago [-]

There are more assembler dialects than I care to remember.

The 2A06 assembler that people who write NES code (and later on SNES/GB/etc) use has some real quirks: $ prefixes a literal hex value but % is binary, but # in front of that is an address, registers are baked into the opcode (ldx -> load into X), and more.

Playstation folks all just used MIPS dialects which are mostly AT&Tish but the PS2 used an Intel style assembler.

pjmlp 122 days ago [-]

Usually in runtimes like Java and .NET there are safepoints exactly to avoid changing context in the middle of a set of instructions.

andygocke 122 days ago [-]

Yeah but we have codegen bugs in .NET as well. The biggest difference that stood out to me in this write up, is we would have gone straight for “coredump” instead of other investigation tools. Our default mode of investigating memory corruption issues is dumps.

pjmlp 122 days ago [-]

Sure, I have experienced them, e.g. once in 2006 using IBM's JVM implementation with Websphere.

However it is probably not as problematic due to the way Go allows for Assembly being used directly.

While the JVM and CLR don't allow for direct access to Assembly code, Go does, thus I assume expecting safepoints everywhere is not an option, as any subroutine call can land on code that was manually written.

yvdriess 121 days ago [-]

Go users can only insert assembly wrapped in a function call. That might be safety related, I am not entirely sure.

(Well technically there is a way to inject assembly without the function call overhead. That's what https://pkg.go.dev/runtime/internal/atomic is doing. But you will need to modify the runtime and compiler toolchain for it.)

pjmlp 121 days ago [-]

If you look the docs, they expect the developer to add specific information and use the registers in a specific way, otherwise Go will face runtime issues.

Whereas when you go over CGO, you get a marshaling layer similar to how JNI, P/Invoke work, that take care of those issues.

titzer 122 days ago [-]

I think the right fix is that the compiler should, e.g. load the constant into a register using two moves and then emit a single add. It's one more instruction, but then the adjustment is atomic (i.e. a single instruction). Another option is to do the arithmetic in a temp register and then move it back.

huflungdung 121 days ago [-]

[dead]

pengaru 122 days ago [-]

For the impatient, here's the fix: https://github.com/golang/go/commit/f7cc61e7d7f77521e073137c...

cmckn 122 days ago [-]

I noticed this when reviewing the linked issue: https://github.com/golang/go/issues/73259#issuecomment-31004...

Does the Go team have a natural language bot or is this just comment.contains(“backport”) type stuff?

kbolino 122 days ago [-]

The latter: https://github.com/golang/build/blob/master/cmd/gopherbot/go...

(found via https://go.dev/wiki/gopherbot)

etra0 122 days ago [-]

Kinda funny that it requires both "please" and "backport" for it to be considered haha.

9rx 122 days ago [-]

Although also the former (gabyhelp): https://github.com/golang/oscar/tree/master/internal/gaby

chavi2 122 days ago [-]

One thing I worry about, probably unnecessarily, is anything with a sense of urgency.

HEY GUYS WE JUST FOUND A GOLANG COMPILER BUG AND FATAL PANICS!

Everyone is like “Hmm. I need to fix this now.”

So, 99% probability it’s what it is. 1% it’s some secret defensive thing because there was a bad stupid zero day someone would get fired over or that could leave the world in shambles if uncovered, or maybe something else needed to be swept under the rug, or maybe someone wants to distract while they introduce a new vulnerability.

I don’t think this with CVEs, but when someone’s like “install this patch everybody!” the dim red light flickers on.

jimsmart 121 days ago [-]

It's an open source project — and quite a popular one, at that — and you are literally replying to a comment that specifies the changes made to fix this particular issue — you can see for yourself what is occurring here. Anyone can.

This issue, and the fix, has perfectly good visibility. Even if you personally can't understand the code, plenty of others can and do.

All of which makes your claims seem like quite unnecessary paranoia — to a lot of folk... and I suspect that is probably why your comment is getting heavily downvoted.

pengaru 121 days ago [-]

the account was created 22 hours ago, it's pointless to engage

Vipsy 122 days ago [-]

One thing that often gets missed is how hard it is to even suspect the compiler as the root cause. Most engineers waste hours chasing bugs in their own code because we’re trained to trust our tools. This mindset alone can make these rare compiler bugs much trickier to find.

pjmlp 122 days ago [-]

In the early PC days we suspected them a lot given how manually writting Assembly was still much better, in many cases.

I found out a bug on Turbo Pascal 6, where if you declare a variable with the same name as the function name, then the result was random garbage.

For those that don't know Pascal, the function name has to be assigned for the result value, so if a local variable with the same name is possible, then you cannot set the return value.

Something like this https://godbolt.org/z/s6srhTW66

    (* In Turbo Pascal 6 this would compile *)

    function Square(num: Integer): Integer;
    var
        Square: Integer;

    begin
        Square := num * num; (* Here the local variable gets used instead *)
    end;

peterfirefly 117 days ago [-]

succ(seg(x)) and pred(seg(x)) turned out to be equivalent of just seg(x) in TP6.

Earlier versions of Turbo Pascal (and Poly Pascal) generated poor code for "... + 1" but better code with succ(...) and doing memory access via memw[s:o] was common for speed for certain kinds of code. Allocating whatever size you needed + 16 guaranteed you had allocated enough to have paragraph aligned allocation (16 instead of 15 so you could just use the segment + 1).

I think it took a day or two to find this bug in some text-mode windowing code I'd written.

Tor3 122 days ago [-]

In the past it was more common to suspect the compiler, as others mention here. On a minicomputer I worked with in the late eighties, early nineties, I occasionally found errors in the compiler output. This was a Pascal compiler and because of that it didn't take too long to figure out that the code was actually correct and something else must be going on. Then firing up the debugger/tracer and scrutinizing and analyzing what happens in the disassembly.. when the problem was found, send a fax (yes!) to the head designer of the compiler, get a fixed test compiler back on a set of floppies.. went through this several times. I still have a printout somewhere with my pen marks pointing out a bug in the generated code.

SuperQue 122 days ago [-]

Yup, I had an issue filed against an open source project I work on. Was a crazy weird crash.

The reporter actually spent the effort to track it down, turns out it _was_ a Go compiler bug. (https://github.com/golang/go/issues/20427)

121 days ago [-]

kmarc 122 days ago [-]

There are certain professions where the compilation process is (ab)used to optimize to a point where these bugs seemingly surface more often.

In the HFT sphere i haven't talked to a company that hasn't reported (bragged about finding) a super weird gcc/clang bug.

Well, also, at my last job we used a snapshot version of the compiler, bc... Any nanoseconds matters.

hshdhdhehd 121 days ago [-]

In HFT might you keep the bug fix secret so other HFTs cant benefit from it.

kmarc 121 days ago [-]

I saw both. One of the top firms wanted that, another I worked at we did report (of course with a scratched minimal reproducible example)

The thing is, it's quite unlikely that your competitor hits the exact same bug. The cost of us having to keep upstream patched, tested isn't justified.

Also in HFT world there are some very similar patterns across competing companies, yet, we just saw TernFS coming out from XTX, with not much fear of competitors benefiting from it more than they do.

renewiltord 122 days ago [-]

Great technical blog. Good pathway for narrative, tight examples, description so clear it makes me feel smarter than I am because so easy to follow though the last time I even read assembly seriously was x86 years ago.

Also, fulfills the marketing objective because I cannot help but think that this team is a bunch of hotshots who have the skill to do this on demand and the quality discipline to chase down rare issues.

I assume these are Ampere Altra? I was considering some of those for web servers to fill out my rack (more space than power) but ended up just going higher on power and using Epyc.

quotemstr 122 days ago [-]

This problem strikes me more as a debuginfo generation bug than a "compiler" bug.

> After this change, stacks larger than 1<<12 will build the offset in a temporary register and then add that to rsp in a single, indivisible opcode. A goroutine can be preempted before or after the stack pointer modification, but never during. This means that the stack pointer is always valid and there is no race condition.

Seems silly to pessimize the runtime, even slightly, to account for the partial register construction. DWARF bytecode ought to be powerful enough to express the calculations needed for restoring the true stack pointer if we're between immediate adjustments.

sauercrowd 122 days ago [-]

> This problem strikes me more as a debuginfo generation bug than a "compiler" bug.

But isn't that the same thing here? The bug occurred in their production workflows, not in some specific debug builds, so with that seems pretty reasonable to call it a compiler bug?

quotemstr 122 days ago [-]

Thanks. I think of unwinder information as debuginfo even though, as you point out, it's used outside of debugging contexts all the time. :-)

As for the actual bug:

Unless you're unwinding the stack by walking the linked list of frames threaded through the frame pointer, then each time you unwind a level of the stack, you need to consult a table keyed on instruction pointer to look up how to compute the register contents of the previous frame based on register content of the current frame. One of the registers you can compute this way is the previous frame's stack pointer.

I haven't looked in depth at what the Go runtime is doing exactly, but at a glance, I don't see mention of frame pointers in the linked article, so I'm guessing Go uses the SP-and-unwind-table approach? If so, the real bug here is that the table didn't have separate entries for the two ADDs and so gave incorrect reconstruction instructions for one of them.

If, however, frame pointers are a load-bearing part of the Go runtime, and that runtime failed to update frame pointer (not just the stack pointer) in the contractually mandatory manner, well, that's a codegen bug and needs a codegen fix.

I guess I just don't like, as a matter of philosophy if not practical engineering, having frame pointers at all. Without the frame pointer, the program already contains all the information you need to unwind, at no runtime cost --- you pay for table lookups only when you unwind, not all the time, on straight-line code.

The purist in me doesn't like burning a register for debugging, but you have to use the right tool for the job I guess.

riobard 122 days ago [-]

What ARM64 machines are you using and what are they used for? Last year you were announcing Gen 12 servers on AMD EPYC (https://blog.cloudflare.com/gen-12-servers/), but IIRC there weren’t any mentions of ARM64. But now it seems you’re running ARM64 in full production.

zamadatix 122 days ago [-]

I'm not Cloudflare, I just read their blog too much. As they hint in the article when mentioning secure boot, they've been deploying Ampere in parallel to AMD for several years now. Purpose wise it seems to be Edge related for efficiency reasons, but maybe they use them for other things too. You can read some more here https://blog.cloudflare.com/designing-edge-servers-with-arm-... and here https://blog.cloudflare.com/arms-race-ampere-altra-takes-on-... along with the original evaluation of Qualcomm here https://blog.cloudflare.com/arm-takes-wing/

riobard 122 days ago [-]

Yeah but those are pretty dated. I was under the impression those old Ampere servers are not efficient compared to modern EPYC anymore. So I’m wondering what their current generation of arm64 servers look like :p

EE84M3i 122 days ago [-]

I seem to recall Cloudflare hosts their some of their non-edge compute on public clouds? Like control plane stuff. Could be that.

MarkSweep 122 days ago [-]

I wonder if Go had a mode where you make it single step every instruction and trigger a GC interrupt on every opcode. That would make it easier to find these kinds of bugs.

defleopold 122 days ago [-]

[flagged]

wy1981 121 days ago [-]

Great find and writeup.

As an aside, this is the type of a problem that I think model checkers can't help with. You can write perfect and complicated TLA+/Lean/FizzBee models and even if somehow these models can generate code for you from your correct models you can still run into bugs like these due to platform/compiler/language issues. But, thankfully, such bugs are rare.

jraph 121 days ago [-]

Yep. Model checking is for checking that your design is sound, basically, not at all the implementation.

For the implementation, you can use certified compilers like CompCert [1], but:

- you still have to show your code is correct

- there are still parts of CompCert that are not certified

[1] https://compcert.org/

alberth 122 days ago [-]

I thought Cloudflare was 100% Rust, and x86 (EPYC) these days.

Interesting to hear Go & ARM in use.

surajrmal 122 days ago [-]

I doubt any company is mono language at that scale. Using ARM usually makes sense for s lot of horizontal scaling workloads so it's also not that surprising.

steveklabnik 122 days ago [-]

Cloudflare has long kept Arm builds of everything even when they deployed to x86 only, to make it easy to switch when it made sense.

And yeah, a lot of Rust but also a lot of Go.

dreamcompiler 122 days ago [-]

Always adjust your stack pointer atomically, kids.

whizzter 122 days ago [-]

I guess those that wrote the preemption were on X86 where this doesn't happen thanks to variable length instructions being able to hold the constant and thus relied on the code-gen to do it atomically, then the ARM port had an automatic "split" from a higher level to make things "easy" thus giving us this bug.

Nobodys fault really, but bad results ensued.

Sesse__ 122 days ago [-]

> Nobodys fault really, but bad results ensued.

Uh, the fault is entirely in writing an assembler _that is not an assembler_, but rather something that is _almost_ like one but then 1% like an IR instead. It's an unforced error.

wbl 122 days ago [-]

Assemblers used to do a ton of stuff back in the day

anyfoo 122 days ago [-]

Oh yeah. S/360 assembly almost looks like a high level language sometimes. In MVS, functions of the OS and standard libraries (or its equivalent) were implemented as elaborate macros, with their own invocation syntax, whereas nowadays you'd expect a function that you'd call (dynamically linked or not), with parameters passed in registers.

At least in the 90s, there were actually macro assemblers that supported OOP programming in assembly. Borland Turbo Assembler 5.0 comes to mind, if was kind of fun.

pjmlp 122 days ago [-]

Those are still around if you go for Assemblers with background in PC culture like NASM, YASM, MASM (still part of MSVC).

By the way Embarcaredo still has Turbo Assembler.

https://docwiki.embarcadero.com/RADStudio/Athens/en/Turbo_As...

Now a thing of the past, but Assemblers for game consoles were also quite powerfull in their macro capabilities.

I never liked the UNIX Assembly culture, because naturally as soon as C became a thing, they became the bare minimum required to assemble the generated Assembly out of the C compiler, as another step into the compilation pipeline.

All the niceties of macro assemblers came through the other platforms, like being able to use NASM instead of the platform assembler, not even GNU AS nor clang are that great in their abilities as Assemblers beyond the basic stuff.

whizzter 121 days ago [-]

It doesn't even need to be an error in the "assembler" but could be another part that converts from some internal highlevel IR, also for most cases split ops doesn't matter for register manipulating instructions (that you might want generated as compactly as possible) since regular atomics are separate on memory addresses.

Even then, if the code-gen was written BEFORE the preemption then it was fairly sloppy for those implementing the preemption to not consider the function epilogue, granted statically adjusting the stack/frame pointer by more than 4kb is probabably a tad of an edge-case.

yvdriess 121 days ago [-]

Hands up, the dozens of us pedants that have used a relaxed atomic add in situations like these. Updating the SP in the most paranoid way possible is the reason that sort of thing exists.

(You cannot express relaxed atomics in golang, but you could technically add support in the compiler for use in the runtime code)

drob518 122 days ago [-]

Exactly what ran through my mind.

brcmthrowaway 122 days ago [-]

I don't get it, how were the machine threads being stopped in thr middle of two instructions? This is baremetal, right?

adgjlsfhk1 122 days ago [-]

go uses interrupts for GC notifications

purplesyringa 122 days ago [-]

Signals.

ahoka 122 days ago [-]

That's why the old advice was not to use signals and threads together, if you can avoid it.

Agingcoder 122 days ago [-]

Excellent article as always from the cloudflare blog - engineering without magic infrastructure and ml. One day I will apply !

Compiler bugs are actually quite common ( I used to find several a year in gcc ), but as the author says, some of them only appear when you work at a very large scale, and most people never dive that far.

jgrahamc 122 days ago [-]

What's stopping you applying today?

Agingcoder 122 days ago [-]

Fair question. Location primarily ( nothing in France ), and I’m not sure how ‘we’re looking for people who enjoy doing that kind of thing’( I very much do ) relates to the actual job offers, ie what job offer should I actually apply to.

My background is not networking ( it’s math then hpc then broader stuff ) but I keep stumbling on similar problems ( including a beautiful one related to intel NICs a few years ago which led be into a rabbit hole of ebpf and kernel network layer and which surfaced later on the cloudflare blog), and the only tech company with which this seems to be a regular occurrence is cloudflare. Their space is a bit unknown to me so I guess I’m having a hard time projecting something onto the job offers.

I’d happily chat to someone working for cloudflare though - I guess this would help me understand what it is that actually happens over there. I guess I’m a bit intimidated by this unknown yet really good looking world :-)

sauercrowd 122 days ago [-]

I've interned at Cloudflare back in 2020 and had a great time- would highly recommend!

Can't speak to the locations but the stuff you're interested/experienced in seems extremely likely to overlap with what they do. They do a lot of very deep technical things in all kinds of areas.

my recommendation if you want to talk to someone about it: search github/twitter/linkedin for ppl who work there on stuff you like, and just send them a message and ask for a 20 minute call!

have done it plenty of times, has always been extremely positive

jgrahamc 121 days ago [-]

You can email me jgc@ Cloudflare and I'll forward your details to the right people.

nevon 122 days ago [-]

Similar to the previous commenter, every time I read a blog post from Cloudflare I end up checking the careers page thinking "this is exactly the kind of work I'd like to be doing". Sadly no openings in my country. I'll keep checking!

moomoo11 122 days ago [-]

Pretty sure location is not a factor for these companies. You should apply anyway. I’ve worked with people living in active war zones.

If you have the skills, they have the coin.

They won’t hire some react guy in X country but someone who can find compiler bugs and save them XX+ million dollars a year? Heck yeah.

Degorath 122 days ago [-]

Unfortunately, in 95% cases location IS a factor with bigger companies.

I'm in a similar position where I'd like to do something a lot more interesting, but intersection between where the interesting companies have offices and where I'd be willing to live do not really overlap enough justify rooting up my life.

(Unless we're talking about "too good to ignore", that's a different story.)

moomoo11 122 days ago [-]

I was explicitly talking about too good to ignore.

Anyone who can optimize a company’s bottom line will be hired.

Like I said, no random average mid react guy or dime a dozen Java developer is getting hired as a remote employee in some flyover country.

But if someone can provide like 50x value then hell yeah..

I thought that was obvious in my message considering we are discussing compiler optimization

Degorath 122 days ago [-]

(Yeah, I'd say your messaging was reasonably clear, but in the context of the whole thread it wasn't obvious whether the poster was putting themselves in that skill bucket.)

I think there's also quite a big spectrum of skill, even when we're talking about compiler optimization and highly skilled software developers. I'd put myself up there, but still I'm no Lars Bak (for whom Google allegedly created an office in Denmark).

ptsneves 122 days ago [-]

How do you rate yourself as higher than dime a dozen? I work as a full remote dev but I am not sure I am anything special, I mean how do you know that you are objectively good.

moomoo11 122 days ago [-]

Where did I say anything about myself? Sounds like projection or some deep insecurities if you meant it _that_ way.

If you're asking what would constitute someone being special, it would depend on the role and skillset. As I said in my earlier comment, someone who is a beast and can find and fix bugs in compilers is a rare person. Especially if that skillset can help the company save boatloads of money that can be deployed elsewhere.

There are probably only a handful of people in the world who understand and can push the AI landscape forward. A lot of them are Chinese immigrants, and yet OpenAI/Meta/etc are paying them boatloads of money.

As for remote roles, I once worked on a project where we hired some dude for like $500/hr as a contractor because he was one of the few people who knew the inside/out of postgres and oracle rdbms because we were doing some very important migration.

stronglikedan 122 days ago [-]

With seemingly the whole world rolling out new RTO mandates, location may not have been a factor recently, but may be lately.

kccqzy 122 days ago [-]

Low compensation relative to many other companies. (It didn't stop me from applying, but I stopped me from accepting.)

122 days ago [-]

pfdietz 122 days ago [-]

I see something like this and I wonder "what testing methodology would have found this?" It has to be general, not something that would involve knowing what the bug was ahead of time.

syncsynchalt 122 days ago [-]

When your scale is large enough, you move to "what monitoring methodology will find this?"

When you're doing enough transactions you start to see a noise floor of e.g. bit flips from cosmic rays, and looking for issues involves correlating/categorizing possible software failures and distinguishing them from the misbehavior of hardware.

pfdietz 120 days ago [-]

I meant, what testing methodology could the compiler writers have used, so it was caught before it went to users.

The feedback loop here should be: novel bug comes in ==> determine how existing testing was deficient ==> modify the testing in a general way that would have found this bug ==> run these modified tests in the background to see if anything similar was missed. Bugs should be used as indicators that regions (as large as possible) of bug space have been inadequately covered.

javierhonduco 122 days ago [-]

Really enjoyed reading this. Thanks for writing it!

maguro_01 119 days ago [-]

An x86-64 Windows 11 machine trying to access a previously available Website now always produces a Cloudflare "obsolete protocol" error on the ordinary attempt. Al browsers get the same error. Did your fix break something?

maguro_01 119 days ago [-]

Sorry, all browsers.

Bengalilol 122 days ago [-]

I always appreciate articles like this, where you can clearly see the engineer’s way of thinking.

I was just puzzled by the middle part of the article, where they start investigating their code but seem to overlook the fact that it only happens on ARM64.

Still, I understand that it’s professional to proceed step by step logically.

Great article, it was a pleasure reading it!

mixedbit 122 days ago [-]

Hard to reproduce bugs often depend on an order of events or timing. Different architecture can trigger different order of execution, but this doesn't mean the bug is not in the application.

bradley13 121 days ago [-]

I find it interesting, how rare it has become to find s compiler bug. For me, at least, it used to be a regular event.

Even Java, as widespread as it is, I have made half-a-dozen reports. None in the last several years, though.

Better testing? The sheer scale of software being produced?

lou1306 121 days ago [-]

Linus's law [1]? When it comes to compilers for mainstream languages, the userbases are so large that they will explore a surprisingly large portion of the compiler's state space.

But definitely, better engineering and QA practices must also help here.

[1] https://en.wikipedia.org/wiki/Linus%27s_law

wat10000 122 days ago [-]

I would have thought that unwinding would use the frame pointer and this wouldn't be a problem.

mperham 122 days ago [-]

The frame pointer was updated non-atomically in two asm ops. An async interruption between the two ops would lead to a corrupt frame pointer.

wat10000 122 days ago [-]

So it was. The article never mentions the frame pointer and I'm familiar with compilers that load the saved value from the stack in the epilog, rather than adjusting it arithmetically. But they do have an assembly listing showing the two-step arithmetic adjustment for both the stack pointer and frame pointer.

But I'm not sure that matters, because the unwind code they show uses the stack pointer rather than the frame pointer anyway.

mperham 122 days ago [-]

Did they ever explain why netlink was involved? Or was that a red herring?

Sesse__ 122 days ago [-]

The stack in that specific function was big enough to trigger the bug.

drob518 122 days ago [-]

Seemed like a red herring. They were able to reproduce it without any libraries. Might have just been net link forcing the stacks to a certain size and that made the bug visible.

syncsynchalt 122 days ago [-]

The netlink function uses a larger stack than most.

Their repro case required a stack adjustment larger than 1<<12 (4kiB).

yalok 122 days ago [-]

Classic problem of non-atomic stack pointer modification.

Used to have a lot of fun with those 3 decades ago.

lordnacho 122 days ago [-]

> This was a very fun problem to debug.

I'm sure it was a relief to find a thorough solution that addressed the root cause. But it doesn't seem plausible that it was fun while it was unexplained. When I have this kind of bug it eats my whole attention.

Something this deep is especially frustrating. Nobody suspects the standard library or the compiler. Devs have been taught from a young age that it's always you, not the tools you were given, and that's generally true.

One time, I actually did find a standard library bug. I ended up taking apart absolutely everything on my side, because of course the last hypothesis you test is that the pieces you have from the SDK are broken. So a huge amount of time is spent chasing the wrong lead when it actually is a fundamental problem.

On top of this, the thing is a race condition, so you can't even reliably reproduce it. You think it's gone like they did initially, and then it's back. Like cancer.

akerl_ 122 days ago [-]

It feels like this comment was almost a purely additive anecdote of your own experience with a similar kind of issue, but you've spoiled it by deciding to tell the author that they're incorrect about how they felt during the process?

Maybe different people find different things fun.

lordnacho 122 days ago [-]

Not saying he's wrong, sometimes the word "fun" connotes something slightly different what what it literally means. "Satisfying" is something I'd use for the end state. Maybe "challenging" for the intermediate state. But while you're in a high-pressure situation that you don't understand, that is rarely "fun" in the literal sense.

You wouldn't pay to be given compiler race condition bugs, right?

klausa 122 days ago [-]

I wouldn't pay to be given any kind of work, but there are some aspects of my job that I find more or less 'fun'.

Hunting bugs that people have given up on or have no ideas on how to tackle is near the top of that list.

Agingcoder 122 days ago [-]

I like these bugs. They’re intricate, technical puzzles, that can take weeks to figure out. You need a proper strategy to figure them out, cannot rely on simple tactics, and when you finally understand what’s going on, it’s immensely satisfying.

This, and now there’s pernosco which makes everything much easier.

Now, under pressure, this is going to be a nightmare unless you have a high tolerance to stress.

akerl_ 122 days ago [-]

Maybe stop digging here and just let it be fun for the author?

a10c 122 days ago [-]

> Not saying he's wrong

https://heinen.dev/ - I’m Thea “Teddy” Heinen (she/her or they/them)!

dylan604 122 days ago [-]

Some people are perverse individuals and actually enjoy debugging very esoteric things. What might be frustrating to you might be the very thing that gets someone else very excited.

commandersaki 122 days ago [-]

Probably just meant satisfying instead of fun. I found a bug in sscanf for the gcc arm toolchain that ships with Ubuntu (and Debian), and it wasn't fun since I had deadlines to deal with. Workaround was to use the official ARM one. But after 2 days, it was satisfying to nail the exact problem and write a regression test.

anyfoo 122 days ago [-]

> I'm sure it was a relief to find a thorough solution that addressed the root cause. But it doesn't seem plausible that it was fun while it was unexplained. When I have this kind of bug it eats my whole attention.

Yeah, and that's fun for me. Some of my most fun bugs to debug have been compiler, or even CPU issues.

secondcoming 122 days ago [-]

It becomes fun when you narrow down to the solution. Before that it's hell.

I don't think I'd be allowed spend weeks to debug something like this. Credit to Cloudflare's PMs.

maples37 122 days ago [-]

Apparently they have a "unexplained crashes must have an explanation determined" policy ever since there was a trend of uninvestigated unexplained crashes that were canaries in the mine for a security issue.

https://blog.cloudflare.com/however-improbable-the-story-of-...

> But [the Cloudbleed sensitive information disclosure security incident] wasn’t the only consequence of the bug. Sometimes it could lead to an invalid memory read, causing the NGINX process to crash, and we had metrics showing these crashes in the weeks leading up to the discovery of Cloudbleed. So one of the measures we took to prevent such a problem happening again was to require that every crash be investigated in detail.

Since then, they have a "no crashes go uninvestigated" policy, which for the scale Cloudflare operates at, seems pretty impressive.

jgrahamc 121 days ago [-]

Yes, and we set up all the tooling for that and I would look at the output every single day and keep an eye on what was happening. Any team that didn't fix a crash quickly got a personal message from me. That responsibility has been taken over by others now.

saagarjha 122 days ago [-]

The people who find the fun are often good at identifying when it is the standard library or the compiler.

wat10000 122 days ago [-]

I find this sort of thing to be tremendously fun. It can be frustrating as well, but overall it’s my favorite part of my job. I don’t see why this would be implausible. Different people enjoy different things.

alfalfasprout 122 days ago [-]

> Devs have been taught from a young age that it's always you, not the tools you were given, and that's generally true.

That's not been my experience at all FWIW. Tools get things wrong all the time.

Simply that more mature projects with heavy use like eg; gcc or clang/llvm generally tend to have had major bugs stamped out by this point. They do still happen though.

More nascent language and compiler ecosystems are more likely to run into issues. Especially languages with runtimes.

LoganDark 122 days ago [-]

Hey; it could've been type-3 fun.

btbuilder 122 days ago [-]

Segfaults with no use of “Unsafe” equivalents in managed languages can give immediate indication it’s not a code problem.

afdbcreid 122 days ago [-]

They explicitly mention there was usage of unsafe, and they weren't sure that's not the cause.

btbuilder 120 days ago [-]

Right, in a 3rd-party library.

rectang 122 days ago [-]

Although I’m good enough at it, like you I hate this kind of debugging experience, and try hard to avoid putting myself in a position where I have to do it. It’s not fun for me at all.

I also don’t like many puzzle games, like Sudoku, because to me they feel like this kind of work. Many colleagues of mine have expressed bafflement that I don’t find such puzzles fun and give me all kinds of grief about how I ought to enjoy them, since they do.

It’s the same thing here, just flipped around: this person seems to enjoy the debugging experience; just let them be. Or recruit them, because that temperament is valuable.

neuroelectron 122 days ago [-]

I've seen only one race condition in my career and it always surprises me how it is even found.

anthk 121 days ago [-]

I miss the Delve debugger for OpenBSD 386 BTW.

gok 122 days ago [-]

The real lesson here should be that doing crazy shit like swizzling the program counter in a signal handler and writing your own assembler is not a good idea.

themafia 122 days ago [-]

Neither of those are "crazy shit." It's just complex because the environment offers specific features like automatic GC with async preemption in a compiled language which pretty much requires it.

Complex engineering isn't something to be avoided by default.

Diggsey 121 days ago [-]

Agree, but I think there is a point to be made here: Go as a language has more subtle runtime invariants that must be upheld compared to other languages, and this has led to a relatively large number of really nasty bugs (eg. there have also been several bugs relating to native function calling due to stack space issues and calling convention differences). By "nasty" I mean ones that are really hard to track down if you don't have the resources that a company like CF does.

To me this points to a lack of verification, testing, and most importantly awareness of the invariants that are relied on. If the GC relies on the stack pointer being valid at all times, then the IR needs a way to guarantee that modifications to it are not split into multiple instructions during lowering. It means that there should be explicit testing of each kind of stack layout, and tests that look at the real generated code and step through it instruction by instruction to verify that these invariants are never broken...

wat10000 122 days ago [-]

The general wisdom is that you shouldn't do this stuff yourself, and you should instead rely on tried and tested implementations. But sometimes you're the one who provides the tried and tested implementations. Implementing a compiled language is often one of those times.

achierius 122 days ago [-]

Sorry, how exactly do you think compilers are supposed to work if not by 'writing [their] own assembler'? Someone has to write the assembler, and different compilers have different needs.

platinumrad 122 days ago [-]

Those are both completely normal things to do when you're implementing a programming language. For example, the Hotspot JVM uses SIGSEGV to stop the world for garbage collection.

blinkingled 122 days ago [-]

This^. Keith W on Dtrace blog said it a decade ago https://wesolows.dtrace.org/2014/12/29/golang-is-trash/

I like Go but I don't really like their NIH / replace everything with our stuff stance - esp on system tools like assemblers and linkers.

me2too 121 days ago [-]

Great write-up

berz01 122 days ago [-]

[flagged]

Rendered at 20:06:14 GMT+0000 (Coordinated Universal Time) with Vercel.