This is neither praise nor criticism of the current CPU paradigm; it's just something you need to understand if you want the best performance out of our machines.
A different paradigm, like a concatenative-paradigm-based program, might naively be more inclined to compile into code that looks more like what the author tried, jumping between implementations of the stack operators without it actually being "functions". One can imagine processors that would be "happier" with that, and would be bothered by things that look like function returns more. But that's not the CPUs we have.
But generalized branch prediction is expensive (multiple entries in the BTB hashed by the previous program flow, and mispredicts if the next function in the chain changes); the point of an RAS is that it’s a very cheap way to keep some 100% predictable branches from using those general resources. Concatenation pretty much requires branches without unique characteristics, so no fast path shortcuts and everything is a bit slower.
Discussion the other day, of optimizing dispatch overhead in no-jit-allowed lightweight-word vm's, raised a brainstormy idea of having multiple copies of some words, and the compiler juggling them with awareness of their tailcall-next-word varied branch predictor states. Sort of a dynamic version of reducing dispatch by identifying hot phrases and creating multi-word-composite new words for them.
Could this allow for translating WASM or emulator workloads to something that runs fast with those restrictions?
And C is hardly the first or only procedural langage.
Thankfully nowadays we have Compiler Explorer supporting almost every flavour of mainstream compiled languages to dispel myths.
While today's compiler do a lot more transformations which can result in surprising output, it's still fairly straightforward to map C statements to CPU instructions.
> People often guess that they were created to use the auto-increment and auto-decrement address modes provided by the DEC PDP-11 on which C and Unix first became popular. This is historically impossible, since there was no PDP-11 when B was developed. The PDP-7, however, did have a few `auto-increment' memory cells, with the property that an indirect memory reference through them incremented the cell. This feature probably suggested such operators to Thompson; the generalization to make them both prefix and postfix was his own. Indeed, the auto-increment cells were not used directly in implementation of the operators, and a stronger motivation for the innovation was probably his observation that the translation of ++x was smaller than that of x=x+1.
--- Dennis Ritchie
Additionally, modern processors have execution models that are entirely different from the C machine model, but this is often hidden even from the ISA. For example, modern x86-64 cores execute code by splitting up instructions into micro-instructions, determining data dependencies between them, then queueing them up on any of the available execution units (of various kinds) in parallel - which is about as far from the C model of executing instructions 1 by 1 in a series as it is from Haskell.
But at the same time it's the lowest-level high-level language (that's popular at least).
I'm also not aware of any high level programming languages that allow access to the CPUs status flags (for instance to check for overflow).
(there are a couple of interesting 'mid-level' languages for 8-bit processors though, like Millfork: https://github.com/KarolS/millfork)
I'd be all over a proper mid-level language that's closer to modern CPUs than C and has less 'optimization magic'. But this idea doesn't seem to be very popular amongst the compiler writer crowd.
The seeming non-sequitur referring to the ISO C standard was because flohofwoe responded to the rhetorical question as if it were a normal question, leading pjmlp to restate the rhetorical question in a stronger form.
There's two types of projects: those that are so big that any compiler writer will test against them to guard against Hyrum's Law, and those that are not. The former have no need for the C standard, as compiler writers would be loathe to break compatibility with it. The latter have a strong need for the C standard, as anything outside the C standard may be broken unknowingly by compiler writers.
> What matters for compiler users is what specific compilers actually implement
What matters for me is the likelihood that a compiler upgrade will break my code. If I stay within the standard, then any such breakage is a bug. If I stray outside the standard, then all I know is that this specific compiler, in this specific context, on this specific machine, for this specific compilation, has produced some machine code.
My home projects are not so large that gcc/clang/msvc authors would test their upgrades against my project, so the bounds of the standard are the limits that I can trust.
> the Windows headers alone are full of MSVC idiosyncrasies
I'd put the Windows headers in the category of sufficiently-large projects that compiler writers would test against them, rather than the other way around.
You'll need to check your code against each new compiler release you want to support anyway, because each release adds a shitton of new warnings which are also not covered by the C standard but should be fixed nonetheless.
“Not as good lately”, doesn’t change that status without another language superseding it, right?
The main alternative paradigms that are ever mooted are LISP machines, which elevate the importance of functions, and maybe stack-based programming languages like Forth, which emphasize the feature of functions seen here (that they keep popping things off the stack).
It's hard to argue that C has led us down to a local minimum because it's just too functional.
In this case the implementation of the ISA stores information on each 'ret' depending on some 'bl' that came before. One can imagine that a different optimization technique which actually leads to a speedup exists. Without a branch-predictor, the program that Matt wrote might've been faster.
Imo, this has nothing to do with paradigms, but how control-flow interacts with the system. This interaction is different between different implementations of an architecture.
Code written for Cache-locality and paradigms that work well with it, for example, only became a "winner" after caches were widely implemented. Before that, the cost of sequential array access and random array access was identical.
With caches, a linear search for an element can be faster than a binary search, even though it requires a lot more memory accessess. Thus, the optimal implementation for finding an element in an array is now dependent on size as well. (I.e. after an ordered array reaches a certain size, binary search should become faster on average).
There’s its very existence. If you’re looking for a general purpose indirect jump, use that.
And while it's not something that the example code in the article demonstrates, but based on the description of how it confuses the branch detector, it sounds like having multiple BLs without a matching RET for each would also be a problem.
Also, I'm not sure why you believe that there is any kind of language where this pattern isn't common. Even something like Forth relies heavily on jumping into subroutines and back.
theyr'e so close to what they're in principle.
IMO, C functions are the answer to the problem exposed in the famous quote "goto considered harmful".
assembly is all about GOTOs.. C provides a *structure* way to deal with them without getting bored to day (it all becomes way to much to soon)
but I think (for reasons that I wish I could get into) that in the end, the goto-based assembly code can go up to multiplication, but then with C and their computer-functions one can go beyond exponentiation.
What is there beyond, I can only name by reference but wouldn't say I undesrtand it... which tetration.
so riddle me this: why is 2[op]2=4 for all these operations: addition, multiplication, exponentiation, tetration, pentation!?
imma go keep being insane. thxbai
Structural programming is traditionally considered subroutines + conditionals + loops. In its purest form, it also means no branches other than the ones implicit in this list (i.e. not only no goto, but also no return, no break/continue etc).
RET is documented as a subroutine return hint in the official ARM architecture reference manual. That's the only reason it has a distinct opcode from BR.
BL is also documented as a subroutine call hint.
This is RISC wishful thinking that makes it sound like BL is not CALL. Maybe in ARM 1 it wasn't, it is now.
So were most of the structured programming languages from 1960's.
PL/I, ESPOL, NEWP, ALGOL, Pascal, ...
Most processors have accumulated functionality beyond minimal instructions for the simplification and acceleration of operating system services, hardware interfacing, cryptography, vector & matrix integer and floating-point math, arbitrary precision math, string processing, virtualization (IO and CPU), secure computing, and virtual memory management; just to name a few. :)
A future era of green field development closer to the technological singularity will look across the software-hardware interface boundaries to optimize both silicon (or superconductors) and compilers to generate fast, small, and/or power efficient designs and code without as many limitations. Generative AI and holistic algorithms with enormous computing power will make this possible. It's almost possible now, it's just a few leaps beyond what EDA and compilers are already doing.
An interesting take on this is the (vaporware) Mill CPU, in which the machine code defines EBBs (Extended Basic Blocks), which can only be entered at the top. You cannot express jumping into the middle of a block from the outside, in their ISA.
There's no need for "probably" here. The micro-architectural mechanism is known as a return stack buffer and is generally separate from the branch predictor unit, though the processor may make use of indirect branch prediction entries for returns as well.
 It is, indeed, a tiny little stack of return addresses and indeed, the article hit performance issues by misaligning it. The (Intel chips') RSB is behind the Retbleed vulnerabilities.
Another consequence of the RAS is that co-routines switching is more expensive than they might appear at first. RISC-V has encoding hints to mark call(jal)/returns that are actually co-routine switching but the full cost can't be avoided.
Of course there is always going to be a penality for stackful coroutines that yield deep into a callstack.
(Real retpolines have a little more magic, naturally.)
What if you set -funsafe-math-optimizations, which allows "optimizations that allow arbitrary reassociations and transformations with no accuracy guarantees"?
So, putting it into https://rust.godbolt.org/z/jT6Mb1K13 , it seems to indeed be using the SIMD instructions.
New Type wrapper for f32 called like FastFloat, marked #[repr(transparent)], and if necessary (not sure) have the compiler promise you you're getting the same in memory representation as an actual f32.
Implement Sum over FastFloat by having it use the faster SIMD intrinsics for this work to give you an answer, accepting the potential loss of accuracy.
Now, unsafely transmute the f32 slice into a FastFloat slice (in principle this is zero instructions, it just satisfies the type checking) and ordinary sum() goes real fast because it's now Sum on the slice of FastFloats.
EtA: The attraction of a New Type plus trait impl is that is re-usable. You could imagine (particularly if it was stable which your approach isn't yet) packaging up several speed-ups like this in a crate, enabling people to get faster arithmetic where they can afford any accuracy trade off without them needing to know anything about SIMD and without (like the C or C++ compiler flags) affecting unrelated code where accuracy may be critical.
Most hardware architectures (x86, ARM fall into this category) pick the rule that the payload is the Nth operand (usually first, sometimes second) when multiple inputs are NaN. I believe there's some hardware where the lesser of the payloads (converted to an integer) is picked instead. RISC-V dispenses with NaN payload propagation entirely. There is theoretically the ability for hardware to generate a new unique NaN with the hardware address of the instruction into the payload, but I don't believe any hardware actually does it.
Most programming languages do not generally model NaN with greater fidelity than "there is a NaN value" or maybe "there is a distinction between qNaN and sNaN."
Sign bit is NOT the signalling/quiet bit. Bit 51 (edit or bit 50 - damn ***** IEEE for not publishing important standards for free public access) is according to the first result I looked at: https://craftinginterpreters.com/optimization.html
Edit 2 from IEEE 754 (2008 version):
6.2.1 NaN encodings in binary formats
This subclause further specifies the encodings of NaNs as bit strings when they are the results of operations. When encoded, all NaNs have a sign bit and a pattern of bits necessary to identify the encoding as a NaN and which determines its kind (sNaN vs. qNaN). The remaining bits, which are in the trailing significand field, encode the payload, which might be diagnostic information (see above).
All binary NaN bit strings have all the bits of the biased exponent field E set to 1 (see 3.4). A quiet NaN bit string should be encoded with the first bit (d1) of the trailing significand field T being 1. A signaling NaN bit string should be encoded with the first bit of the trailing significand field being 0. If the first bit of the trailing significand field is 0, some other bit of the trailing significand field must be non-zero to distinguish the NaN from infinity. In the preferred encoding just described, a signaling NaN shall be quieted by setting d1 to 1, leaving the remaining bits of T unchanged.
6.3 The sign bit
When either an input or result is NaN, this standard does not interpret the sign of a NaN. Note, however, that operations on bit strings—copy, negate, abs, copySign—specify the sign bit of a NaN result, sometimes based upon the sign bit of a NaN operand. The logical predicate totalOrder is also affected by the sign bit of a NaN operand. For all other operations, this standard does not specify the sign bit of a NaN result, even when there is only one input NaN, or when the NaN is produced from an invalid operation.
When neither the inputs nor result are NaN, the sign of a product or quotient is the exclusive OR of the operands’ signs; the sign of a sum, or of a difference x−y regarded as a sum x+(−y), differs from at most one of the addends’ signs; and the sign of the result of conversions, the quantize operation, the roundTo- Integral operations, and the roundToIntegralExact (see 5.3.1) is the sign of the first or only operand. These rules shall apply even when operands or results are zero or infinite.
When the sum of two operands with opposite signs (or the difference of two operands with like signs) is exactly zero, the sign of that sum (or difference) shall be +0 in all rounding-direction attributes except roundTowardNegative; under that attribute, the sign of an exact zero sum (or difference) shall be −0. However, x + x = x − (−x) retains the same sign as x even when x is zero.
When (a×b)+c is exactly zero, the sign of fusedMultiplyAdd(a, b, c) shall be determined by the rules above for a sum of operands. When the exact result of (a × b) + c is non-zero yet the result of fusedMultiplyAdd is zero because of rounding, the zero result takes the sign of the exact result.
Except that squareRoot(−0) shall be −0, every numeric squareRoot result shall have a positive sign.
Ranting edit #755: and while we are at it, -0 is an abomination in the eyes of the Great Architect in the Matrix - it should never have been allowed - perhaps -0 should have been NaN with signalling bits in the exponent (even though that would prevent some language virtual machine optimisations where 53 bits of NaN get used to pack other information, but the win would be compelling because reducing bugs due special cases is huge IMHO). How many developers understand IEEE corner cases: fuck all in my long experience.
A random stackoverflow answer without sources seems to confirm your pov. Now I don't know what to believe. How is the sign bit used in NaNs? Would they really waste that bit?
“A quiet NaN bit string should be encoded with the first bit (d1) of the trailing significand field T being 1. A signaling NaN bit string should be encoded with the first bit of the trailing significand field being 0.“
PS: I might be complaining, but I do sincerely thank you for your hard work. I certainly would not want to be on a standards committee, so I sincerely appreciate the efforts of the committed whom fight so hard to make things better. Your comment just goes to show how many corner cases there are to the corner cases!!
Citing the person who came up with posits for anything related to IEEE 754 is a pretty poor decision, he is prone to a lot of misunderstanding here. There's a lot of decent explanations of IEEE 754 (including Wikipedia), so you shouldn't need to resort the explanations of someone whose major shtick is telling people IEEE 754 sucks.
> How is the sign bit used in NaNs? Would they really waste that bit?
Sign bits have no meaning for NaNs, although some C libraries (e.g., glibc) will distinguish between a NaN with the sign bit set and not set when printing floating-point numbers, which gives the illusion that it matters more than any other bit. Using the sign bit for qNaN-versus-sNaN is a bad idea from a design perspective, since there are a few FP operations explicitly specified to modify only the sign bit (fneg, fabs), and this would mean that you get regular operations that could generate sNaNs, which defeats the design goal of sNaN.
"we gave you enough different NaNs to uniquely represent every grain of sand on the planet uniquely"
how does math work on them?
"we have no idea"
how do we tell what type of nan we have?
what should these bajillion values represent?
"I don't know, probably something"
the fact that they didn't just make there be a single NaN and make it equal to itself (and while you're at it make it so the real numbers have a total ordering)
For almost all operations, if any input is a NaN, the result is a NaN. It's pretty well-specified by IEEE 754, the only thing that isn't is what the payload is, but there's no general expectation that the payload is preserved through computation.
> how do we tell what type of nan we have?
> what should these bajillion values represent?
Diagnostic information that tells you which operation caused the NaN in the first place. (This is Kahan's standard argument for why having many NaNs can be useful.)
> the fact that they didn't just make there be a single NaN
What's the alternative? You'd either have to have lots of illegal floating point values, or you'd have to make the number of finite floating-point numbers with the highest exponent value slightly smaller than for all other exponent values, which complicates a good deal of numerical analysis.
> make it equal to itself
Yeah, this is the big, nasty mistake of IEEE 754.
> and while you're at it make it so the real numbers have a total ordering
There is a totalOrder predicate specified by IEEE 754. Representing floating-point comparisons as a partial order is arguably better than giving it a total order, but the lack of reflexivity of equals makes the existing predicates not a partial order.
Make a number with 3 bits of mantissa, and go add a thousand repetitions of 1. You can get hundreds of different results depending on the order you add them up.
You're describing a non-associative operation.
void branchLoop(unsigned int condition, unsigned int &sum)
// put something suitably large here
unsigned int loopCount = 0x0fffffff;
unsigned int i;
// compile with -O0 or this gets optimized away
for (i = 0; i < loopCount; i++)
if ((i & condition) == 0)
> for reasons
So, if we had the time and energy, we could probably make all of computing at least 10x faster.
But we don't have the time or energy to dedicate that much effort to every line of code... But perhaps AI does?
The reason why people thing "Even if you do, you are only going to get a few percentage points." is because it generally takes 5-50x the developer time to optimize such code. If it takes half a day to write naive code to do something like validate utf8, it probably takes ~25 workdays to make a fast SIMD version. If you instead spend an extra half a day, there is a good chance you get a 10-50% speedup using normal code.
And most of the speedup is because of tricks to avoid doing beaches. There is a great blog post from one of the authors of JSON SIMD discussing this.
I'm on mobile, there is a link for the blog post on the simd JSON github repository.
The blog post I mentioned:
Paper: Parsing Gigabytes of JSON per Second
Another related post from Lemire:
Ridiculously fast unicode (UTF-8) validation
Those algorithms are fast. But to put them in perspective. A single x86 CPU can write 64B per cycle. At 5GHz, the theorical maximum bandwidth is 320 GBps. IIRC, the read bandwidth is twice that.
There are others botlenecks, and is very hard to write code that writes at every cycle.
A interesting consequence, is that the theorical maximum bandwidth is logarithmical to the number of cycles. Again, talking about branchless streaming application.
well, it depends on where you start from. the speedups can become arbitrarily impressive sounding when the starting point is arbitrarily inefficient.
e.g. if you're starting with a python script that wasn't written with performance in mind and has ended up being compute-bound in pure python number crunching code, if you rewrite the thing in C while thinking a little about appropriate data structures and memory allocation/access -- i.e. replacing a festival of python dict lookups and very frequent memory allocation for tiny intermediate results with indexing into preallocated arrays or so on -- it's quite common to see speedups of 500x - 1000x. this is before micro-optimisation, before introducing SIMD or multi-threading or setting the compiler to build for your exact CPU or so on.
If you can't, then it's really a crap shoot as to whether the output is at all valid. If we want to use more opaque processes, I think we need more transparent outputs. If a neural net can produce a machine checkable proof or supply it with the optimisation that's great, but otherwise it's just hot air.
So it was micro optimization afterall!
I thought it would be interesting to compare the behaviour of (very) different AArch64 processors on this code.
I ran your code on an Oracle Cloud Ampere Altra A1:
sum_slice time: [677.45 ns 684.25 ns 695.67 ns]
sum_ptr time: [689.11 ns 689.42 ns 689.81 ns]
sum_ptr_asm_matched time: [1.3773 µs 1.3787 µs 1.3806 µs]
sum_ptr_asm_mismatched time: [1.0405 µs 1.0421 µs 1.0441 µs]
sum_ptr_asm_mismatched_br time: [699.79 ns 700.38 ns 701.02 ns]
sum_ptr_asm_branch time: [695.80 ns 696.61 ns 697.56 ns]
sum_ptr_asm_simd time: [131.28 ns 131.42 ns 131.59 ns]
And on the relatively (very) slow Rockchip RK3399 on OrangePi 4 LTS (1.8GHz Cortex-A72):
sum_slice time: [1.7149 µs 1.7149 µs 1.7149 µs]
sum_ptr time: [1.7165 µs 1.7165 µs 1.7166 µs]
sum_ptr_asm_matched time: [3.4290 µs 3.4291 µs 3.4292 µs]
sum_ptr_asm_mismatched time: [1.7284 µs 1.7294 µs 1.7304 µs]
sum_ptr_asm_mismatched_br time: [1.7384 µs 1.7441 µs 1.7519 µs]
sum_ptr_asm_branch time: [1.7777 µs 1.7980 µs 1.8202 µs]
sum_ptr_asm_simd time: [421.93 ns 422.63 ns 423.30 ns]
I guess absolute numbers are not too meaningful here, but a bit interesting that Ampere Altra is also the fastest of the 3 except in SIMD where M1 wins. I would have expected that with 80 of these cores on die they'd be more power constrained than M1, but I guess not.
Edit: I took the liberty of allowing LLVM to do the SIMD vectorization rather than OP's hand-built code (using the fadd_fast intrinsic and fold() instead of sum()). It is considerably faster still:
sum_slice time: [86.382 ns 86.515 ns 86.715 ns]
sum_slice time: [306.94 ns 306.94 ns 306.95 ns]
One potential explanation is that a lot of processors have "meta" predictors. That is, they have multiple distinct predictors that they then use a second level predictor to decide when and how they use it. This is really useful because some predictors perform very well in certain cases but very poorly in others. Therefore, what may be happening is that the RAS is getting overridden by another structure since the predictor detects that the RAS is frequently wrong but another predictor is frequently right.
With 10x the number of cores as the M1 Max and a TDP of 250W (maybe 5x? Apple doesn't publish numbers), the average power limit per core is likely significantly less, and the M1 might be able to leverage 'turbo' here for this short benchmark.
Still, this is not really a meaningful benchmark, just interesting.
Compilers and microprocessors are way more complex than they were back in the 80s or 90s, and compiler engineers know way more about how instructions are actually executed than the vast majority of programmers.
If I ask you to write assembler faster than what the compiler emits for one very specific CPU model, you'll pretty much always be able to do that (given enough time).
But as CPUs become more complex, and compilers get better, achieving that win requires more and more "overfitting" to implementation details of the specific hardware, in ways that are not helpful or even counterproductive on older or newer models of the same CPU family.
It's still worth it sometimes, of course. It's just more work and a lower value proposition on average.
I was stunned by the first once, which I first did in C and then moved to ASM. The C code was pretty straightforward. But it was trivial to beat in ASM by something like 120%.
That course taught me how bad compilers(Or rather GCC) are at high-level register optimization and using the barrel shifter. Basically all the performance I squeezed out of ASM was because the compiler just wasn't figuring basic stuff out. Mind you I was(am) not an ASM expert. That piece of code was the first thing I ever write in ASM. And yet I was able to easily beat a decades old world class compiler.
Using intrinsics with the control flow in C or C++ works really well - you get the right instruction selection from the intrinsics, easy to reason about control flow and the compiler deals with register allocation (and possibly instruction scheduling) which are a pain to do by hand.
Assuming prior knowledge of assembly/C but without much experience decompiling or testing speed.
Here's a fun article from the cloudflare blog that gives an example of using of perf to diagnose performance of a small utility: https://blog.cloudflare.com/when-bloom-filters-dont-bloom/
Matt Godbolt's compiler explorer is also worth checking out: https://godbolt.org/
My guess would be that the author is porting some C code to Rust and making sure not to regress performance along the way (probably hopefully trying to increase it). Likely their program was written in Rust and the section they were trying to optimize called some old c code. Sounds like they rewrote the section in Rust since Rust <-> C ffi calls break out of the happy realm the rust compiler likes and end up causing a performance hit themselves. You can write inline assembly in Rust using the macro https://doc.rust-lang.org/reference/inline-assembly.html.
Somewhat OT, but I miss Phil Hartman. You may remember him from shows like Saturday Night Live, The Simpsons and "Planet of the Apes."
A traditional switch/goto loop can thrash the branch predictor. Separating into different tail calling functions gives you more slots and allows the branch predictor to learn relationships between ops.
Not to discount the many other benefits of tail calls.
*Edit: I misspoke slightly, computed gotos can also split the patch jump, but less reliably.
Once upon a time(tm), you could write self-modifying code or guess at keeping pipelines occupied by manual instruction reordering, but cache line invalidation and OOOE make these moot.
The problem is that with a pipeline stall (wrong branch predicted) in hyper-deep pipelines, the penalty is enormous: waiting for the other condition calculation to percolate through the pipeline or independent stages.
Processors are optimized for the mainstream, usually the current or last generation of compilers when the processors were designed.
To generate the generally fastest bitcode, it would require an incremental JIT with history that can permute and mutate bitcode from runtime metrics. That's beyond HotSpot(tm), LLVM, or anything of the sort.
And this is why optimizing compilers are some of the most complex programs there are. (or so I have been taught)
The rule is a boiled-down version of the larger notion "Don't optimize by writing your own assembly, because even with domain knowledge of the problem you're trying to solve, you're probably not more clever than the engineer-decades that went into building your compiler toolchain and processor architecture, unless you're an expert in both fields, in which case good luck and shoulder that maintenance burden."
The rule of thumb drops a lot of detail on the ground but is a good first approximation.
Trying to beat the compiler at optimizing normal code is kind of like trying to beat a calculator with paper and pencil - computers are just better at that sort of thing than people are.
One use case is where you want to use bizarre CPU functions (popcount, encryption, load CR3, etc.) that no one's taught the compiler how to generate, although for some of them you might be better off using compiler intrinsics.
Another is when you're dealing with things underneath the language abstraction, like the Boost co-routines mentioned in a link a few comments above. Of course, if the language folks decide to add the capability you want (e.g. C++20 coroutines), you're back to being better off using the compiler.
Finally there are pedagogical reasons, e.g. sometimes I show my classes the world's simplest "Hello World", using a few lines of assembler to invoke the write and exit syscalls.
OTOH this reminds of another saying, "don't roll your own crypto". But all those "don't" are a bit frustrating.
In a professional setting there's a responsibility to the end user which generally precludes doing these dangerous things - that is, one should feel free to take up woodworking as a hobby but shouldn't offer to build someone's house unless you're a licensed professional.
Based on the content of his recent articles, we could assume he is continuing his development of a JIT for a Rust port of his graphics engine. Given that assumption, I would argue that the compiler writers are lagging here--specifically lack of dynamic compilation. For example, is there a JIT compiler for the Rust language? I was thinking about reproducing his experiment in SBCL, which does have dynamic compilation--although it wouldn't be a real "apples" to "apples" comparison because my work machine is a x86_64.
Also obviously the compiler writers (AOT or JIT) need to know about and understand very minute details of how a cpu is behaving. I was responding to a person saying it was worrying that devs need that kind of knowledge and experience by saying that that isn't true because the tools exist (I feel that there's some analogy to what knowledge you need for a car: changing oil, servicing engine, replacing engine, designing an engine...). Once you are writing a JIT you're in the "making the tools" group, so you now need to know more than the vast majority of devs (not just more than the "average" dev) that's inescapable.
I've gained a lot of performance by replacing handwritten assembly functions with plain code versions, just because CPUs and compilers have evolved over the last 15-20 years, while that assembly code is what it is.
There's also a sense in which they're really simple. The input format and output format are (usually) well defined. The user interface is largely printing things to stderr and giving up, possibly in a retry loop when there's an editor involved. The program dependency graph is often quite small so the bug you're looking at is probably in the source code you checked out. Security is not the dominant concern you have elsewhere.
You've got a few compilers (Java, .NET, et. al.) which are capable of re-compiling hot path code during live execution and then seamlessly transitioning to those paths. This recompilation can be based upon the statistics of the live process, so it's almost like a sort of adaptive AI. Which paths are hot in production does not need to be known at compile time with these approaches.
In general I think they aren't that complicated since they have a pass structure that's relatively easy to inspect and helps avoid spaghetti code.
Still, this seems like function inlining. But why not just inline and use a regular branch loop? Is foo() also being called from elsewhere? Is space at a premium?
> The answer is "because this is a didactic example, not code that's trying to go as fast as possible".
You don't want to be forced to trick the compiler into using the SIMD instructions you are aware of so you write an assembly function.
EDIT: Reading the documentation after the fact, it appears that that was what
br x30 was - naively I had interpreted the hex as a fixed offset to a label.