I've recently been contemplating the idea of putting a small ALU with each register. Simplest would do just AND, OR, XOR. A bigger one might add and subtract. The idea is that any instruction that clobbered a source operand would only need one read port (and no write port) from the register file. As routing near registers gets more complicated it might make sense to put a bit of execution right there with the data.
Thoughts on this?
Tuna-Fish 12 days ago [-]
That's not how registers work. Register names are not registers. Unless you are in the deepest of embedded, all operations that write to a register allocate a new one, and do not write to their input register.
edit: OoO primer, assuming x86_64 but everything else works the same way too:
Your cpu has hundreds of physical registers. There are 16 register names managed in the frontend. Whenever the frontend sees an instruction that writes to RAX, it allocates a fresh register that does not contain a value, sets it as pending, writes that register number under RAX in the register alias table (RAT), and sends the instruction forward. Once the instruction is in the scheduler, it waits until all it's inputs are ready, and then issues to an execution unit. Once it gets executed, it writes it's value to the register allocated for it and sets it as ready.
If you have an instruction that uses the same register name as input and output, those are guaranteed not to be the same register. Used registers get garbage collected once no name or pending instruction refers to them anymore.
LeifCarrotson 12 days ago [-]
As someone deep in an embedded cave, that's illuminating. You have a completely different meaning abstracted on top of the word "register".
Over here, in AVR and MSP430 and Cortex-M0 land, there are about 16 registers, instructions take at most 2 registers, and write to the destination (often also a source argument, ADD R1 R2 does R1 := R1 + R2). There is no 'frontend', no renaming, no aliasing, no garbage collector; what you see is what you get.
I assume that sometime long ago, the words had the original meaning, but they gradually added abstractions that were still mostly indiscernible at each step of the journey until they had completely diverged.
Neywiny 12 days ago [-]
While I share your passion, there's still a front end. The front end vs back end of a cpu core exists as long as the logic for decoding an instruction is separate from that of executing. In an AVR core, the "instruction decode" and "instruction register" blocks are the front end, and the ALU is the backend. The first image result for "msp430 cpu core block diagram" for me was [1] which has the first block saying "front end". So like I get it but just because we work on the smallest cores doesn't mean they're alien tech compared to the newest Neoverse or AVX-512 monsters.
The different meaning is for the word "register name", which is a completely separate, reified concept different from "register". The big difference is that RAX or R1 no longer refers to a register, but a register name. Registers are still registers.
mystified5016 12 days ago [-]
Yeah I'm deep into AVR, never got into the fancy architectures like x86 and this sounded bananas to me. In my world a register is a register is hardware.
Hell, by sidestepping GCC's register allocation one can substantially increase performance if you arrange your data into registers that your ALU or other peripheral likes best. Prime example, RAM access can be twice as fast if you load your target address into the X/Y/Z register with LD instead of letting GCC emit the LDS instruction. You'd do something like store a pointer to a very important struct that you want to access many times very quickly.
I honestly have no clue if a similar concept exists on modern x86. I'd assume that you are explicitly meant to not care about things at this level and it's all handled behind the scenes. It's crazy how much the x86 CPU does for you, but I find it much more fun to tweak things close to the bare metal.
dzaima 12 days ago [-]
All registers are basically always treated the exact same in OoO-land (though on x86-64 for some instruction forms it can add an extra byte of instruction encoding if there's some operand in r8..r15).
But there's still a massive amount of performance tuning. Some fun examples being how Haswell can only do one FP add per cycle, but two FMAs, so you can sometimes improve throughput by replacing some 'a+b's with 'a*1+b' (at the cost of higher latency). Or how, on older Intel, three-addend 'lea' can execute on only one port (as opposed to two for two-addend 'lea', or 3 or 4 for 'add') and has 3-cycle latency (vs. 1 cycle for two-addend; so two back-to-back 'lea's have lower latency than one three-addend one), but still uses only one port and thus can sometimes be better for throughput.
gpderetta 12 days ago [-]
It is not really my speciality, but there are equivalent things on OoO Land. Although CPUs are very good at running reasonable code efficiency, peak FP performance is still the real of hand written ASM or at the very least copious use of intrinsics.
It can also be very microarchitecture specific. Because FP code often need significant unrolling, the number of architectural registers needed to store partial results can be a bottleneck, especially if the compiler doesn't do a perfect job.
ip26 11 days ago [-]
x86 is the same way for programmers. All the renaming and so forth is completely hidden from the program. It is only detectable through performance differences.
Symmetry 12 days ago [-]
Just as importantly, most values in critical dependencies happens via the bypass network, forwarding from one execution port directly to the next without ever making the round trip to the registers.
phkahler 12 days ago [-]
Thank you, that was a very good explanation of "register renaming". The way you described it "renaming" is really more like allocation and is pretty simple - not done on an as-needed basis, just done automatically for every instruction making it SSA (single static assignment).
Tuna-Fish 11 days ago [-]
Yep. And if you are using clang (or something else that does SSA register allocation), you basically first build the SSA form, compile it down to a fixed register set, and then the CPU reconstructs the SSA form.
Another thing that's often misunderstood is branch prediction -- branch prediction is not looking at branches and picking which way to go. Branch prediction runs easily at least 3 cycles before the CPU can see the instructions, so it operates just on memory addresses, and tries to guess the next address to fetch. Generally, before it's built any statistics, it just guesses "next address". When there are jumps, it builds history out of those and makes decision based on that. On various architectures, there used to be hinted branches where the predictor could be tweaked to always first attempt to follow a taken branch, and fall back to prediction if that fails a few times. The hints are all no-ops today, simply because there is no way for the predictor to make use of instruction contents because it doesn't see those until it's work is already done.
Findecanor 12 days ago [-]
I definitely think it is worth exploring other models than the traditional randomly accessed register file.
Your idea reminded me a bit of Scheduled Dataflow [0] architecture where every register is a dedicated input to a unit. Instead of an instruction having source and destination register parameters there are only destination registers.
The Mill does it the other way around: There are only source operands and the result is stored in each unit's dedicated output register.
The video where it's revealed the belt in the Mill architecture is completely conceptual is something I randomly think of at night when trying to sleep. Always been fascinated by it.
gumby 12 days ago [-]
A modern core has a bunch of resources (ALUs, register files and so on) to assign/draw upon as dynamically needed. Not every one is needed for every instruction of course. In the old days when there was a single ALU, and maybe a single address calculator, that was that.
Now the resources can be scheduled in a very complex, out of order, and subinstruction fashion. The chip designers guess what the instruction mix will likely be and hopefully make the right call as to how many Xs are required and how many Ys. Too few and operations in flight will wait. Too many and the chip becomes more complex for no benefit, and maybe those unused resources crowd out space for something else useful.
If you stick an ALU on every register you’re guaranteeing to use some area on something not used all the time.
CalChris 12 days ago [-]
Onur Mutlu has a similar (definitely not same) idea of Processing in Memory. Basically the idea is to put some operations nearer to the data. Your idea is nearer to the register and his is in the memory controller nearer to memory.
I could see memory getting a vector FPU that takes an entire DRAM row (64Kbit these days) and does things like scalar/vector MAC. Since DRAM is so slow it could be a relatively slow FPU (10 cycles or more). The biggest issue would be standardization. How do you send it instructions and operands? How do you manage the internal state? A standard would be difficult and seems premature given the rapid changes happening even with FP data formats. Oh, looks like that paper goes into some depth on this stuff!
peterfirefly 12 days ago [-]
Registers and ALUs are sort of already coupled like that in practice.
Longer, less wrong version: modern CPUs have forwarding networks with latches in them. Those latches store ALU results. Those results are (usually) the equivalent of the content-to-be of a specific version of a register -- and by register, I actually mean register name.
So, "registers" that are currently active get to live in the forwarding network where all the routing is and "registers" that aren't as active get to live in the register file, away from the ALUs and the forwarding network.
(And the "registers" used in machine instructions are really just register names. There's a lot renaming going on in order to keep many versions (and potential versions) of register values live at the same time to enable superscalar execution and out-of-order execution.)
ip26 12 days ago [-]
The entries in this register file would be larger & slower, which means you will not be able to squeeze in as many. This reduces your performance.
Also, the hardware to distribute any entry as the second source to any other entry is effectively a read port, followed by a write port.
eigenform 12 days ago [-]
Presumably in that kind of machine, a "register" explicitly means "the link between one execution unit and another" and an "instruction" configures how a specific unit is linked to others. There are no "names of registers," only "names of execution units".
1. When you program this machine, you have to compute the schedule at compile-time. Exactly when and how long do you keep certain units linked?
2. You probably have a worse routing problem? - if you want to perform arbitrary instructions in a chain, all execution units need point-to-point physical wires to carry data to/from all the other units that they might need data from (and most of them will necessarily be unused?)
Instead of having a massively distributed network of execution units, it's probably more efficient to have "virtual" links between units: which you implement as shared access to a centralized memory (a "register file").
o11c 12 days ago [-]
That breaks OoO users of the register before the operation.
Also, although the linked article complains about "too many ports", remember that the useful size of the register file is ultimately limited by how many in-flight operations are possible, which is determined by pipeline depth and number of instructions between branches.
phkahler 12 days ago [-]
>> That breaks OoO users of the register before the operation.
How so? The register can be used the same as before but clobber operations don't have to be sent over to a separate ALU and "back".
dzaima 12 days ago [-]
If you have, say, 'add r1, r2, r3; add r2, r2, 1' and want to do the latter instruction in-register-file, you can't reorder them (e.g. if r2 is known before r3) as after having ran the second instruction the first ones r2 operand would be nowhere to be found. You'd need to track whether there's any unfinished usages of each register (or maybe the simpler option of just tracking if it has been any operand), which isn't traditionally required.
Perhaps a bigger issue is that, if you have a speculative decision (incl. all loads & stores) between having written a register and the clobbering update, you can't do it in-register too, as that'd make it impossible to rollback.
phkahler 12 days ago [-]
Thanks! This and some of the other comments have been helpful in understanding how these things work in more detail. So conceptually register names should be seen as result names - the result of a load or an operation gets called r2 for example. The ISA only provides so many result names, but the CPU may have any number of physical registers to store results. It's a nice model, and modifying a result in-register would really mess things up. It would destroy a previous result which would require a very different (more complex) approach to scheduling and clean up.
Add and subtract is too expensive to be multiplied.
This thing referenced above effectively reduces pipeline length by one step and is quite useful in scoreboarding-based implementation of OoO instruction issue.
gen3 12 days ago [-]
It’s been a few years, but my understanding is that most time spent by the CPU is waiting for data, not logic. I wonder if there would be a real impact on execution speed
The_Colonel 12 days ago [-]
Most of the crazy parts of CPUs (out-of-order, speculative execution, branch predictors, cache hierarchy) are ways of trying to work around the slow memory. Improving the logic execution can allow going farther speculatively and thus pre-fetch sooner. Compressing instruction encoding can lower the need to fetch instructions.
fulafel 12 days ago [-]
Most of those, except cache, are attempts to work around the bottleneck of single flow of control and thus limited available parallelism.
tsimionescu 12 days ago [-]
Unfortunately all experience shows that both programmers and compilers are much worse at parallelizing code than CPUs are. There have been many attempts at architectures that allowed compilers or programmers to express code in more parallel ways (VLIW architectures, Intel's Itanium, IBM's Cell used in the PS3) and they all categorically failed.
Successful processors offer a sequential execution model, and handle the parallelism internally. Even CUDA is designed somewhat like this: you express your code in a largely linear fashion, and rely on the NVIDIA-created compiler and on the GPU itself to run it in parallel.
adrian_b 12 days ago [-]
CUDA is quite different. In CUDA you do not express parallel code in a linear fashion, a.k.a. sequential fashion, hoping that CUDA will determine the dependence chains and extract them from the sequential code and execute them in parallel.
The main feature of CUDA is that in order to describe an algorithm that is applied to an array, you just write the code that applies the algorithm to an element of that array, like writing only the body of a "for" loop.
Then the CUDA run-time will take care of creating an appropriate number of execution threads, taking into account the physical configuration of the available GPU, e.g. how many array elements can be processed by a single instruction, how many execution threads can share a single processing core, how many processing cores exist in the GPU, and so on. When there are more array elements than the GPU resources can process at once, CUDA will add some appropriate looping construct, to ensure the processing of the entire array.
The CUDA programmer writes the code that processes a single element array, informing thus the CUDA run-time that this code is independent of its replicas that process any other array element, except when the programmer references explicitly other array elements, which is normally avoided.
The task of a CPU able to do non-sequential instruction execution (a.k.a. out-of-order execution) and simultaneous initiation of multiple instructions (a.k.a. superscalar instruction execution) is quite different.
The main problem for such a CPU is the determination of the dependence relationships between instructions, based on examining the register numbers encoded in the instructions. Based on the detected dependencies and on the availability of the operands, the CPU can schedule the execution of the instructions, in parallel over multiple execution units.
There exists an open research problem, whether there is any better way to pass the information about the dependencies between instructions from the compiler, which already knows them, to the CPU that runs the machine code, otherwise than by using fake register numbers, whose purpose is only to express the dependencies, and which must be replaced for execution with the real numbers of the physical registers, by the renaming mechanism of the CPU.
tsimionescu 12 days ago [-]
Agreed, CUDA is very different. But still, it's another example where programmers just write sequential single flow of execution code, and it gets executed in parallel according to "external" rules.
The_Colonel 12 days ago [-]
Out-of-order execution doesn't imply parallelism, it was designed from the beginning to work around the input data availability problem.
In speculative execution and branch predictors, prefetch may seem just as a nice bonus, but given that nowadays CPU performance is largely bottlenecked by memory access, prefetch resulting from these techniques will often come out as the dominant performance factor.
tsimionescu 12 days ago [-]
It's still a form of parallelism, that could in principle be written into the program instead of being automatically implemented by the processor.
For example, in hand crafted assembly programs, it's sometimes common to know how long a fetch operation lasts, and manually schedule operations such that they can be executed in parallel with the fetch operation.
Theoretically a high level language could also be designed to expose this kind of logic to the programmer. A program in such a language would be expressed as a set of very very short threads that can be interleaved by the compiler given precise knowledge of instruction timers.
10 days ago [-]
fulafel 11 days ago [-]
OoO came about after multi-issue architectures were starved for instructions to execute due to on-chip blockers like execution unit availability, data hazards, pipeline bubbles, register availability, branching, cache ports. You can call those input data availability problems but it's not availability from offchip memory. So in actual history, yes it was for parallelism (keeping multiple execution units busy).
OoO did have the side benefit from possibly executing past a few memory stalls but those were secondary. OoO reodering resources were sized for addressing the stalls from on-chip timescale things. Today the resources are bigger, but even bigger is the relative memory latency (how many insns you could execute in the time it takes to service a main memory fetch that your pipeline depends on).
_nalply 12 days ago [-]
Would it make sense to compress code and data with something like zstd and let the CPU decompress?
Of course this means a large change of how computers work but perhaps it is possible to make this opt-in (i.e. backwards compatible) for software?
tsimionescu 12 days ago [-]
This would make memory read performance much, much more unpredictable, so it is a no-go from the start. And beyond that, the problem is not one of bandwidth, it is one of latency. This would increase bandwidth sometimes, but it would increase latency always, which is a terrible trade-off. Missed branch predictions would cost even more than they do today, for example.
The_Colonel 12 days ago [-]
This idea isn't about compressing in-flight, but in the instruction cache, so that more instructions will fit into the cache, and you don't need to fetch as often (and thus incur latency) from main memory / L2.
Zstd is impractical, but I can imagine some sort of storage efficient microcode? (current Intel CPUs store original x86 instructions in the L1 instruction cache).
simiones 12 days ago [-]
You then need extra memory to store the de-compressed instructions, since you can't predict before running the decompression how many instructions you'll get. And you still the problem of an unpredictably-sized instruction cache.
Plus, the larger the instruction cache is, the worse every branch mis-prediction is. As far as I know, the size of the instruction cache is not really limited because of space issues, it's limited for precisely this reason.
nullc 12 days ago [-]
Some instruction sets are variable length, there are also things like thumb and the 'compressed' riscv extension.
If you compress each instruction one at a time into a variable number of bits the ability to jump to any instruction is preserved but compression is hurt by not being able to use cross-instruction conditional probabilities and having to pack things into integer numbers of bits.
One could imagine a compression format that had 'jump points' -- compiler selection locations where it was possible to jump and decode from, so that you'd only take the above losses at potential jump targets.
You could go further an imagine the instruction set having some kind of constrained short jump forward a limited distance (by simply letting the decoder decode that way) or that can go back a limited distance without using a jump point so long as there was no way for controlflow to change out from under it.
I wonder what percentage of instructions would need to be jump points in such a system?
But I think this is mostly of academic interest: except in very small embedded devices code size is not a huge performance driver and like anything else you get diminishing returns from further optimizations. Variable length multiple-of-bytes instruction sets like x86 probably get a significant fraction of the potential compression gains and do so without a lot of complexity.
pkhuong 12 days ago [-]
x86 is a variable-length encoding where more frequently used instructions tend to have shorter encodings. Thumb doesn't go as far, with only 2 and 4 -byte instructions. You can find old papers on Huffman encoding instructions.
More effective block compression schemes are harder to pull off because of branches.
adgjlsfhk1 12 days ago [-]
The problem here is branches. since you can jump to pretty much any instruction, every instruction needs to be a valid place to start decoding which makes the idea non-tenable.
phkahler 12 days ago [-]
I wasn't really thinking of faster execution speed overall. Just faster for a given number of ports on the register file. It may also eliminate the need for result forwarding hardware since the registers would just the ALU output latch.
NohatCoder 12 days ago [-]
It is probably easier to add more functionality to existing ALUs. Various instruction sets have added "free" bonus operations, but one could go a step further and allow any two functions from a wide set to be combined, the intermediate value would not hit the registers, and thus save a store and a load relative to two individual instructions.
sroussey 12 days ago [-]
Why stop there? Move the ALU to the RAM
rcxdude 12 days ago [-]
There are some proposals of that form: basically if you have a RAM interface that can support commands like "fill this range of memory with this pattern" or "add this constant to all values in this range" it can help speed up a decent amount of code, but also drastically reduce power consumption, as the memory interface is a pretty significant source of power drain on mobile systems. It does significantly complicate the interfaces involved, though, so it would require a lot of effort to implement.
There's already an x86-64 extension for this: RAO-INT (remote atomics).
vasco 12 days ago [-]
We've been doing the reverse and bringing the RAM to the ALU.
weinzierl 12 days ago [-]
This has been tried. I think the IBM System/360 could do arithmetic directly in RAM.
_a_a_a_ 12 days ago [-]
err, any refs for that? I really doubt it but could well be wrong. I'm not aware of any mainstream arch that can/could do that, that I can think of.
weinzierl 12 days ago [-]
At least the IBM System/370 (not 360 as I said above) had assembly instructions that took memory addresses directly without a register or immediate being in the instruction.
Here is an example of such an instruction from the
"AP, Add Packed (Decimal)
The data string located at the storage address specified by operand-2 (b2+d2) is added to the data string located at the storage address specified by operand-1 (b1+d1). Operand-2 remains unchanged."
Now, how this was actually done in hardware I don't know, but I would not be surprised if hidden registers were involved. So, maybe this is not really an example that fits the original premise of this subthread, but I think it is interesting nevertheless.
Rendered at 11:39:27 GMT+0000 (Coordinated Universal Time) with Vercel.
Thoughts on this?
edit: OoO primer, assuming x86_64 but everything else works the same way too:
Your cpu has hundreds of physical registers. There are 16 register names managed in the frontend. Whenever the frontend sees an instruction that writes to RAX, it allocates a fresh register that does not contain a value, sets it as pending, writes that register number under RAX in the register alias table (RAT), and sends the instruction forward. Once the instruction is in the scheduler, it waits until all it's inputs are ready, and then issues to an execution unit. Once it gets executed, it writes it's value to the register allocated for it and sets it as ready.
If you have an instruction that uses the same register name as input and output, those are guaranteed not to be the same register. Used registers get garbage collected once no name or pending instruction refers to them anymore.
Over here, in AVR and MSP430 and Cortex-M0 land, there are about 16 registers, instructions take at most 2 registers, and write to the destination (often also a source argument, ADD R1 R2 does R1 := R1 + R2). There is no 'frontend', no renaming, no aliasing, no garbage collector; what you see is what you get.
I assume that sometime long ago, the words had the original meaning, but they gradually added abstractions that were still mostly indiscernible at each step of the journey until they had completely diverged.
https://www.researchgate.net/profile/Ch-Srinivas/publication...
Hell, by sidestepping GCC's register allocation one can substantially increase performance if you arrange your data into registers that your ALU or other peripheral likes best. Prime example, RAM access can be twice as fast if you load your target address into the X/Y/Z register with LD instead of letting GCC emit the LDS instruction. You'd do something like store a pointer to a very important struct that you want to access many times very quickly.
I honestly have no clue if a similar concept exists on modern x86. I'd assume that you are explicitly meant to not care about things at this level and it's all handled behind the scenes. It's crazy how much the x86 CPU does for you, but I find it much more fun to tweak things close to the bare metal.
But there's still a massive amount of performance tuning. Some fun examples being how Haswell can only do one FP add per cycle, but two FMAs, so you can sometimes improve throughput by replacing some 'a+b's with 'a*1+b' (at the cost of higher latency). Or how, on older Intel, three-addend 'lea' can execute on only one port (as opposed to two for two-addend 'lea', or 3 or 4 for 'add') and has 3-cycle latency (vs. 1 cycle for two-addend; so two back-to-back 'lea's have lower latency than one three-addend one), but still uses only one port and thus can sometimes be better for throughput.
It can also be very microarchitecture specific. Because FP code often need significant unrolling, the number of architectural registers needed to store partial results can be a bottleneck, especially if the compiler doesn't do a perfect job.
Another thing that's often misunderstood is branch prediction -- branch prediction is not looking at branches and picking which way to go. Branch prediction runs easily at least 3 cycles before the CPU can see the instructions, so it operates just on memory addresses, and tries to guess the next address to fetch. Generally, before it's built any statistics, it just guesses "next address". When there are jumps, it builds history out of those and makes decision based on that. On various architectures, there used to be hinted branches where the predictor could be tweaked to always first attempt to follow a taken branch, and fall back to prediction if that fails a few times. The hints are all no-ops today, simply because there is no way for the predictor to make use of instruction contents because it doesn't see those until it's work is already done.
Your idea reminded me a bit of Scheduled Dataflow [0] architecture where every register is a dedicated input to a unit. Instead of an instruction having source and destination register parameters there are only destination registers.
The Mill does it the other way around: There are only source operands and the result is stored in each unit's dedicated output register.
[0]. https://www.semanticscholar.org/paper/Scheduled-Dataflow%3A-...
Now the resources can be scheduled in a very complex, out of order, and subinstruction fashion. The chip designers guess what the instruction mix will likely be and hopefully make the right call as to how many Xs are required and how many Ys. Too few and operations in flight will wait. Too many and the chip becomes more complex for no benefit, and maybe those unused resources crowd out space for something else useful.
If you stick an ALU on every register you’re guaranteeing to use some area on something not used all the time.
https://arxiv.org/pdf/2012.03112
Longer, less wrong version: modern CPUs have forwarding networks with latches in them. Those latches store ALU results. Those results are (usually) the equivalent of the content-to-be of a specific version of a register -- and by register, I actually mean register name.
So, "registers" that are currently active get to live in the forwarding network where all the routing is and "registers" that aren't as active get to live in the register file, away from the ALUs and the forwarding network.
(And the "registers" used in machine instructions are really just register names. There's a lot renaming going on in order to keep many versions (and potential versions) of register values live at the same time to enable superscalar execution and out-of-order execution.)
Also, the hardware to distribute any entry as the second source to any other entry is effectively a read port, followed by a write port.
1. When you program this machine, you have to compute the schedule at compile-time. Exactly when and how long do you keep certain units linked?
2. You probably have a worse routing problem? - if you want to perform arbitrary instructions in a chain, all execution units need point-to-point physical wires to carry data to/from all the other units that they might need data from (and most of them will necessarily be unused?)
Instead of having a massively distributed network of execution units, it's probably more efficient to have "virtual" links between units: which you implement as shared access to a centralized memory (a "register file").
Also, although the linked article complains about "too many ports", remember that the useful size of the register file is ultimately limited by how many in-flight operations are possible, which is determined by pipeline depth and number of instructions between branches.
How so? The register can be used the same as before but clobber operations don't have to be sent over to a separate ALU and "back".
Perhaps a bigger issue is that, if you have a speculative decision (incl. all loads & stores) between having written a register and the clobbering update, you can't do it in-register too, as that'd make it impossible to rollback.
Add and subtract is too expensive to be multiplied.
This thing referenced above effectively reduces pipeline length by one step and is quite useful in scoreboarding-based implementation of OoO instruction issue.
Successful processors offer a sequential execution model, and handle the parallelism internally. Even CUDA is designed somewhat like this: you express your code in a largely linear fashion, and rely on the NVIDIA-created compiler and on the GPU itself to run it in parallel.
The main feature of CUDA is that in order to describe an algorithm that is applied to an array, you just write the code that applies the algorithm to an element of that array, like writing only the body of a "for" loop.
Then the CUDA run-time will take care of creating an appropriate number of execution threads, taking into account the physical configuration of the available GPU, e.g. how many array elements can be processed by a single instruction, how many execution threads can share a single processing core, how many processing cores exist in the GPU, and so on. When there are more array elements than the GPU resources can process at once, CUDA will add some appropriate looping construct, to ensure the processing of the entire array.
The CUDA programmer writes the code that processes a single element array, informing thus the CUDA run-time that this code is independent of its replicas that process any other array element, except when the programmer references explicitly other array elements, which is normally avoided.
The task of a CPU able to do non-sequential instruction execution (a.k.a. out-of-order execution) and simultaneous initiation of multiple instructions (a.k.a. superscalar instruction execution) is quite different.
The main problem for such a CPU is the determination of the dependence relationships between instructions, based on examining the register numbers encoded in the instructions. Based on the detected dependencies and on the availability of the operands, the CPU can schedule the execution of the instructions, in parallel over multiple execution units.
There exists an open research problem, whether there is any better way to pass the information about the dependencies between instructions from the compiler, which already knows them, to the CPU that runs the machine code, otherwise than by using fake register numbers, whose purpose is only to express the dependencies, and which must be replaced for execution with the real numbers of the physical registers, by the renaming mechanism of the CPU.
In speculative execution and branch predictors, prefetch may seem just as a nice bonus, but given that nowadays CPU performance is largely bottlenecked by memory access, prefetch resulting from these techniques will often come out as the dominant performance factor.
For example, in hand crafted assembly programs, it's sometimes common to know how long a fetch operation lasts, and manually schedule operations such that they can be executed in parallel with the fetch operation.
Theoretically a high level language could also be designed to expose this kind of logic to the programmer. A program in such a language would be expressed as a set of very very short threads that can be interleaved by the compiler given precise knowledge of instruction timers.
OoO did have the side benefit from possibly executing past a few memory stalls but those were secondary. OoO reodering resources were sized for addressing the stalls from on-chip timescale things. Today the resources are bigger, but even bigger is the relative memory latency (how many insns you could execute in the time it takes to service a main memory fetch that your pipeline depends on).
Of course this means a large change of how computers work but perhaps it is possible to make this opt-in (i.e. backwards compatible) for software?
Zstd is impractical, but I can imagine some sort of storage efficient microcode? (current Intel CPUs store original x86 instructions in the L1 instruction cache).
Plus, the larger the instruction cache is, the worse every branch mis-prediction is. As far as I know, the size of the instruction cache is not really limited because of space issues, it's limited for precisely this reason.
If you compress each instruction one at a time into a variable number of bits the ability to jump to any instruction is preserved but compression is hurt by not being able to use cross-instruction conditional probabilities and having to pack things into integer numbers of bits.
One could imagine a compression format that had 'jump points' -- compiler selection locations where it was possible to jump and decode from, so that you'd only take the above losses at potential jump targets.
You could go further an imagine the instruction set having some kind of constrained short jump forward a limited distance (by simply letting the decoder decode that way) or that can go back a limited distance without using a jump point so long as there was no way for controlflow to change out from under it.
I wonder what percentage of instructions would need to be jump points in such a system?
But I think this is mostly of academic interest: except in very small embedded devices code size is not a huge performance driver and like anything else you get diminishing returns from further optimizations. Variable length multiple-of-bytes instruction sets like x86 probably get a significant fraction of the potential compression gains and do so without a lot of complexity.
More effective block compression schemes are harder to pull off because of branches.
Here is an example of such an instruction from the
"AP, Add Packed (Decimal) The data string located at the storage address specified by operand-2 (b2+d2) is added to the data string located at the storage address specified by operand-1 (b1+d1). Operand-2 remains unchanged."
http://www.simotime.com/asmins01.htm#AP
Now, how this was actually done in hardware I don't know, but I would not be surprised if hidden registers were involved. So, maybe this is not really an example that fits the original premise of this subthread, but I think it is interesting nevertheless.