Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Condor's Cuzco RISC-V Core at Hot Chips 2025 (chipsandcheese.com)

157 points by rbanffy 161 days ago | 49 comments

1024bees 161 days ago [-]

It's nice to see a microarchitecture take a risk, and getting perspective on how this design performs with respect to performance, power and area would be interesting.

Very unlikely to me that this design would have comparable "raw" performance to a design that implements something closer to tomasulo's algorithm. The assumption that the latency of a load will be a l1 hit is a load bearing abstraction; I can imagine scenarios where this acts as a "double jeopardy" causing scheduling to lock up because the latency was mispredicted, but one could also speculate that isn't important because the workload is already memory bound.

There's an intuition in computer architecture that designs that lean on "static" instruction scheduling mechanisms are less performant than more dynamic mechanisms for general purpose compute, but we've had decades of compiler development since itanium "proved" this. Efficient computer (or whatever their name is) is doing something cool too, it's exciting to see where this will go

brucehoult 160 days ago [-]

> Very unlikely to me that this design would have comparable "raw" performance to a design that implements something closer to tomasulo's algorithm.

The point appears to be losing maybe a few percent (5%-7%) of performance, in exchange for saving tens of percent of energy consumption.

> The assumption that the latency of a load will be a l1 hit is a load bearing abstraction

That's just the initial assumption, that load results will appear 4 cycles after the load is started. If it gets to that +4 time and the result has not appeared then it looks for a empty execution slot starting another 14 cycles along (+18 in total) for a possible L2 cache return.

The original slot result is marked as "poison" so if any instruction is reached that depends on it, it will also be moved further along the RoB and it's original slot marked as poison, and so on.

If the dependent instruction was originally at least 18 cycles along from the load being issued then I think it will just pick up the result of the L2 hit (if that happens), and not have to be moved.

If L2 also misses and the result still has not been returned when you get to the moved instruction, a spare execution slot will again be searched for starting at another 20 cycles along (+38 in total), in preparation for an L3 hit.

The article says when searching for an empty slot only a maximum of 8 cycles worth of slots are searched. The article doesn't say what happens if there are no empty execution slots within that 8 cycle window. I suspect the instruction just gets moved right to the end, where all slots are empty.

It also doesn't say what happens if the load doesn't hit in L3. As the main memory latency is under control of the SoC vendor and/or the main board or system integrator (for sure not Condor), I suspect that L3 misses are also moved to the very end.

bri3d 161 days ago [-]

> we've had decades of compiler development since itanium "proved" this

Sure, but until someone doesn't do "The assumption that the latency of a load will be a l1 hit," they're in trouble for most of what we think of as "general purpose" computing.

I think you get it, but there's this overall trope that the issue with Itanium was purely compiler-related: that we didn't have the algorithms or compute resource to parallelize enough of a single program's control flow to correctly fill the dispatch slots in a bundle. I really disagree with this notion: this might have been _a_ problem, but it wasn't _the_ problem.

Even an amazing compiler which can successfully resolve all data dependencies inside of a single program and produce a binary containing ideal instruction bundling has no idea what's in dcache in the case of an interrupt/context switch, and therefore every load and all of its dependencies risks a stall (or in this case, replay) for a statically scheduled architecture, while a modern out-of-order architecture can happily keep going, even speculatively taking both sides of branches.

The modern approach to optimize datacenter computing is to aggressively pack in context switches, with many execution contexts (processes, user groups/containers, whatever) per guest domain and many guest domains per hypervisor.

Basically: I have yet to see someone successfully use the floor plan they took back from not doing out-of-order to effectively fill in for memory latency in a "general purpose" datacenter computing scenario. Most designers just add more cores, which only makes the problem worse (even adding more cache would be better than more cores!).

VLIW and this kind of design have a place: I could see a design like this being useful in place of Cortex-A or even Cortex-X in a lot of edge compute use cases, and of course GPUs and DSPs already rely almost exclusively on some variety of "static" scheduling already. But as a stated competitor to something like Neoverse/Graviton/Veyron in the datacenter space, the "load-bearing load" (I like your description!) seems like it's going to be a huge problem.

acdha 161 days ago [-]

> we've had decades of compiler development since itanium "proved" this.

I think an equally large change is the enormous rise of open source and supply chain focus. When Itanium came out, there was tons of code businesses ran which had been compiled years ago, lots of internal reimplementation of what would now be library code, and places commonly didn’t upgrade for years because that was also often a licensing purchase. Between open source and security, it’s a lot more reasonable now to think people will be running optimized binaries from day one and in many cases the common need to support both x86 and ARM will have flushed out a lot of compatibility warts along with encouraging use of libraries rather than writing as many things on their own.

jasonwatkinspdx 161 days ago [-]

This is still using a Tomasulo like algorithm, it's just been shifted from the backend to the front end. And instructions don't lock up on an L1 miss. Instead the results of that instruction are marked as poisoned, and the front end replays the their microps forward in the execution stream once the L1 miss is resolved. As the article points out, this replay is likely to fill out otherwise unused execution slots on general purpose code, as OoO cpus rarely sustain their full execution width.

It's a smart idea, and has some parallels to the Mill CPU design. The backend is conceptually similar to a statically scheduled VLIW core, and the front end races ahead using it's matrix scorecard trying to queue up as much as it can for it vs the presence of unpredictable latencies.

quantummagic 161 days ago [-]

> Mill CPU design

There were some fascinating concepts being explored in that project. It's a shame nothing came of it.

Findecanor 160 days ago [-]

Last post on their forum a month ago, they claimed that they were live and having progress, but I dunno ...

What I'm afraid of is that perhaps they have been shifting what their goal is a little too often, which of course would delay their time to market.

For example, I think they have shifted from straightforward fixed-SIMD to scalable vectors of some sort, and last I heard they were talking about AI .. which usually means that there's some kind of support for matrix multiplication.

bri3d 161 days ago [-]

Interesting idea. It's like putting a VLIW compilation pass into the scheduler, but without an intermediate microcode cache like NV Denver did. Without handling memory dependencies / cache hazards, I'm not so sure how well it will do in general-purpose use cases. They don't have the same code locality / second-layer icache problem that Denver had, but data loads are still going to be a mess.

I guess the notion is that data cache misses will basically lead to what could be called "instruction amplification," where an instruction will miss its scheduled time slot and have to be replayed, possibly repeatedly, until its dependencies are available. The article asserts that this is the rough equivalent of leaving execution ports unoccupied in a "traditional" OoO architecture, but I'm not so sure. I'm curious about how well this works in practice; I would worry that cache misses would rapidly multiply into a cascading failure case where the entire pipeline basically stalls and the architecture reverts to in-order level performance - just like most general-purpose VLIW architectures.

gchadwick 161 days ago [-]

Andes (Condor is owned by Andes) seems to get relatively little press Vs other RISCV outfits. My sense is they've been quietly building a very solid RISCV CPU business with a great IP portfolio.

This latest core looks very interesting, can't wait to see it hit silicon and see what it can really do!

pclmulqdq 161 days ago [-]

Andes has won a lot of sockets already with their lower-power cores. They have almost become the #1 choice for RISC-V cores.

brucehoult 160 days ago [-]

I think they've been doing more RISC-V deals than SiFive for quite a few years, due largely I think to their proximity to and established relationships with (for NDS32) a lot of the current customers for RISC-V.

sylware 160 days ago [-]

It is amazing that risc-v is getting performant large implementations in 'a market' completely dominated by others (with extremeley toxic PI locks).

It is really refreshing to see that people with significant resources are trying very hard to do the right thing. This is some sort of beacon of hope in the silicon world.

Now, the hard part: will they manage to access the state-of-the-art silicon process?

And we all know the even harder part: migrate the whole software stack, including closed source application like games...

bee_rider 161 days ago [-]

The static schedule part seems really interesting. They note that it only works for some instructions, but I wonder if it would be possible to have a compiler report “this section of the code can be statically scheduled.” In that case, could this have a benefit for real-time operation? Or maybe some specialized partially real-time application—mark a segment of the program as desiring static scheduling, and don’t allow memory loads, etc, inside there.

clamchowder 161 days ago [-]

(author here) they try for all instructions, just that it's a prediction w/replay because inevitably some instructions like memory loads are variable latency. It's not like Nvidia where fixed latency instructions are statically scheduled, then memory loads/other variable latency stuff is handled dynamically via scoreboarding.

IshKebab 161 days ago [-]

I don't think that would help - the set of instructions that have dynamic latencies is basically fixed. Anything memory-related (loads, stores, cache management, fences, etc.) and complex maths (division, sqrt, transcendental functions, etc.)

So you know what code can be statically scheduled just from the instructions already.

bee_rider 161 days ago [-]

I’m just spitballing. But, what if we had some system that went:

1) load some model and set the system into “ready” mode

2) wait for an event from a sensor

3) when the event occurs, trigger some response

4) do other stuff; book keeping, update the model, etc,

5) reset the system to “ready” mode and goto 2

Is it possible we might want some hard time bounds on steps 2 and 3, but be fine with 1, 4, and 5 taking however long? (Assuming the device can be inactive while it is getting ready). Then, we could make sure steps 2 and 3 don’t include any non-static instructions.

IshKebab 161 days ago [-]

Not sure what you're getting at tbh... Do you know about interrupts?

namibj 160 days ago [-]

Given that Nvidia Maxwell / Pascal (mostly GTX 900 / GTX 1000 series) had a bit for each ISA read operand slot that said whether to cache that register file access for reuse by a subsequent instruction, and ARM and RISC-V have thumb/compressed encodings, I'd expect frontend support for blocks of pre-scheduled code (that could be loaded into something like AMD Zen3's μOP cache, as a sizable chunk to allow sufficient loop unrolling for efficiency) to be practical.

Whether the market segment (that could utilize that much special sauce effectively enough to be worth the software engineering) would be large enough to warrant the hardware design and bespoke silicon (which such a project entails)......

I'd probably favor spending the silicon on scatter/gather or maybe some way to span a large gap between calculating an address and using the value fetched from that address, so prefetching wouldn't need to re-calculate the address (expensive) or block of a GPR with the address (precious resource). Also could make load atomicity happen anytime between the address provision (/prefetch-request) and load-completing (destination data register provision).

Prior art: recent (iirc it came with H100) Nvidia async memcpy directly from global to "shared" (user-managed partition of L1D$) memory bypassing the register file.

usrusr 161 days ago [-]

What would the CPU do with the parts not marked as "can be statically scheduled"? I read it as they try it anyways and may get some stalling ("replay") if the schedule was overoptimistic. Not sure how a compiler marking sections could be of help?

IshKebab 161 days ago [-]

Stalling and replay are not the same btw. Stalling is when you wait a bit before continuing, replay is when you try an operation multiple times.

usrusr 161 days ago [-]

So the difference is block everything until the dependency is available and then continue immediately, vs give up on time slots already reserved for downstream dependencies while continuing with those parts in the current schedule that are not blocked and copy the blocked parts at the end of the queue? Sounds like a trade-off that can go one way or the other? But yeah, I was using the term "stalling" in a browser sense, as the superset of both. No idea how incorrect that is.

IshKebab 161 days ago [-]

Yeah I think even traditional OoO designs use replay for missed loads rather than stalling. The performance would be too bad if it actually stalled for every load.

I think stalling is used for rarer more awkward things like changing privilege modes or writing certain CSRs (e.g. satp) where you don't want to have to maintain speculative state.

monocasa 161 days ago [-]

Traditional OoO designs don't stall for loads per se, but will stall for a full ROB that has a chain of dependencies waiting on the results of the load.

IshKebab 160 days ago [-]

Good point, but I guess that's the sort of delay that you can't avoid. If there's literally no work to do until a load is available you have to wait. This design can't avoid that either.

imtringued 160 days ago [-]

Assuming a parallel programming language and a SMT aware compiler, the CPU could just switch to another block of static instructions while it is waiting.

namibj 160 days ago [-]

You mean like e.g. Nvidia Maxwell?

(There's decent 3rd party documentation from nervana systems from when they squeezed all they could out of f32 dense matrix multiply, at the time substantially faster than Nvidia's cuBLAS library; this is very not exclusive to that architecture, though.)

tliltocatl 160 days ago [-]

> Assuming a parallel programming language

Assuming a parallelizable workload, which is often not the case.

dlcarrier 161 days ago [-]

It's interesting that a high-performance computing core has added instructions for bit manipulation. They're really common on low-power embedded cores, where bit manipulating inputs and outputs is more common. They can save a lot of instructions when needed, though. For example, clearing a bit in a variable, without an express instruction, requires raising two to the power of the bit, inverting the result, anding that with the variable, then writing the result back to the variable. Depending on the language, it looks something like this:

    Variable &=~(2^Bit)

The series of bitwise operators looks more grawlix (https://en.wikipedia.org/wiki/Grawlix) than instructions, as though yelling pejoratives at the bit is what clears it.

bri3d 161 days ago [-]

The bit manipulation instructions are a required part of the RVA23 baseline standard, so we're likely to see them in almost all general purpose RISC-V cores in the future.

Arnavion 161 days ago [-]

Since RVA22 actually. B = Zba + Zbb + Zbs and RVA22 requires those individually.

0x000xca0xfe 161 days ago [-]

Bit manipulation instructions are great for high-performance code, too, because they allow conditional computing without branching.

Some real-world examples in simdjson: https://arxiv.org/pdf/1902.08318

Findecanor 161 days ago [-]

M68K also had single-bit instructions. Way back when I wrote M68K assembly, I used them a lot.

I'd think there are quite a few data structures and algorithms where there can be benefits of using powers of two, or to count bits in a word.

RISC-V without the B(itmanip) extension is otherwise quite spartan. B also contains many instructions that other ISAs have in their base set, such as address calculation, and/or/xor not, rol/ror, and even some zero/sign-extension ops.

161 days ago [-]

IshKebab 161 days ago [-]

Very interesting design. I guess replaying loads is the really awkward bit. Also how do variable-latency arithmetic instructions work?

daneel_w 161 days ago [-]

"L2$, L3$, I$, D$". Well, OK.

0x000xca0xfe 161 days ago [-]

It's just shorthand for "level 2 cache", "level 3 cache", "instruction cache" and "data cache".

daneel_w 161 days ago [-]

Yes, obviously. It's just the first time I've seen a CPU designer/manufacturer use such relaxed "informality" in a spec sheet.

Findecanor 161 days ago [-]

I follow RISC-V and see it all the time.

CPU manufacturers also aren't using Unicode, using the letter u instead of µ (micro), and the letter A instead of Å (the unit Ångström).

jasonwatkinspdx 161 days ago [-]

The slides are for Hot Chips, which is a very engineering focused venue. It's not your normal marketing stuff.

DiabloD3 161 days ago [-]

I've been seeing it more and more, especially with vendors that don't speak a western language on their spec sheets.

Everyone can tell what L1$ means, but what would L1 缓存 mean?

Brian_K_White 158 days ago [-]

I got it, and hate it. It's approximately like putting an emoticon that looks like money in there.

Hey maybe the emoticon can be a pile of gold coins, then it would still be a cache as well as cash.

bitwize 161 days ago [-]

Ooh, I wonder what strings were put in those BASIC variables...

daneel_w 161 days ago [-]

LET L2$="256 KiB"; LET L3$="8 MiB"

bee_rider 161 days ago [-]

Hmm?

daneel_w 161 days ago [-]

Cache is pronounced like cash, which the $ symbol is supposed to allude to.

xxpor 161 days ago [-]

Wow, how have I never put 2 and 2 together on that.

robinsonb5 161 days ago [-]

You're not alone - it took me way longer than it should have done to figure that one out!

starkruzr 161 days ago [-]

leading to the unfortunate abbreviation sometimes drawn on blackboards, "$hit"

grg0 161 days ago [-]

It is an apt abbreviation if you visualize shit tightly packed in a container. And when you thrash the cache, shit hit the fan (and spills to VRAM.)

bee_rider 161 days ago [-]

Yes, they are obviously caches. I just didn’t understand your comment.

ychompinator 160 days ago [-]

[dead]

13134svs 161 days ago [-]

[flagged]

13134svs 161 days ago [-]

[flagged]

13134svs 161 days ago [-]

[flagged]

13134svs 161 days ago [-]

[flagged]

lexicality 161 days ago [-]

[flagged]

Rendered at 17:12:55 GMT+0000 (Coordinated Universal Time) with Vercel.