Hello again HN, I'm bunnie! Unfortunately, time zones strike again...I'll check back when I can, and respond to your questions.
dmitrygr 22 minutes ago [-]
very cool. tiny processors everywhere. but be nice to PIO. PIO is good :)
bunnie 4 minutes ago [-]
Agreed! The PIO is great at what it does. I drew a lot of inspiration from it.
dmitrygr 1 minutes ago [-]
What are your thoughts on efficiency? BIO vs PIO implementing, say, 68k 16-bit-wide bus slave. I know i can support 66MHz 68K bus clock with PIO at 300MHz. How much clock speed would BIO need?
mrlambchop 23 minutes ago [-]
I loved this article and had wanted to play with PIO for a long time (or at least, learn from it through playing!).
One thing jumped out here - I assumed CISC inside PIO had a mental model of "one instruction by cycle" and thus it was pretty easy to reason about the underlying machine (including any delay slots etc...).
For this RISC model using C, we are now reasoning about compiled code which has a somewhat variable instruction timing (1-3 cycles) and that introduces an uncertainty - the compiler and understanding its implementation.
I think this means that the PIO is timing-first, as timing == waveform where BIO is clarity-first with C as the expression and then explicit hardware synchronization.
I like both models! I am wondering about the quantum delays however that are being used to set the deadlines - here, human derived wait delays are utilized knowledge of the compiled instructions to set the timing.
Might there not be a model of 'preparing the next hardware transaction' and then 'waiting for an external synchronization' such as an external signal or internal clock, so we don't need to count the instruction cycles so precisely. On the external signal side, I guess the instruction is 'wait for GPIO change' or something, so the value is immediately ready (int i = GPIO_read_wait_high(23) or something) and the external one is doing the same, but synchronizing (GPIO_write_wait_clock( 24, CLOCK_DEF)) as an alternative to the explicit quantum delays.
This might be a shadow register / latch model in more generic terms - prep the work in shadow, latch/commit on trigger.
Anyway, great work Bunnie!
bunnie 4 minutes ago [-]
The idea of the wait-to-quantum register is that it gets you out of cycle-counting hell at the expense of sacrificing a few cycles as rounding errors. But yes, for maximum performance you would be back to cycle counting.
That being said - one nice thing about the BIO being open source is you can run the verilog design in Verilator. The simulation shows exactly how many cycles are being used, and for what. So for very tight situations, the open source RTL nature of the design opens up a new set of tools that were previously unavailable to coders. You can see an example of what it looks like here: https://baochip.github.io/baochip-1x/ch00-00-rtl-overview.ht...
Of course, there's a learning curve to all new tools, and Verilator has a pretty steep curve in particular. But, I hope people give the Verilator simulations a try. It's kind of neat just to be able to poke around inside a CPU and see what it's thinking!
dmitrygr 1 hours ago [-]
> Above is the logic path isolated as one of the longest combination paths in the design, and below is a detailed report of what the cells are.
which is an argument that "fpga_pio" is badly implemented or that PIO is unsuitable for FPGA impls. Real silicon does not need to use a shitton of LUT4s to implement this logic and it can be done much more efficiently and closes timing at higher clocks (as we know since PIO will run near a GHz)
Retr0id 52 minutes ago [-]
PIO is unsuitable for FPGA impls, that's what the article says.
> If you’re thinking about using it in an FPGA, you’d be better off skipping the PIO and just implementing whatever peripherals you want directly using RTL.
dmitrygr 46 minutes ago [-]
Yes, my point is that the article throws a lot of shade at PIO while the real issue is that the author is trying to shove a third-party FPGA reimpl of it into a place it never belonged. PIO itself is a perfectly good design for what it does and where it does it.
bunnie 18 minutes ago [-]
Actually, the PIO does what it does very well! There is no "worse" or "better" - just different.
Because it does what it does so well, I use the PIO as the design study comparison point. This requires taking a critical view of its architecture. Such a review doesn't mean its design is bad - but we try to take it apart and see what we can learn from it. In the end, there are many things the PIO can do that the BIO can't do, and vice-versa. For example, the BIO can't do the PIO's trick of bit-banging DVI video signals; but, the PIO isn't going to be able to protocol processing either.
In terms of area, the larger area numbers hold for both an ASIC flow as well as the FPGA flow. I ran the design through both sets of tools with the same settings, and the results are comparable. However, it's easier to share the FPGA results because the FPGA tools are NDA-free and everyone can replicate it.
That being said, I also acknowledge in the article that it's likely there are clever optimizations in the design of the actual PIO that I did not implement. Still, barrel shifters are a fairly expensive piece of hardware whether in FPGA or in ASIC, and the PIO requires several of them, whereas the BIO only has one. The upshot is that the PIO can do multiple bit-shifts in a single clock cycle, whereas the BIO requires several cycles to do the same amount of bit-shifting. Again, neither good or bad - just different trade-offs.
One thing jumped out here - I assumed CISC inside PIO had a mental model of "one instruction by cycle" and thus it was pretty easy to reason about the underlying machine (including any delay slots etc...).
For this RISC model using C, we are now reasoning about compiled code which has a somewhat variable instruction timing (1-3 cycles) and that introduces an uncertainty - the compiler and understanding its implementation.
I think this means that the PIO is timing-first, as timing == waveform where BIO is clarity-first with C as the expression and then explicit hardware synchronization.
I like both models! I am wondering about the quantum delays however that are being used to set the deadlines - here, human derived wait delays are utilized knowledge of the compiled instructions to set the timing.
Might there not be a model of 'preparing the next hardware transaction' and then 'waiting for an external synchronization' such as an external signal or internal clock, so we don't need to count the instruction cycles so precisely. On the external signal side, I guess the instruction is 'wait for GPIO change' or something, so the value is immediately ready (int i = GPIO_read_wait_high(23) or something) and the external one is doing the same, but synchronizing (GPIO_write_wait_clock( 24, CLOCK_DEF)) as an alternative to the explicit quantum delays.
This might be a shadow register / latch model in more generic terms - prep the work in shadow, latch/commit on trigger.
Anyway, great work Bunnie!
That being said - one nice thing about the BIO being open source is you can run the verilog design in Verilator. The simulation shows exactly how many cycles are being used, and for what. So for very tight situations, the open source RTL nature of the design opens up a new set of tools that were previously unavailable to coders. You can see an example of what it looks like here: https://baochip.github.io/baochip-1x/ch00-00-rtl-overview.ht...
Of course, there's a learning curve to all new tools, and Verilator has a pretty steep curve in particular. But, I hope people give the Verilator simulations a try. It's kind of neat just to be able to poke around inside a CPU and see what it's thinking!
which is an argument that "fpga_pio" is badly implemented or that PIO is unsuitable for FPGA impls. Real silicon does not need to use a shitton of LUT4s to implement this logic and it can be done much more efficiently and closes timing at higher clocks (as we know since PIO will run near a GHz)
> If you’re thinking about using it in an FPGA, you’d be better off skipping the PIO and just implementing whatever peripherals you want directly using RTL.
Because it does what it does so well, I use the PIO as the design study comparison point. This requires taking a critical view of its architecture. Such a review doesn't mean its design is bad - but we try to take it apart and see what we can learn from it. In the end, there are many things the PIO can do that the BIO can't do, and vice-versa. For example, the BIO can't do the PIO's trick of bit-banging DVI video signals; but, the PIO isn't going to be able to protocol processing either.
In terms of area, the larger area numbers hold for both an ASIC flow as well as the FPGA flow. I ran the design through both sets of tools with the same settings, and the results are comparable. However, it's easier to share the FPGA results because the FPGA tools are NDA-free and everyone can replicate it.
That being said, I also acknowledge in the article that it's likely there are clever optimizations in the design of the actual PIO that I did not implement. Still, barrel shifters are a fairly expensive piece of hardware whether in FPGA or in ASIC, and the PIO requires several of them, whereas the BIO only has one. The upshot is that the PIO can do multiple bit-shifts in a single clock cycle, whereas the BIO requires several cycles to do the same amount of bit-shifting. Again, neither good or bad - just different trade-offs.