Building the perfect memory bandwidth beastnextplatform.com
nickdothutton 11 days ago [-]
Some of you might appreciate reading about the original memory bandwidth monster. 1995 vintage.

https://en.wikipedia.org/wiki/Cray_T90

newman314 11 days ago [-]
We just got 4 E1080s delivered. It'll be interesting to see the difference between the existing 980s and the 1080s once the 1080s are up and running.
rbanffy 11 days ago [-]
Where do you work and how much does it cost to work with you? ;-)
modeless 11 days ago [-]
Isn't this what Tesla's Dojo machine is trying to tackle? I wonder about the relative importance of FLOPS vs memory bandwidth in far future AI machines.
justahuman74 11 days ago [-]
whats an example of a workload that scales well by memory bandwidth rather than compute?
haimez 11 days ago [-]
Maybe a low latency hybrid cache & query engine where data is mostly in memory but some computation (eg: aggregation at different percentiles on the fly) with low latency requirements would benefit.

Generally speaking, something where:

1. The data is in memory or can be made to be, for the duration of the need.

2. The overall throughput isn’t limited by some external I/O operation (eg: cache servers might seem like “memory hungry” things, but will bottleneck on network throughput before memory throughput [note: latency is definitely not throughput here]). - the CPU operations involved once data is fetched from memory are very cheap, but also generate a sequential but high volume of writes. Maybe an ideal example is incrementing every integer in a large array by 1 since reads and writes are predictable and SIMD instructions can push the theoretical per clock CPU throughput even higher.

Disclaimer: I might be missing some other scenarios from a lack of creativity. The question you asked got more interesting the longer I thought about it, and I think it might have something to do with why this “Memory Bandwidth Beast” hasn’t yet had its day in the sun.

adamcharnock 11 days ago [-]
Factorio, interestingly. Their dev blog talked about how it was typically constrained on memory bandwidth.
hansvm 10 days ago [-]
Memory bandwidth is 1000x lower than CPU bandwidth, so as a rule of thumb any algorithm whose work scales linearly in the amount of data being processed will be memory bandwidth bound, and also any algorithm which can't be structured to do a lot of work on one memory region at once before moving onto the next one.

Examples (for large enough inputs that it's relevant) include shuffling, sorting, kmeans clustering, branch and bound sudoku solving, vector addition, dot products, and so on.

Moreover, writing a particular piece of code is often easier if you ignore memory bandwidth as a constraint. The classic example is matrix multiplication -- it can be structured such that even disk bandwidth isn't relevant compared to CPU bandwidth, but doing so is a little fiddly compared to the naive n^2 dot products approach, so writing it yourself usually results in a memory bandwidth bound solution for large matrices.

Similarly, writing two passes over your data rather than doing a mega-loop, the choice to use classic kmeans rather than one of its approximations (when it would be appropriate to do so), or not enforcing sortedness at some reasonable boundary and having to do additional passes over your data. It's easy to write code that hoovers up way more bandwidth than it needs to, and often faster algorithms that come out don't do anything different than access the right data at the right time to reduce that pressure, like a trinity rotation [0].

Caveat: Benchmark everything, especially as you're building intuition. Trying to fix what you think is a memory bandwidth issue can result in pipeline stalls and all sorts of fun things, especially when your server has more faster caches than your dev machine, when data in prod doesn't match your micro benchmark, ....

[0] https://github.com/scandum/rotate

the_svd_doctor 11 days ago [-]
Fast Fourier Transforms
marginalia_nu 11 days ago [-]
Looking up many items in a large hash table. So like an in-memory K-V store could probably be designed to work well around such an infrastructure.
jeffreygoesto 11 days ago [-]
Volume rendering.
11 days ago [-]
marktangotango 11 days ago [-]
Memory hard crypto POWs such as Equihash or Autolykos spring to mind.
Avlin67 10 days ago [-]
so memory distuptive bandwith could add more than many cores in terms of effective computing maybe 5 dimension quantum memory cell