1x MI300x has 192GB HBM3.
1x MI325x has 256GB HBM3e.
They cost less, you can fit more into a rack and you can buy/deploy at least the 300's today and 325's early next year. AMD and library software performance for AI is improving daily [0].
I'm still trying to wrap my head around how these companies think they are going to do well in this market without more memory.
> I'm still trying to wrap my head around how these companies think they are going to do well in this market without more memory.
Cerebras and Groq provide the fastest (by an order of magnitude) inference. This is very useful for certain workflows, which require low-latency feedback: audio chat with LLM, robotics, etc.
Outside that narrow niche, AMD stuff seems to be the only contender to NVIDIA, at the moment.
latchkey 2 days ago [-]
> Cerebras and Groq provide the fastest (by an order of magnitude) inference.
Only on smaller models, their numbers are all 70b in the article.
Those numbers also need to be adjusted for the comparable amounts of capex+opex costs. If the costs are so high that they have to subsidize the usage/results, then they are just going to run out of money, fast.
krasin 2 days ago [-]
> Only on smaller models, their numbers are all 70b in the article.
No, they are 5x-10x faster for all the model sizes (because it's all just running from SRAM and they have more of it than NVIDIA/AMD), even though they benchmarked just up to 70B.
> Those numbers also need to be adjusted for the comparable amounts of capex+opex costs. If the costs are so high that they have to subsidize the usage/results, then they are just going to run out of money, fast.
True. Although, for some workloads, fast enough inference is a strict prerequisite and GPUs just don't cut it.
latchkey 1 days ago [-]
576 CS-3 nodes costs around $900 million, which is $1.56 million per node.
It takes 4 nodes to serve one 70B model or $6.24m. It is unclear how many requests they can serve concurrently for this, they only report on tokens.
A cluster of 128 MI300x, which has a combined 24,576GB and can serve up a whole ton of models and users, 4 racks total, is in the ~$5m range, if you don't go big on networking/disk/ram (which you don't need for inference anyway).
While speed might be an issue here, I don't think people are going to be able to justify the price for the speed (always the tradeoff) unless they can get their costs down significantly.
campers 1 days ago [-]
On Google Cloud a server with 8 TPU v5e will do 2175 token/seconds on Llama2 70B.
That seems like a better deal than millions for a few CS-3 nodes.
And they've just announced the v6 TPU:
Compared to TPU v5e, Trillium delivers:
Over 4x improvement in training performance
Up to 3x increase in inference throughput
A 67% increase in energy efficiency
An impressive 4.7x increase in peak compute performance per chip
Double the High Bandwidth Memory (HBM) capacity
Double the Interchip Interconnect (ICI) bandwidth
You are right assuming that model capabilities are determined only by model size. But consider that OpenAI is saying they have a way of scaling intelligence with inference time compute, not just model size. If that proves out, reducing latency per output token potentially becomes as valuable as or even possibly more valuable than scaling model size. Speed becomes intelligence. And Cerebras has 1/10 the latency per token of anything else.
krasin 1 days ago [-]
You're correct on $/bandwidth. The point about low latency continues to be ignored, though.
menaerus 1 days ago [-]
It's maybe because the assumption about low latency because everything fits in SRAM is not valid?
CS-1 had 18G of SRAM, CS-2 extended it to 40G and CS-3 has 44G of SRAM. None of these is sufficient to run the inference of Llama 70B and much less so of even larger models.
latchkey 1 days ago [-]
Exactly. Latency is less relevant if you have to have 4 literal servers (each taking up a whole rack) to push out one single 70B model and we don't know how many concurrent user requests that actually services (probably 1).
arisAlexis 1 days ago [-]
That's a benchmark or shower thought?
cma 2 days ago [-]
Cerebras had less chip perimeter to hook up external memory I/O and is memory capacity limited with just SRAM. SRAM circuit size hasn't been scaling nearly as well as logic on recent nodes, but if scaling there had continued to from when Cerebras started it may have worked out better.
They'll probably still have to do advanced packaging putting HBM on top to save things.
They could maybe enable some cool real time inference stuff like VR SORA, but that doesn't seem like much of a product market for the cost yet.
Maybe something heavy on inference iteration like an o1 style model that trades training time for more inference, used to process earnings reports the fastest or something zero sum latency war like that will be a viable market. A real time use case that may be viable with cerebras first could be with flexible robotics in ad hoc latency sensitive environments, maybe warfare.
If models keep lasting ~year timescales could we ever see people going with ROM chips for the weights instead of memory? Has density and speed kept up there? Lots of stuff uses identical elements to help make the masks more cheaply, so I don't think you could use something like EUV for ROM where every few um^2 of die is distinct.
rbanffy 1 days ago [-]
> They'll probably still have to do advanced packaging putting HBM on top to save things.
This is where the interesting wafer-size packaging TSMC does for the Dojo D1 supercomputer comes in - Cerebras has demonstrated what can be a superior process for inter-element bandwidth, because connections can be denser than they are with an interposer, but the ability to have different elements coming from different processes is also important, and used on the D1 slab. Stacking HBM modules on top of a Cerebras wafer might help with that. I'm sure the smart people there are not sleeping on these ideas.
For ultra low latency uses such as robotics or military applications, I believe a more integrated approach similar to the Telum processors from IBM is better - putting the inference accelerator on the same die as the CPUs gives them that, and they are also much smaller than a Cerebras wafer (and it's cooling).
Gene Amdahl would have loved to see them.
krasin 1 days ago [-]
> If models keep lasting ~year timescales could we ever see people going with ROM chips for the weights instead of memory?
Before ROM, there's a step where HBM for weights is replaced with Flash or Optane (but still high bandwidth, on top of the chip) and KV cache lives in SRAM - for small batch sizes, that would actually be decently cheap. In this case, even if weights change weekly, it's not a big deal at all.
hhdhdbdb 1 days ago [-]
How is this a narrow niche?
Chain of thought type operations is in this "niche".
Also anything where the value is in the follow up chat not the one shot.
wmf 2 days ago [-]
Groq and Cerebras only make sense at massive scale which is why I guess they pivoted to being API providers so they can amortize the hardware over many customers.
latchkey 2 days ago [-]
Correct except that massive scale doesn't work cause it just uses up exponentially more power/space/resources.
They also have a very limited use case... if things ever shift away from LLM's and into another form of engineering that their hardware does not support, what are they going to do? Just keep deploying hardware?
Slippery slope.
arisAlexis 1 days ago [-]
The article explains in depth the issues with memory, did you read through ?
YetAnotherNick 1 days ago [-]
2x 80GB A100 is better in all the metrics than MI300x while being cheaper.
asdf1145 2 days ago [-]
clickbait title: inference is not training
mentalically 2 days ago [-]
The value proposition of Cerebras is that they can compile existing graphs to their hardware and allow inference at lower costs and higher efficiencies. The title does not say anything about creating or optimizing new architectures from scratch.
germanjoey 2 days ago [-]
the title says "Cerebras Trains Llama Models"...
mentalically 2 days ago [-]
That's correct and if you read the whole thing you will realize that it is followed by "... to leap over GPUs" which indicates that they're not literally referring to optimizing the weights of the graph on a new architecture or freshly initialized variables on existing ones.
pama 2 days ago [-]
This is as clickbaity as it gets.
Trains has no other sensible interpretation in the context of LLM models. My impression was that they trained the models to be better than the models trained by GPUs, presumably because they trained faster and managed to train for longer than Meta, but this interpretation was far from the content.
Also interesting to see the ommission of deepinfra from the price table, presumably because it would be cheaper than Cerebras, though I didnt even bother to check at that point because I hate these cheap clickbaity pieces that attempt to enrich some player at the cost of everyone’s time or money.
Good luck with their IPO. We need competition but we dont need confusion.
mentalically 2 days ago [-]
What are you confused about? Their value proposition is very simple and obvious, custom hardware with a compiler that transforms existing graphs into a format that can run at lower cost and higher efficiency because it utilizes a special instruction set only available on Cerebras silicon.
fancyfredbot 1 days ago [-]
The title is clickbait but that's how marketing works whether we like it or not. The achievement is real - Cerberas improved their software and the inference is much faster as a result. I find it easy to forgive annoying marketing tactics when they're being used to promote something cool.
pama 1 days ago [-]
It is textbook bait and switch. If the achievemt is important, use the correct title. An advance in actual training performance or a better model is very important and interests a different set of people with deeper pockets than those who care about inference.
htrp 2 days ago [-]
Title is about training.... article about inference
KTibow 2 days ago [-]
Why is nobody mentioning that there is no such thing as Llama 3.2 70B
pk-protect-ai 1 days ago [-]
Wow, 44GB SRAM, not HBM3 or HBM3e, but actual SRAM ...
7e 2 days ago [-]
"It would be interesting to see what the delta in accuracy is for these benchmarks."
^ the entirety of it
7e 2 days ago [-]
"So, the delta in price/performance between Cerebras and the Hoppers in the cloud when buying iron is 2.75X but for renting iron it is 5.2X, which seems to imply that Cerebras is taking a pretty big haircut when it rents out capacity. That kind of delta between renting out capacity and selling it is not a business model, it is a loss leader from a startup trying to make a point."
As always, it is about TCO, not who can make the biggest monster chip.
asdf1145 2 days ago [-]
did they release MLPerf data yet or wouldn't help their IPO?
1 days ago [-]
Rendered at 17:25:46 GMT+0000 (Coordinated Universal Time) with Vercel.
I'm still trying to wrap my head around how these companies think they are going to do well in this market without more memory.
[0] https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html
Cerebras and Groq provide the fastest (by an order of magnitude) inference. This is very useful for certain workflows, which require low-latency feedback: audio chat with LLM, robotics, etc.
Outside that narrow niche, AMD stuff seems to be the only contender to NVIDIA, at the moment.
Only on smaller models, their numbers are all 70b in the article.
Those numbers also need to be adjusted for the comparable amounts of capex+opex costs. If the costs are so high that they have to subsidize the usage/results, then they are just going to run out of money, fast.
No, they are 5x-10x faster for all the model sizes (because it's all just running from SRAM and they have more of it than NVIDIA/AMD), even though they benchmarked just up to 70B.
> Those numbers also need to be adjusted for the comparable amounts of capex+opex costs. If the costs are so high that they have to subsidize the usage/results, then they are just going to run out of money, fast.
True. Although, for some workloads, fast enough inference is a strict prerequisite and GPUs just don't cut it.
It takes 4 nodes to serve one 70B model or $6.24m. It is unclear how many requests they can serve concurrently for this, they only report on tokens.
A cluster of 128 MI300x, which has a combined 24,576GB and can serve up a whole ton of models and users, 4 racks total, is in the ~$5m range, if you don't go big on networking/disk/ram (which you don't need for inference anyway).
While speed might be an issue here, I don't think people are going to be able to justify the price for the speed (always the tradeoff) unless they can get their costs down significantly.
https://cloud.google.com/blog/products/compute/updates-to-ai...
From https://cloud.google.com/tpu/pricing and https://cloud.google.com/vertex-ai/pricing#prediction-prices (search for ct5lp-hightpu-8t on the page) the cost for that appears to be $11.04/hr which is just under $100k for a year. Or half that on a 3-year commit.
That seems like a better deal than millions for a few CS-3 nodes.
And they've just announced the v6 TPU:
https://cloud.google.com/blog/products/compute/trillium-sixt...CS-1 had 18G of SRAM, CS-2 extended it to 40G and CS-3 has 44G of SRAM. None of these is sufficient to run the inference of Llama 70B and much less so of even larger models.
They'll probably still have to do advanced packaging putting HBM on top to save things.
They could maybe enable some cool real time inference stuff like VR SORA, but that doesn't seem like much of a product market for the cost yet.
Maybe something heavy on inference iteration like an o1 style model that trades training time for more inference, used to process earnings reports the fastest or something zero sum latency war like that will be a viable market. A real time use case that may be viable with cerebras first could be with flexible robotics in ad hoc latency sensitive environments, maybe warfare.
If models keep lasting ~year timescales could we ever see people going with ROM chips for the weights instead of memory? Has density and speed kept up there? Lots of stuff uses identical elements to help make the masks more cheaply, so I don't think you could use something like EUV for ROM where every few um^2 of die is distinct.
This is where the interesting wafer-size packaging TSMC does for the Dojo D1 supercomputer comes in - Cerebras has demonstrated what can be a superior process for inter-element bandwidth, because connections can be denser than they are with an interposer, but the ability to have different elements coming from different processes is also important, and used on the D1 slab. Stacking HBM modules on top of a Cerebras wafer might help with that. I'm sure the smart people there are not sleeping on these ideas.
For ultra low latency uses such as robotics or military applications, I believe a more integrated approach similar to the Telum processors from IBM is better - putting the inference accelerator on the same die as the CPUs gives them that, and they are also much smaller than a Cerebras wafer (and it's cooling).
Gene Amdahl would have loved to see them.
Before ROM, there's a step where HBM for weights is replaced with Flash or Optane (but still high bandwidth, on top of the chip) and KV cache lives in SRAM - for small batch sizes, that would actually be decently cheap. In this case, even if weights change weekly, it's not a big deal at all.
Chain of thought type operations is in this "niche".
Also anything where the value is in the follow up chat not the one shot.
They also have a very limited use case... if things ever shift away from LLM's and into another form of engineering that their hardware does not support, what are they going to do? Just keep deploying hardware?
Slippery slope.
Trains has no other sensible interpretation in the context of LLM models. My impression was that they trained the models to be better than the models trained by GPUs, presumably because they trained faster and managed to train for longer than Meta, but this interpretation was far from the content.
Also interesting to see the ommission of deepinfra from the price table, presumably because it would be cheaper than Cerebras, though I didnt even bother to check at that point because I hate these cheap clickbaity pieces that attempt to enrich some player at the cost of everyone’s time or money.
Good luck with their IPO. We need competition but we dont need confusion.
^ the entirety of it
As always, it is about TCO, not who can make the biggest monster chip.