I remember in college intro to CS class back in 1998, where I heard the story of building the first computer that could perform at 1 TFLOPS. It cost $46 million and took up 1600 square feet. Now a $600 Mac Mini will do double that.
...that's probably Koomey's law, which looks well on track to hold for the rest of our careers. But eventually as computing approaches the Landauer limit it must asymptotically level off as well. Probably starting around year 2050. Then we'll need to actually start "doing more with less" and minimizing the number of computations done for specific tasks. That will begin a very very productive time for custom silicon that is very task-specialized and low-level algorithmic optimization.
 Shows that Moore's law (green line) is expected to start leveling off soon, but it has not yet slowed down. It also shows Koomey's law (orange line) holding indefinitely. Fun fact, if Koomey's law holds, we'll have exaflop power in <20W in about 20 years. That's equivalent to a whole OpenAI/DeepMind-worth of power in every smartphone.
0: (Slide 13) https://www.sec.gov/Archives/edgar/data/937966/0001193125212...
1: "The constant rate of doubling of the number of computations per joule of energy dissipated" https://en.wikipedia.org/wiki/Koomey%27s_law
2: "The thermodynamic limit for the minimum amount of energy theoretically necessary to perform an irreversible single-bit operation." https://en.wikipedia.org/wiki/Landauer%27s_principle
Nvidia's performance gains in recent years has been about scaling chip size and making more efficient use of each transistor both in terms of power and count than anything else. A large part of that is minimizing how far data physically moves for any given workloads via stuff like HBM, memory compression, and smarter/larger caches.
In fact, Nvidia doesn't even really try to be on the bleeding edge nodes anymore because per transistor costs has been trending up or level on bleeding edge nodes for at least 5 years now.
They didn’t make anything.
This is just speculative benchmarking.
I am deeply not interested in multiplying the numbers on your pricing sheet by the estimated numbers on the stable diffusion model card.
I have zero interest in your (certainly excellent) Proprietary Special Sauce (TM) that makes spending money on your service a good idea.
This just reads as spam that got past the spam filter.
Did you actually train a diffusion model?
Are you going to release the model file?
Where is the actual code someone could use to replicate your results?
Given the lack of example outputs, I guess not.
> Try out our Stable Diffusion code here!
It seems worth noting that the $160k scenario wasn't actually measured.
[a]: Of course, IANAL, and it’s up to the courts
> 256 A100 throughput was extrapolated using the other throughput measurements
Is it an indictment of their service that they couldn't afford 256 GPUs on their own cloud?
It turns out that, in these kinds of large-scale experiments, hardware failures are a constant fact of life, and we have tools to manage these hardware failures and allow runs to continue anyway.
Unfortunately, it would mess up our throughput calculations for getting clean baselines here, so we're waiting for our cloud provider to kindly replace the bad A100. Expect those numbers in the next day or so.
I wonder if AMD is as over-the-top brutal with legal control over where their GPUs can be used as Nvidia is. Maybe with energy cost you might possibly still want to stick with the A100's anyways, but you can afford quite a lot of RX 7900's with $100k (if you can find em).
General model training https://github.com/learning-at-home/hivemind
Stable diffusion specific https://github.com/chavinlo/distributed-diffusion
Inference only stable diffusion https://stablehorde.net/
I highly doubt you could get anywhere near the same price.
So you would need at least 2,716,796 hours to train on CPU.
A m6a.12xlarge is roughly equivalent to a EPYC 7352 24-core, it currently costs
$0.5028 an hour on spot.
So that works out to a cost of $1,366,005.