Training Stable Diffusion from Scratch Costs <$
mullingitover 11 days ago [-]
Interesting to think about where the cost will go in a few years.

I remember in college intro to CS class back in 1998, where I heard the story of building the first computer that could perform at 1 TFLOPS[1]. It cost $46 million and took up 1600 square feet. Now a $600 Mac Mini will do double that.


DSingularity 11 days ago [-]
And on a much more programmable platform - which is just as important in terms of accessibility !
miohtama 11 days ago [-]
It is not going to go down much anymore, because the end of Moore's law has been reached as physical limitations become a factor. You cannot scale chips close to 1 atom wide transistors.
runnerup 9 days ago [-]
Moore's law isn't dead. Only Dennard's law. See slide 13 here[0]. Moore's law stated that the number of transistors per area will double every n months. That's still happening. Besides, neither Moore's law nor Dennard scaling are even the most critical scaling law to be concerned about...

...that's probably Koomey's law[1], which looks well on track to hold for the rest of our careers. But eventually as computing approaches the Landauer limit[2] it must asymptotically level off as well. Probably starting around year 2050. Then we'll need to actually start "doing more with less" and minimizing the number of computations done for specific tasks. That will begin a very very productive time for custom silicon that is very task-specialized and low-level algorithmic optimization.

[0] Shows that Moore's law (green line) is expected to start leveling off soon, but it has not yet slowed down. It also shows Koomey's law (orange line) holding indefinitely. Fun fact, if Koomey's law holds, we'll have exaflop power in <20W in about 20 years. That's equivalent to a whole OpenAI/DeepMind-worth of power in every smartphone.

0: (Slide 13)

1: "The constant rate of doubling of the number of computations per joule of energy dissipated"

2: "The thermodynamic limit for the minimum amount of energy theoretically necessary to perform an irreversible single-bit operation."

runnerup 8 days ago [-]
Also even MHz increases have had a bit of a comeback lately, with the fastest mid-2000's Pentium 4's reaching 3.8-4.2GHz and the latest Ryzen 7000's reaching 6GHz.
Our_Benefactors 11 days ago [-]
I’ll take this 10 year bet. You really think nvidia is just gonna stop releasing new revisions? “Moores law is dead” is way over-memed, it’s more of an axiom about how computers continually improve than really being about transistor count at this point.
andromeduck 11 days ago [-]
Moore's law and more importantly dennard scaling both died in the mid 2000s. Nvidia is in fact successful because of the end of dennard scaling and the shudts do more mission specialized silicon like TPUs, and codec accelerators, inference engines are also a consequence of that.

Nvidia's performance gains in recent years has been about scaling chip size and making more efficient use of each transistor both in terms of power and count than anything else. A large part of that is minimizing how far data physically moves for any given workloads via stuff like HBM, memory compression, and smarter/larger caches.

In fact, Nvidia doesn't even really try to be on the bleeding edge nodes anymore because per transistor costs has been trending up or level on bleeding edge nodes for at least 5 years now.

wokwokwok 11 days ago [-]
Is this just an ad for a service?

They didn’t make anything.

This is just speculative benchmarking.

I am deeply not interested in multiplying the numbers on your pricing sheet by the estimated numbers on the stable diffusion model card.

I have zero interest in your (certainly excellent) Proprietary Special Sauce (TM) that makes spending money on your service a good idea.

This just reads as spam that got past the spam filter.

Did you actually train a diffusion model?

Are you going to release the model file?

Where is the actual code someone could use to replicate your results?

Given the lack of example outputs, I guess not.

ml_hardware 11 days ago [-]
Did you actually read the blog? The very first sentence is:

> Try out our Stable Diffusion code here! >

abeppu 11 days ago [-]
> *256 A100 throughput was extrapolated using the other throughput measurements.

It seems worth noting that the $160k scenario wasn't actually measured.

landanjs 11 days ago [-]
Hi, one of the authors of the post, we will update the post with numbers from 256 GPU run within the next few days. We estimated the 256 run to be the fastest (13 days), but also the most expensive at $160k. The measured 128 GPU run would take 21 days but for $125k if you are interested in lower costs.
varispeed 11 days ago [-]
That is not necessarily a saving. If you have a let's say a team of five people each costing $1000 a day, those idle 6 days (not counting the weekend) would add up to $30k of wasted money. Then if you are working on something the competition is also working on, these lost days would add up and potentially cost losing the edge - could be quite expensive or even cost the business.
landanjs 11 days ago [-]
Agreed. When I mentioned lower costs, this was exclusive to model training, but there are so many other factors that influence the cost to a business.
epicycles33 11 days ago [-]
Glad to see this - you can even get reasonable-ish results on lower res images with ~2 hours train time on a P100 GPU. See my try here:
gedy 11 days ago [-]
Still pretty pricey for average person, but these will trend cheaper and why I think it's futile to "regulate" AI. Someone somewhere will train models on anything visible to public, licensed or not. Feels like Pandora's box has been opened and we need to deal with it.
adam_arthur 11 days ago [-]
The companies that wait for a 100% "clean" model are going to get left behind. e.g. ChatGPT launching despite Google, Meta and others already having very similar technology internally
sp332 11 days ago [-]
There is already a class action lawsuit. The companies that move forward with "dirty" models can be wiped out by legal fees before they got off the ground.
mullingitover 11 days ago [-]
This will likely just entrench the companies with pockets deep enough to satisfy the lawyers in the class action suits. It might burn down a startup. On the other hand, if Microsoft thinks it's a potential $500 billion business, a $1 billion settlement is just table stakes.
Sebguer 11 days ago [-]
This ignores that a legal remedy can be "you can no longer offer this as a product".
mullingitover 11 days ago [-]
That likely wouldn't be on the table in a settlement offer, and it might be a tough sell for the plaintiff class to get nothing or a protracted legal battle instead of an easy and significant payout. Anything is possible, but I don't think a lawsuit's likely outcome in this situation would be a scorched earth fight.
dfadsadsf 11 days ago [-]
Has this ever happen in practice for well funded company outside Napster case? I am skeptical that training on publicly accessible data can be ruled illegal. Too many side effects including making Google Search problematic.
mensetmanusman 11 days ago [-]
That just means it will happen outside the us wherever laws aren’t enforced (China, Russia, etc.)
operatingthetan 11 days ago [-]
Doesn't that kind of action cause organizations to be less transparent about the sources of their models then?
11 days ago [-]
anon291 11 days ago [-]
As usual, AI has no agency. I believe we should view AI as simply an extension of our own agency. Thus, if you prompt an AI to generate copyrighted work, that's fine. Viewing it yourself is like imagination. However, just as you can draw mickey mouse for your own fun all you want, you cannot sell such images.
colejohnson66 11 days ago [-]
That’s kindof already the case. Ignoring the legality of GitHub Copilot itself, if it suggests code that violates copyright and you use it, you are (probably most likely[a]) still infringing. You can’t hide behind “the AI did it.”

[a]: Of course, IANAL, and it’s up to the courts

pizza 11 days ago [-]
5 bucks says within a year there’ll be some innovation that shrinks this by 2 orders of magnitude. Either from much cheaper compute cost (eg OPUs) or much more efficient training. Hell, there ought to be some way to leapfrog these innovations in such a way that the huge model of yesteryear becomes a more powerful optimizer/loss function itself. That’d just about solve the “hands off my unique shapes!” problem of acceptable training data trawling too :)
odyssey7 11 days ago [-]
How many tries does it take for an expert to succeed at training a custom Stable Diffusion?
ipsum2 11 days ago [-]
Note that this doesn't take into account the numerous iterations required to dial in the correct hyperparameters and model architecture, which could easily increase cost 5-10x.

> 256 A100 throughput was extrapolated using the other throughput measurements

Is it an indictment of their service that they couldn't afford 256 GPUs on their own cloud?

jfrankle 11 days ago [-]
It's an indictment of the A100 node that died on us yesterday, leaving us with 248 GPUs in the particular cluster where we were running the experiments :(

It turns out that, in these kinds of large-scale experiments, hardware failures are a constant fact of life, and we have tools to manage these hardware failures and allow runs to continue anyway.

Unfortunately, it would mess up our throughput calculations for getting clean baselines here, so we're waiting for our cloud provider to kindly replace the bad A100. Expect those numbers in the next day or so.

ipsum2 11 days ago [-]
Getting reliable GPUs is a difficult problem, I empathize. I've spent a decent amount of time and money because there was one failing GPU on an AWS cluster.
jfrankle 11 days ago [-]
We've come to accept that it's an impossible problem at this point. Instead, we're getting good at automatically detecting hardware failures and rapidly restarting runs on fewer nodes. We're also exploring batch sizes that are (where possible) divisible by N nodes and N-1 nodes. Fault tolerant system design is unfortunately an evergreen topic in CS.
choxi 11 days ago [-]
Data truly is the new oil. When it’s all done the compute costs and code will be cheap or free. There’s a lot hinging on how we interpret copyright laws or what kind of data rights laws we enact.
ralph84 11 days ago [-]
The copyright on Mickey Mouse is due to expire next year, so there will definitely be some attempts at copyright "reform" this year.
11 days ago [-]
rektide 11 days ago [-]
This task requires a bit more work than I'd want, but I'd also point out $100k can buy ~9 A100's which are good for ~7k hours of work a month (through not entirely reputable channels, so there's a chance some might die earlier or might have to be returned). That might not train Stable Diffusion in a fast enough time for you (~50k hours estimated training time), but it's still damned impressive. And you can keep the hardware.

I wonder if AMD is as over-the-top brutal with legal control over where their GPUs can be used as Nvidia is. Maybe with energy cost you might possibly still want to stick with the A100's anyways, but you can afford quite a lot of RX 7900's with $100k (if you can find em).

xnx 11 days ago [-]
It's interesting to compare the cost for cloud GPU's vs. buying the hardware outright. At ~$10,000 per Nvidia A100 GPU, it seems like this cloud provider would break even on the hardware after about 5 months at these rates. There are certainly other costs involved (racking, power, etc.), but that's not too bad. I'm almost surprised Nvidia doesn't cannibalize it's hardware sales by running its own cloud.
coding123 11 days ago [-]
There are some large AWS customers that probably burn that in idle time on a bunch of unused machines per week (probably day).
a9h74j 11 days ago [-]
Can the training be parallelized in a manner similar to SETI-at-home?
nodja 11 days ago [-]
Yes, hivemind trained a gpt 6B model like this.

General model training

Stable diffusion specific

Inference only stable diffusion

mensetmanusman 11 days ago [-]
This ignores all the runtime costs for LLMs that aren’t operating effectively :)
marcooliv 11 days ago [-]
Is there any value of this cost that we can say "this is dangerous". For any reason?
capableweb 11 days ago [-]
You could probably take any technology and come up with some sort of reasoning about why it is dangerous.
11 days ago [-]
xwdv 11 days ago [-]
We can do it for way less using spot instances on AWS, though it takes longer.
WatchDog 11 days ago [-]
AWS don't have any GPU spot instances right?

I highly doubt you could get anywhere near the same price.

WatchDog 11 days ago [-]
Very very rough estimate, using inference benchmarks, which can't necessarily be extrapolated to training, but if a A100 takes 6.49 seconds to generate an image, and a EPYC 7352 24-core cpu takes 223.19 seconds[0], that's 34 times slower.

So you would need at least 2,716,796 hours to train on CPU.

A m6a.12xlarge is roughly equivalent to a EPYC 7352 24-core[1], it currently costs $0.5028 an hour on spot.

So that works out to a cost of $1,366,005.



graphe 11 days ago [-]
I've downloaded anime models for free. I'm sure they were <$160 without the k.
O__________O 11 days ago [-]
Those models are not from scratch.
manimino 11 days ago [-]
It is a fair point though - there's no utility in training an openly available model from scratch. Finetuning is far more practical.
O__________O 11 days ago [-]
Numerous reasons why someone might want to train model from scratch; for example, copyright province and licensing control.