I've been looking around the documentation on Huggingface, but all I could find was either how to train unconditional U-Nets, or how to use the pretrained Stable Diffusion model to process image prompts (which I already know how to do). Writing a training loop for CLIP manually wound up with me banging against all sorts of strange roadblocks and missing bits of documentation, and I still don't have it working. I'm pretty sure I also need some other trainables at some point, too.
 Specifically, Wikimedia Commons images in the PD-Art-100 category, because the images will be public domain in the US and the labels CC-BY-SA. This would rule out a lot of the complaints people have about living artists' work getting scraped into the machine; and probably satisfy Debian's ML guidelines.
 Which actually does work
Honestly it baffles me that in all this discussion, I rarely see people discussing how to do this with appropriately licensed images. There are some pretty large datasets out there of public images, and doing so might even help encourage more people to contribute to open datasets.
Also if the big ML companies HAD to use open images, they would be forced to figure out sample efficiency for these models. Which is good for the ML community! They would also be motivated to encourage the creation of larger openly licensed datasets, which would be great. I still think if we got twitter and other social media sites to add image license options, then people who want to contribute to open datasets could do so in an easy and socially contagious way. Maybe this would be a good project for mastodon contributors, since that is something we actually have control over. I'd be happy to license my photography with an open license!
It is really a wonderful idea to try to do this with open data. Maybe it won't work very well with current techniques, but that just becomes an engineering problem worth looking at (sample efficiency).
Then there's the fact that humanity has been able to develop and share art and literary works for thousands of years without the modern copyright system.
It would be interesting to see if this technology can erode the copyright concept a bit. Maybe not remove it completely, but perhaps influence people to create wider definitions for "fair use", and undo the extensions that Disney lobbyists have created.
But they developed Cubism in parallel. There were periods where their work was almost indistinguishable. "Houses at l'Estaque", the trope namer for Cubism thanks to the remarks of a critic, was in fact by Braque.
You can generate infinite recognizable Basquiat from an AI, but is it Basquiat? No, of course not, because Basquiat's style operates within the context of a specific individual human making a point about expectations and the interface between his race and his artistic boldness and audacity as experienced by his wealthy audience. Making an AI 'ape' (!) his art style is itself quite the artistic statement, but it's not the same thing in the slightest.
You can generate infinite Rothko as 512x512 squares, but if you don't understand how the gallery hangings work and their ability to fill your entire visual field with first carefully chosen color, and then a great deal of detail at the threshold of perception of distinctions between color shades meant to further drive home the reaction to the basic color's moods, what you generate is basically arbitrary and nothing. Rothko isn't 'just a random color', Rothko is about giving you a feeling through means that aren't normal or representational, and the unusualness of this (reasonably successful) effort is what gave the work its valuation.
Ownership of the experience by a particular artist isn't the point. Rothko isn't solely celebrity worship and speculation. Picasso isn't all of Cubism. Art is things other than property of particular artists.
What makes it awkward is the great ease by which AI can blindly and unhelpfully wear the mask of an artist, such as Basquiat, to the detriment of art. It's HOW you use the tools, and it's possible to abuse such tools.
I'm not sure how I feel about this - I agree with the conclusion, but not the reasoning. For me, AI-generated Basquiat is not Basquiat simply because he had no ownership or agency in the process of its creation.
It feels like an overly romantic notion that art requires specific historical/cultural context at the moment of its creation to be valid.
If I could hypothetically pay Basquiat $100 to put his own work into a stable diffusion model that created a Basquiat-esque work, that's still a Basquiat. If I could pay him to draw a circle with a pencil, that's his work - and if I used it in an AI model, then it's not.
It's about who held the paintbrush, or who delegated holding the paintbrush, not a retrospectively applied critical theory.
Even more than that, you couldn't do Rothko that way: the man would be beyond offended and would not deal with you at all. But by contrast, you ABSOLUTELY are doing a Warhol if you train an AI on him and have it generate infinite works, and furthermore I think he'd be absolutely delighted at the notion, and would love exploring the unexplored conceptual space inside the neural net.
In a sense, an AI Warhol is megaWarhol, an unexplored level of Warholiness that wasn't attainable within his lifetime.
Context and intent matter. All of modern art ended up exploring these questions AS the artform itself, so boiling it down to 'did a specific person make a mark on a thing' won't work here.
Any drawing Basquiat did is a piece of art by Basquiat, whether or not it fits into the narrative of a book/thesis/lecture/exhibition. The circle metaphor isn't important - replace it with anything else. Artists regularly throw their own work away. Some of this is saved and celebrated posthumously, some never sees the light of day in accordance with their wishes. Scraps that fell on Picasso's floor sell for huge amounts of money.
Does everything he did fit the "brand" that some art historians have labelled him with, or the "brand" that auction houses promote to increase value, or the "brand" which a fashion label licenses for t-shirts? No, but I suspect this is probably what you are talking about ie. a "classic" Basquiat™ with certificate of authenticity?
Is it by Basquiat? vs Is it a Basquiat?
This arguments come up in every thread, and I'm baffled that people don't think the scale matters.
You may also be observed in public areas by police, but it would be an orwellian dystopia to have millions of cameras in spaces analyzing everyone's behavior in public.
(But I'm indeed in favor of weaker copyright laws! But preferably to take power away from the copyright monopolies than the individual artists who barely get by with their profits)
Aren't there already 80M+ surveillance cameras in the US?
Outside of the US, London seems to have a lot of CCTV cameras.
Do privacy laws restrict how they can be used and whether they can be monitored by AI systems?
Copyright law (especially in US) only ever changes in the direction that suits corporations. So - no.
What I expect instead is artists being sued by a big tech company for copyright violations because that big tech company used the artist Public Domain image for training their copyrighted AI and as a result it created a copyrighted copy of the original artist's image.
You can already see the quite strange and toned down language they use on their sites. (And for some the revealing reversal from we licence to you to you licence to us)
Some smaller AI companies might believe they own a clear cut copyright and sue, but it would make sense that they would either be thrown out or loose
However, even if an image is not copyrightable, it can still infringe copyright. For example, mechanical reproductions of images are not copyrightable in the US - which is why you even can have public domain imagery on the web. However, if I scan a copyrighted image into my computer, that doesn't launder the copyright away, and I can still be sued for having that image on my website.
Likewise, if I ask an AI to give me someone else's copyrighted work, it will happily regurgitate its training set and do that, and that's infringement. This is separate from the question of training the AI itself; even if that is fair use, that does nothing for the people using the AI because fair use is not transitive. If I, say, take every YouTube video essay and review on a particular movie and just clip out and re-edit all the movie clips in those reviews, that doesn't make my re-edit fair use. You cannot "reach through" a fair use to infringe copyright.
 In Europe there's a concept of neighboring rights, where instead of issuing you a full copyright you get 20 years of ownership instead. This is intended for things like databases and the like. This also applies to images; copyright over there distinguishes between artistic photography (full copyright) and other kinds of photography (20 years neighboring right only). This is also why Wikimedia Commons has a hilarious amount of Italian photos from the 80s in a special PD-Italy category.
 Which is not too difficult to do
 My current guess is that it is fair use, because the AI can generate novel works if you give it novel input.
That’s because only humans can own copyrights. People can and have registered copyrights for Midjourney outputs.
There's certainly arguments to be made in this direction, for example corporations tending to have the most money they can afford to spend on lobbying to get their way, but the attitude of "it hasn't been good up 'til now so it definitely can't ever be good" is pretty defeatist and would imply that positive change is impossible in any area.
There is nothing that the generative AI can do in this process that's legally different from copy pasting the image, editing it a bit by hand, and somehow claiming intellectual property of the _initial_ image, no ?
Just objectively false.
That seems to be the core of the issue, and a much more interesting conversation to have. So why do I keep seeing a version of your first paragraph everywhere and not an explanation on why the assumption can be made?
Human endeavor is inherently collaborative. The idea that my art is my virgin creation is an illusion perpetuated by capitalists. My art is the work of thousands who came before me with my slight additions and tweaks.
Your (and in general, our) suggestion that we should be concerned with respecting or even expanding these protections is incorrect if you want human creativity to flourish.
But I am absolutely not in favor of keeping IP restrictions in place and then letting big corporations scoop up the works of small independent artists for their ML models.
Think of it in terms of software licenses. The people who write GPL protected software are leveraging existing copyright laws to enforce distribution of their code. They would probably be in favor of abolishing the entire IP rights system. But if a big corporation was copying a project from an independent creator that was GPL licensed, they’d sure as hell want to prosecute.
I believe strongly that IP restrictions are harmful. But keeping them in place while letting big corporations benefit from the work of independent artists who don’t want their work used in this way seems wrong to me. As long as artists wouldn’t expect anyone else to be able to copy their works, I’d like them to be able to consent to their work being used in these systems.
> But keeping them in place while letting big corporations benefit from the work of independent artists who don’t want their work used in this way seems wrong to me.
I see what you're saying here. My concern is that should copyright style protection be extended to the "vibe" or "style" of a painting it is going to be twisted in a way that ends up being used to silence/abuse artists in the same way that copyright strikes are already.
I think the idea that art is mostly individually creative vs mostly drawing upon the work of all the artists and art-appreciators around you and before you is already really problematic. The corrupting power of the idea is what I worry about. Similarly to crypto/NFTs, the idea that scarcity should exist in the digital world is the most dangerous thing, most of the other bad stems from that.
IMO the most important thing to work on is getting people to reject the idea itself as harmful.
I worry that any short term fix to try to prop up artists' rights in response to this changing landscape will become a long term anchor on our society's equity and cultural progress in the exact same way copyright is.
Then came the brutal reality: creating high-quality artwork needs time. Some can be created after work, but not that much. Some forms of art require expensive instruments. Some, like filmmaking, require collaboration and coordination of many people. So yes, I could do some forms of art part-time using the money from my day job, but I knew it was a far cry from what I could do when working on it full time. It's not capitalism, it's just reality.
If all artists are "weekend warriors", they will still produce a lot of art, and some of it will be the best in that world. But the quality will be far from what we enjoy today.
That said, there are of course other ways to pay artists than the capitalist way of having customers pay for what they like. But I think the track record firmly favors a capitalist system.
Can you point to a system that worked well before that you'd like to go back to?
I just wanted to point out that capitalism is in fact a specific economic system. It's not a law of nature, or another word for "markets" or "freedom", or a realization that some other system doesn't work.
So, yes, capitalism in the sense of the freedom to trade one's labor does appear to be naturally and universally emergent in advanced human societies, in the absence of violent interference.
To the extent that certain aspects of capitalism lead to violence, those are elements that other parties -- generally corporations or governments rather than writers or philosophers -- added to the ideology.
People die trying to break out of non-capitalist countries, while they die trying to break in to capitalist ones. That's one possible way to tell the good guys from the bad guys.
Ahahah, I absolutely love this sentence. You might have said the quiet part out loud though.
“You gots to understand”, said Fat Tony, “I'm not a violent man. The violence simply comes in when you interfere with my business.”
But capitalism prevails and may be the best system there is for now because I cannot fathom a change in system overnight that would not result in mass suffering for (almost) everyone.
That reason has nothing to do with intellectual property or how it's created, it's a consequence of living in a capitalist society.
The there still is the question of attribution, which 100% of real artists care about.
So no, it's not at all clear where the legal lines are drawn. There have been no court cases yet, regarding the training of ML models. People are trying to draw analogies from other types of cases, but this has not been tried in court yet. And then the answer will likely differ based on country.
Not if Google honors the robots.txt like they say they do. Hosting content with a robots.txt saying "index me please" is essentially an implicit contract with Google for full access to your content in return for showing up in their search results.
Hosting an image/code repository with a very specific license attached and then having that licensed ignored by someone who repackages that content and redistributes it is not the same as sites explicitly telling Google to index their content.
A much closer comparison IMO would be someone compressing a massive library of copyrighted content and then redistributing it and arguing it's legal because "the content has been processed and can't be recovered without a specific setup". I don't think we'd need prior court cases to argue that would most likely be illegal, so I don't see how machine learning models differ.
You can also watch the fast.ai MOOC titled Deep Learning from Scratch to Stable Diffusion .
You can also look at open source implementation of text2image models like Dall-E Mini or the works of lucid rain.
I worked on the Dall-E Mini project, and the technical knowhow that you need isn’t closely taught at MOOCs. You need to know, on top of Deep Learning theory, many tricks, gotchas, workarounds, etc.
You could follow the works of Eluther AI, follow Boris Dayma (project leader of Dall-E Mini) and Horace Ho on twitter. And any such people who have significant experience in practical AI and regularly share their tricks. The PyTorch forums is also a good place.
Learn PyTorch and/or JAX/Flax really well.
If you're talking about training from scratch and not fine tuning, that won't be cheap or easy to do. You need thousands upon thousands of dollars of GPU compute  and a gigantic data set.
I trained something nowhere near the scale of Stable Diffusion on Lambda Labs, and my bill was $14,000.
 Assuming you rent GPUs hourly, because buying the hardware outright will be prohibitively expensive.
That argument also makes little sense when you consider that the model is a couple gigabytes itself, it can't memorize 240TB of data, so it "learned".
But if you want to create custom versions of SD, you can always try out dreambooth: https://github.com/XavierXiao/Dreambooth-Stable-Diffusion, that one is actually feasible without spending millions of dollars on GPUs.
1. https://www.youtube.com/watch?v=cdiD-9MMpb0 Lex Fridman podcast with Andrej Karpathy
P.S. To counteract that (unintentionally actually, likely because of a simple optimization of instruments' duty cycle) in astronomy people come up with a concept of "observatory" (Like Hubble, JWST) instead of "experiment" (like LHC, HESS telescopes) where outside people can submit their proposals, and if selected get observational time. Along with raw data authors of the proposals get required expertise from the collaboration to process and analyze that data.
This is just lossy compression with a large and well-tuned (to the expected problem domain) dictionary.
Video compression codecs can achieve a 500x compression ratio, and they are general-purpose.
Uncompressed, LAION-5B would be 4PB, for a compression ratio into SD of ~780kx, or one byte per picture.
The only practical limit is the amount of information entropy in the source material, and if you're going to claim that internet pictures are particularly information-dense I'd need some evidence, because I don't believe you.
Both are just simple statistical relationships between parameters and random variables.
Most human behavior is easy to describe with only a few underlying parameters, but there are outlier behaviors where the number of parameters grows unboundedly.
("AI" hasn't even come close to modeling these outliers.)
Internet pictures squarely falls into the "few underlying parameters" bucket.
We can speculate they apply to certain models of slices of human behaviour based on our vague understanding of how we work, but not nearly to the same degree.
When a human looks at a picture and then creates a duplicate, even from memory, we consider that a copyright violation. But when a human looks at a picture and then paints something in the style of that picture, we don't consider that a copyright violation. However we don't know how the brain does it in either case.
How is this different to Stable Diffusion imitating artists?
> That argument also makes little sense when you consider that the model is a couple gigabytes itself, it can't memorize 240TB of data, so it "learned".
The matter is really very nuanced and trivialising it that way is unhelpful.
If I recompress 240TB as super low quality jpgs and manage to zip them up as single file that is significantly smaller than 240TB (because you can), does the fact they are not pixel perfect matches for the original images mean you’re not violating copyright?
If an AI model can generate statistically significantly similar images from the training data, with a trivial guessable prompt (“a picture by xxx” or whatever) then it’s entirely arguable that the model is similarly infringing.
The exact compression algorithm, be it model or jpg or zip is irrelevant to that point.
It’s entirely reasonable to say, if this is so good at learning, why don’t you train it without the art station dataset.
…because if it’s just learning techniques, generic public domain art should be fine right? Can’t you just engineer the prompting better so that it generates “by Greg Rutkowski“ images without being trained on actual images by Greg?
If not, then it’s not just learning technique, it’s copying.
So; tldr: there’s plenty of scope for trying to train a model on an ethically sourced dataset, and investigation of techniques vs copying in generative models.
It is 100% not something we can just brush off.
If you compress them down to two or three bytes each, which is what the process effectively does, then yes, I would argue that we stand to lose a LOT as a technological society by enforcing existing copyright laws on IP that has undergone such an extreme transformation.
Does that mean it’s worthless to try to train an ethical art model?
Is it not helpful to show that you can train a model that can generate art without training it on copyrighted material?
Maybe it’s good. Maybe not. Who cares if people waste their money doing it? Why do you care?
It certainly feels awfully convenient for that there are no ethically trained models because it means no one can say “you should be using these; you have a choice to do the right thing, if you want to”.
I’m not judging; but what I will say is that there’s only one benefit in trying to avoid and discourage people training ethical models:
…and that is the benefit of people currently making and using unethically trained models.
You couldn't teach a human to do that without them having seen Greg's art. There are elements of stroke, palette, lightning and composition that can't be fully captured by natural language (short of encoding a ML model, which defeats the point).
However, copyright doesn't prevent someone to look at the work and study it. Even study it by heart. Infringement comes only if that someone would make a reproduction of that work. Also, there are provision for fair use, etc.
Is it fair to hold it to a higher standard than humans though? To some degree it's the whole "xxx..... on a computer!" thing all over again if we go that way
Can you please rewrite this in the writing style of Socrates?
Harping about copyrights in the Age of Diffusion Models is unhelpful (for artists) like protesting against a tsunami. It's time to move up the ladder.
ML engineers have a similar predicament - GPT-3 like models can solve at first try, without specialised training, tasks that took a whole team a few years of work. Who dares still use LSTMs now like it's 2017? Moving up the ladder, learning to prompt and fine-tune ready made models is the only solution for ML eng.
The reckoning is coming for programmers and for writers as well. Even scientific papers can be generated by LLMs now - see the Galactica scandal where some detractors said it will empower people to write fake papers. It also has the best ability to generate appropriate citations.
The conclusion is that we need to give up some of the human-only tasks and hop on the new train.
Oh and also I second the fast.ai suggestion, part 2 is 100% focused on implementing stable diffusion from scratch in the python standard library and it's amazing all around. The course is still actively coming out but the first few lessons are freely available already and the rest sounds like it will be made freely available soon.
I trained from scratch with 4x3090 and while it’s not as good as SD it’s surprisingly better with hands.
I particularly interested in the image generation part (the DDPM/SGM)
There’s code on my GitHub (glid3)
edit: The architecture is identical to SD except I trained on 256px images with cosine noise schedule instead of linear. Using the cosine schedule makes the unet converge faster but can overfit if overtrained.
edit 2: Just tried it again and my model is also pretty bad at hands actually. It does get lucky once in a while though.
I use an open air rig like the ones used for crypto mining. 4x3090 would normally trip the breakers without mods but if you under volt the cards the power draw is just under the limit for a home AC outlet.
Doesn't the "BY" part of the license mean you have to provide attribution along with your models' output? I feel you'll have the equivalent of Github Copilot problem: it might be prohibitive to correctly attribute each output, and listing the entire dataset in attribution section won't fly either. And if you don't attribute, your model is no different than Stable Diffusion, Copilot and other hot models/tools: it's still a massive copyright violation and copyright laundering tool.
 - https://creativecommons.org/licenses/by-sa/4.0/
Isnt it a bit anthropomorphic to compare the two algorithms by "how a human believes they work" instead of "what they're actually doing different to the inputs to create the outputs"?
These are algorithms and we can look at how they work, so it feels like a cop-out to not do that.
The attribution requirement would absolutely apply to the model weights themselves, and if I ever get this thing to train at all I plan to have a script that extracts attribution data from the Wikimedia Commons dataset and puts it in the model file. This is cumbersome, but possible. A copyright maximalist might also argue that the prompts you put into the model - or at least ones you've specifically engineered for the particular language the labels use - are derivative works of the original label set and need to be attributed, too. However, that's only a problem for people who want to share text prompts, and the labels themselves probably only have thin copyright.
Also, there's a particular feature of art generators that makes the attribution problem potentially tractable: CLIP itself was originally designed to do image classification. Guiding an image diffuser is just a cool hack. This means that we actually have a content ID system baked into our image generator! If you have a list of what images were fed into the CLIP trainer and their image-side outputs, then you can feed a generated image back into CLIP and compare the distance in the output space to the original training set and list out the closest examples there.
 A US copyright doctrine in which courts have argued that collections of uncopyrightable elements can become copyrightable, but the resulting protection is said to be "thin".
 CLIP uses a "dual headed" model architecture, in which both an image and text classifier are co-trained to output data into the same output parameter space. This is what makes art generators work, and it can even do things like "zero-shot classification" where you ask it to classify things it was never trained on.
Just to be correct, SD generates labels on images sometimes, so, we need to worry ;)
This is not possible because the model is smaller than the input weights. Just as any new image it generates is something it made up, any attributions it generated would also be made up.
CLIP can provide “similarity” scores but those are based on an arbitrary definition of “similarity”. Diffusion models don’t make collages.
"— If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original"
There is working training code for openCLIP https://github.com/mlfoundations/open_clip
But training multi-modal text-to-image models is still a _very_ new thing, in terms of the software world. Given that, my experience has been that it's never been easier to get to work on this stuff from the software POV. The hardware is the tricky bit (and preventing bandwidth issues on distributed systems).
That isn't to say that there isn't code out there for training. Just that you're going to run into issues and learning how to solve those issues as you encounter them is going to be a highly valuable skill soon.
I'm seeing in a sibling comment that you're hoping to train your own model from scratch on a single GPU. Currently, at least, scaling laws for transformers  mean that the only models that perform much of anything at all need a lot of parameters. The bigger the better - as far as we can tell.
Very simply - researchers start by making a model big enough to fill a single GPU. Then, they replicate the model across hundreds/thousands of GPU's, but feed each on a different set of the data. Model updates are then synchronized, hopefully taking advantage of some sort of pipelining to avoid bottlenecks. This is referred to as data-parallel.
Demoing even the v1 of stable diffusion to the non-technical general users blows them away completely.
Now that v2 is here, it’s clear we’re not able to keep pace in developing products to take advantage of it.
The general public still is blown away by autosuggest in mobile OS keyboards. Very few really know how far AI tech has evolved.
Huge market opportunity for folks wanting to ride the wave here.
This is exciting for me personally, since I can keep plugging in newer and better versions of these models into my app and it becomes better.
Even some of the tech folks I demo my app to, are simply amazed how I can manage to do this solo.
Let's take a deterministic algorithm that predictably corrects your typos and build it on AI. It will offer you no benefits, but it will completely destroy the utility since it will never work predictably or accurately.
Suggest puts up options for the next word.
They both serve the same purpose of helping the user quickly and accurately communicate on a cell phone. Like auto-suggest, I rely on auto-correct to fix things that I know I commonly mistype. When it doesn't work predictably, it's useless.
Honestly I was quite surprised at how regular people are impressed by this tech. I was also surprised by how little regular people are aware of this tech even existing.
We, on hackernews, on a thread about Stable Diffusion, are of course not too unimpressed.
But that’s not the vast majority of people.
>> Yeah! Don't they make a trillion dollars a year? How is it so crappy?
And for some damn reason they refuse to stop changing "ok" to "OK" like we're all octogenarians on Facebook.
Hence, I'm working on http://diffudle.com/ which is a mix of Wheel Of Fortune + Stable Diffusion + Wordle. I Can't figure it out but feels to me like its lacking something.
That's awesome, I love it!
It’s possible this time is different, but people at my company were entertained by DALLE for all of 5 minutes before no one ever mentioned it again. The value proposition is simply low.
I guess you've never seen my drawings...
This revolution is allowing us to conduct the orchestra instead of playing each instrument.
I've already seen tools that support workflows where you compose art by iteratively generating a piece of it, performing some correction, and repeating. So, I think there's room in the art world for less than perfectly generated art. That said, let's not kid ourselves that the typical failure modality of ML today (99% correct enough, 1% disastrously incorrect) doesn't either cause it to be entirely useless in many applications or end up wreaking havoc on end users in others.
What do the results have to do with "non-technical" people? I am blown away every time I run stable diffusion of the images I get out from it.
what precisely is the market here?
Not looking to monetize at all. But inference is expensive. So might have something to cover costs.
When I was growing up in the early 90s, my dad took me into his office over the weekends when he was doing some overtime paperwork. I would be on his IBM Windows 3.1 workstation. He didn’t have any games on his work computer, so I would spend the entire day “playing” with MS Paint. I couldn’t read yet (3-4 years old), but I was able to figure it out.
We didn’t have a computer at home. But seeing how I was so good at it, my parents bought one. I eventually got into coding etc. All of this defined who I am today.
So I wanted to recreate some of this magic, for my own son. He’s 3 months old, so not quite the right age. But I have some free time on parental leave. So why not. Might be useful for parents with 3-5 year olds.
>Be kind. Don't be snarky. Have curious conversation; don't cross-examine. Please don't fulminate. Please don't sneer, including at the rest of the community. Edit out swipes.
Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.
You are making changes to a products UX based on graphical inference?
I could see a decent business supporting the logic problems a UX designed from AI graphics would introduce ;)
Is that a "large" organization market or not depends on your metric and what the market positioning of the offering is. I would see applications in both specialist content creation tools as well as "stock photos and merch".
In terms of finding stock photos, if you add a better text api that is easier to control this probably can compete with static stock photos in the sense that people can tune their images as much as they like. For example with their corporate merch (Imagine producing a slideset at Acme co. "Please give me an elephant and walrus wearing acme caps".
Ad agencies already love that they can train a model to quickly iterate product shot ideas extremely rapidly.
Then we have "the usual" effect automation has on market demand - automation increases the productivity of a task requiring labour, hence allowing to reduce the cost of a unit of production, which generally increases the demand. I.e. creative stuff will be cheaper to do, you won't replace artists, but suddenly the dude or dudette who spent hours just tweaking stuff has their own art studio at finger tips to command. They can get so much more done much faster.
The tech is not 100% bullet proof yet but at this pace it will be good enough soon (or probably is for several applications if there was just an UX sugaring targeting specific domain workflow).
And it's still not available in my language on iOS... :( (Norwegian)
HuggingFace Space (currently overloaded unsurprisingly): https://huggingface.co/spaces/stabilityai/stable-diffusion
Doing a 2.0 release on a (US) 2-day holiday weekend is an interesting move.
It seems a tad more difficult to set up the model than the previous version.
The docs aren't good though, it tells you to download two things when actually I think you only need one. If you do need two then it doesn't tell you at all where to put the second.
You really need xformers if you're doing it at home, I've got a 3090 and it blew through the ram without it. However, the instructions didn't work for me for compiling and there's an incompatibility if you try and install from conda. You can have it work but you need to upgrade python from 3.8.5 to 3.9 in the yaml file first, then you can install it (xformers needs 3.9+, and something else in SD breaks on 3.10+ so 3.9 works).
This needs the classic "sit next to a new person installing it by following the docs and see what problems they hit, fix the docs and start from scratch again" process.
Looks good, though so far the images I've made don't look as nice as with 1.4, but I guess that's largely down to finding the right tweaks for the model and right magic wording for the prompts.
Their HQ seems to be located in London.
You can see nobody likes this new model in any of the stable diffusion communities. It's a big flop and for a good reason. The reason it was so
successful in the first place was because you could combine artist names to get the model to the outcome you want.
I'll again remind anyone who thinks they might want to use this to download a working version of SD now. They might break their own libraries in the future, and getting SD1.4 could be a real hassle in a year or so. Getting the right .ckpt file, which can have pickled python malware, is not so trivial, and this will get worse in time.
It's going to diverge into castrated official model that intentionally breaks the older models and older models from unofficial shady sources that might contain malware.
Social media/"AI ethics" pressure groups will eventually come from these organizations (see Meta's recent debacle with Galactica). Being an unknown org without these pressures was a big reason Stable Diffusion got so popular in the first place.
I suspect for similar liability issues as SD 2.0, that they will not strt embedding sub-2.0 weights into consumer electronics.
All these models are pretty good as that community is strong on art, styles, art skill, and tagging, causing the models to be a serious test case for what's possible. The model with artist names was indeed capable of invoking their styles (for instance, an artist with exceptional anatomy rendering had it translate into the AI version). The more-trained model without the artist names was much more intelligent. It was simply more capable of quality output, so long as your intention wasn't 'remind me of this artist'.
I think that's likely to be true in the general case, too. This tech is destined for artist/writer/creator enhancement, so it needs to get smarter at divining INTENT, not just blindly generating 'knock-offs' with little guidance.
What you want is better tagging in the dataset, and more personalized. If I have a particular notion of an 'angry sky', this tech should be able to deliver that unfailingly, in any context I like. Greg Rutkowski not required or invoked :)
It does look like artist names have a significantly reduced effect.
StableDiffusion 1.0 used CLIP released by OpenAI. 2.0 uses a CLIP retrained from scratch by Stability.
We don’t know OpenAI’s dataset so don’t know what was in it or how to recreate it. Nothing was “removed”.
I suspect that people will find keywords that would improve the aesthetics further again, or that fine-tuning will also take place.
Seems the 768-v model, if used properly, can substantially speed-up the generation, but not exactly sure yet. Seems straightforward to switch to 512-base model for my app next week.
I think they're looking into larger models later though
768x768 native models (v1.x maxed out at 512x512)
a built-in 4x upscaler: "Combined with our text-to-image models, Stable Diffusion 2.0 can now generate images with resolutions of 2048x2048–or even higher."
Depth-to-Image Diffusion Model: "infers the depth of an input image, and then generates new images using both the text and depth information." Depth-to-Image can offer all sorts of new creative applications, delivering transformations that look radically different from the original but which still preserve the coherence and depth of that image (see the demo gif if you haven't looked)
Better inpainting model
Trained with a stronger NSFW filter on training data.
For me the depth-to-image model is a huge highlight and something I wasn't expecting. The NSFW filter is a nothing (it's trivially easy to fine-tune the model on porn if you want, and porn collections are surprisingly easy to come by...).
The higher resolution features are interesting. HuggingFace has got the 1.x models working for inference in under 1G of VRAM, and if those optimizations can be preserved it opens up a bunch of interesting possibilities.
Not really surprised they did this, but be sure some communities will have it fine tuned on porn now-ish. So probably they did it for legal reasons in case illegal materials are generated and they are real companies/people with their names on the release?
Reading between the lines, it likely generated CSAM-like images even without explicit prompting for it.
>> it can't memorize 240TB of data, so it "learned"
P.S. if you want to buy a graphics card, make sure to have at least 12GB VRAM
Is it realistic to make use of this on the command line, feeding it my own images? Or has someone wrapped it in an app or online service?
Anyway I think it would be fun to play with, just depends on the content of the image and the artists preferences. I still haven’t printed a full page of the upscaled photo but I do want to try that and see how it looks in comparison!
 The new Photo AI, on the other hand, is slow, clunky, and not infrequently glitches out wildly. But on the plus side it does combine sharpening and denoising into one workflow.
Note that on some image types it tends to make things look digitally painted rather than detailed. I recommend you try a few different tools and see what works best for the type of photography you do.
A confrontation is inevitable, though. Right now it costs moderate sums of money to do this level of training. Not always will this be so. If I were an AI-centric organization, I would be racing to position myself as a trustworthy actor in my particular corner of the AI space so that when legislators start asking questions about the explosion of bad actors, I can engage in a little bit of regulatory capture, and have the legislators legislate whatever regulations I've already implemented, to the disadvantage of my competitors.
For people who say "people can make whatever images they like in photoshop," I will remind you of this:
and they will lose it, just like they've lost the war on encryption.
As I explained: This kind of mandated restriction is looming over AI. Companies are trying to get out in front of these restrictions so they can implement them on their own terms.
But images of boobs are still legal. So this NSFW filter seems to be much more above then the law asks. Is the issue is that even if you do not train with CP you might get the model so output something that some random person will get offended and label it as CP? I assume that other companies can focus on NSFW and have their lawyers figure this out, IMo would be cool that someone sues the governments and make them reveal facts about their concern that CP of fake or cartoon people is dangerous" , I think they could focus on saving real children then cartoon ones.
Indeed, what they're already doing is already hobbling the models.
Emad is right that we learn new things from the creativity unleashed by accessible models that can be run (and even fine tuned) or consumer hardware.
But judging from what people post, one thing we learn is that it seems models fine tuned on porn (such as the notorious f222 and its derivative Hassan's blend) can be quite a bit better at non-porn generation of diverse, photorealistic faces and hands too.
I'm not sure I understand this. A possible implementation could be a neural net that blanked the screen with a frown face any time it detected something it thinks was "bad". What purpose/need would pre-screening serve?
I think this is making the assumption that all frames are blocked.
> then it would be disastrous for latency
We're talking about the future here. I'm not sure it makes sense to use current tech to say it's not going to happen, or come up with latency numbers. But, "real time" inference is definitely a possibility, and is in active use for video moderation (Youtube, etc) and object detection (Tesla, etc). Nobody will notice a system running at 2000fps.
Also seems problematic to approach this from a purely capitalistic and consumerist angle. There is a lot of opportunity here besides just launching the next AI unicorn.
I will say that while the government backlash is inevitable just like it was with encryption, these image generation models are so easy to train on consumer hardware that the cat is hopelessly out of the bag. It might as well be thoughtcrime.
It will, however, definitely not affect the more-common use case of anime women with very large breasts. And people will be able to finetune SD 2.0 on NSFW images anyways.
At least if people have to finetune the model on that shit, then you can argue that it's not your fault because someone had to do extra steps to put stuff in there.
Diffusion model dont need any CSAM in training dataset to generate CSAM. All it's need is any random NSFW content alongside with any safe content that includes children.
That said, part of the problem with the general ignorance about machine learning and how it works is that there will be totally unreasonable demands for technical solutions to social problems. “Just make it impossible to generate CP” I’m sure will succeed just as effectively as “just make it impossible to Google for CP.”
So... very, very well? I obviously don't have numbers, but I imagine CSAM would be a lot more popular if Google did nothing to try to hide it in search results.
The underlying idea you have is that the artificial CSAM is a viable substitute good - i.e. that pedophiles will use that instead of actually offending and hurting children. This isn't borne out by the scientific evidence; instead of dissuading pedophiles from offending it just trains them to offend more.
This is opposite of what we thought we learned from the debate about violent video games, where we said stuff like "video games don't turn people violent because people can tell fiction from reality". This was the wrong lesson. People confuse the two all the time; it's actually a huge problem in criminal justice. CSI taught juries to expect infallible forensic sci-fi tech, Perry Mason taught juries to expect dramatic confessions, etc. In fact, they literally call it the Perry Mason effect.
The reason why video games don't turn people violent is because video game violence maps poorly onto the real thing. When I break someone's spine in Mortal Kombat, I input a button combination and get a dramatic, slow-motion X-ray view of every god damned bone in my opponent's back breaking. When I shoot someone in Call of Duty, I pull my controller's trigger and get a satisfyingly bassy gun sound and a well-choreographed death animation out of my opponent. In real life, you can't do any of that by just pressing a few buttons, and violence isn't nearly that sexy.
You know what is that sexy in real life? Sex. Specifically, the whole point of porn is to, well, simulate sex. You absolutely do feel the same feelings consuming porn as you do actually engaging in sex. This is why therapists who work with actual pedophiles tell them to avoid fantasizing about offending, rather than to find CSAM as a substitute.
I don't believe this is the reason. By practicing martial arts which maps well to real life violence I do not see an increase of violent behaviour. Similarly playing FPS games in VR which maps much closer that flat screen games does not make me want to go shoot people in real life. I don't think people playing paintball or airsoft will turn violent from partaking in those activities. The majority of people are just normal people are not bad people who would ever shoot someone or rape someone.
>You know what is that sexy in real life? Sex.
Why is any porn legal then? If porn turned everyone into sexual abusers I would believe your argument, but that just isn't true. If it were true that a small percentage of people who see porn will turn into sexual abusers I don't think that makes it worth banning porn altogether. I feel there should be a better way that doesn't restrict people's freedom of speech.
I can't believe someone says this. It's so not true in my experience. These feelings have a lot in common, but they are definitely not the same.
If someone is generating sketchy cartoons from a training set of sketchy cartoons... well, gross, but there's no victims there.
I don't think the fact that it's artificially generated has any bearing for some important purposes.
The model does not contain the images themselves though. I think it would not be classified as that.
This is a huge missed opportunity to actually help society.
What do you propose? The FBI releases a CSAM data set for devs to use for “training”?
Would you be the one to create the model? Would you run a business that sells synthetic CSAM?
That being said, this is a question for sociologists/psychologists IMO. Would giving people with these kinds of tendencies that kind of material make them more or less likely to cause harm? Is there a way to answer that question without harming anybody?
In the mean time, stay away from 4chan.
Anyway, one obvious application: FBI could run a darknet honeypot site selling AI-generated child porn. Eliminate the actual problem without endangering children.
It's very unlikely AI generated child porn would even be illegal. Drawn or photoshopped photos aren't so I don't think AI generated would be.
Don't forget that pornographic images and videos featuring children may be used for grooming purposes, socializing children into the idea of sexual abuse. There's a legitimate social purpose in limiting their production.
Apparently Facebook has a huge problem with distribution through messenger.
I can't judge how likely that is.
I guess I also don't care much as I only really care aboit stopping production using real children, simulated CSAM gets a shrug and even use of old CSAM only gets a frown.
My (lol now flagged) opinion is that it’s kind of weird to advocate for the CSAM archive to move into [literally any private company?] to turn it into some sort of public good based on… frowns?
There’s a lot of important social questions to ask about the future of pornography, but I’m sure not going to be the one to touch that with a thousand foot pole.
Now, if you meant gross cartoons, yes, those get posted daily. But there are no children being abused by the creation or sharing of those images, and conflating the two types of image is dishonest.
This not a game release. It doesn't matter if it's cracked tommorow or in a year. On open source no less, it's going to happen sooner rather than later.
As disgusting as it is but somebody is going to feed CP to an A.I. Model and that's just the reality of it. It's just going to happen one way or another and it's not any of these A.I. Companies fault.
Isn't the main feature of stable diffusion is that it doesn't?
I very much doubt the police will look at AI this way when such models do eventually hit the web (assuming they haven't already) but at some point someone will get caught through this stuff and the arrest itself may have damning consequences throughout the AI space.
I swear every time I find myself thinking “Hey, stop being so cynical and jaded all the time”, I stumble across something like this.
The scary thing is that you can then train it further with things like DreamBooth to start producing porn of celebrities… or, even more worrying, people you know.
Seriously folks, we are within a year or less of this being trivial. It’s already possible with a lot of work today.
All the difficult parts (poses, backgrounds, art styles) has already been done by the SD researchers, the porn network only needs reference material for the NSFW description/tags/details. This is significantly cheaper.
A similar project, training SD to output images in the style of Arcane, is incredibly successful in replicating the animation style with what seems to be very little actual training data.
I don't think you need to start from scratch at all if you use the SD model as a base, all you need to do is to train it on specific concepts, styles and key words that the original doesn't have.
Have you ever seen a non-porn DVD that had multiple camera angles (a feature defined in the standard DVD spec)?
This is an urban myth.
I'll even say that the high bandwidth push in the public was highly related to that. Even HTML5 video players, adult websites were faster to implement it than big streaming websites that still used flash or similar tech.
What's more interesting, is that there's evidence (from public posts, I haven't tried these models myself) that models trained on some porn get better at non-porn images too.
Deceitful extremists and vengeful criminals fabricating lies seem to be a far more serious problem than fantasy porno.
Also lexica.art is swarming with celebrity fantasy porn that just has a thin stylistic filter of paintings from the 19th century. And a plethora of furry daddies that you can't not love.
I get why these models should be curated but I also like that the sketchy porn possibilities keep them feeling un-padded / interesting / dangerous.
Then again this all is probably really dangerous so maybe that's silly.
I thought that was Justice Stewart? And then he answered it "I know it when I see it."
They can force model upgrades too:
> The New AI Model Licenses Have a Legal Loophole (OpenRAIL-M of Stable Diffusion)
5. and 7. make it not open source
If you're trying to build an app based on SD, then not being open source matters. But seems like the majority of use cases are just "I want to run the model locally". And at that point HF can't stop me from just ripping the Wi-Fi card out of my computer.
I guess SD is betting on saving $ on compute being more important in this space than the ability to gatekeep certain queries. And the tradeoff is that you need to do nsfw filtering in your released model.
It will be interesting to see who's right in 2 years.
It works with Pytorch -> torch-mlir -> MLIR / IREE -> vulkan. Works on both Windows and Linux. And has a simple gradio web UI https://github.com/nod-ai/SHARK/tree/main/web but we plan to enable better UI integrations very soon.
Join us on discord https://discord.gg/RUqY2h2s9u if you have any trouble. Appreciate any / all feedback.
Similar to those games, anyone is also able to distribute their own open data files if they so wish
It's unlikely anyone actually will start training an open source AI model from scratch because doing so costs insane amounts of money, but the same can be said about the many hours of work recreating game assets can take for open source game engines.
Yes, someone else could spend the millions of dollars to create a model that actually is open source, but shouldn't the people advertising their models as open source do that?
Does not openly distributing their data files make their code any less open source? I don't think so. The code is open and licensed with a FOSS license. They spend time and money on creating a model and give the world the ability to replicate their model if it can collect the necessary funds. There are plenty of other open source projects that require vast arrays of server racks and compute power to be useful, that doesn't change anything about the openness of the code.
Ran into a few errors with the default instructions related to CUDA version mismatches with my nvidia driver. Now I'm trying without conda at all. Made a venv. I upgraded to the latest that Ubuntu provides and then downloaded and installed the appropriate CUDA from .
That got me farther. Then ran into the fact that the xformers binaries I had in my earlier attempts is now incompatible with my current drivers and CUDA, so rebuiding that one. I'm in the 30-minute compile, but did the `pip install ninja` as recommended by  and it's running on a few of my 32 threads now. Ope! Done in 5 mins. Test info from `python -m xformers.info` looks good.
Damn still hitting CUDA out of memory issues. I knew I should have bought a bigger GPU back in 2017. Everyone says I have to downgrade pytorch to 1.12.1 for this to not happen. But oh dang that was compiled with a different cuda, oh groan. Maybe I should get conda to work afterall.
`torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 5.93 GiB total capacity; 5.62 GiB already allocated; 15.44 MiB free; 5.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`
Guess I better go read those docs... to be continued.
Then again, it's their model, they can do whatever they want with it, but it still leaves me with a weird feeling.
Though, previous fine-tunings/textual inversions won’t work since the CLIP encoder has been replaced too. I’d be interested in knowing if it needs to be retrained too for this case.
Edit: Looks like AUTOMATIC1111 can merge three checkpoints. I still don't know how it works technically, but I guess that's how it's done?
Someone even figured out they could get great compression of specialized model files by first subtracting the base model from the specialized model (using plain arithmetic) before zipping it. Of course, you need the same base file handy when you go to reverse the process.
TL;DR very far fetched and a bit pointless to go looking for these non-obvious alternative meanings, in my opinion
Typo of freudian slip?
Just kidding of course, nice project!
I'm asking because I'm running SD locally but my GPU is not good enough to train new checkpoints and while I get the time to work on improve I wanted to use this API in order to generate some models for an illustration book I am working on.
Ideally one that states that the uploaded images are deleted after generating the model and not used for anything else in any fashion whatsoever.
Also,let people download the models and delete them afterwards with the same handling. Then it gets very interesting indeed!
There are a lot of tools available, but I haven't found anything where the result isn't just another kind of bad, so if the upscaling and inference in this model is good, it should in theory be possible to restore images by using the old photos as the seed, right?
Couldn't we train a very good model by distributing the dataset along with the computing power using something similar to folding@home?
Mind you, theoretically that is a limitation of our current network architectures. If we could conceive a learning approach that was localised, to the point of being "embarrassingly parallel", perhaps. It would probably be less efficient, but if it is sufficiently parallel to compensate for Amdahl's law, who knows?
Less theoretically, one could imagine that we use the same approach that we use in systems engineering in general: functional decomposition. Instead of having one Huge Model To Rule Them All, train separate models that each perform a specific, modular function, and then integrate them.
In a sense this is what is currently happening already. Stable Diffusion have one model to generate img2depth, to generate an estimation which parts of a picture are far away from the lense. They have another model to upscale low res images to high res images, etc etc. This is also how the brain works.
But it is difficult to see how this sort of approach could be applied to very small scale, low contextual tasks, like folding@home.
Is now it possible to generate higher resolution images with less memory?
> It is our pleasure to announce the open-source release of Stable Diffusion Version 2.
> The original Stable Diffusion V1 led by CompVis changed the nature of open source AI models and spawned hundreds of other models and innovations all over the world. It had one of the fastest climbs to 10K Github stars of any software, rocketing through 33K stars in less than two months.
Newbie question, why can’t someone just take a pre-trained model/network with all the settings/weights/whatever and run it on a different configuration (at a heavily reduced speed)?
Isn’t it like a Blender/3D studio/Autocad file, where you can take the original 3D model and then render it using your own hardware? With my single GOU it will take days to raytrace a big scene, whereas someone with multiple higher speced GPUs will need a few minutes.
But I think maybe you mean, can they make a model which normally needs a lot of RAM run more slowly on a machine that only has a little RAM?
It sounds like there are some tricks to allow the use of smaller amounts of ram by making specific algorithmic tweaks, so if a model normally needs 12GB of VRAM then, depending on the model, it may be possible to modify the algorithm to use 1/2 the RAM for example. But I don’t think it’s the same as other rendering tasks where you can use arbitrarily less compute and just run it longer.
Maybe I’m wrong though.
If you're willing to wait more (30 seconds per image, assuming limited image sizes) there are repositories that will run the model on the CPU instead, leveraging your much cheaper RAM.
In theory you could swap VRAM in and out in the middle of the rendering process, but this would make the entire process incredibly slow. I think you'll have more success just running the CPU version if you're willing to accept slowdowns.
Most people I know don't have a desktop in the first place, and on average I wouldn't guess that desktop users build a new one more often than once every ~4 years. And that's among people who build their own; if you buy pre-built, you have to spend a lot extra to get those top of the line specs.
It's possible to now go out and buy this on a whim if you have a tech job or equivalent salary, though.
In the price range it is the only Nvidia card with 12GiB and the 3080 starts at 10GiB.
So you can certainly get a 12GiB card without spending 3080+ money, but if you want any more power than a 3060 and keep the 12GiB then you would need to spring for a 3080 12GiB which is a big jump in price.
Side note: The 4x upscaler model is showing as unavailable if you follow the hugging face link to it.
 - https://app.gooey.ai/FaceInpainting/
Here's the bug that caused this too - https://bugzilla.mozilla.org/show_bug.cgi?id=1689099
There are some specialized third party performance optimizations you might miss out on though, but nothing major IMO.
Nothing personal against the work, I think it’s brilliant, and cheap. Just like a Kinkaid
- lexica.art for prompt inspiration
It doesn't output color though, only B/W.
Stable Diffusion is amazing at generating art. Something similar but specialized in UI could be too. Maybe one could make a custom model, but with my lack of design knowledge I’m not even sure where to start…
It would surely save my monkey brain from pouring many more hours into looking at existing websites/UI libraries/Dribble and drawing inspiration (copying) from them.
These new updates are quite great but are they so game-changing that it is considered 2.0?
I hope AI gets these programmers jobs soon.
Then we all can go to the woods and have a good life, finally.
Your beloved corporations don't have a metric called “creativity”, they have a bottom line, and she has all the powers.
I am an artist by education and can confirm that creativity is overrated, the processes that artist follows and repetition towards a given goal deliver the results.
Whatever feelings or ideas you have, the actual craft is the medium in which you will deliver.
Reducing *The Path* to text input is not an artistic or craftsmanship process.
There is no creativity involved. May be, someone with more knowledge about the real process and broader visual culture will make more aesthetically right choices. But this can be automated too.
This is not a “tool”, like Photoshop. This is something else.
And all of you know this.
More than 50 percent of frontend code is boilerplate. CRUD apps follow similar logic.
Why not automate this repetitive processes first?
No. Corporations are starting the automation from the lowest risk crowd—the digital artists, they have low representation, no coherent community and are always ready to sell themselves for pennies.
Now they will compete with the machines.
And your time in this battle will come. Soon.