hi friends! burkay from fal.ai here. would like to clarify that the model is NOT built by fal. all credit should go to Black Forest Labs (https://blackforestlabs.ai/) which is a new co by the OG stable diffusion team.
what we did at fal is take the model and run it on our inference engine optimized to run these kinds of models really really fast. feel free to give it a shot on the playgrounds. https://fal.ai/models/fal-ai/flux/dev
metadat 260 days ago [-]
The playground is a drag. After accepting being forced to sign up, attach my GitHub, and hand over my email address, I entered the desired prompt and waited with anticipation.. Only to see a black screen and how much it's going to cost per megapixel.
Bummer. After seeing what was generated in the blog post I was excited to try it! Now feeling disappointed.
My go-to test for these tools so far has been the seven horned, seven eyed lamb mentioned in the Book of Revelation. Every tool I've tried has failed at this task.
cpfohl 259 days ago [-]
Ah. I try the following:
> A Gary Larsen, "Far Side" comic of a racoon disguising itself by wearing a fedora and long trench coat. The raccoon's face is mostly hidden by the fedora. There are extra paws sticking out of the front of the trench coat from between the buttons, suggesting that the racoon is in fact a stack of several raccoons.
Every human I've ever described this to has no problem picturing what I mean. It's a classic comic trope. AIs still struggle.
jiggawatts 259 days ago [-]
A rough rule of thumb is that if a text-generator AI model of some size would struggle to understand your sentence, then an image-generator model a couple of times the size or even bigger would also struggle.
The intelligence just doesn't "fit" in there.
Personally I'm curious to see what would happen if someone burnt $100M of compute time on training a truly enormous image generator model, something the same-ish size as GPT4...
shagie 259 days ago [-]
This gets interesting. One approach that I've used with image generation before is to find an image of the sort that I want, and have Dall-e describe it... and then modify the prompt that it provides to be one with the elements that I want.
The image shows an imaginative, whimsical illustration of a character composed of two parts. The upper part features a man dressed in a long, elegant gray coat, wearing a bowler hat and round sunglasses, with a sophisticated white polka-dot ascot tie. His face has a subtle smile. The lower part of the character transitions seamlessly into a smaller figure of a cat, appearing to wear striped pants, with its tail visible. The entire character combines human and feline elements, creating a surreal, anthropomorphic appearance. The illustration is in black and white, emphasizing a stylized, cartoon-like design.
The image captures a whimsical and secretive scene featuring three dwarves stacked in a totem formation, each attempting to conceal their nature under a large brown cloak. The top dwarf has a bright, cheerful expression and blond hair, holding the cloak wide to mimic wings, and is dressed in black armor adorned with teal gems and matching earrings. The middle dwarf displays a fierce expression, sporting a bushy orange beard, and is also clad in similar dark armor with teal embellishments. The bottom dwarf, an older figure with a long white beard, is adorned in a royal dark outfit with gold accents and a small crown, clasping a glowing white orb. This trio of dwarves, each with distinctive fantasy armor, unites in a playful attempt to disguise their stature and nature, adding an element of adventure and mystery to the scene.
Working off of that idea of the totem formation ... "Create an image featuring three children in a totem pole formation that are trying to conceal their nature in a single oversized trench coat."
I suspect the orange beard came from the previous part in the session. But that might be an approach to take in trying to describe it in a way that can be used.
whywhywhywhy 259 days ago [-]
Current generation image generators don’t understand text like instructions as you’re trying to do, describing an object then placing it then setting the scene.
It’s more like a giant telescope of many lenses (the latents from the prompts) and you’re adjusting the lenses to bring a possible reality of many into focus.
Taek 259 days ago [-]
It looks like imgur is blocking Mullvad VPN connections
krapp 259 days ago [-]
>Every human I've ever described this to has no problem picturing what I mean. It's a classic comic trope. AIs still struggle.
But AIs learn and therefore create in exactly the same way as humans, ostensibly on the same data. How can this be possible? /s
conception 259 days ago [-]
That classic “bowl of ramen without chopsticks “ also fails. Haven’t seen any get that right yet either.
andybak 259 days ago [-]
Negation is a known weak spot. Aren't you just retesting that again and again? Does it tell you much beyond that?
tiborsaas 259 days ago [-]
Can you share the exact prompt you used?
TheAceOfHearts 259 days ago [-]
Sure. Normally I try a few variants, but "lamb with seven horns" was what I tried when I made that post.
For what it's worth, I've previously asked in the Stable Diffusion Discord server for help generating a "lamb with seven horns and seven eyes" but the members there were also unsuccessful.
CGamesPlay 260 days ago [-]
I mean, you can use a fork to make whipped cream, but it won't be easy and it's not the right tool for the job. Does that mean that the fork is useless?
TheAceOfHearts 260 days ago [-]
I never said it was useless, just that it fails at this specific problem. One of my complaints with many of these image generation tools is that there's not much communication as to what should be expected from them, nor do they explain the areas where they're expected to succeed or fail.
Recently Claude began to allow generation of SVG drawings, and asking it to draw a unicorn and later add extra tails or horns worked correctly.
A fork exists in physical space and it's pretty intuitive to understand what it can do. These models exist within digital space and are incredibly opaque by comparison.
lukan 260 days ago [-]
"Recently Claude began to allow generation of SVG drawings, and asking it to draw a unicorn and later add extra tails or horns worked correctly."
That sounds interesting! Were the results somewhat clean and clear SVG or rather a mess that just looked decent?
TheAceOfHearts 260 days ago [-]
Here's the screenshot [0] that was shared with me. It's obviously pretty basic, but Claude understood the correct location for where the horns and tails should be located. This looks like a clear iterative improvement over older models.
You also might want to "clarify" that it is not open source (and neither are any of the other "open source" models). If you want to call it something, try "open weights", although the usage restrictions make even that a HUGE FUCKING STRETCH.
Also, everybody should remember that these models are not copyrightable and you should never agree to any license for them...
DHolzer 259 days ago [-]
When I read "open source" i thought they actually are doing open source instead of "open weights" this time. Surely they would expect to be called out on hackernews if they label it incorrectly...
Thanks for pointing that out @Hizonener
Morphiak 260 days ago [-]
A personal bugbear is the AI fascination with calling themselves open source, virtue signalling I guess. Open weights is exactly right. Source code and arguably more important datasets are both required to replicate the work, which is more in the spirit of open source (and science). I think Meta is especially egregious here, given their history.
Never underestimate the value of getting hordes of unpaid workers to refine your product. (See also React, others)
andybak 259 days ago [-]
> virtue signalling
I'd prefer "false advertising" - it's more direct and without the culture war baggage.
seanw444 259 days ago [-]
"Open source" is perceived as a virtue, and their claim is false. Thus false virtue claim. Or... virtue signaling.
HPsquared 259 days ago [-]
Indeed. The data is the main "information source" from which the model is trained.
astrange 259 days ago [-]
It's certainly not true that models are not copyrightable; databases have copyright protection if creativity was involved in creating them.
That said, I don't think outputs of the model are derivative works of it, any more than the model is a derivative of its training data, so it's not clear to me they can actually enforce what you do with them.
filleokus 259 days ago [-]
> It's certainly not true that models are not copyrightable; databases have copyright protection if creativity was involved in creating them.
I'm no IP lawyer, but I've always thought that copyright put "requirements" on the artefact (i.e the threshold of originality), not the process.
In my jurisdiction we have database rights, meaning that you get IP protections for the artefact based on the work put into the process. For example a database of distances between adress pairs or something is probably not copyrightable, but can be protected under database rights if enough work was done to compile the data.
EDIT: Saw in another place in thread speaking about the https://en.wikipedia.org/wiki/Sweat_of_the_brow doctrine, relates to Database rights. (Neither of which notably are not applicable in the U.S)
kube-system 259 days ago [-]
These models are a product of much more creativity than simply a list of phone numbers in a phone book. I don't see how they wouldn't meet the modicum of creativity required for US copyright protection.
Hizonner 259 days ago [-]
The software that creates the model is the product of creativity. The model itself is the product of mechanically applying that software to datasets that are (a) assembled with minimal, if any creativity, and (b) definitely not assembled with any eye to the specific form of the resulting model. The whole point is to get the software to form the model without you having to worry about what the result is going to look like. So you can't turn around and claim that the model is a creative work because of the choice of training data.
The only thing that's really specified about the model itself is its architecture, which is (1) dictated by function, and (2) usually deeply stereotyped.
kube-system 259 days ago [-]
> mechanically applying that software to datasets that are (a) assembled with minimal, if any creativity, and (b) definitely not assembled with any eye to the specific form of the resulting model.
Fair enough, but those datasets are also primarily copyrighted material. If the software here merely transforms the input material (which I agree it does), then the output is a derivative work.
Hizonner 259 days ago [-]
But the pitch-shifted song is still recognizably a creative work. It has identifiable, humanly comprehensible forms of all the original creative elements that Swift originally put into it (plus I guess a de minimis amount of extra creativity from the choice to pitch shift it).
If I take a string of data from a true hardware RNG, XOR it with a Taylor Swift song, and throw away the original random stream, is the resulting fundamentally random bit string still a derivative work of the song? As with the ML model, you can't recognize the song in it. And as with at least some training examples in the inputs of most ML models, you can't recover the song from it either.
It feels like the test for whether X is derivative for copyright purposes should include some kind of attention to whether X is a creative work at all. Maybe not, but then what test do you use?
I do recognize the possibility that the models might not themselves be eligible for copyright as independent works, yet still infringe copyright in the training inputs. It seems messy, but not impossible.
... and as I said elsewhere, it's also messy that while you generally can't recover every training input from the model, you can usually recover something very close to some of the training inputs.
astrange 259 days ago [-]
> If I take a string of data from a true hardware RNG, XOR it with a Taylor Swift song, and throw away the original random stream, is the resulting fundamentally random bit string still a derivative work of the song? As with the ML model, you can't recognize the song in it. And as with at least some training examples in the inputs of most ML models, you can't recover the song from it either.
It's not a copy of it, and when you distribute it you're not distributing the original. So it's not a derivative for copyright purposes.
It can still be a derivative for other legal purposes. Judges don't appreciate it when you do funny math tricks like that and will see through them.
> It feels like the test for whether X is derivative for copyright purposes should include some kind of attention to whether X is a creative work at all. Maybe not, but then what test do you use?
Yes, that's how US copyright law works. (well sort of…)
Being a transformative work of something makes it less of a copy of it, the more transformed it is, since it falls under fair use exemptions or is clearly a different category of thing.
If a model was a derivative of its training data, then Google snippets/thumbnails would be derivatives of its search results and would be illegal too. Unless you wrote a new law to specifically allow them.
In other countries (Germany, Japan) fair use is weaker, but model training has laws specifically making it legal in certain circumstances, and presumably so do Google snippets.
Hizonner 259 days ago [-]
> It's not a copy of it, and when you distribute it you're not distributing the original.
A compressed (or normally encrypted) version wouldn't be a copy that way, either, but I would still absolutely go down for distributing it. The difference is that the compression can be reversed to recover the original. Even lossy compression would create such a close derivative that nobody would probably even bother to make the distinction.
You're right that "math games" don't work in the law, but that cuts both ways. If you do something that truly makes the original unrecoverable and in fact undetectable, and if nothing salient to the legal issues at hand about the new version derives from the original, then judges are going to "see through" the "math trick" of pretending that it is a derivative.
> then Google snippets/thumbnails would be derivatives of its search results
Thumbnails are legally derivative works, in the US and probably most other places. In the US, they're protected by the fair use defense, and in other places they're protected by whatever carveouts those places have. But that doesn't mean they're not derivative works.
In fact, if I remember the US "taxonomy" correctly, thumbnails are infringing. It's just that certain kinds of ingfringement are accepted because they're fair use.
If thumbnails weren't derivative works at all, then the question of fair use wouldn't arise, because there can be no infringement to begin with if the putatively infringing work isn't either derivative or a direct copy.
Where thumbnails are different from ML models is that they're clearly works of authorship. In a thumbnail, you can directly see many of the elements that the author put into the original image it's derived from.
The questions are (a) whether ML models are works of authorship to begin with (I say they're not), and (b) whether something that's not a work of authorship can still be a derivative work for purposes of copyright infringment (I'm not sure about that).
So far as I know, neither one is the subject of either explicit legislation or definitive precedent in most of the world, including the US.
astrange 259 days ago [-]
There's a lot of effort involved in the training runs too, and you might be able to get away with saying the ops engineers made creative choices too (of which checkpoints were good etc.)
Since it costs millions to produce one of these models, it's not just taking the software and running it to compile them.
andymcsherry 259 days ago [-]
I get the sentiment, but one of their models, albeit the worst one, is licensed under Apache without usage restrictions. The source to run the models is also open source.
astromaniak 260 days ago [-]
> it is not open source
It would be nice here if you give some examples of what you call open source model. Please ;) Because the impression is that these things do not exist, it's just a dream which does not deserve such a nice term..
Hizonner 260 days ago [-]
As far as I know, none have been released. And it doesn't even really make sense, because, as I said, the models aren't copyrightable to begin with and therefore aren't licensable either.
However, plenty of open source software exists. The fact that open source models don't exist doesn't excuse attempts to falsely claim the prestige of the phrase "open source".
kube-system 259 days ago [-]
> the models aren't copyrightable to begin with
What criteria for copyright protection are they missing?
astromaniak 259 days ago [-]
> As far as I know, none have been released.
I can tell you a secret. What you call 'open source' models are impossible. Because massive randomness is a part of training process. They are not reproducible. Having everything you cannot even tell if the given model was trained on the given dataset. Copyright is a different thing.
And a bad news, what's coming is even worst. Those will be the whole things with self awareness and personal experience. They can be copied, but not reproduced. More over, it's hard or almost impossible to detect if something undeclared was planted in their 'minds'.
All together means 'open source' model in strict interpretation is a myth, great idea which happen to be not. Like Turing test.
> However, plenty of open source software exists.
Attempt to switch topic detected.
PS: as for that massive downvote, I even wasn't rude, don't care. This account will be abandoned soon regardless, like all before and after.
jillesvangurp 259 days ago [-]
> models aren't copyrightable to begin with
You are wrong about that. It's a file with numbers. Which makes it a database or dataset and very much protected by copyright. That's why licenses are needed. For the phone book, things like open street maps, and indeed AI models.
> The fact that open source models don't exist
The fact that many people (myself included) routinely download and use models distributed under OSI approved licenses (Apache V2, MIT, etc.) makes that statement verifiably wrong. And yes, I do check the license of stuff that I use as I work with companies that care about such matters.
> As far as I know ...
Now you know better.
JimDabell 259 days ago [-]
> You are wrong about that. It's a file with numbers. Which makes it a database or dataset and very much protected by copyright. That's why licenses are needed. For the phone book, things like open street maps, and indeed AI models.
This is only true in jurisdictions that follow the sweat of the brow doctrine, where effort alone without creativity is considered enough for copyright. In other places, such as the USA, collections of facts are not copyrightable and a minimal amount of creativity is required for something to qualify as copyrightable. The phone book is an example that is often used, actually, to demonstrate the difference.
> Which makes it a database or dataset and very much protected by copyright.
Not every collection of numbers is a database, and a database is not the same thing as a dataset.
Databases have limited copyright-like protection in some places. Under TRIPS, that extends to only databases that are "creative by virtue of the selection or arrangement of their contents" or something along those lines. In the US they talk specifically about curation.
ML models do not meet either requirement by any reasonable interpretation.
> The fact that many people (myself included) routinely download and use models distributed under OSI approved licenses (Apache V2, MIT, etc.) makes that statement verifiably wrong.
The "source code" of an ML model is most reasonably interpreted as including all of the training data, which are never, ever available.
Now you know better.
[On edit: By the way, the people creating these works had better hope they're outside copyright, because if not, each one of them is a derivative work of (at least some large and almost impossible to identify subset of) its training data, so they need licenses from all the copyright holders of that training material, which few of them have or can get.]
kube-system 259 days ago [-]
If we stop unnecessarily anthropomorphizing software, I think it is plainly obvious these are derivative works. You take the training material, run it through a piece of software, and it produces an output based on that input. Just because the black box in the middle is big and fancy doesn't mean that somehow the output isn't a result of the input.
However, transformativeness is a factor in whether or not there is a fair-use exception for the derivative work. And these models are highly transformative, so this is a strong argument for their fair-use.
Hizonner 259 days ago [-]
Maybe, but...
"Fair use" is pretty much entirely a US concept, and similar concepts in other countries aren't isomorphic to it.
The model does have a radically different form from its inputs. So you could easily imagine that being "transformative enough" for US fair use. A lot of the other fair use elements look pretty easy to apply, too. Although there's still the question of whether all the intermediate copies you made to create the model were fair use...
In fact, I'll even concede that a court could find that a model wasn't a derivative work of its inputs to begin with, and not even have to get to the fair use question. The argument would be that the model doesn't actually reproduce any of the creative elements of any particular training input.
I do think a finding like that would be a much bigger stretch than a finding that the model was copyrightable. I could easily see a world where the model was found derivative but was not found copyrightable. And it's actually not clear to me at all that the model has to be copyrightable to infringe the copyright in something else, so that's another mess.
Somewhat related, even if the model itself isn't infringing, it's definitely possible to have most models create outputs that are very similar to (some specific examples in) their training data... in ways that obviously aren't transformative. Outputs that might compete with the original training data and otherwise fail to be fair use. So even if the model is in the clear, users might still have to watch out.
simonw 260 days ago [-]
I'm personally comfortable calling a model "open source" if the license is compatible with the https://opensource.org/ definition.
The Llama models aren't. Some of the Mistral models are (the Apache 2 ones). Microsoft Phi-3 is - it's MIT.
dagaci 259 days ago [-]
Open source must include source material so that another can reproduce that the model. I would expect that to be a minimum.
simonw 259 days ago [-]
I agree, but that can't happen with the vast majority of these models because they're trained on unlicensed data so they can't slap an open source license on the training data and distribute it.
I've decided to draw my personal line at Open Source Initiative compliance for the license they release the model itself under.
I respect the opinion that it's not truly open source unless they release the training data as well, but I've decided not to make that part of my own personal litmus test here.
My reasoning is that knowing something is "open source" helps me decide what I legally can or cannot do with it when building my own software. Not having access to the training data downs affect my legal rights, it just affects my ability to recompile myself. And I don't have millions of dollars of GPUs so that isn't so important to me, personally.
Hizonner 259 days ago [-]
> that can't happen with the vast majority of these models because they're trained on unlicensed data
Tough beans? There's lots of actual software that can't be open source because it embeds stuff with incompatible restrictions, but nobody tries to redefine "open source" because of that.
... and, on a vaguely similar-flavored note, you'd better hope that the models you're using end up found to be noninfringing or fair use or something with respect to those "unlicensed data", because otherwise you're in a world of hurt. It's actually a lot easier to argue that the models aren't copyrightable than it is to argue that they're not derivative of the input.
> I've decided to draw my personal line at Open Source Initiative compliance for the license they release the model itself under.
You're allowed to draw your personal line about what you'll use anywhere you want, but that doesn't mean that you should try to redefine "open source" or support anybody who does.
tikkun 260 days ago [-]
> We are excited to introduce Flux
I'd suggest re-wording the blog post intro, it reads as if it was created by Fal.
Specific phrases to change:
> Announcing Flux
(from the title)
> We are excited to introduce Flux
> Flux comes in three powerful variations:
This section also comes across as if you created it
> We invite you to try Flux for yourself.
Reads as if you're the creator
burkaygur 260 days ago [-]
Thanks for the feedback! Made some updates.
tikkun 260 days ago [-]
Way better, nice
nextos 260 days ago [-]
The name is a bit unfortunate given that Julia's most popular ML library is called Flux. See: https://fluxml.ai.
There was a looong distracting thread a month ago about something similar, niche language, might have been Julia, had a package with the same name as $NEW_THING.
I hope this one doesn't stir as much discussion. It has 4000 stars, there isnt a large mass of people who view the world through the lens of "Flux is ML library". No one will end up in a "who is on first?" discussion because of it. If this line of argument is held sacrosanct, it ends up in an infinite loop until everyone gives up and starts using UUIDs.
jachee 260 days ago [-]
Eagerly waiting for this to happen in the medication names space. :)
djbusby 260 days ago [-]
Like the Go language that existed before Google Go.
i would give them a break, so many things exist in the tech sector that being completely original is basically impossible, unless you name your thing something nonsensical
also search engines are context aware, if your search history is full of julia questions, it will know what you're searching for
msikora 260 days ago [-]
Also Flux is a now obsolete application architecture for ReactJS.
frognumber 260 days ago [-]
It would be nice to understand limits of the free tier. I couldn't find that anywhere. I see pricing, but I'm generating images without swiping my credit card.
If it's unlimited or "throttled for abuse," say that. Right now, I don't know if I can try it six times or experiment to my heart's desire.
vessenes 260 days ago [-]
Congrats Burkay - the model is very impressive. One area I’d like to see improved in a flux v2 is knowledge of artist styles. Flux cannot respond to requests asking for paintings in the style of David Hockney, Norman Rockwell, Edgar Degas, — it seems to have no fine art training at all.
I’d bet that fine art training would further improve the compositional skills of the model, plus it would open up a range of uses that are (to me at least) a bit more interesting than just illustrations.
astrange 259 days ago [-]
It's "just" another diffusion model, although a very good one. Those people are probably in there even if its text encoder doesn't know about them. So you can find them with textual inversion.
whywhywhywhy 259 days ago [-]
>Flux cannot respond to requests asking for paintings in the style of David Hockney, Norman Rockwell
Does it respond to any names? I noticed SD3 removed all names to prevent recreating famous people but as a side effect lost the very powerful ability to infer styles from artist names too.
warkdarrior 260 days ago [-]
Have those artists given permission for their styles to be slurped up into a model?
vessenes 260 days ago [-]
Florentine art schools would like a word - they’ve been teaching painters by having them copy masters since the 16th century.
GaggiX 260 days ago [-]
Give me a sec, I will contact Edgar Degas with my telegraph.
vessenes 260 days ago [-]
Truly an API call I would pay for
dabeeeenster 260 days ago [-]
The unsubscribe links in your emails don't work
shubik22 260 days ago [-]
thanks for hosting the model! i created an account to try it out, you started emailing me with “important notice: low account balance - action required” and now it seems like there’s no way for me to unsubscribe or delete my account. is that the case? thanks!
RobotToaster 260 days ago [-]
If you are using the dev model, the licence isn't open source.
It is very fast and very good at rendering text, and appears to have a text encoder such that the model can handle both text and positioning much better: https://x.com/minimaxir/status/1819041076872908894
That’s not really fair to conclude that the training data contains vanity fair images since the prompt includes “by Vanity Fair”.
I could write “with text that says Shutterstock” in the prompt but that doesn’t necessairly mean the dataset contains that
minimaxir 260 days ago [-]
The logo has the same exact copyrighted typography as the real Vanity Fair logo. I've also reproduced the same-copyrighted-typography with other brands with identical composition as copyrighted images. Just asking it "Vanity Fair cover story about Shrek" at a 3:2 ratio gives it a composition identical to a Vanity Fair cover very consistently (subject is in front of logo typography partially obscuring it)
The image linked has a traditional www watermark in the lower-left as well. Even something innocous as a "Super Mario 64" prompt shows a copyright watermark: https://x.com/minimaxir/status/1819093418246631855
fennecbutt 255 days ago [-]
If the training data includes a public blog post which has a screenshot of a vanity fair piece?
It's like GRRM complaining that LLMs can reproduce chunks of text from his books "they fed my novels into it" Oh yeah? It's definitely not all the parts of your book quoted in millions of places online, including several dedicated wiki style sites? That wouldn't be it, right?
Carrok 260 days ago [-]
On my list of AI concerns, whether or not Vanity Fair has it’s copyright infringed does not appear.
minitoar 259 days ago [-]
First they came for fashion magazines, and I said nothing.
jMyles 259 days ago [-]
Just to be clear: you're comparing the collapse of the creative restrictions which the state has cleverly branded "intellectual property" to... the holocaust?
Of all of the instances on HN of Godwin's law playing out that I've ever seen, this one is the new cake-taker.
minitoar 258 days ago [-]
No, I’m not making a comparison to the holocaust. Thanks for asking.
amarant 259 days ago [-]
Must we always jump to Nazis?
This is like the fifth time I see someone paraphrasing Niemöller in an ai context, and it's exhausting. It's also near impossible to take the paraphraser seriously.
More to the point, AI is a tool. I could just as well infringe on vanity fair IP using ms-paint. Someone more artistic than me could make a oil-on-canvas copy of their logo too.
Or, to turn your own annoying "argument" against you:
First they came for AI models, and I did not speak out, because I wasn't using them. Then they came for Photoshop, and I did not speak out, because I had never learned to use it. Then they came for for oil and canvas, and now there are no art forms left for me.
minitoar 258 days ago [-]
This isn’t paraphrasing, it’s referencing. The reference has become synonymous with saying “this is a slippery slope for X”.
As to your use of the argument in the other direction, I’d say it doesn’t work very well because no one with any power is coming for those things.
amarant 253 days ago [-]
Nobody at all is "coming for" fashion magazines, but you sure seem to be "coming for" AI. Whether you have any power or not is besides point.
Whether you are paraphrasing or referencing to a famous confessional poem dealing with the Holocaust, the only reasonable interpretation is that you're comparing with the Holocaust. Even if you were unaware of the phrases origins, that's how anyone who does know where it comes from will interpret it. See other comments drawing the same conclusion for reference.
Again. Ai is a tool. It can produce illegal material, just like a pencil can, or a brush with oil and canvas. How are they different? They are not.
whywhywhywhy 259 days ago [-]
All journalism is just duplicating the works and performance of others without their permission for profit anyway.
minitoar 258 days ago [-]
True, but I guess some societies decided there was a greater good in that very specific context.
smith7018 260 days ago [-]
Are you suggesting that the model independently came up with Vanity Fair's logo, including font and kerning?
You don't need an A100, you can get a used 32GB V100 for $2K-$3K. It's probably the absolute best bang-for-buck inference GPU at the moment. Not for speed but just the fact that there are models you can actually fit on it that you can't fit on a gaming card, and as long as you can fit the model, it is still lightyears better than CPU inference.
Morphiak 260 days ago [-]
Why this versus the 2 3090s (with nvlink for marginal gains) and 48GB for 2$K ?
CuriouslyC 260 days ago [-]
3090 TIs should be able to handle it without much in the way of tricks for a "reasonable" (for the HN crowd) price.
fl0id 260 days ago [-]
higher ram apple silicon should be able to run it too. if they don't use some ancient pytorch version or something.
phkahler 260 days ago [-]
Why not on a CPU with 32 or 64 GB of RAM?
holoduke 260 days ago [-]
Much slower memory and limited parallelism. Gpu ÷- 8k pr more cuda cores vs +-16 on regular cpu. Less mem swapping between operations. Gpu much much faster.
CuriouslyC 260 days ago [-]
Performance, mostly. It'll work but image generation is shitty to do slowly compared to text inference.
s-macke 259 days ago [-]
Got it running. But it is a special setup.
* NVIDIA Jetson AGX Orin Dev. Kit with 64 GB shared RAM.
* Default configuration for flux-dev. (FP16, 50 steps)
* 33GB GPU RAM usage.
* 4 minutes 20 seconds per image at around 50 Watt power usage.
Well, I was wondering about bias in the model, so I entered "a president" as the prompt. Looks like it has a bias alright, but it's even more specific than I expected...
teamspirit 260 days ago [-]
You weren’t kidding. Tried three times and all three were variations of the same[0].
What is the difference between schnell and dev? Just the kind of distillation?
weberer 259 days ago [-]
>FLUX.1 [dev]: The base model
>FLUX.1 [schnell]: A distilled version of the base model that operates up to 10 times faster
It should also be noted that "schnell" is the German word for "fast".
Kerbonut 259 days ago [-]
Not quite right, per their github repo:
> Models
> We are offering three models:
> FLUX.1 [pro] the base model, available via API
> FLUX.1 [dev] guidance-distilled variant
> FLUX.1 [schnell] guidance and step-distilled variant
schleck8 260 days ago [-]
Schnell is definitely worse in quality, although still impressive (it gets text right). Dev is the really good one that arguably outperforms the new Midjourney 6.1
ilkke 258 days ago [-]
Also worth mentioning schnell is a 4-step model, so comparable to SD lightning in that regard
Aardwolf 260 days ago [-]
What's the difference between pro and dev? Is the pro one also 12B parameters? Are the example images on the site (the patagonia guy, lego and the beach potato) generated with dev or pro?
treesciencebot 260 days ago [-]
I think they are mainly -dev and -schnell. Both models are 12B. -pro is the most powerful and raw, -dev is guidance distilled version of it and -schnell is step distilled version (where you can get pretty good results with 2-8 steps).
Aardwolf 257 days ago [-]
what does guidance distilled mean?
something about pro must be better than dev or it wouldn't be made API-only, but what exactly, how does guidance distilling affect pro it and what quality remains in dev?
layer8 260 days ago [-]
Requires sign-in with a GitHub account, unfortunately.
wavemode 260 days ago [-]
I think they may have turned on the gating some time after this was submitted to HackerNews. Earlier this morning I definitely ran the model several times without signing in at all (not via GitHub, not via anything). But now it says "Sign in to run".
treesciencebot 260 days ago [-]
i just updated the links to clarify which models require sign-in and which doesn't!
smusamashah 260 days ago [-]
Tested it using prompts from ideogram (login walled) which has great prompt adherence. Flux generated very very good images. I have been playing with ideogram but i don't want their filters and want to have a similar powerful system running locally.
If this runs locally, this is very very close to that in terms of both image quality and prompt adherence.
> A captivating and artistic illustration of four distinct creative quarters, each representing a unique aspect of creativity. In the top left, a writer with a quill and inkpot is depicted, showcasing their struggle with the text "THE STRUGGLE IS NOT REAL 1: WRITER". The scene is comically portrayed, highlighting the writer's creative challenges. In the top right, a figure labeled "THE STRUGGLE IS NOT REAL 2: COPY ||PASTER" is accompanied by a humorous comic drawing that satirically demonstrates their approach. In the bottom left, "THE STRUGGLE IS NOT REAL 3: THE RETRIER" features a character retrieving items, complete with an entertaining comic illustration. Lastly, in the bottom right, a remixer, identified as "THE STRUGGLE IS NOT REAL 4: THE REMI
Otherwise, the quality is great. I stopped using stable diffusion long time ago, the tools and tech around it became very messy, its not fun anymore. Been using ideogram for fun but I want something like ideogram that I can run locally without any filters. This is looking perfect so far.
This is not ideogram, but its very very good.
benreesman 260 days ago [-]
Ideogram handles text really well but I don’t want to be on some weird social network.
If this thing can mint memes with captions in it on a single node I guess that’s the weekend gone.
Thanks for the useful review.
smusamashah 260 days ago [-]
Flux is amazing actually. See my other comment where I verified a prompt on their fastest model. Check the linked reddit thread too.
You can run it locally in ComfyUI. I was able to run it with 12GB of vram and reportedly even 8GB is doable, albeit very slow.
seveibar 260 days ago [-]
whenever I see a new model I always see if it can do engineering diagrams (e.g. "two square boxes at a distance of 3.5mm"), still no dice on this one. https://x.com/seveibar/status/1819081632575611279
Would love to see an AI company attack engineering diagrams head on, my current hunch is that they just aren't in the training dataset (I'm very tempted to make a synthetic dataset/benchmark)
roenxi 260 days ago [-]
It'll probably come suddenly. It has been fascinating to me watching the journey from Stable Diffusion 1 to 3. SD1 was a very crude model, where putting a word in the prompt might or might not add representations of the word to the image. Eg, using the word "hat" somewhere in the prompt might do literally nothing or suddenly there were hats everywhere. The context of the word didn't mean much to SD1.
SD2 was more consistent about the word appearing in the image. "hat" would add hats more reliably. Context started to matter a little bit.
SD3 seems to be getting a lot better at the idea of scene composition, so now specific entities can be prompted to wear hats. Not perfect, but noticeably improved from SD2.
Extrapolating from that, we're still a few generations from being able to describe things with the precision of an engineering diagram - but we're heading in the right direction at a rapid clip. I doubt there needs to be any specialist work yet, just time and the improvement of general purpose models.
fennecbutt 255 days ago [-]
At a rapid clip is a great unintentional pun here.
napoleongl 260 days ago [-]
Can’t you get this done via an LLM and have it generate code for mermaid or D2 or something? I’ve been fiddling around with that a bit in order to create flowcharts and datamodels, and I’m pretty sure I’ve seen at least one of those languages handle absolute positioning of object.
seveibar 260 days ago [-]
it usually isn't accurate. LLMs generally have very little spatial awareness.
zellyn 260 days ago [-]
I have likewise been utterly unable to get it to generate images that look like preliminary rapid pencil sketches. Suggestions by experienced prompters welcome!
phkahler 260 days ago [-]
>> Would love to see an AI company attack engineering diagrams head on, my current hunch is that they just aren't in the training dataset (I'm very tempted to make a synthetic dataset/benchmark)
That seems like a good use for a speech driven assistant that know how to use PC desktop software. Just talk to a CAD program and say what you want. This seems like a long way off but could be very useful.
It appears the model does have some "sanity" restrictions from whatever its training process is that limits some of the super weird outputs.
"A horse sitting on a dog" doesn't work but "A dog sitting on a horse" works perfectly.
bboygravity 260 days ago [-]
a zebra on top of an elephant worked fine for me
PoignardAzur 260 days ago [-]
Am I missing something? The beach image they give still fails to follow the prompt in major ways.
swatcoder 260 days ago [-]
You're not. I'm surprised at their selections because neither the cooking one nor the beach one adhere to the prompt in very well, and that first one only does because it prompt largely avoids much detail altogether. Overall, the announcement gives the sense that it can make pretty pictures but not very precise ones.
astrange 259 days ago [-]
Well, that's nothing new, but it doesn't matter to dedicated users because they don't control it just by typing in text prompts. They use ComfyUI, which is a node editor.
fennecbutt 255 days ago [-]
I'd say automatic1111 is more popular. Comfy seems like a rat's nest, unreal shader node flashbacks.
ilkke 258 days ago [-]
Does this afford better prompt adherence control in some way?
spywaregorilla 257 days ago [-]
No directly. but it encourages iteration on the same seed and then on specific details rather than just trying different prompts on different seeds from scratch over and over
Jackson__ 259 days ago [-]
Sounds to me like it's an issue with their VLM captions creating very "pretty" but not actually useful captions. Like one of the example image prompts includes this absolute garbage:
> Convey compassion and altruism through scene details.
perstablintome 259 days ago [-]
The quality is difficult to judge consistently as there's variants among seed with the same prompt. And then there's the problem of cherry picked examples making the news. So I'm building a community gallery to generate Pro images for free, hope this at least increases the sample size https://fluxpro.art/
SV_BubbleTime 260 days ago [-]
Wow.
I have seen a lot of promises made by diffusion models.
This is in a whole different world. I legitimately feel bad for the people still a StabilityAI.
The playground testing is really something else!
The licensing model isn’t bad, although I would like to see them promise to open up their old closed source models under Apache when they release new API versions.
The prompt adherence and the breadth of topics it seems to know without a finetune and without any LORAs, is really amazing.
Havoc 260 days ago [-]
Bit annoying signup...Github only...and github account creation is currently broken "Something went wrong". Took two tries and two browsers...
fernly 260 days ago [-]
I had the same "something went wrong" experience, but on retrying the "sign in to run" button, it was fine and had logged me in.
Gave me a credit of 2USD to play with.
vunderba 260 days ago [-]
Vast majority of comparisons aren't really putting these new models through their paces.
The best prompt adherence on the market right now BY FAR is DALL-E 3 but it still falls down on more complicated concepts and obviously is hugely censored - though weirdly significantly less censored if you hit their API directly.
I quickly mocked up a few weird/complex prompts and did some side-by-side comparisons with Flux and DALL-E 3. Flux is impressive and significantly performant particularly since both the dev/shnell models have been confirmed by Black Forest to be runnable via ComfyUI.
Your comparisons are all with the flux shnell model
> The fastest image generation model tailored for local development and personal use
Versus flux pro or dev models
vunderba 260 days ago [-]
I did put them through pro/dev as well just to be safe. The quality changes and you can play with guidance (cranking it all the way to 10) but it doesn't make a significant difference for these prompts from what I could tell.
Several iterations and these were the best I got out of schnell, dev and pro respectively for the following prompt:
"a fantasy creature with the body of a dragon and a beachball for a head, hybrid, best quality, shadows and lighting, fantasy illustration muted"
How long until nsfw fine tunes? Don’t pretend like it’s not on all of y’all’s minds, since over half of all the models on Civit.ai are NSFW. That’s what folks in the real world actually do with these models.
throwoutway 260 days ago [-]
> Nearby, anthropomorphic fruits play beach volleyball.
This is missing from the image. The generated image looks well, but while reading the prompt I was surpised it was missing
fl0id 260 days ago [-]
Mmmh, trying my recent test prompts, still pretty shit. F.e. whereas midjourney or SD do not have a problem to create a pencil sketch, with this model (pro), it always looks more like a black and white photograph or digital illustration or render. It is also like all the others apparently not able to follow instructions on the position of characters. (i.e. X and Y are turned away from each other).
viraptor 260 days ago [-]
Censored a bit, but not completely. I can get occasional boobs out of it, but sometimes it just gives the black output.
refulgentis 260 days ago [-]
This gives you no info on how the model works. what is being applied is fal's post-inference "is this NSFW?" filter model
So your censorship investigation (via boobs) is testing a completely different, unrelated, model.
viraptor 260 days ago [-]
It does provide information. Regardless of whether they use a post-inference filter, we now know that the model itself was trained on and can produce NSFW content. Compare this to SD3 which produces a noise pattern if you request naked bodies.
(Also you can download the model itself to check the local behaviour without extra filters. Unfortunately I don't have time to do it right now, but I'd love to know)
refulgentis 260 days ago [-]
Right, that (the black bars) gives no info on how the model works. Thus, you'd love to "know more". ;)
Rest is groping for a reason to make "model is censored [classifier made POST return black image instead of boobs]" something sensical.
260 days ago [-]
yjftsjthsd-h 260 days ago [-]
> FLUX.1 [dev]: The base model, open-sourced with a non-commercial license
...then it's not open source. At least the others are Apache 2.0 (real open source) and correctly labeled proprietary, respectively.
Hey, great work over at fal.ai to run this on your infrastructure and for building in a free $2 in credits to try before buying. For those thinking of running this at home, I'll save you the trouble. Black Forest Flux did not run easily on my Apple Silicon MacBook at this time. (Please let me know if you have gotten this to run for you on similar hardware.) Specifically, it falls back to using CPU which is very slow. Changing device to 'mps' causes error "BFloat16 is not supported on MPS"
zarmin 260 days ago [-]
WILD
Photo of teen girl in a ski mask making an origami swan in a barn. There is caption on the bottom of the image: "EAT DRUGS" in yellow font. In the background there is a framed photo of obama
Anyone know why text-to-image models have so many fewer parameters than text models? Are there any large image models (>70b, 400b, etc)?
Sohcahtoa82 259 days ago [-]
The way someone explained it to me is that text-to-image models are essentially just de-noisers.
They train them by taking an image with a label, ie, "cat", and then adding some noise to it, run a training step, add more noise, run another step, and so on until the image is total (or near total) noise and still being told it's a cat.
Then, when you want to generate "cat", you start with noise, and it finds a cat in the noise and cancels some of the noise repeatedly. If you're able to watch an image get generated, sometimes you'll even see two cats on top of each other, but one ends up fading away.
Turns out, these denoisers don't require that many parameters, and if your resulting image has a few pixels that are just a tiny bit off color, you won't even notice.
minimaxir 260 days ago [-]
Diffusion is very efficient encoding/decoding.
The only reason that diffusion isn't used for text is because text requires discrete outputs.
If a pixel is just slightly the wrong shade of green, nobody really cares.
dinobones 260 days ago [-]
I wonder if the key behind the quality of the MidJourney models, and this models, is less about size + architecture and more about the quality of images trained on.
It looks like this is the case for LLMs, that the training quality of the data has a significant impact on the output quality of the model, which makes sense.
So the real magic is in designing a system to curate that high quality data.
CuriouslyC 260 days ago [-]
Midjourney unquestionably has heavy data set curation and uses RLHF from users.
You don't have to speculate on this as you can see that custom models for SDXL for instance perform vastly better than vanilla SDXL at the same number of parameters. It's all data set and tagging.
spywaregorilla 260 days ago [-]
custom models perform vastly better at the tasks they are finetuned to do
CuriouslyC 260 days ago [-]
That is technically true, but when the base model is wasting parameter information on poorly tagged, watermarked stock art and other garbage images, it's not really a meaningful distinction. Better data makes for better models, nobody cares about how well a model outputs trash.
spywaregorilla 260 days ago [-]
Ok, but you're severely misrepresenting the importance of things. Base SDXL is a fine model. Base SDXL is going to be much better than a materially smaller model that you've retrained with "good data".
cma 259 days ago [-]
SDXL used RLHF too
42lux 260 days ago [-]
It's the quality of the image text pair not the image alone but midjourney is not a model it's a suite of models that work in conjunction. They have an llm in the front to optimize the user prompts, they use SAM models, controlnet models for poses that are in high demand and so much more. That's why you can't really compare foundation models anymore because there are none.
jncfhnb 260 days ago [-]
No, it’s definitely the size. Tiny LLMs are shit. Stable Diffusion 3’s problem is not that that its training set was wildly different, it’s that it’s just too small (because the one released so far is not the full size).
You can get better results with better data, for sure. And better architecture, for sure. But raw size is really important the difference in quality for models, all else held equal, is HUGE and obvious if you play with them.
pzo 260 days ago [-]
I would agree - midjourney is getting a free labour since many of their generations are not in secret mode (require pro/mega subscription) so prompts and outputs are visible to everyone. Midjourney rewards users to rating those generations. I wouldn't be surprised if there are some bots on their discord that are scraping those data for training their own models.
ilkke 258 days ago [-]
Are the prompts of pro users secret to Midjourney?
astrange 259 days ago [-]
Flux-schnell is still incapable of generating "horse riding an astronaut" or "upside-down mini cooper".
sztanko 259 days ago [-]
I don't think any of them can.
virtualritz 259 days ago [-]
I enter an elaborate prompt, press "Sign in to Run", sign in with my GH, get taken back to the previous page and my prompt text has reset to some default with no way to get back what I entered before.
Complete and utter UX/first impression fail. I had no desire to actualy try the model after this.
fngjdflmdflg 260 days ago [-]
Is the architecture outlined anywhere? Any publications or word on if they will publish something in the future? To be fair to them, they seemed to have launched this company today so I doubt they have a lot of time right now. Or maybe I just missed it?
I don't have anything to compare it to as I'm not that familiar with other diffusion models in the first place. I was kind of hoping to read the key changes they made to the diffusion architecture and how they collected and curated their dataset. I'd assume their are also using LAION but I wonder if they are doing anything to filter out low quality images (separate from what LAION atheistic already does). Or maybe if they have their own dataset.
Oras 260 days ago [-]
This is great and unbelievably fast! I noticed a small note saying how much this would cost and how many images you can create for $1.
I assume you’re offering this as an API? Would be nice to have pricing page as I didn’t see one on your website.
smusamashah 260 days ago [-]
Holy crap this is amazing. I saw an image with a prompt on reddit and didn't believe it was generated imaged. I thought it must be joke that people are sharing non-generated images in the thread.
> Photo of Criminal in a ski mask making a phone call in front of a store. There is caption on the bottom of the image: "It's time to Counter the Strike...". There is a red arrow pointing towards the caption. The red arrow is from a Red circle which has an image of Halo Master Chief in it.
Some of the images I generated using schnell model with 8-10 steps using this prompt. https://imgur.com/a/3mM9tKf
codezero 260 days ago [-]
[dead]
j1mmie 260 days ago [-]
I'm really impressed at its ability to output pixel art sprites. Maybe the best general-purpose model I've seen capable of that. In many cases its better than purpose-built models.
ilkke 258 days ago [-]
Any examples? I wasn't able to get any good results
mlboss 260 days ago [-]
These venture funded startups keep releasing models for free without a business model in sight. I am all for open source but worry it is not sustainable long term.
qball 259 days ago [-]
At this point the only thing an AI startup has to do to get people to spend money on the model is to:
-not censor it
-not be doing prompt injection
It's very easy, which is why no other firm is capable of it.
wmf 260 days ago [-]
The free models are for practice and advertising. Once they get good they start charging. We've already seen this with Mistral and Stability.
djbusby 260 days ago [-]
It's not. These VC firms currently just blasting money+AI at lots of things. Exploring to find what sticks. Expensive discovery.
259 days ago [-]
NeckBeardPrince 259 days ago [-]
I think we have enough stuff out there called flux.
rty32 259 days ago [-]
Just came to say I didn't see the "for homeless" part in that LEGO example. The prompt is a bit funny, almost ridiculous.
julienlafond 259 days ago [-]
Same with the potato example, no "Nearby, anthropomorphic fruits play beach volleyball."
kennethwolters 260 days ago [-]
It is very good at "non-human subjects in photos with shallow focus".
Really curious to see what other low-hanging fruits people are finding.
CuriouslyC 260 days ago [-]
Check out reddit.com/r/stablediffusion, it's been handling everything people have thrown at it so far.
robotnikman 260 days ago [-]
This is amazing! I thought it would be a few more years before we would have a such high quality model we could run locally.
SirMaster 260 days ago [-]
I tried: "Moe from The Simpsons, waving" several times. But it only ever drew Lisa from The Simpsons waving.
Noam45 259 days ago [-]
I recently heard of FLUX and began reading about this. It's a remarkable technology
seu 259 days ago [-]
So I'm forced to signup and give my email for a supposed trial, only to be immediately told by email that I have a "Low Account Balance - Action Required"? Seriously?
xmly 260 days ago [-]
Nice one. Will it plan to support both text and image to image?
ilkke 258 days ago [-]
Someone on Reddit did promptless img2img in comfy by passing an image into vae decode and then thru the schnell model for a kind of a refiner, with great results
asadm 260 days ago [-]
This is actually really good! I fear much better than SD3 even!
vishalk_3 259 days ago [-]
Great product. BTW I am new to this technology can you please tell me what is the parameter given to Model to make it look like real life image ?
ilkke 258 days ago [-]
Try something like "Photo of...", "as photography" or "photorealistic". You can even specify the camera model and lens/exposure settings. You can find these in metadata of your phone photos for example.
vishalk_3 245 days ago [-]
Thanks for the reply buddy. But I am still not able to understand how camera model and lens/exposure settings can be used to make the photo real.
Lets, say that you took an image of a flower in a garden and ai has also generated an image of the same flower. When we see these pics side by side we find a lot of difference between them. Origin of my question was "how can we minimize this difference ?". How we can tell the machine that the more the magnitude of a certain parameter the more real it is, not sure if camera settings could help in this case.
260 days ago [-]
jncfhnb 260 days ago [-]
Looks like a very promising model. Hope to see the comfyui community get it going quickly
lovethevoid 260 days ago [-]
Its already available for comfyui
jncfhnb 260 days ago [-]
It’ll need time for the goodies beyond the base model though I would guess
lovethevoid 260 days ago [-]
Works great as is right now, I can see some workflows being affected or having to wait for an update, but even those can do with some temporary workarounds (like having to load another model for later inpainting steps).
So if you're wanting to experiment and have a 24GB card, have at it!
jncfhnb 260 days ago [-]
Yeah I mean like controlnet / ipadapter / animateddiff / in painting stuff
I don’t feel like base models are super useful. Most real use cases depend on being able to iterate on consistent outputs imo.
I have had a very bad experience trying to use other models to modify images but I mostly do anime shit and maybe styles are less consistently embedded into language for those models
EternalFury 260 days ago [-]
Impressive quality
mikejulietbravo 260 days ago [-]
What's the tl;dr on a difference from this to SD?
minimaxir 260 days ago [-]
tl;dr better quality even with the least powerful model and can be much faster
"Photo of Criminal in a ski mask making a phone call in front of a store. There is caption on the bottom of the image: "It's time to Counter the Strike...". There is a red arrow pointing towards the caption. The red arrow is from a Red circle which has an image of Halo Master Chief in it."
I'm really looking forward to exploring its capabilities and seeing how it compares to other models.
KennyBlanken 260 days ago [-]
[flagged]
wavemode 260 days ago [-]
Hmm I was able to try the model myself (generated a handful of images) and didn't log in.
SV_BubbleTime 260 days ago [-]
It’s a fal login, which is linked/federated with GitHub if you have done that before.
It takes zero time to “make an account”.
wavemode 260 days ago [-]
I think they may have turned on the gating some time after this was submitted to HackerNews. Earlier this morning I definitely ran the model several times without signing in at all (not via GitHub, not via anything). But now it says "Sign in to run".
260 days ago [-]
Rendered at 17:13:02 GMT+0000 (Coordinated Universal Time) with Vercel.
what we did at fal is take the model and run it on our inference engine optimized to run these kinds of models really really fast. feel free to give it a shot on the playgrounds. https://fal.ai/models/fal-ai/flux/dev
Bummer. After seeing what was generated in the blog post I was excited to try it! Now feeling disappointed.
I was hoping it'd be more like https://play.go.dev.
Good luck.
> A Gary Larsen, "Far Side" comic of a racoon disguising itself by wearing a fedora and long trench coat. The raccoon's face is mostly hidden by the fedora. There are extra paws sticking out of the front of the trench coat from between the buttons, suggesting that the racoon is in fact a stack of several raccoons.
Every human I've ever described this to has no problem picturing what I mean. It's a classic comic trope. AIs still struggle.
The intelligence just doesn't "fit" in there.
Personally I'm curious to see what would happen if someone burnt $100M of compute time on training a truly enormous image generator model, something the same-ish size as GPT4...
The first attempt at this based on https://reductress.com/post/my-boyfriends-are-always-two-kid... ... really misunderstood the image. This may also be part of the problem.
I then went to the image from https://www.reddit.com/r/DnD/comments/c6fdw4/oc_introducing_...And that provided:
Working off of that idea of the totem formation ... "Create an image featuring three children in a totem pole formation that are trying to conceal their nature in a single oversized trench coat."That produced https://imgur.com/a/Of9FsJl
I suspect the orange beard came from the previous part in the session. But that might be an approach to take in trying to describe it in a way that can be used.
It’s more like a giant telescope of many lenses (the latents from the prompts) and you’re adjusting the lenses to bring a possible reality of many into focus.
But AIs learn and therefore create in exactly the same way as humans, ostensibly on the same data. How can this be possible? /s
For what it's worth, I've previously asked in the Stable Diffusion Discord server for help generating a "lamb with seven horns and seven eyes" but the members there were also unsuccessful.
Recently Claude began to allow generation of SVG drawings, and asking it to draw a unicorn and later add extra tails or horns worked correctly.
A fork exists in physical space and it's pretty intuitive to understand what it can do. These models exist within digital space and are incredibly opaque by comparison.
That sounds interesting! Were the results somewhat clean and clear SVG or rather a mess that just looked decent?
[0] https://imgur.com/Cc5uJNg
"a woman lying on her back wearing a blouse and shorts."
But it wouldn't render the image - i instead got a NSFW warning. That's one way to hide the fact that it cannot render it properly i guess...
PS: after a few tries it rendered "a woman lying on her back" correctly.
Remarkably better than the "DrawThings" iPhone app (my only reference point).
[0] https://github.com/bbkane/envelope/issues/44
Also, everybody should remember that these models are not copyrightable and you should never agree to any license for them...
Thanks for pointing that out @Hizonener
Never underestimate the value of getting hordes of unpaid workers to refine your product. (See also React, others)
I'd prefer "false advertising" - it's more direct and without the culture war baggage.
That said, I don't think outputs of the model are derivative works of it, any more than the model is a derivative of its training data, so it's not clear to me they can actually enforce what you do with them.
Are you talking about https://en.wikipedia.org/wiki/Database_right or plain old copyright?
I'm no IP lawyer, but I've always thought that copyright put "requirements" on the artefact (i.e the threshold of originality), not the process.
In my jurisdiction we have database rights, meaning that you get IP protections for the artefact based on the work put into the process. For example a database of distances between adress pairs or something is probably not copyrightable, but can be protected under database rights if enough work was done to compile the data.
EDIT: Saw in another place in thread speaking about the https://en.wikipedia.org/wiki/Sweat_of_the_brow doctrine, relates to Database rights. (Neither of which notably are not applicable in the U.S)
The only thing that's really specified about the model itself is its architecture, which is (1) dictated by function, and (2) usually deeply stereotyped.
Fair enough, but those datasets are also primarily copyrighted material. If the software here merely transforms the input material (which I agree it does), then the output is a derivative work.
If I take a string of data from a true hardware RNG, XOR it with a Taylor Swift song, and throw away the original random stream, is the resulting fundamentally random bit string still a derivative work of the song? As with the ML model, you can't recognize the song in it. And as with at least some training examples in the inputs of most ML models, you can't recover the song from it either.
It feels like the test for whether X is derivative for copyright purposes should include some kind of attention to whether X is a creative work at all. Maybe not, but then what test do you use?
I do recognize the possibility that the models might not themselves be eligible for copyright as independent works, yet still infringe copyright in the training inputs. It seems messy, but not impossible.
... and as I said elsewhere, it's also messy that while you generally can't recover every training input from the model, you can usually recover something very close to some of the training inputs.
It's not a copy of it, and when you distribute it you're not distributing the original. So it's not a derivative for copyright purposes.
It can still be a derivative for other legal purposes. Judges don't appreciate it when you do funny math tricks like that and will see through them.
> It feels like the test for whether X is derivative for copyright purposes should include some kind of attention to whether X is a creative work at all. Maybe not, but then what test do you use?
Yes, that's how US copyright law works. (well sort of…)
Being a transformative work of something makes it less of a copy of it, the more transformed it is, since it falls under fair use exemptions or is clearly a different category of thing.
If a model was a derivative of its training data, then Google snippets/thumbnails would be derivatives of its search results and would be illegal too. Unless you wrote a new law to specifically allow them.
In other countries (Germany, Japan) fair use is weaker, but model training has laws specifically making it legal in certain circumstances, and presumably so do Google snippets.
A compressed (or normally encrypted) version wouldn't be a copy that way, either, but I would still absolutely go down for distributing it. The difference is that the compression can be reversed to recover the original. Even lossy compression would create such a close derivative that nobody would probably even bother to make the distinction.
You're right that "math games" don't work in the law, but that cuts both ways. If you do something that truly makes the original unrecoverable and in fact undetectable, and if nothing salient to the legal issues at hand about the new version derives from the original, then judges are going to "see through" the "math trick" of pretending that it is a derivative.
> then Google snippets/thumbnails would be derivatives of its search results
Thumbnails are legally derivative works, in the US and probably most other places. In the US, they're protected by the fair use defense, and in other places they're protected by whatever carveouts those places have. But that doesn't mean they're not derivative works.
In fact, if I remember the US "taxonomy" correctly, thumbnails are infringing. It's just that certain kinds of ingfringement are accepted because they're fair use.
If thumbnails weren't derivative works at all, then the question of fair use wouldn't arise, because there can be no infringement to begin with if the putatively infringing work isn't either derivative or a direct copy.
Where thumbnails are different from ML models is that they're clearly works of authorship. In a thumbnail, you can directly see many of the elements that the author put into the original image it's derived from.
The questions are (a) whether ML models are works of authorship to begin with (I say they're not), and (b) whether something that's not a work of authorship can still be a derivative work for purposes of copyright infringment (I'm not sure about that).
So far as I know, neither one is the subject of either explicit legislation or definitive precedent in most of the world, including the US.
Since it costs millions to produce one of these models, it's not just taking the software and running it to compile them.
It would be nice here if you give some examples of what you call open source model. Please ;) Because the impression is that these things do not exist, it's just a dream which does not deserve such a nice term..
However, plenty of open source software exists. The fact that open source models don't exist doesn't excuse attempts to falsely claim the prestige of the phrase "open source".
What criteria for copyright protection are they missing?
I can tell you a secret. What you call 'open source' models are impossible. Because massive randomness is a part of training process. They are not reproducible. Having everything you cannot even tell if the given model was trained on the given dataset. Copyright is a different thing.
And a bad news, what's coming is even worst. Those will be the whole things with self awareness and personal experience. They can be copied, but not reproduced. More over, it's hard or almost impossible to detect if something undeclared was planted in their 'minds'.
All together means 'open source' model in strict interpretation is a myth, great idea which happen to be not. Like Turing test.
> However, plenty of open source software exists.
Attempt to switch topic detected.
PS: as for that massive downvote, I even wasn't rude, don't care. This account will be abandoned soon regardless, like all before and after.
You are wrong about that. It's a file with numbers. Which makes it a database or dataset and very much protected by copyright. That's why licenses are needed. For the phone book, things like open street maps, and indeed AI models.
> The fact that open source models don't exist
The fact that many people (myself included) routinely download and use models distributed under OSI approved licenses (Apache V2, MIT, etc.) makes that statement verifiably wrong. And yes, I do check the license of stuff that I use as I work with companies that care about such matters.
> As far as I know ...
Now you know better.
This is only true in jurisdictions that follow the sweat of the brow doctrine, where effort alone without creativity is considered enough for copyright. In other places, such as the USA, collections of facts are not copyrightable and a minimal amount of creativity is required for something to qualify as copyrightable. The phone book is an example that is often used, actually, to demonstrate the difference.
https://en.wikipedia.org/wiki/Sweat_of_the_brow
Not every collection of numbers is a database, and a database is not the same thing as a dataset.
Databases have limited copyright-like protection in some places. Under TRIPS, that extends to only databases that are "creative by virtue of the selection or arrangement of their contents" or something along those lines. In the US they talk specifically about curation.
ML models do not meet either requirement by any reasonable interpretation.
> The fact that many people (myself included) routinely download and use models distributed under OSI approved licenses (Apache V2, MIT, etc.) makes that statement verifiably wrong.
The "source code" of an ML model is most reasonably interpreted as including all of the training data, which are never, ever available.
Now you know better.
[On edit: By the way, the people creating these works had better hope they're outside copyright, because if not, each one of them is a derivative work of (at least some large and almost impossible to identify subset of) its training data, so they need licenses from all the copyright holders of that training material, which few of them have or can get.]
However, transformativeness is a factor in whether or not there is a fair-use exception for the derivative work. And these models are highly transformative, so this is a strong argument for their fair-use.
"Fair use" is pretty much entirely a US concept, and similar concepts in other countries aren't isomorphic to it.
The model does have a radically different form from its inputs. So you could easily imagine that being "transformative enough" for US fair use. A lot of the other fair use elements look pretty easy to apply, too. Although there's still the question of whether all the intermediate copies you made to create the model were fair use...
In fact, I'll even concede that a court could find that a model wasn't a derivative work of its inputs to begin with, and not even have to get to the fair use question. The argument would be that the model doesn't actually reproduce any of the creative elements of any particular training input.
I do think a finding like that would be a much bigger stretch than a finding that the model was copyrightable. I could easily see a world where the model was found derivative but was not found copyrightable. And it's actually not clear to me at all that the model has to be copyrightable to infringe the copyright in something else, so that's another mess.
Somewhat related, even if the model itself isn't infringing, it's definitely possible to have most models create outputs that are very similar to (some specific examples in) their training data... in ways that obviously aren't transformative. Outputs that might compete with the original training data and otherwise fail to be fair use. So even if the model is in the clear, users might still have to watch out.
The Llama models aren't. Some of the Mistral models are (the Apache 2 ones). Microsoft Phi-3 is - it's MIT.
I've decided to draw my personal line at Open Source Initiative compliance for the license they release the model itself under.
I respect the opinion that it's not truly open source unless they release the training data as well, but I've decided not to make that part of my own personal litmus test here.
My reasoning is that knowing something is "open source" helps me decide what I legally can or cannot do with it when building my own software. Not having access to the training data downs affect my legal rights, it just affects my ability to recompile myself. And I don't have millions of dollars of GPUs so that isn't so important to me, personally.
Tough beans? There's lots of actual software that can't be open source because it embeds stuff with incompatible restrictions, but nobody tries to redefine "open source" because of that.
... and, on a vaguely similar-flavored note, you'd better hope that the models you're using end up found to be noninfringing or fair use or something with respect to those "unlicensed data", because otherwise you're in a world of hurt. It's actually a lot easier to argue that the models aren't copyrightable than it is to argue that they're not derivative of the input.
> I've decided to draw my personal line at Open Source Initiative compliance for the license they release the model itself under.
You're allowed to draw your personal line about what you'll use anywhere you want, but that doesn't mean that you should try to redefine "open source" or support anybody who does.
I'd suggest re-wording the blog post intro, it reads as if it was created by Fal.
Specific phrases to change:
> Announcing Flux
(from the title)
> We are excited to introduce Flux
> Flux comes in three powerful variations:
This section also comes across as if you created it
> We invite you to try Flux for yourself.
Reads as if you're the creator
This library is quite well known, 3rd most starred project in Julia: https://juliapackages.com/packages?sort=stars.
It has been around since, at least, 2016: https://github.com/FluxML/Flux.jl/graphs/code-frequency.
I hope this one doesn't stir as much discussion. It has 4000 stars, there isnt a large mass of people who view the world through the lens of "Flux is ML library". No one will end up in a "who is on first?" discussion because of it. If this line of argument is held sacrosanct, it ends up in an infinite loop until everyone gives up and starts using UUIDs.
https://en.wikipedia.org/wiki/Go!_(programming_language)
Disclosure: I work at Google but not on the Go team.
Flux A is the ML library
Flux B is the T2I model
Flux C is the React library
Flux D is the physics concept of power per unit area
Flux E is the goo you put on solder
also search engines are context aware, if your search history is full of julia questions, it will know what you're searching for
If it's unlimited or "throttled for abuse," say that. Right now, I don't know if I can try it six times or experiment to my heart's desire.
I’d bet that fine art training would further improve the compositional skills of the model, plus it would open up a range of uses that are (to me at least) a bit more interesting than just illustrations.
Does it respond to any names? I noticed SD3 removed all names to prevent recreating famous people but as a side effect lost the very powerful ability to infer styles from artist names too.
It is very fast and very good at rendering text, and appears to have a text encoder such that the model can handle both text and positioning much better: https://x.com/minimaxir/status/1819041076872908894
A fun consequence of better text rendering is that it means text watermarks from its training data appear more clearly: https://x.com/minimaxir/status/1819045012166127921
I could write “with text that says Shutterstock” in the prompt but that doesn’t necessairly mean the dataset contains that
The image linked has a traditional www watermark in the lower-left as well. Even something innocous as a "Super Mario 64" prompt shows a copyright watermark: https://x.com/minimaxir/status/1819093418246631855
It's like GRRM complaining that LLMs can reproduce chunks of text from his books "they fed my novels into it" Oh yeah? It's definitely not all the parts of your book quoted in millions of places online, including several dedicated wiki style sites? That wouldn't be it, right?
Of all of the instances on HN of Godwin's law playing out that I've ever seen, this one is the new cake-taker.
This is like the fifth time I see someone paraphrasing Niemöller in an ai context, and it's exhausting. It's also near impossible to take the paraphraser seriously.
More to the point, AI is a tool. I could just as well infringe on vanity fair IP using ms-paint. Someone more artistic than me could make a oil-on-canvas copy of their logo too.
Or, to turn your own annoying "argument" against you:
First they came for AI models, and I did not speak out, because I wasn't using them. Then they came for Photoshop, and I did not speak out, because I had never learned to use it. Then they came for for oil and canvas, and now there are no art forms left for me.
As to your use of the argument in the other direction, I’d say it doesn’t work very well because no one with any power is coming for those things.
Whether you are paraphrasing or referencing to a famous confessional poem dealing with the Holocaust, the only reasonable interpretation is that you're comparing with the Holocaust. Even if you were unaware of the phrases origins, that's how anyone who does know where it comes from will interpret it. See other comments drawing the same conclusion for reference.
Again. Ai is a tool. It can produce illegal material, just like a pencil can, or a brush with oil and canvas. How are they different? They are not.
https://www.vanityfair.com/verso/static/vanity-fair/assets/l...
There is a PR to that repo for a diffusers implementation, which may run on a cheap L4 GPU w/ enable_model_cpu_offload(): https://huggingface.co/black-forest-labs/FLUX.1-schnell/comm...
* NVIDIA Jetson AGX Orin Dev. Kit with 64 GB shared RAM.
* Default configuration for flux-dev. (FP16, 50 steps)
* 33GB GPU RAM usage.
* 4 minutes 20 seconds per image at around 50 Watt power usage.
(available without sign-in) FLUX.1 [schnell] (Apache 2.0, open weights, step distilled): https://fal.ai/models/fal-ai/flux/schnell
(requires sign-in) FLUX.1 [dev] (non-commercial, open weights, guidance distilled): https://fal.ai/models/fal-ai/flux/dev
FLUX.1 [pro] (closed source [only available thru APIs], SOTA, raw): https://fal.ai/models/fal-ai/flux-pro
Well, I was wondering about bias in the model, so I entered "a president" as the prompt. Looks like it has a bias alright, but it's even more specific than I expected...
[0] https://fal.media/files/elephant/gu3ZQ46_53BUV6lptexEh.png
https://imgur.com/a/fgf6Jt3
>FLUX.1 [schnell]: A distilled version of the base model that operates up to 10 times faster
It should also be noted that "schnell" is the German word for "fast".
> Models
> We are offering three models:
> FLUX.1 [pro] the base model, available via API
> FLUX.1 [dev] guidance-distilled variant
> FLUX.1 [schnell] guidance and step-distilled variant
something about pro must be better than dev or it wouldn't be made API-only, but what exactly, how does guidance distilling affect pro it and what quality remains in dev?
If this runs locally, this is very very close to that in terms of both image quality and prompt adherence.
I did fail at writing text clearly when text was a bit complicated. This ideogram image's prompt for example https://ideogram.ai/g/GUw6Vo-tQ8eRWp9x2HONdA/0
> A captivating and artistic illustration of four distinct creative quarters, each representing a unique aspect of creativity. In the top left, a writer with a quill and inkpot is depicted, showcasing their struggle with the text "THE STRUGGLE IS NOT REAL 1: WRITER". The scene is comically portrayed, highlighting the writer's creative challenges. In the top right, a figure labeled "THE STRUGGLE IS NOT REAL 2: COPY ||PASTER" is accompanied by a humorous comic drawing that satirically demonstrates their approach. In the bottom left, "THE STRUGGLE IS NOT REAL 3: THE RETRIER" features a character retrieving items, complete with an entertaining comic illustration. Lastly, in the bottom right, a remixer, identified as "THE STRUGGLE IS NOT REAL 4: THE REMI
Otherwise, the quality is great. I stopped using stable diffusion long time ago, the tools and tech around it became very messy, its not fun anymore. Been using ideogram for fun but I want something like ideogram that I can run locally without any filters. This is looking perfect so far.
This is not ideogram, but its very very good.
If this thing can mint memes with captions in it on a single node I guess that’s the weekend gone.
Thanks for the useful review.
https://news.ycombinator.com/item?id=41132515
See: https://www.reddit.com/r/StableSwarmUI/comments/1ei86ar/flux... (SwarmUI is cross platform and runs on macs, and linux)
Would love to see an AI company attack engineering diagrams head on, my current hunch is that they just aren't in the training dataset (I'm very tempted to make a synthetic dataset/benchmark)
SD2 was more consistent about the word appearing in the image. "hat" would add hats more reliably. Context started to matter a little bit.
SD3 seems to be getting a lot better at the idea of scene composition, so now specific entities can be prompted to wear hats. Not perfect, but noticeably improved from SD2.
Extrapolating from that, we're still a few generations from being able to describe things with the precision of an engineering diagram - but we're heading in the right direction at a rapid clip. I doubt there needs to be any specialist work yet, just time and the improvement of general purpose models.
That seems like a good use for a speech driven assistant that know how to use PC desktop software. Just talk to a CAD program and say what you want. This seems like a long way off but could be very useful.
Prompt: two square boxes at a distance of 3.5mm. Both boxes have the same size, 10cm.
"An upside down house" -> regular old house
"A horse sitting on a dog" -> horse and dog next to eachother
"An inverted Lockheed Martin F-22 Raptor" -> yikes https://fal.media/files/koala/zgPYG6SqhD4Y3y_E9MONu.png
"A horse sitting on a dog" doesn't work but "A dog sitting on a horse" works perfectly.
> Convey compassion and altruism through scene details.
I have seen a lot of promises made by diffusion models.
This is in a whole different world. I legitimately feel bad for the people still a StabilityAI.
The playground testing is really something else!
The licensing model isn’t bad, although I would like to see them promise to open up their old closed source models under Apache when they release new API versions.
The prompt adherence and the breadth of topics it seems to know without a finetune and without any LORAs, is really amazing.
Gave me a credit of 2USD to play with.
The best prompt adherence on the market right now BY FAR is DALL-E 3 but it still falls down on more complicated concepts and obviously is hugely censored - though weirdly significantly less censored if you hit their API directly.
I quickly mocked up a few weird/complex prompts and did some side-by-side comparisons with Flux and DALL-E 3. Flux is impressive and significantly performant particularly since both the dev/shnell models have been confirmed by Black Forest to be runnable via ComfyUI.
https://mordenstar.com/blog/flux-comparisons
> The fastest image generation model tailored for local development and personal use
Versus flux pro or dev models
Several iterations and these were the best I got out of schnell, dev and pro respectively for the following prompt:
"a fantasy creature with the body of a dragon and a beachball for a head, hybrid, best quality, shadows and lighting, fantasy illustration muted"
https://gondolaprime.pw/pictures/schnell-dev-pro.jpg
This is missing from the image. The generated image looks well, but while reading the prompt I was surpised it was missing
So your censorship investigation (via boobs) is testing a completely different, unrelated, model.
(Also you can download the model itself to check the local behaviour without extra filters. Unfortunately I don't have time to do it right now, but I'd love to know)
Rest is groping for a reason to make "model is censored [classifier made POST return black image instead of boobs]" something sensical.
...then it's not open source. At least the others are Apache 2.0 (real open source) and correctly labeled proprietary, respectively.
Photo of teen girl in a ski mask making an origami swan in a barn. There is caption on the bottom of the image: "EAT DRUGS" in yellow font. In the background there is a framed photo of obama
https://i.imgur.com/RifcWZc.png
Donald Trump on the cover of "Leopards Ate My Face" magazine
https://i.imgur.com/6HdBJkr.png
They train them by taking an image with a label, ie, "cat", and then adding some noise to it, run a training step, add more noise, run another step, and so on until the image is total (or near total) noise and still being told it's a cat.
Then, when you want to generate "cat", you start with noise, and it finds a cat in the noise and cancels some of the noise repeatedly. If you're able to watch an image get generated, sometimes you'll even see two cats on top of each other, but one ends up fading away.
Turns out, these denoisers don't require that many parameters, and if your resulting image has a few pixels that are just a tiny bit off color, you won't even notice.
The only reason that diffusion isn't used for text is because text requires discrete outputs.
If a pixel is just slightly the wrong shade of green, nobody really cares.
It looks like this is the case for LLMs, that the training quality of the data has a significant impact on the output quality of the model, which makes sense.
So the real magic is in designing a system to curate that high quality data.
You don't have to speculate on this as you can see that custom models for SDXL for instance perform vastly better than vanilla SDXL at the same number of parameters. It's all data set and tagging.
You can get better results with better data, for sure. And better architecture, for sure. But raw size is really important the difference in quality for models, all else held equal, is HUGE and obvious if you play with them.
Complete and utter UX/first impression fail. I had no desire to actualy try the model after this.
I assume you’re offering this as an API? Would be nice to have pricing page as I didn’t see one on your website.
Reddit message: https://www.reddit.com/r/StableDiffusion/comments/1ehh1hx/an...
Linked image: https://preview.redd.it/dz3djnish2gd1.png?width=1024&format=...
The prompt:
> Photo of Criminal in a ski mask making a phone call in front of a store. There is caption on the bottom of the image: "It's time to Counter the Strike...". There is a red arrow pointing towards the caption. The red arrow is from a Red circle which has an image of Halo Master Chief in it.
Some of the images I generated using schnell model with 8-10 steps using this prompt. https://imgur.com/a/3mM9tKf
-not censor it
-not be doing prompt injection
It's very easy, which is why no other firm is capable of it.
Really curious to see what other low-hanging fruits people are finding.
Lets, say that you took an image of a flower in a garden and ai has also generated an image of the same flower. When we see these pics side by side we find a lot of difference between them. Origin of my question was "how can we minimize this difference ?". How we can tell the machine that the more the magnitude of a certain parameter the more real it is, not sure if camera settings could help in this case.
So if you're wanting to experiment and have a 24GB card, have at it!
I don’t feel like base models are super useful. Most real use cases depend on being able to iterate on consistent outputs imo.
I have had a very bad experience trying to use other models to modify images but I mostly do anime shit and maybe styles are less consistently embedded into language for those models
Result (distilled schnell model) for
"Photo of Criminal in a ski mask making a phone call in front of a store. There is caption on the bottom of the image: "It's time to Counter the Strike...". There is a red arrow pointing towards the caption. The red arrow is from a Red circle which has an image of Halo Master Chief in it."
https://www.reddit.com/r/StableDiffusion/s/SsPeQRJIkw
It takes zero time to “make an account”.