> Unlike popular diffusion models, OmniGen features a very concise structure, comprising only two main components: a VAE and a transformer model, without any additional encoders.
> OmniGen supports arbitrarily interleaved text and image inputs as conditions to guide image generation, rather than text-only or image-only conditions.
> Additionally, we incorporate several classic computer vision tasks such as human pose estimation, edge detection, and image deblurring, thereby extending the model’s capability boundaries and enhancing its proficiency in complex image generation tasks.
This enables prompts for edits like:
"|image_1| Put a smile face on the note."
or
"The canny edge of the generated picture should look like: |image_1|"
> To train a robust unified model, we construct the first large-scale unified image generation dataset X2I, which unifies various tasks into one format.
nairoz 131 days ago [-]
> trained from scratch
Not exactly. They mention starting from the VAE from Stable Diffusion XL and the Transformer from Phi3.
Looks like these LLMs can really be used for anything
yieldcrv 131 days ago [-]
Pretty cool, comfy ui and community is too cumbersome for me and still results in too much throwaway content
lelandfe 131 days ago [-]
I left all the defaults as is, uploaded a small image, typed in "cafe," and 15 minutes later I am still waiting on this finishing.
Same, I left running for half an hour but nothing happened.
bob_1200 131 days ago [-]
The author updated their code a couple of days ago, and it runs smoothly on my end, producing results in about one minute. https://github.com/VectorSpaceLab/OmniGen
Citizen_Lame 130 days ago [-]
Left it running 1 hour nothing happens. Maybe this is a social experiment.
zamadatix 130 days ago [-]
Seems far more likely it's a transient unhandled exception which isn't bubbling up to let the frontend know.
grvbck 130 days ago [-]
Left it running overnight. Set output to 512x512 in an attempt to speed things up. Nothing.
freilanzer 130 days ago [-]
Second attempt, no output no matter how long it's left running.
bob_1200 130 days ago [-]
I think there might be an issue with this website; it doesn't seem to be their official site. It's recommended to use the official code and demo: https://vectorspacelab.github.io/OmniGen/
Took a few minutes to load, some assets download at less than 1kbps. The first 3 times I got a "Connection error" after 30s. The 4th time has now been running for 5m.
phromo 130 days ago [-]
On an A100 running 512x512 takes roughly 20s for one image+text input (50 iterations)
I think this type of capability will make a lot of image generation stuff obsolete eventually. In a year or two, 75%+ of what people do with ComfyUI workflows might be built into models.
freilanzer 130 days ago [-]
Well, at the moment it seems it's not working at all.
sswz 131 days ago [-]
Using a single model to unify all image generation tasks, including many computer vision tasks and visual language reasoning, could transform future image generation models. Although some capabilities, like text-to-image, aren't perfect, it's a significant advancement. The model's ability to integrate so many tasks with strong instruction-following skills is impressive. I'm excited about the broad impact OmniGen could have on future research.
wwwtyro 131 days ago [-]
With consistent representation of characters, are we now on the precipice of a Cambrian explosion of manga/graphic novels/comics?
Multicomp 131 days ago [-]
I sure hope so - at the very least I will use it for tabletop illustrations instead of having to describe a party's scenario result - I can give them a character-accurate image showing their success (or epic lack thereof).
jowday 130 days ago [-]
It’s not really consistent - or anymore consistent than, say, SDXL with IP adapter. Even in their example images the character they’ve input comes out wearing different clothes.
haccount 130 days ago [-]
I would say we already had one of those. There's more hand crafted human made content available than anyone cares to read.
While this will enable a certain degree of more spam it will more importantly, on the positive side of things, democratize the creative process to those who want to tell a story in images but lack the skill and resources to churn it out traditionally.
fullstackwife 131 days ago [-]
not yet, still can't generate transparent images
Vt71fcAqt7 131 days ago [-]
From the controlnet author:
Transparent Image Layer Diffusion using Latent Transparency
Why do you need that? For manga specifically, generate in greyscale and convert luminance to alpha; then composite; then color.
Or, if you need solid regions that overlap and mask out other regions, then generate objects over a chroma-keyable flat background.
block_dagger 131 days ago [-]
This looks promising. I love how you can reference uploaded images with markup - this is exactly what the field needs more of. After spending the last two weeks generating thousands of album cover images using DALL-E and being generally disappointed with the results (especially with the variations feature of DALL-E 2), I'm excited to give this a try.
101008 131 days ago [-]
I am working on a API to generate avatars/profile pics based on a prompt. I tried looking for train my own model bt I think it's a titanic task and impossible to do it myself. Is my best solution use an external API and then crop the face for what was generated?
ncoronges 131 days ago [-]
The simplest commercial product for finetuning your own model is probably Adobe firefly, although there’s no API access support yet. But there are cheap and only slightly more involved options like Replicate or Civit.ai. Replicate has solid API support.
Is it Flux 1 possible to download and deploy to my own server? (And make a simple API on top of it?) I don't need fine tuning.
spaceman_2020 131 days ago [-]
The easiest flux api I’ve seen is with Fal.ai
It is expensive though- Flux dev images are like $0.035/image
handfuloflight 131 days ago [-]
If you have GPUs on your server that can handle it.
131 days ago [-]
haccount 130 days ago [-]
You can use a few controlnet templates and then whatever model you like and consistently get the posture correct. The diffusion plugin for Krita is a great playground for exploring this.
KerryJones 131 days ago [-]
Love this idea -- you have a typo in tools "Satble Diffusion"
gremlinsinc 130 days ago [-]
Anyone know how it handles Text? That's kind of my deal breaker, I like Ideogram for it's ability to do really cool fonts, etc.
empath75 131 days ago [-]
it seems like there's a lot of potential for abuse if you can get it to generate ai images of real people reliably.
hnbad 130 days ago [-]
We literally already had AI fake porn of Taylor Swift making the rounds a while ago. Prepare for women in public positions to face that kind of bullshit more frequently.
CamperBob2 130 days ago [-]
Eh, once it's ubiquitous, nobody will care.
cubefox 130 days ago [-]
Once fakes in politics are ubiquitous, people will stop trusting the real evidence.
CamperBob2 129 days ago [-]
That appears to have already happened, no AI required.
cubefox 129 days ago [-]
The trust in video evidence certainly can be much lower than it is now.
CamperBob2 129 days ago [-]
It's more an issue of indifference than trust. For instance, you can show Trump supporters any number of legitimate videos that depict Trump and his associates saying, doing, and promising all kinds of outrageous, offensive, and destructive things, and they won't care in the slightest. It's not that they don't trust the video, it's that they've been programmed not to care. The leader cannot fail.
That's the ultimate purpose of disinformation -- it's not to make you believe false things, it's to make you believe nothing.
So yes, AI fakery will contribute to that phenomenon on behalf of numerous bad actors, but it was always going to happen anyway. You don't need Hinton and Sutskever on your side if you have Aisles and Murdoch.
cubefox 128 days ago [-]
> So yes, AI fakery will contribute to that phenomenon on behalf of numerous bad actors, but it was always going to happen anyway.
That's like saying: "Yes, crime might increase, but we will always have crime anyway." What will happen anyway is irrelevant precisely because it happens anyway. What's relevant is the expected increase in media distrust once everything might be a fake.
oatsandsugar 131 days ago [-]
I mean, I struggle even getting Dall-E to iterate on one image without changing everything, so this is pretty cool
anyi09881 131 days ago [-]
Curious what's the actual cost for each edit? Will this infra always be reliable?
CamperBob2 131 days ago [-]
I was able to clone the repo and run it locally, even on a Windows machine, with only minimal Python dependency grief. Takes about a minute to create or edit an image on a 4090.
It's pretty impressive so far. Image quality isn't mind-blowing, but the multi-modal aspects are almost disturbingly powerful.
Not a lot of guardrails, either.
hyuuu 127 days ago [-]
could you elaborate on the multi modal aspect of this model?
jay_1999 130 days ago [-]
[dead]
sunny-sunny 131 days ago [-]
[dead]
545999961 130 days ago [-]
[dead]
convolution8 128 days ago [-]
[dead]
toast12345 128 days ago [-]
[flagged]
toast12345 128 days ago [-]
[flagged]
kazishariar 131 days ago [-]
[flagged]
illumanaughty 131 days ago [-]
We've been manipulating photos as long as we've been taking them.
handfuloflight 131 days ago [-]
Art is what you can get away with. (Andy Warhol)
Rendered at 16:45:11 GMT+0000 (Coordinated Universal Time) with Vercel.
From https://arxiv.org/html/2409.11340v1
> Unlike popular diffusion models, OmniGen features a very concise structure, comprising only two main components: a VAE and a transformer model, without any additional encoders.
> OmniGen supports arbitrarily interleaved text and image inputs as conditions to guide image generation, rather than text-only or image-only conditions.
> Additionally, we incorporate several classic computer vision tasks such as human pose estimation, edge detection, and image deblurring, thereby extending the model’s capability boundaries and enhancing its proficiency in complex image generation tasks.
This enables prompts for edits like: "|image_1| Put a smile face on the note." or "The canny edge of the generated picture should look like: |image_1|"
> To train a robust unified model, we construct the first large-scale unified image generation dataset X2I, which unifies various tasks into one format.
Not exactly. They mention starting from the VAE from Stable Diffusion XL and the Transformer from Phi3.
Looks like these LLMs can really be used for anything
Took a few minutes to load, some assets download at less than 1kbps. The first 3 times I got a "Connection error" after 30s. The 4th time has now been running for 5m.
[1]: https://huggingface.co/Shitao/OmniGen-v1
While this will enable a certain degree of more spam it will more importantly, on the positive side of things, democratize the creative process to those who want to tell a story in images but lack the skill and resources to churn it out traditionally.
Transparent Image Layer Diffusion using Latent Transparency
https://arxiv.org/abs/2402.17113
https://github.com/lllyasviel/sd-forge-layerdiffuse
Or, if you need solid regions that overlap and mask out other regions, then generate objects over a chroma-keyable flat background.
Check out:
https://replicate.com/blog/fine-tune-flux
It is expensive though- Flux dev images are like $0.035/image
That's the ultimate purpose of disinformation -- it's not to make you believe false things, it's to make you believe nothing.
So yes, AI fakery will contribute to that phenomenon on behalf of numerous bad actors, but it was always going to happen anyway. You don't need Hinton and Sutskever on your side if you have Aisles and Murdoch.
That's like saying: "Yes, crime might increase, but we will always have crime anyway." What will happen anyway is irrelevant precisely because it happens anyway. What's relevant is the expected increase in media distrust once everything might be a fake.
It's pretty impressive so far. Image quality isn't mind-blowing, but the multi-modal aspects are almost disturbingly powerful.
Not a lot of guardrails, either.