Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Arbitrary-Scale Super-Resolution with Neural Heat Fields (therasr.github.io)

151 points by 0x12A 106 days ago | 56 comments

jiggawatts 106 days ago [-]

The learned frequency banks reminded me of a notion I had: Instead of learning upscaling or image generation in pixel space, why not reuse the decades of effort that has gone into lossy image compression by generating output in a psychovisually optimal space?

Perhaps frequency space (discrete cosine transform) with a perceptually uniform color space like UCS. This would allow models to be optimised so that they spend more of their compute budget outputting detail that's relevant to human vision. Color spaces that split brightness from chroma would allow increased contrast detail and lower color detail. This is basically what JPG does.

mturnshek 106 days ago [-]

You may already know this, but image generators like Stable Diffusion and Flux already do this in the form of “latent diffusion”.

Rather than operate on pixel space directly, they learn to operate on images that have been encoded by a VAE (latents). To generate an image with them, you run the reverse diffusion (actually flow in the case of flux) process they’ve learned and then decode the result using the VAE.

These VAE encoded latent images are 8x smaller in width/height and have 4 channels in the case of Stable Diffusion and 16 in the case of Flux.

I do think it would be more useful if it worked more like you said, though - if the channels weren’t encoded arbitrarily but some of them had pretty clear, useful human meaning like lightness, it would be another hook to control image generation.

To some extent, you can control the existing VAE channels, but it is pretty finicky.

sigmoid10 106 days ago [-]

If there's one thing that neural networks have shown, it's that they are much better at picking up encoding patterns for realistic tasks than humans. There are so many aspects that could be used in dimensional reduction tasks that it seems pretty wild that we've come this far with human-designed patterns. From a top down engineering perspective, it might seem like a disadvantage to have algorithms that are not tailored to particular cases. But when you want things like general purpose image generation, it's simply much more economical to let ML figure out which dimensions to focus on. Because humans would spend years coming up with the details of certain formats and still not cover half the cases.

106 days ago [-]

orbital-decay 105 days ago [-]

>You may already know this, but image generators like Stable Diffusion and Flux already do this in the form of “latent diffusion”.

They... don't. Latents don't meaningfully represent human perception, they represent correlations in the dataset. Parent is talking about the function aligned with actual measured human perception (UCS is an example of that). Whether it's a good idea, and how trivial it is for the model to fit this function automatically, is another question.

crazygringo 106 days ago [-]

> by generating output in a psychovisually optimal space? Perhaps frequency space (discrete cosine transform)

I've never understood the DCT to be psychovisually optimal at all. At lower bitrates, it degrades into ringing and blockiness that don't match a "simplified perception" at all.

The frequency domain models our auditory space well, because our ears literally process frequencies. Bringing that over to the visual side has never been about "psychovisual modeling" but about existing mathematical techniques that happen to work well, despite their glaring "psychovisual" flaws.

On the other hand, yes a HSV color space could make more sense than RGB, for example. But I'm not sure it's going to provide a significant savings? I'd certainly be curious. It also might create problems though, because hue is undefined when saturation is zero, saturation is undefined when brightness is zero, etc. It's not smooth and continuous at the edges the way RGB is. And while something like CIELAB doesn't have that problem, you have the problem of keeping valid value combinations "in bounds".

pizza 106 days ago [-]

JPEG is good for when you want a picture to look reasonably good while throwing away ~90-95% of the data. In fact, there's a relatively new JPEG variant that lets you get even better psychovisual fidelity for the same compression level by just doing JPEG in the XYB color space, xybjpeg. JPEG is also a very simple algorithm, when compared to the ones that'd be noticeably better near 99% compression.

To beat blockiness/banding across very gradually varying color gradients (think eg the gradient of a blue sky), JPEG XL has to whip out a lot of tricks, like handling sub-LF DCT coefficients between blocks, heterogeneous block sizes, deblocking filters for smoothing, and heterogeneous quantization maps.

BTW, one of the ways different camera manufacturers aimed to position themselves as having cameras that generated the best pictures was by using custom proprietary quantization tables to optimize for psychovisual quality.

crazygringo 106 days ago [-]

No disagreements.

I do suspect that at some point we will make a major compression breakthrough that is based on something more "psychovisual". Not Gaussian splatting, but something more akin to that -- something that directly understands geometric areas of gradating colors as primitive objects, textures as primitives, and motion as assigned to those rather than to pixels.

On the other hand, it may very well be a form of AI-based compression that does this, rather than us explicitly designing it.

nullc 106 days ago [-]

Lossy image compression has mostly targeted an entirely different performance envelope.

E.g. in the image you can see a diagonal bands basis function. Image codecs don't generally have those-- not because they wouldn't be useful but because codec developers favor separable transforms that have fast factorizations for significant performance improvements.

I don't think we know and can really make good comparisons between traditional tools and ML powered compression because of this. We just don't have decades of efforts where the engineers were allowed a million multiples and a thousand memory accesses per pixel.

cma 106 days ago [-]

There is definitely work out there that deals directly in dct blocks from jpeg:

https://arxiv.org/abs/1907.11503

https://arxiv.org/abs/2308.09110

With generative ai they tend to have a learned compressed representation instead (VAE)

pizza 106 days ago [-]

We do, see eg LPIPS loss

littlestymaar 106 days ago [-]

> why not reuse the decades of effort that has gone into lossy image compression by generating output in a psychovisually optimal space

I've been wondering exactly this for a while, if somebody more knowledgeable knows why we're not doing that I'd be happy to hear it.

dahart 106 days ago [-]

Interesting thoughts! First thing to mention is that if you look at the code, it uses SSIM, which is a perceptual image metric. Second is that it may be using sRGB, which isn’t a perceptually uniform color space, but is closer to one than linear RGB. I say that simply because most images these days are sRGB encoded. Whether Thera is depends on the dataset.

Aren’t Thera’s frequency banks pretty darn close to DCT or Fourier transform already? This is a frequency space decomposition & reconstruction, and their goal is similar to JPG in that it aims to capture the low frequencies accurately, and skimp on the frequencies that matter less, either by being less visible or lead to error (aliasing artifacts). It doesn’t seem entirely accurate to frame this paper as learning in pixel space.

As far as perceptual color spaces, yeah that might be worth trying. It’s not clear exactly what the goal is or how it would help, but it might. Thera does use the same color spaces that JPG encoding uses: RGB and YCbCr, which are famously bad. Perceptual color spaces save some bits in the file format, and like frequency space, they are convenient and help with perceptual decisions, but it’s less common to see them used to save work, at least outside of research. Notably, image generation often needs to work in linear color space anyway, and convert to a perceptual color space at the end. For example, CG rendering is all done in linear space, even when using a perceptual color metric to guide adaptive sampling.

Another question worth asking is whether in general a neural network already learns the perceptual factors. When it comes to black box training, if the data and loss function capture what a viewer needs to see, then the network will likely learn what it needs and use it’s own notion of perceptual metrics in it’s latent space. In that case, it may not help to use inputs and output that are encoded in a perceptual space, and we might be making incorrect assumptions.

In this case with Thera, the paper’s goal may be difficult to pin down perceptually. Doesn’t the arbitrary in ‘arbitrary-scale super resolution’ toss viewing conditions and the notion of an ideal viewer out the window? If we don’t even want to know what the solid angle of a pixel is, we can’t know very much about how they’re perceived.

105 days ago [-]

WhitneyLand 106 days ago [-]

Seems like a nice result but wouldn’t have hurt for them to give a few performance benchmarks. I understand that the point of the paper was a quality improvement, but it’s always nice to reference a baseline for practicality.

vessenes 106 days ago [-]

Not disagreeing, but the number of parameters are listed in the single digit millions size (which surprised me). So, I would expect this to be very fast on modern hardware.

KeplerBoy 106 days ago [-]

Very fast is a bit vague in a space where you might have a millisecond per frame.

cubefox 106 days ago [-]

I doubt that this is a technique for real-time applications. They don't say anything about that on the website.

KeplerBoy 106 days ago [-]

True, especially because they would probably talk about things like temporal coherency if it were meant to be applied to video/game feeds.

cubefox 106 days ago [-]

This seems more in line with potential real-time applications, though it is still a lot slower than DLSS: https://dl.acm.org/doi/10.1145/3641519.3657439

i5heu 106 days ago [-]

Very good work!

Sadly this model really does not like nosy images that have codec compression artifacts, at least with my few test images.

LoganDark 106 days ago [-]

I wonder if there is a de-artifacting model out there.

earslap 106 days ago [-]

I think the company named Topaz had a photoshop plugin to remove "jpeg artifacts" - I don't know if they are using a neural model for it though.

dimatura 105 days ago [-]

Yes, there are plenty of them. Not sure what the SOTA is, though. Similar to super-resolution, it is relatively simple to create a nearly-infinite dataset for these; pick a clean image, then introduce JPG artifacts. Then train a model to invert the process.

LoganDark 104 days ago [-]

> Similar to super-resolution, it is relatively simple to create a nearly-infinite dataset for these; pick a clean image, then introduce JPG artifacts. Then train a model to invert the process.

Yep, exactly what I was thinking. The thing is it's hard to find enough clean images!

numba888 105 days ago [-]

I did it some time ago. It works at low levels. Doing it in hard cases is non-trivial but possible. It's difficult to monetize that's why nobody is doing it seriously.

flerchin 106 days ago [-]

I'd like to see the results in something like Wing Commander Privateer.

nandometzger 106 days ago [-]

Try it yourself. Here is the demo: https://huggingface.co/spaces/prs-eth/thera

flerchin 105 days ago [-]

I tried it on these, and the results were really great!

https://www.wcnews.com/chatzone/threads/all-your-base-s-with...

karmakaze 106 days ago [-]

Tried it on this image[0] and it was blurry while still being pixelated.

[0] https://en.wikipedia.org/wiki/Wing_Commander:_Privateer#/med...

0x12A 105 days ago [-]

That's an interesting example! Perhaps too out-of-distribution, though. For fair comparison with other methods, we used the DIV2K training set in our paper, which only comprises 800 images. Would be cool to train a version on a much bigger set, potentially including images similar to what you tried :)

grumbel 105 days ago [-]

As other have mentioned, this models just puts emphasis on pixels and compression artifacts, so it's of not much use for improving old or low quality images.

I tried doing some pixelart->HD conversion with Gemini2.0Flash instead and the results look quite promising:

* https://imgur.com/a/t9F94F1

The images are however all over the place, as it doesn't seem to stick very close to the prompt. Trying to fine tune the image with further chatting often leads to overexposed looking pictures.

All the results are done with prompts along the lines of "here is a pixelart image convert it into a photo" or some variation there of. No img2img, LoRA or anything here, all plain Gemini chat.

mastax 106 days ago [-]

It just looks like a strong sharpening filter.

smusamashah 106 days ago [-]

It does not work with pixel art very well.

earthnail 106 days ago [-]

@0x12A what’s the difference between this version and v1 of the paper from November 2023?

0x12A 105 days ago [-]

Hi, this is a complete rework, though the core idea remains the same. Results are now much better due to improved engineering, and we compare to recent SOTA methods up until 2025. Also we have some new experiments and worked a lot on figures and presentation :)

varispeed 106 days ago [-]

Could such method be adapted to audio? For instance to upscale 8-bit samples to 16-bit in Amiga mods?

rini17 106 days ago [-]

I tried photos of animals, and it was okayish except the eyes were completely off.

mrybczyn 106 days ago [-]

hrm. on nature portrait photography 600x600 upscale, it has a LOT of artifacts. Perhaps too far out of distribution?

That said, your examples are promising, and thank you for posting a HF space to try it out!

0x12A 106 days ago [-]

Hi, author here :) It shouldn’t be OOD, unless its too noisy maybe? And what scaling factor did you use? Single image SR is a highly ill-posed problem, so at higher upscaling factors it just becomes really difficult…

throwaway314155 106 days ago [-]

Perhaps parent comment used a .jpg as input? The model seems to artifact a lot on existing compression artifacts.

saddat 106 days ago [-]

Why do those algorithm not include prompting to guide the scaling ?

nthingtohide 106 days ago [-]

DLSS will benefit greatly from research in this area. DLSS 4 uses transformers.

DLSS 3 vs DLSS 4 (Transformer)

https://www.youtube.com/watch?v=CMBpGbUCgm4

seanalltogether 106 days ago [-]

I would love to see this kind of work applied to old movies from the 30s and 40s like the Marx Brothers.

throwaway2562 106 days ago [-]

Just curious: why?

It wouldn’t be more funny ha-ha, just more funny strange.

Hizonner 106 days ago [-]

Where are the ground truth images?

WhitneyLand 106 days ago [-]

Click through to the actual paper and they are in the last column labeled “GT”.

flufluflufluffy 106 days ago [-]

Was anyone else expecting an infinitely zoomable pictures from that title? I am disappoint

imoreno 106 days ago [-]

You were imagining something where you give it one grey pixel, then zoom in infinitely and read the Magna Carta? Where did you imagine it would get the information from?

BriggyDwiggs42 106 days ago [-]

Wait but that’s the point. It doesn’t have the information, it makes a best guess to match its training.

p1mrx 106 days ago [-]

the cloud

adhoc32 106 days ago [-]

Instead of training on vast amounts of arbitrary data that may lead to hallucinations, wouldn't it be better to train on high-resolution images of the specific subject we want to upscale? For example, using high-resolution modern photos of a building to enhance an old photo of the same building, or using a family album of a person to upscale an old image of that person. Does such an approach exist?

0x12A 106 days ago [-]

Author here -- Generally in single image super-resolution, we want to learn a prior over natural high-resolution images, and for that a large and diverse training set is beneficial. Your suggestion sounds interesting, though it's more reminiscent of multi image super-resolution, where additional images contribute additional information, that has to be registered appropriately.

That said, our approach is actually trained on a (by modern standards) rather small dataset, consisting only of 800 images. :)

112233 105 days ago [-]

It feels like it's multishot nl-means, then immedeately those pre-trained "AI upscale" things like Topaz with nothing in between. Like, if I have 500 shots from a single session and I would like to pile the data together to remove noise and increase detail, preferably starting from the raw data, then - nothing? Only guys doing something like that are astrophotographers, but their tools are .. specific.

But for "normal" photography, it is either pre-trained ML, pulling external data in, or something "dumb" like anisotrophic blurring.

adhoc32 106 days ago [-]

I'm not a data scientist, but I assume that having more information about the subject would yield better results. In particular, upscaling faces doesn't produce convincing outcomes; the results tend to look eerie and uncanny.

MereInterest 106 days ago [-]

Not a data scientist, but my understanding is that restricting the set of training data for the initial training run often results in poorer inference due to a smaller data set. If you’re training early layers of a model, you’re often recognizing rather abstract features, such as boundaries between different colors.

That said, there is a benefit to fine-tuning a model on a reduced data set after the initial training. The initial training with the larger dataset means that it doesn’t get entirely lost in the smaller dataset.

crazygringo 106 days ago [-]

That is how Hollywood currently de-ages famous actors, by training on their photos and stills from when they were around the desired age.

But it's extremely time-consuming and currently expensive.

imoreno 106 days ago [-]

That is effectively what it's doing already. If you examine the artifacts, there is obviously a bias towards certain types of features.

Rendered at 18:20:55 GMT+0000 (Coordinated Universal Time) with Vercel.