Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Friends don’t let friends train small diffusion models (nonint.com)

92 points by webmaven 701 days ago | 25 comments

Scene_Cast2 701 days ago [-]

In my experience tinkering with plain vanilla neural networks, training quality is quite sensitive to the width of the first couple of base layers. In particular, using a rhombus / "diamond" shape works best.

Here's my intuition. The output of the first layer is mostly a linear matrix transform. These linearities form the basis for the more complex features. If the network doesn't have enough of these base linear features, it can't compose the higher-level (and lower-dimensional) features later down the line.

duped 701 days ago [-]

You might want to just make the first layer a DCT of the input vector which will split them into decorrelated components.

wardedVibe 701 days ago [-]

I see we are re-deriving conv nets

duped 700 days ago [-]

Not really, the first principle of not-too-bad ANNs is to use some kind of preprocessed, hopefully decorrellated feature vectors as the inputs. Conceptually that is way easier to implement with intuition and know-how over blindly training an arbitrary structure to do what you want.

Ideally the training just derives non-intuitive relationships between these decorrelated input features to give you something useful.

ausbah 700 days ago [-]

by rhombus or diamond shaped do you mean the first layer being wider than the input layer - with each following hidden layer being smaller than the previous hidden layer?

Scene_Cast2 700 days ago [-]

By rhombus, I meant something like input -> 20 -> 50 -> 200 -> 50 -> 20 -> output.

raihansaputra 699 days ago [-]

I'm still learning the basics of ML, but I assume the advantage of rhombus instead of just having 200*5 layers is training time? Would it have other advantages?

WhyOhWhyQ 701 days ago [-]

Can it be proved that any work generated with one of these models (i.e., music generation, visual art generation, writing generation) isn't blatantly stealing from something in the dataset? If the dataset is large enough, how can you know that you're not going to be sued at a later date for publishing work generated with one of these models?

leereeves 701 days ago [-]

The models do sometimes generate verbatim copies of their training data. Here's an example, Copilot regurgitating Quake code, including sweary comments:

https://news.ycombinator.com/item?id=27710287

astrange 701 days ago [-]

You can't, which is why using general web scrapes are problematic. You're fine if you track the copyright status of your training set.

smegsicle 701 days ago [-]

of course you can't, see: the episode of seinfeld where elaine accidentally rips off a ziggy

forgingahead 700 days ago [-]

This blog is by the author of Tortoise-TTS (stylised as TorToiSe): https://github.com/neonbjb/tortoise-tts

Very cool work being done in ML these days, both inside large corps, and also by independent researchers/"hobbyists" (quotation marks because there is some really expert work being produced by them).

And from yesterday, an open-source version of Google's Imagen by lucidrains: https://github.com/lucidrains/imagen-pytorch

bitforger 701 days ago [-]

Wow! Super cool results.

This is just begging to be text-guided with the lyrics. Then you could put in the lyrics and have the model generate a clip it thinks would go with the lyrics. :)

ausbah 700 days ago [-]

>I got to thinking, though: what if there was some minimum bound to the number of channels for the base layer of a diffusion network?

>To test out this theory, I needed a way to compress the output of the diffusion model. One idea that came to mind is using the concepts behind PixelShuffle. So I compressed the waveform by 16x using a 1D form of PixelShuffle, increased the base dimensionality to 256, and started another training run.

I'm not familiar with the author's previous projects or the terminology here. Do they just mean he increased the size of the first hidden layer and (base dimensionality to 256) and compressed the output layer by 16?

i_like_apis 701 days ago [-]

Hang on, is this thing actually generating halfway intelligible rap?

Utterly amazing.

svantana 701 days ago [-]

As I read it, this is just a spectrogram-->pcm converter (in speech known as a vocoder), which you can do decently without a model, just pure optimization. The hard part is to generate the spectrogram, which I assume they will tackle next.

woodson 701 days ago [-]

Well, there’s Griffin-Lim, but there is a reason most recent work uses a trained neural vocoder.

gwern 701 days ago [-]

No, he's still just experimenting with the audio representation. However, you certainly can generate halfway intelligible rap, and much more, if you check out https://openai.com/blog/jukebox/ (with GPT-3 lyrics) or consider other kinds of audio synthesis like https://speechbot.github.io/dgslm/ (fully end to end).

zone411 701 days ago [-]

If you connect various steps of what can already be done well, such as melody generation (my project), lyrics generation, singing voice synthesis, chord generation, and various semi-automated VST plugins with generation features, you can create some good stuff already. Automating them completely is not even a huge leap forward but you'll still need to do some cherry picking of the results.

moralestapia 701 days ago [-]

I also have this question and it's a bit more complicated for me as I'm not a native speaker + the accent is heavily skewed because of the nusic style.

So, are there some real words in there? Do the lyrics make sense? Or is it just "random" vocalizations that go along with the song?

jjoonathan 701 days ago [-]

Is it halfway intelligible rap or fully intelligible ghost rap?

songeater 701 days ago [-]

shameless plug ging down this route: OpenAI Jukebox creating my singer/songwriter project... https://soundcloud.com/songshtr/sets/machinecroon

durnygbur 701 days ago [-]

lol, thisrapdoesntexist.com

lostmsu 701 days ago [-]

What's the size of the final model? What was the dataset? Training source code?

wardedVibe 701 days ago [-]

Any chance you'll post the code? Neat project

draw_down 701 days ago [-]

Rendered at 14:38:20 GMT+0000 (Coordinated Universal Time) with Vercel.