I don't find this very convincing, both from a mathematical and experimental standpoint.
It seems their method is equivalent to SGD where the learning rate of each tensor is scaled by the number of elements in the tensor. The supposed "Signal-to-Noise ratio" they use is just gSNR=norm(g)/RMS(g-mean(g)), where g is the gradient w.r.t. a d-dimensional tensor and the mean is computed across the elements of g. For a zero-mean iid random gradient the elementwise mean(g)≈0. A similar argument probably holds for arbitrary, but not completely random high-dimensional gradients, mean(g)≈0. In this case gSNR=sqrt(d), which explains why it is constant over time and how it varies across the components of the network.
It also seems the optimal value of their hyperparameter sweeps occurs at the edge in almost every case, and a granularity of 10x for the learning rate and weight decay is too large to make direct comparisons anyway.
spenrose 2 days ago [-]
Something we need is no more papers titled " ... All You Need"
fastneutron 2 days ago [-]
“All you need” considered harmful.
webmaven 2 days ago [-]
"Considered Harmful" Considered Harmful...
upghost 2 days ago [-]
"All You Need" in the title is apparently All You Need.
0xdeadbeefbabe 2 days ago [-]
"All you need is love" can be a recipe for producing offspring. I've been wondering about AI parallels.
Smar 2 minutes ago [-]
Technically, you don't actually need love there, but yeah.
2 days ago [-]
joshdavham 2 days ago [-]
Yeah it’s way too much of a cliché at this point.
etiam 2 days ago [-]
I think we're more at the point (or beyond) of it being deliberately obnoxious as a failed attempt at humor. But maybe I'm underestimating just how idolized that original paper is.
At any rate, by now I'm erring on the side of not promoting or citing them.
unnah 2 days ago [-]
Based on a quick googling, apparently the original paper is "One kitchen is all you need" by Sister Eudocia, Isabelle DeVerneil, and Jane Hildebrandt, published in Modern Hospital vol. 79 issue 3, pages 120-122, 1952. https://pubmed.ncbi.nlm.nih.gov/12992940/
glial 2 days ago [-]
Funny, I always assumed it was a Beatles joke.
brookst 2 days ago [-]
“Attention is all you need” is definitely a Beatles joke.
eden-u4 2 days ago [-]
Tried the source code on a toy model: adam took 2 epochs to train a 10k parameters model, this didn't achieve anything useful in 20.
Tweaked a bit the hyper parameters and such, but nothing. Probably a bogus implementation?
johndough 2 days ago [-]
I tried it to train a CNN-based CIFAR10 classifier, which worked well (only a tiny bit worse than Adam, but the difference might go away with hyper parameter tuning), but the optimizer totally failed (loss -> infinity) when training a U-Net for an image segmentation task. I had to increase eps to 1e-4 and decrease lr to 1e-3 so it would not explode, but that made it very slow to converge.
My summary is that the memory savings might be great if it works, but it does not work everywhere.
jszymborski 2 days ago [-]
Yah I mean that's the rub with SGD... you need to spend a non-trivial compute budget on hyperparam tuning, which sometimes beats Adam.
Adam, on the other hand, generally gets you pretty good results without futzing too much with hyper params.
eden-u4 2 days ago [-]
ah, numerical instability in the warmup stage might be the issue then?
akos23 2 days ago [-]
More likely a bogus paper, neither their mathematical reasoning nor their experiments seem to hold up if you look at them closely.
Der_Einzige 2 days ago [-]
A single main conference publication at a top AI conference has ROI in the millions for the first author. I watched someone in the middle of their undergrad with a single ACL workshop publication get a 150K offer starting. It’s remarkable that anything real at all is published given how perverse the incentives are to blatantly make shit up.
cma 2 days ago [-]
Did you set them to use the same memory budget? Adam holds more state.
They do say it consistently matches or outperforms despite simplicity, and I think that statement means at the lower budget for their approach, but a fair comparison fk seems if it is at least promising would be take advantage of the lower memory read to add more params in their version in the comparison.
Also the paper says slow initial convergence, under limitations:
> More-
over, our methods ensure a steady and stable update during
training, allowing the model to converge better in a given
task with sufficient training steps. Thus, we might observe
that the convergence speed is relatively lower than Adam’s
in the early stage of training; as our primary focus is to
investigate the effectiveness of the SaI approach, we left the
acceleration of convergence speed in future work.
dist-epoch 2 days ago [-]
Was the toy model a transformer?
Maybe it's just way too small, you wouldn't use Karatsuba multiplication to do 3*5.
eden-u4 2 days ago [-]
that's a wrong simile given that you would get the same end result in both cases.
I'm not using a transformer, just a plain Feedforward, Relu and dropout for a simple classifier.
I don't know, I can be wrong. I hope and some toy experiment shows that even in low case parameters it works fine as well as adam.
amunozo 2 days ago [-]
It's time to stop the "All You Need" titles. This one does not even sound good .
v3ss0n 2 days ago [-]
Need to write an article `All you need should consider harmful` .
scotty79 2 days ago [-]
"Considered harmful articles are all you need"
0xdeadbeefbabe 2 days ago [-]
Goto is all you need.
cuuupid 2 days ago [-]
It's one of the most irritating snowclones because most of the time the papers are not presenting some dramatic leap forward like attention.
rob_c 2 days ago [-]
Interesting take but:
After a reread it's nice to see the optimizer is faster but how long is spent in the optimizer and can adamw be tuned for a low memory environment given its greedy to try and reduce the impact of statistical noise on gradient calculations.
Note that when training on image1k it only becomes comparable to adamw after many epoch and infact performs measurably worse for most of the training session.
(How significant that is, is up to debate and model/task/data)
Why not incorporate 2nd order changes into adamw directly?
The lower memory footprint is nice but it's not immediately why this is the case. Is the batch size reduced? Model changed? I'll be read this after a 2nd coffee and see if it is more obvious...
Still promising if true.
yobbo 2 days ago [-]
I haven't read more than the abstract of this particular paper, but it is expected that training behaves differently with/without adam.
The problem with adam is that it keeps one more statistic (the same size as the model parameters) in memory. It also adds a little computation.
The way to deal with it otherwise is to tune the momentum-parameters and clip/limit the gradient in various ways.
Rendered at 21:08:36 GMT+0000 (Coordinated Universal Time) with Vercel.
It seems their method is equivalent to SGD where the learning rate of each tensor is scaled by the number of elements in the tensor. The supposed "Signal-to-Noise ratio" they use is just gSNR=norm(g)/RMS(g-mean(g)), where g is the gradient w.r.t. a d-dimensional tensor and the mean is computed across the elements of g. For a zero-mean iid random gradient the elementwise mean(g)≈0. A similar argument probably holds for arbitrary, but not completely random high-dimensional gradients, mean(g)≈0. In this case gSNR=sqrt(d), which explains why it is constant over time and how it varies across the components of the network.
It also seems the optimal value of their hyperparameter sweeps occurs at the edge in almost every case, and a granularity of 10x for the learning rate and weight decay is too large to make direct comparisons anyway.
At any rate, by now I'm erring on the side of not promoting or citing them.
Tweaked a bit the hyper parameters and such, but nothing. Probably a bogus implementation?
My summary is that the memory savings might be great if it works, but it does not work everywhere.
Adam, on the other hand, generally gets you pretty good results without futzing too much with hyper params.
They do say it consistently matches or outperforms despite simplicity, and I think that statement means at the lower budget for their approach, but a fair comparison fk seems if it is at least promising would be take advantage of the lower memory read to add more params in their version in the comparison.
Also the paper says slow initial convergence, under limitations:
> More- over, our methods ensure a steady and stable update during training, allowing the model to converge better in a given task with sufficient training steps. Thus, we might observe that the convergence speed is relatively lower than Adam’s in the early stage of training; as our primary focus is to investigate the effectiveness of the SaI approach, we left the acceleration of convergence speed in future work.
Maybe it's just way too small, you wouldn't use Karatsuba multiplication to do 3*5.
I'm not using a transformer, just a plain Feedforward, Relu and dropout for a simple classifier.
I don't know, I can be wrong. I hope and some toy experiment shows that even in low case parameters it works fine as well as adam.
After a reread it's nice to see the optimizer is faster but how long is spent in the optimizer and can adamw be tuned for a low memory environment given its greedy to try and reduce the impact of statistical noise on gradient calculations.
Note that when training on image1k it only becomes comparable to adamw after many epoch and infact performs measurably worse for most of the training session. (How significant that is, is up to debate and model/task/data)
Why not incorporate 2nd order changes into adamw directly?
The lower memory footprint is nice but it's not immediately why this is the case. Is the batch size reduced? Model changed? I'll be read this after a 2nd coffee and see if it is more obvious...
Still promising if true.
The problem with adam is that it keeps one more statistic (the same size as the model parameters) in memory. It also adds a little computation.
The way to deal with it otherwise is to tune the momentum-parameters and clip/limit the gradient in various ways.