Using as little computational resources (memory and/or FLOPS) as possible as an additional optimization criterion when training NNs is an interesting avenue. I think the current state of pre-trained model families is weird. Take Llama 3.1 or Segment Anything 2: you get tiny/small/medium/larger/huge models, where for each tier the model size was predefined, and they are trained somewhat (completely?) independently. This feels iffy, patchy, and like we haven't really arrived yet.
I'd want a model that scales up and down depending on the task given at inference, and a model that doesn't have a fixed size when starting the training. Shouldn't it specialize over training progress, when seeing more tokens, and grow larger where needed? Without some human fixing a size beforehand?
Self-organization is a fascinating topic to me. This last year I've been working on Self-Organizing Gaussian Splats [0]. With a lot of squinting, this lives in a similar space as the Self-Compressing Neural Networks from the link above. The idea of the Gaussians was to build on Self-Organizing Maps (lovely 90s concept, look for some GIFs if you don't know it), and use that to represent 3D scenes in a memory-efficient way. By mapping attributes into a locally smooth 2D grid. It's quite a simple algorithm, but works really well, and better than many quite complicated coding schemes. So this has me excited that we'll (re-)discover great methods in this space in the near future.
Afaik, they aren't really trained independently -- for most models, e.g. DINO, etc., the smaller sizes are actually distilled from larger models. It's much easier to generate performant models at smaller size via distillation.
And I'd be curious of the utility of model that scales up and down at inference - if this was the case you'd still need to have storage that is the same as the maximum model size. This would essentially be useless for embedded applications, etc., unless you have heavy quantization - but quantization in a small parameter space would probably make the smaller modes useless. I could see the benefit here in terms of optimizing latency for different applications but maybe you have other ideas.
Given all that, I think training for smaller number of parameters, as noted in OP, would kind of beat out some model that scales at inference time - especially when most people know what kind of application they are aiming to build and the required level of performance.
nwoli 163 days ago [-]
One elegant approach for this I’ve found is this https://github.com/mit-han-lab/gan-compression They basically train an “all in one” network from which you can extract small or large models afterwards (with optional additional finetuning to improve the selected channel size combinations)
idontknowmuch 161 days ago [-]
Ahh that's an interesting paper I must of missed that one - thanks for the link. I think another paper that recently got a lot of hype has been the Matroyshka representation learning paper -- essentially training models with different parameters and output embedding sizes at the same time, basically distillation during training rather than post-training (https://arxiv.org/abs/2205.13147).
throwup238 164 days ago [-]
I think this might be the first step to making neural networks that actually mimic biological brains. IMO the biggest piece missing from NN architectures is a mechanism like neuroplasticity that modifies the topology of neurons. Brains reorganize themselves around the things they learn.
This paper is a long way from implementing synaptic pruning/strengthening/weakening, neurogenesis, or synaptogenesis but it’s the first one I’ve seen where the network is self optimizing.
nyrikki 164 days ago [-]
Unfortunately dendritic compartmentalization, spike timing etc are still not present. All efforts at models of SNNs that I know of have hit problems like riddled basins so far, that is what to look for to move past the limits of perceptron based networks IMHO.
As PAC learning with autograd and perceptrons is just compression, or set shattering, this paper is more of an optimization method that reduces ANN expressiveness through additional compression. Being able to control loss of precision is exciting though.
It may help in some cases, especially for practical use cases, but their unaddressed mention of potential problems with noisy loss functions needs to be addressed.
Human biological neurons can do XOR in the dendrites without hitting the soma at all is another example.
If you haven't heard about dendritic compartmentalization and plasticity, here is a paper.
> In conclusion our results support the view that experience can drive clustered synaptic enhancement onto neuronal dendritic subcompartments, providing fundamental architecture to circuit development and function
IIAOPSW 163 days ago [-]
You piqued my curiosity, so I looked for a paper. I found something tangential but fascinating.
"Naud and Sprekeler (2018) suggest that this could be achieved using a synaptic strategy that facilitates summation for simple action potentials arriving on the basal dendrites and depresses faster burst-like events arriving on the distal tuft"
Oh, its frequency multiplexing with a band pass filter. Same trick the analog phone system used to reduce the amount of wire needed in the network. Same problem, same solution. Convergent evolution.
I wonder if there's ways to do phreaking on neurons.
Are dendritic sub-compartments necessary to explicitly model, or does this work just imply that biological neurons are complicated and are better modeled as a multi-layered artificial network, rather than a single simple computational unit?
Similarly, do you think that spiking networks are important, or just a specific mechanism used in the brain to transmit information, which dense (or sparse) vectors of floats do in artificial neural networks?
nyrikki 163 days ago [-]
If the goal was to create an artificial neural network that better approximated the biological human brain, yes the perceptron model is insufficient.
If your goal is to produce a useful model on real hardware and it works...no
Remember the constraints of ANNs being universal approximaters (in theory)
1) The function you are learning needs to be continuous
2) Your model is over a closed, bounded subset of R^n
3) The activation function is bounded and monodial
Obviously that is the theoretical UAT constraints. For gradient decent typically used in real ML models, the constraint of finding only smooth approximations of continuous functions can be problematic depending on your needs.
But people leveraged phlogiston theory for beer brewing with great success and obviously Newtonian Mechanics is good enough for many tasks.
SNNs in theory should be able to solve problems that are challenging for perceptron models, but as I said, features like riddled basins are problematic so far.
> 1) The function you are learning needs to be continuous
Seems like a bad limitation when you try to model reasoning based on facts and logic, there are many things there that are just true or false and no spectrum to it. There is no "kinda true" in those circumstances, you should only get 1 or 0 and never any value between.
nyrikki 163 days ago [-]
Perceptrons are binary classifiers, that output 0 or 1, based on a threshold.
While not practical to find or use, any feed forward network supervised is effectively a paramedic linear regression.
Think of an Excel line graph, drawing lines between points, with the above the line being 'true', or when the soma fires.
That is how perceptrons work.
Single layer perceptrons cannot represent linearly inseparable functions like XOR or band pass.
A single biological neurons can use the timing of pulses, band pass, change the rate of pulses etc... before it ever reaches the soma.
Not all problem can be reduced to decision problems and not all of them can use constant depth threshold circuits, which hard attention is.
An LLM can be a very reliable threshold or majority gates as an example, but cannot generalize PARITY.
Basically statistical learning inherited the same limits of statistics.
"This statement is 'False'" is a good paradox to use as a lens.
derefr 163 days ago [-]
> reduces ANN expressiveness
But does it? It’s been my hypothesis for a while that every grad-trained NN is hauling around a lot of “nascent” nodes — nodes that were on their way to being useful, but haven’t received enough input yet to actually have their outputs be distinguishable from noise / ever influence the output. Sort of the neuroplastic equivalent of an evolutionary pre-adaptation.
If such nodes exist in NNs, they would be important to decreasing training time to learning new concepts given further training; but if there will be no more training, then they could be pruned for literally no change in expressivity (i.e. the optimality of the NN as an autoencoder of the existing training data.)
nyrikki 163 days ago [-]
Easiest way I can figure out how to explain my claim.
Consider when you use 'partial connectivity', E.G. convolution or pooling layers for local feature extraction on say MNIST.
While useful, those partial connection layers are explicitly used because fully connected layers do not have translational invariance.
So with a fully connected network, shifting the letter 'i' a few pixels to the right wouldn't match.
We choose to discard some of those connections for local feature detection. But as the reason that the fully connected model lacks translational invariance is because it maintains that position data.
Note how that is more 'expressive', even if counterproductive for the actual use case.
Another lens is the fact that neural networks have extreme simplicity bias. In that they learn only the simplest features to solve a task at hand.
If you want to recognize an i, irrespective of the translational location, that bias is useful. But you 'throw away' (in a very loose sense) the positional data to do so.
Horses for courses, not good vs bad.
eru 163 days ago [-]
> I think this might be the first step to making neural networks that actually mimic biological brains.
I don't think that's actually a good goal. I suspect the whole term 'neural network' is just misleading and leads to these kinds of misconceptions.
'Neural networks' are mostly just matrix multiplications interleaved with some simple non-linear functions like \x -> max(0, x). Nothing biological about that.
Salgat 163 days ago [-]
If you remove the speech centers of the brain all at once, you completely and permanently lose the ability to speak or understand speech. Do the same but very slowly, and the brain is able to compensate, even if you completely lose your original speech centers in the process. NN with pruning is the same thing, where you prune iteratively while retraining to regain most of the lost regressions. If you prune too much all at once, you have to restart from scratch.
TheDudeMan 163 days ago [-]
Stop trying to mimic brains. Do what works best for transistors.
sva_ 163 days ago [-]
It would be foolish not to look for inspiration in a system that had billions of years of evolution invested in it.
p1esk 163 days ago [-]
We already found the inspiration. That’s how we invented neural networks. Now we need to focus on what works.
dkersten 163 days ago [-]
How do we know that current artificial neural networks aren’t the local maximum of modelling, and there isn’t a better model (biologically inspired or otherwise) that we haven’t explored yet?
We need both to work on improving what we have that works, and to explore other avenues and inspirations (both to try entirely new things, and to improve the things we already have working in new ways). I don’t think it wise to throw out what we have working to try again with something biologically inspired, but I also don’t think it wise to say ok we’ve learned enough from biology, let’s focus purely on what we have now, when we don’t understand so much about biological brains, intelligence, and consciousness.
p1esk 162 days ago [-]
let’s focus purely on what we have now
That’s not what I said.
Try different things, choose what works, as opposed to trying to imitate biology for the sake of imitating biology.
dkersten 159 days ago [-]
I think we should imitate biology for the sake of imitating biology, though. Alongside other approaches.
Right now, we know that biology works, because of animal and human intelligence. We don’t yet know if our other approaches have the ability to eventually lead to that.
eru 163 days ago [-]
It's good for some people to keep looking for more inspiration. Not just from brains, but also from all other aspects of the world.
Eg dropout was (allegedly) inspired by our doubled up chromosomes and evolutionary selection.
hoseja 162 days ago [-]
It's also full of legacy cruft. We can barely see three colors because some of our ancestors were nocturnal and we never regained the lost receptors [0] [1]. Look at that beautiful bird coverage of the spectrum, look at that awful barely orthogonal human M and L receptor. The blood vessels obstruct the retina in vertebrates! Methanogenesis hasn't been invented outside of Archaea! Also brain with higher cognitive functions (which everyone demands from these neural networks) is barely a couple million years old. Evolution works with copy-paste, rotate and change-single-character, and every single edit has to be compilable and viable.
except copying biology (neural networks) has worked better than other approaches tried so far (reasoning as search, symbolic and semantic nets, expert systems...) so stick with whats working and we have a working reference model to study and we can build optimized hardware to match if it keeps working better than other methods
kromem 163 days ago [-]
Or grow beyond both with optics.
wigster 164 days ago [-]
i know nothing of which i speak.... but the theme reminded me of the insect brains, with very relatively few Neurons that manage pretty extraordinary feats.
i guess random evolution pruning happens and if there is no detrimental effect, cheerio.
kklisura 163 days ago [-]
> mechanism like neuroplasticity that modifies the topology of neurons
Isn't this already accomplished via weights?
dpkingma 163 days ago [-]
There seems to be some relevant prior work that is not referenced by this paper, such as our work on training sparse neural networks:
Abstract: "We propose a practical method for L0 norm regularization for neural networks: pruning the network during training by encouraging weights to become exactly zero. Such regularization is interesting since (1) it can greatly speed up training and inference, and (2) it can improve generalization. [...]"
This is super cool. It's surprising to me that it took so long for someone to try this. It seems like such an obvious idea (in hindsight). But I guess that's easy to say now that someone came up with it.
If this turns out to work well even for much larger models, then we might see loss functions that incorporate ever more specific performance metrics, conceivably even actual execution times on specific hardware.
szcs 163 days ago [-]
Author here. I actually came up with the idea a long time ago, I first experimented with variants of this in Caffe (before Tensorflow was a thing).
xpe 163 days ago [-]
There was related work that happened before, as mentioned in the paper.
Version467 163 days ago [-]
Whoops, missed that. Thanks.
spacemanspiff01 164 days ago [-]
So this was published a year and a half ago? Is there a reason it did not catch on?
svantana 164 days ago [-]
It's not really that innovative. As the paper notes, there are several similar previous works. Also, it sounds like they have done a bunch of tweaking to reduce the "irreversible forgetting" specifically for this particular dataset and network, which is not very scientific. Further testing is required to see if this method really has legs.
163 days ago [-]
szcs 163 days ago [-]
Author here, I just noticed this. If you have any questions I can try answering them.
smusamashah 163 days ago [-]
Do you have plans to apply this on a small open LLM as a POC to show that there is no loss of performance?
szcs 163 days ago [-]
Yes, LLMs are planned too.
fooblaster 161 days ago [-]
I don't think I quite understand how to apply your technique to gpus, which have discrete formats like fp4, fp8, fp16, int8, etc. Can you shed some light on that?
adipandas 161 days ago [-]
Beautiful idea.
My take on it: I find it difficult to generalize the notion of layer removal when the bit depth of that layer goes to zero. It's wouldn't be straight forward although the authors provide equation 5. It feels like lot of information is missing in this work to even reproduce it. And authors do only 1 case study.
I believe some implementation is required to understand the authors completely. Example, optimizer modification for layer when it is removed in training.
bilsbie 164 days ago [-]
dynamic quantization-aware training that puts size (in bytes) of the model in the loss
> This is one of the coolest papers I've seen in a while. "Self-Compressing Neural Networks" is dynamic quantization-aware training that puts size (in bytes) of the model in the loss!
> My implementation (in @__tinygrad__):
If i can sort the lines of the matrix, which is probably defined by how the token embedding is setup, i could potentially zero out weights which do not contribute at all and have areas of zeros i could mark and skip?
octocop 163 days ago [-]
What advantages does this have over applying neural network compression methods after training?
szcs 163 days ago [-]
Here is a quick graph I just generated from data I had. Horizontal axis is average bit depth, vertical is accuracy. PTQ is Post Training Quantisation, QAT is Quantisation Aware Training. It's a simple ResNet trained on CIFAR-10, I don't have the resources to train anything bigger.
If you do it during training you automatically check that the compressed model still do what you want after compression, and can even correct some small issues by continuing the training on the smaller model.
Edit: In general the more a compression function understands of what your goals are the better, so it is naturally advantageous to make the compression function look like training since then it is fully aware of what to optimize for.
alecco 163 days ago [-]
(Jan 2023)
andrewflnr 164 days ago [-]
This kind of thing, much more than LLMs, makes me worry about AGI takeoff.
tazu 164 days ago [-]
Why? Do you think lossless compression is intelligence?
Anyway, OP is about lossy compression. I can't fully follow it but they talk about techniques for mitigating loss later in the paper.
luckystarr 164 days ago [-]
Parents thinking was probably: If you can achieve similar results with a fraction of memory/compute usage then capability at the same hardware level will increase even more.
HeatrayEnjoyer 163 days ago [-]
"Hardware overhang" is the term of art.
My meek opinion is this is obvious. Human-level intelligence requires at most 20 watts and substrate no more complicated than can be constructed from simple organic molecules in a dirty environment.
What is possible with 20 kilowatts and wafer fabricators?
andrewflnr 164 days ago [-]
It's specifically the fact that the network is directing its own optimization. Which yes, could then potentially be used to get more capability from the hardware, but that's true of manually optimized networks as well. Needing less human help is the... interesting part.
idiotsecant 164 days ago [-]
Compressing understanding (not just information) in a way that uses semantic links in information is a big part of intelligence, I'd say.
visarga 163 days ago [-]
We're doing a double search - searching for experience outside, collecting data - and searching for understanding inside, by compressing the data. Search and learn, they define both AI and us.
Rendered at 06:04:00 GMT+0000 (Coordinated Universal Time) with Vercel.
I'd want a model that scales up and down depending on the task given at inference, and a model that doesn't have a fixed size when starting the training. Shouldn't it specialize over training progress, when seeing more tokens, and grow larger where needed? Without some human fixing a size beforehand?
Self-organization is a fascinating topic to me. This last year I've been working on Self-Organizing Gaussian Splats [0]. With a lot of squinting, this lives in a similar space as the Self-Compressing Neural Networks from the link above. The idea of the Gaussians was to build on Self-Organizing Maps (lovely 90s concept, look for some GIFs if you don't know it), and use that to represent 3D scenes in a memory-efficient way. By mapping attributes into a locally smooth 2D grid. It's quite a simple algorithm, but works really well, and better than many quite complicated coding schemes. So this has me excited that we'll (re-)discover great methods in this space in the near future.
[0]: https://fraunhoferhhi.github.io/Self-Organizing-Gaussians/
And I'd be curious of the utility of model that scales up and down at inference - if this was the case you'd still need to have storage that is the same as the maximum model size. This would essentially be useless for embedded applications, etc., unless you have heavy quantization - but quantization in a small parameter space would probably make the smaller modes useless. I could see the benefit here in terms of optimizing latency for different applications but maybe you have other ideas.
Given all that, I think training for smaller number of parameters, as noted in OP, would kind of beat out some model that scales at inference time - especially when most people know what kind of application they are aiming to build and the required level of performance.
This paper is a long way from implementing synaptic pruning/strengthening/weakening, neurogenesis, or synaptogenesis but it’s the first one I’ve seen where the network is self optimizing.
As PAC learning with autograd and perceptrons is just compression, or set shattering, this paper is more of an optimization method that reduces ANN expressiveness through additional compression. Being able to control loss of precision is exciting though.
It may help in some cases, especially for practical use cases, but their unaddressed mention of potential problems with noisy loss functions needs to be addressed.
Human biological neurons can do XOR in the dendrites without hitting the soma at all is another example.
If you haven't heard about dendritic compartmentalization and plasticity, here is a paper.
https://www.cell.com/neuron/fulltext/S0896-6273(11)00993-7
> In conclusion our results support the view that experience can drive clustered synaptic enhancement onto neuronal dendritic subcompartments, providing fundamental architecture to circuit development and function
"Naud and Sprekeler (2018) suggest that this could be achieved using a synaptic strategy that facilitates summation for simple action potentials arriving on the basal dendrites and depresses faster burst-like events arriving on the distal tuft"
Oh, its frequency multiplexing with a band pass filter. Same trick the analog phone system used to reduce the amount of wire needed in the network. Same problem, same solution. Convergent evolution.
I wonder if there's ways to do phreaking on neurons.
https://www.sciencedirect.com/science/article/pii/S030645222...
Similarly, do you think that spiking networks are important, or just a specific mechanism used in the brain to transmit information, which dense (or sparse) vectors of floats do in artificial neural networks?
If your goal is to produce a useful model on real hardware and it works...no
Remember the constraints of ANNs being universal approximaters (in theory)
1) The function you are learning needs to be continuous 2) Your model is over a closed, bounded subset of R^n 3) The activation function is bounded and monodial
Obviously that is the theoretical UAT constraints. For gradient decent typically used in real ML models, the constraint of finding only smooth approximations of continuous functions can be problematic depending on your needs.
But people leveraged phlogiston theory for beer brewing with great success and obviously Newtonian Mechanics is good enough for many tasks.
SNNs in theory should be able to solve problems that are challenging for perceptron models, but as I said, features like riddled basins are problematic so far.
https://arxiv.org/abs/1711.02160
Seems like a bad limitation when you try to model reasoning based on facts and logic, there are many things there that are just true or false and no spectrum to it. There is no "kinda true" in those circumstances, you should only get 1 or 0 and never any value between.
While not practical to find or use, any feed forward network supervised is effectively a paramedic linear regression.
Think of an Excel line graph, drawing lines between points, with the above the line being 'true', or when the soma fires.
That is how perceptrons work.
Single layer perceptrons cannot represent linearly inseparable functions like XOR or band pass.
A single biological neurons can use the timing of pulses, band pass, change the rate of pulses etc... before it ever reaches the soma.
Not all problem can be reduced to decision problems and not all of them can use constant depth threshold circuits, which hard attention is.
An LLM can be a very reliable threshold or majority gates as an example, but cannot generalize PARITY.
Basically statistical learning inherited the same limits of statistics.
"This statement is 'False'" is a good paradox to use as a lens.
But does it? It’s been my hypothesis for a while that every grad-trained NN is hauling around a lot of “nascent” nodes — nodes that were on their way to being useful, but haven’t received enough input yet to actually have their outputs be distinguishable from noise / ever influence the output. Sort of the neuroplastic equivalent of an evolutionary pre-adaptation.
If such nodes exist in NNs, they would be important to decreasing training time to learning new concepts given further training; but if there will be no more training, then they could be pruned for literally no change in expressivity (i.e. the optimality of the NN as an autoencoder of the existing training data.)
Consider when you use 'partial connectivity', E.G. convolution or pooling layers for local feature extraction on say MNIST.
While useful, those partial connection layers are explicitly used because fully connected layers do not have translational invariance.
So with a fully connected network, shifting the letter 'i' a few pixels to the right wouldn't match.
We choose to discard some of those connections for local feature detection. But as the reason that the fully connected model lacks translational invariance is because it maintains that position data.
Note how that is more 'expressive', even if counterproductive for the actual use case.
Another lens is the fact that neural networks have extreme simplicity bias. In that they learn only the simplest features to solve a task at hand.
If you want to recognize an i, irrespective of the translational location, that bias is useful. But you 'throw away' (in a very loose sense) the positional data to do so.
Horses for courses, not good vs bad.
I don't think that's actually a good goal. I suspect the whole term 'neural network' is just misleading and leads to these kinds of misconceptions.
'Neural networks' are mostly just matrix multiplications interleaved with some simple non-linear functions like \x -> max(0, x). Nothing biological about that.
We need both to work on improving what we have that works, and to explore other avenues and inspirations (both to try entirely new things, and to improve the things we already have working in new ways). I don’t think it wise to throw out what we have working to try again with something biologically inspired, but I also don’t think it wise to say ok we’ve learned enough from biology, let’s focus purely on what we have now, when we don’t understand so much about biological brains, intelligence, and consciousness.
That’s not what I said.
Try different things, choose what works, as opposed to trying to imitate biology for the sake of imitating biology.
Right now, we know that biology works, because of animal and human intelligence. We don’t yet know if our other approaches have the ability to eventually lead to that.
Eg dropout was (allegedly) inspired by our doubled up chromosomes and evolutionary selection.
[0] https://en.wikipedia.org/wiki/File:BirdVisualPigmentAbsorban... [1] https://en.wikipedia.org/wiki/File:Cones_SMJ2_E.svg
Isn't this already accomplished via weights?
https://arxiv.org/abs/1712.01312
Abstract: "We propose a practical method for L0 norm regularization for neural networks: pruning the network during training by encouraging weights to become exactly zero. Such regularization is interesting since (1) it can greatly speed up training and inference, and (2) it can improve generalization. [...]"
My take on it: I find it difficult to generalize the notion of layer removal when the bit depth of that layer goes to zero. It's wouldn't be straight forward although the authors provide equation 5. It feels like lot of information is missing in this work to even reproduce it. And authors do only 1 case study.
I believe some implementation is required to understand the authors completely. Example, optimizer modification for layer when it is removed in training.
https://x.com/realGeorgeHotz/status/1819963680739512550
> This is one of the coolest papers I've seen in a while. "Self-Compressing Neural Networks" is dynamic quantization-aware training that puts size (in bytes) of the model in the loss! > My implementation (in @__tinygrad__):
https://github.com/geohot/ai-notebooks/blob/master/mnist_sel...
If i can sort the lines of the matrix, which is probably defined by how the token embedding is setup, i could potentially zero out weights which do not contribute at all and have areas of zeros i could mark and skip?
https://i.imgur.com/yiVLljh.png
Edit: In general the more a compression function understands of what your goals are the better, so it is naturally advantageous to make the compression function look like training since then it is fully aware of what to optimize for.
Anyway, OP is about lossy compression. I can't fully follow it but they talk about techniques for mitigating loss later in the paper.
My meek opinion is this is obvious. Human-level intelligence requires at most 20 watts and substrate no more complicated than can be constructed from simple organic molecules in a dirty environment.
What is possible with 20 kilowatts and wafer fabricators?