This does feel a bit like under grad introduction to statistical analysis and surprising anyone felt the need to explain these things. But I also suspect most AI people out there now a days have limited math skills so maybe it’s helpful?
godelski 23 hours ago [-]
As an ML researcher who started in physics (this seems common among physics/math turned ML people. Which Evan is included), I cannot tell you how bad is it... One year at CVPR when diffusion models hit the scenes I was asking what people's covariance was (I had overestimated the model complexity), and the most common answer I got was "how do I calculate that?" People do not understand things like what "pdf" means. People at top schools! I've been told I'm "gatekeeping" for saying that you should learn math (I say "you don't need math to build good models, but you do to understand why they're wrong"). Not that you need to, but should. (I guess this explains why Mission Impossible Language Models won best paper...)
I swear, the big reason models are black boxes are because we _want_ them to be. There's clear anti-sentiment mentality against people doing theory and the result of this shows. I remember not too long ago Yi Tay (under @agihippo but main is @YiTayML) said "fuck theorists". I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.
Also, I'd like to point out, the author uses "we" but the paper only has one author on it. So may I suggest adding their cat as a coauthor? [0]
Personal sad story, but hopefully relevant: during my recent PhD I worked on a problem where I used a Dirichlet Process in my solution. That paper has been bouncing around for the past few years getting rejected from every venue I have submitted it to. My interpretation is that most reviewers (there are exceptions - too few to impact the final voting) don't understand any non-DL theory anymore and are not willing to read up for the sake of a fair review. This is based on their comments, where we have been told that our solution is complex (maybe? - but no one suggests an alternative), exposition is not clear (we have rewritten the paper a few times - we rewrite it based on comments from venue i to submit to venue i+1 - its a wild goose chase), and in one case, someone said the paper is derivative because it uses Blackwell-MacQueen sampling; their evidence? - they skimmed through a paper we had cited that also used the sampling algorithm. This is like saying a paper is derivative because it uses SGD.
I am on the review panel of some conferences too and it is not uncommon to be assigned a paper outside of my comfort zone. That doesn't mean I cut and bail. You set aside time, read up on the area, ask authors questions, and judge accordingly. Unfortunately this doesn't happen most of the time - people seem to be in a rush to finish their review no matter the quality. At this point, we just mechanically keep resubmitting the paper every once a while.
Sorry, end of rant :)
somethingsome 5 hours ago [-]
Just a note
> exposition is not clear (we have rewritten the paper a few times - we rewrite it based on comments from venue i to submit to venue i+1 - its a wild goose chase)
Does not mean that the paper is invalid, but maybe the storyline is difficult to follow, the results not easy to interpret, or overall badly written or missing justifications.
Even if you take into account the reviews to rewrite it, it doesn't mean the paper is clear and easy to understand.
As you noted, researchers need to read material outside of their confort zone, and the publications have shifted in focus. Before you could expect a reader to be familiar to the topic, now you need to educate him as clearly as possible.
I picked a random text inside the paper
> The workings of the technique itself are presented at a high-level in Figure 2.
Annoying to read.
> Instead of learning the training distribution directly, which might be expensive because of the dimensionality of the data, we first project the data down to one dimension.
Why is that good enough? Justification missing
> This is done just once, and is shown in the left panel in Figure 2. Since we are solving for classification, we pick this dimension to be a numeric indicator of how close an instance is to a class boundary.
Why is it a good indicator, justification
> As a convenient proxy, we train a separate highly accurate probabilistic
Ok, references on previous research that show it can work?
So in essence, I don't say you need to explain everything, but the text could be more clear on the choices and why they make sense.
My gut feeling is that you know and understand what you are doing, but you miss too many justifications that proves your work valuable.
I didn't read the whole thing, so maybe I'm missing the picture, but from random sampling on the text I expect the rest to follow the same.
While I read the introduction, I don't want to read 'we did that and that and that'. But 'there was this issue, we solve it in this way because this reason '
And following issues->solution->why should give me enough understanding of what you are trying to achieve.
Follow-up sections should refine the solutions
aspenmayer 12 hours ago [-]
Is a preprint of your paper available?
I looked at your blog a bit and was able to find this, which may be it?
> Learning Interpretable Models Using Uncertainty Oracles
I copied the DOI for convenience but they’re the same paper.
I have no formal math background really so I can’t speak to your methods but I appreciate that you have shared your work freely.
Did you have any issues defending your thesis due to the issues you described above related to publishing?
Noticed a typo in your abstract:
“Maybe” should be “may be” in sentence below (italics):
> We show that this technique addresses the above challenges: (a) it arrests the reduction in accuracy that comes from shrinking a model (in some cases we observe ~ 100% improvement over baselines), and also, (b) that this maybe applied with no change across model families with different notions of size; results are shown for Decision Trees, Linear Probability models and Gradient Boosted Models.
abhgh 12 hours ago [-]
Yes, it did come up during my defense, but it was deemed not to be a concern since I had one prior paper [1] (the original one in this thread of work, the paper I linked above was an improvement over it), and my advisor (co-author on both papers) vouched for the quality of the work.
Thank you for pointing out the typo - will fix it!
Don't be. Issues like this are the reason I haven't defended yet. The fact that an AC didn't laugh at that "critique" is itself indicative of a problem. They're as checked out as the reviewers. I was doing work in a more mathy area and could not get assigned reviewers that understood what was being done. To try to get something through I tried a more popular domain, won a bet with my advisor that I could get SOTA on a very popular dataset in a few months, but I have no compute left. I can beat big labs on one dataset with far less compute, but how can I compete when reviewers want dozens? Even if others weren't held to that standard... There's not enough compute for that. You can always have "more experiments"
For review, I set aside hours for each paper, and more the further out of my domain that they are (I'm also very happy to increase my score with a rebuttal and mark lower confidence (I frequently write what would change my mind to help authors). My best post rebuttal ever was "The authors answered all my questions, but due to the lack of novelty I'm lowering my score"). I'll keep doing this, but to be honest, after I defend I have no intention to push to conferences or journals. I just fail to see the value. It has caused me to spend more time rewriting and taking away from research. It just makes me upset and isn't making me a better researcher. I crave for someone to actually _criticize_ my work. I have a good citation count and h-index. My best paper is "unpublished", has hundreds of citations, resulted in a very popular blog post, and years later people are still messaging me for using it in their work. I don't think I'm a top researcher, but I don't think I'm well below the pack.
I just hate that my research directions are pigeonholed. That you need to do topics that people care about. That you need to evaluate with large scale. As if we can't have conclusions beyond the empirical. As if this isn't about communicating our work. That I need to write to those that are not "peers" (niche domain experts, as opposed to broader domain experts). As if experiments aren't proxies, but are demonstrations of a product. I think this significantly slows down the progress to AGI since it causes us to railroad to build from large models from big companies, and there is so little interest in anything else. How can we explore more architectures, learning methods, and all that if we're required to get SOTA out of the gate?
I don't want to say too much about my work since it is still bouncing around in review and I don't want to dox myself. But I'll say something about a work that I __reviewed__. It was for Neural PDEs. Review was for a workshop, and it was clear to me that this paper was rejected from the main conference. What was not clear is why. Until I got to see the reviews form my peers. Their complaints had the standard "novelty" and "not well written" (it was very well written btw), but the kicker for them was that the datasets were synthetic... Like... what?! Why does that even matter? They're solving equations! Luckily they had low confidence and I got the paper through. I wasn't surprised when a few months later I stumbled upon the paper again and found out it was from Welling's group.
> At this point, we just mechanically keep resubmitting the paper every once a while.
I really wonder how long it will take conference organizers to recognize that the noise in the review process is a significant contributor to the increasing number of submissions. This seems a rather obvious connection but I rarely hear it discussed. Not to mention that it can damage papers quality (this certainly happened to mine, and I suspect yours). Reviews can improve the papers if the review contains actual critiques. But hey, why do work when no one questions a reject?
I feel like mine was more ranty lol. But it helps to not feel alone.
mturmon 22 hours ago [-]
The front matter in Vladimir Vapnik’s book “Statistical Learning Theory” (first edition published 1995) has this quote:
*
During the last few years at various computer science conferences, I heard reiteration of the following claim:
“Complex theories do not work; simple algorithms do.”
One of the goals of this book is to show that, at least in the problems of statistical inference, this is not true. I would like to demonstrate that in the area of science a good old principle is valid:
Vladimir was a friend during this time, and I think about this quote a lot with regards to ML tinkering.
godelski 15 hours ago [-]
I haven't had a chance to read that, but that quote suggests I should (especially considering the author and the editors).
I often refer to "Elephant Fitting" w.r.t these systems. I suspect you understand this, but I think most think it is just about overfitting. But the way problem isn't about the number of parameters, but that parameters need to be justified. As explained by Dyson here[0]. Vladimir's quote really reminds me of this. Fermi likewise was stressing the importance of theory.
I think it is a profound quote, and you were (are?) lucky to have that friendship. I do think abstraction is at the heart of intelligence. François Chollet discusses it a lot, and he's far from alone. It seems to be well agreed upon in the neuroscience and cognitive science communities. I think this is essential to understand in our path forward to developing intelligent systems, because there are plenty of problems that need to be solved in which there is no algorithmic procedure. Where there is no explicit density function. Intractable, doubly intractable, and more problems. Maybe we're just too dumb, but it's clear there are plateaus where luck is needed to advance. I do not believe our current machines would be capable of closing a gap.
As someone who had questions about some of what you said and feels legitimately scared to ask what you meant out of fear of being judged:
> I've been told I'm "gatekeeping"
I mean...when the alternative is politely (better yet - excitedly) answering the question asked? You kind of are.
> I swear, the big reason models are black boxes are because we _want_ them to be.
Talk is cheap.
> I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.
I agree "fuck theorists" is in no way constructive. But, Deep Mind has objectively helped move the field forward. And your criticism of "get good" stuff? Did you not just tell people to "learn math" rather than help them to understand it yourself? That's the _exact_ meaning of the phrase "get good" on the internet. At best you're both being about as toxic (at least from your own description).
godelski 15 hours ago [-]
> when the alternative is politely (better yet - excitedly) answering the question asked? You kind of are.
Gatekeeping is controlling access. Not to be confused with hurdles. I'm more than happy to have more people in "the party." No one is being excluded in the way that isn't also true for any other field. You unfortunately need some level of expertise to be able to understand discussions between experts. But am I stopping you from getting that expertise? No, in fact I'm very happy to lend a hand! Those aren't gates, they're hurdles. You don't need a specific PhD or to go to a good school or anything. It's about the knowledge. If you need a helping hand to get over, ask, because others may not know or may not know if you're struggling fruitlessly or struggling as part of the process of improving.
But yes, hurdles exist and they are not bad. I sure as hell don't want someone that can't do calculus designing rocket engines. And you probably don't want a rocket engineer performing surgery on you. Call them what you will, but it's not a bouncer at the door telling you you're not "pretty enough", which is what gatekeeping is generally used to refer to.
> Talk is cheap.
Sure, but we actually know a lot more about the inner workings of networks than most people realize. Sure, they aren't transparent, but that doesn't mean they are completely opaque either.
But I have no idea how to respond to this comment. What I said was fairly broad and this response is broader. Are you asking for "proof"? Of what? Interpretability? Is not the article proof of it to some degree? Or Golden Gate Claude?
> Did you not just tell people to "learn math" rather than help them to understand it yourself?
No? I think you misunderstand. Mind you, this is hacker news. Would you like some books for reference? A roadmap? If you have suggestions for how I should phrase my venting differently, I'm all ears. But it feels like that would be out of left field to just drop a bunch of random books and requires a lot of words to explain how all these things connect. I've written many "walls of text" here and frankly, anything longer than a paragraph often gets skipped over. It's fine, it's HN after all.
> you're both being about as toxic
Are you aware of the things I'm referencing? It seems like you are not. Given that, I think you should reserve your judgement and accusations until you know more about the context. (e.g.
So I will add more context to clarify my complaints, for any of those interested.
I specifically called out Mission Impossible Language Models[0], so what's that about? I suggest reading the paper. The authors create a continuum of difficulties in impossible languages. The hardest being a random word ordering. The claim is that LLMs can't learn impossible languages just as well as natural languages. It's fairly easy to understand the error in this work. They use perplexity, which is sometimes called "surprisal." You take it conditioned on the previous words and you calculate what is likely to come next. But perplexity doesn't tell you that the model didn't learn the language, or even efficiently. The metric isn't going to work for a one-to-one comparison with a structured language. The reason being that there is naturally more entropy in the impossible language. Frankly, because there are more words that are equally likely to come next. It's comparing coin flips to dice throws.
Let's use an example: our target sentence will be "My name is godelski." In a random shuffle language we have 4! (24) ways to represent that sentence that are all __equivalent__. That's the key part. In natural language, all I can think of is 2 ("Godelski, my name is" as a highly unlikely alternative). So in natural language if we have "My name" and are predicting the next word, "is" is pretty likely. But in the random language "is" is just as likely as "<name>". This isn't proof that the language isn't learned, it is just that the language isn't structured. "My name is godelski" and "My name godelski is" are equivalent sentences in a random ordering language. But actually, this gets even harder because the tokenization was trained on natural word order. If you look at Table 1 you'll see how this gets messy (notice that "bookshelf" is the tokens " books" (space intentional) "he" "lf"). The picture gets clearer when you look at how they prepared the data (it isn't shuffled each time the model gets the data, it is shuffled once and then the model is trained on that. This is not the same as the random language and unless you're really lucky, there's going to be certain patterns more common than others and so that'll just make it more difficult for the model. The dataloader should shuffle sentences, which will teach the model to ignore the patterns. You should also measure perplexity against all valid predictions, not a single one. This one is a killer for me).
Side note:
> fear of being judged:
You're always going to be judged. Stand up for yourself and what you believe in. Don't be afraid of being wrong either. Truth is, you're always wrong. It's a matter of degree. The problem isn't being wrong, it is not being able to change your mind. Even if things get heated between people, there typically isn't ill will between them if they believe the other person is capable of changing their mind.
What's your objection to Mission Impossible Language Models?
godelski 13 hours ago [-]
I see you're one of the authors.
I disagree with the conclusions of the paper. Maybe I have some misunderstandings, and if so, please do correct me. But my reading of it, is that the experiments and evaluations are insufficient to formulate the conclusion made. I think the results even make sense with Chomsky's claim. (I'll stick to the random shuffle for clarity)
It does not appear that the evaluations are considering all possible valid outputs for the next token. Perplexity is not actually a measure of language performance, though it is wonderful that it has worked out so well so far (I suspect due to the structure in languages). The perplexity being higher is not necessarily indicative of poorer performance. I view this as analogous to sequences of coin flips (our natural language) to sequences of dice rolls (our shuffle). One naturally has more randomness than another. A model that successfully learns the former will have lower perplexity than the model that learns the latter.
To properly evaluate we need to consider if the model is able to produce valid sentences, and consistently. With our coin and dice analogy let's assume we have a sequence of 3 events. Our model conditions on a single flip of heads and we can estimate likelihoods for the sequences HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. Our successful model will tell us that the last 4 are not possible, but that the others are equally likely. Now if we compare to a dice roll, conditioned on a roll of a 1, then the model is not invalid for suggesting higher entropy. That is exactly what we want our model to do. There are just more _valid_ answers. In the same way if we're predicting (conditionally) next token, then we should expect a higher perplexity in the "more impossible" languages, but that does not tell us the success of learning the language (I would also expect these models to take longer to converge due to this, just as with coins and dice. I'll leave "learn just as well" to Chomsky, as this is ambiguous).
Entropy isn't enough. Our metric needs to be based on the mass distribution. To compare against one another, we'd have to normalize the values to their distributions. A direct comparison to one another will always lead to the random shuffle model having higher perplexity (just as with coins and dice), so it is an unfair comparison. Without the normalization we'd expect to find exactly what is shown in Figure 2.
As I understand the writing and the code, you do not compare against all valid tokens, but rather the fixed ones. I'm just seeing the perplexities counted in the usual way (I see loop over batch, but not for valid permutations). I see the line in the text
> dataset shuffling during training.
So I assume that this means the dataloader is shuffling the selected sentences? I don't see this in the code but I'm happy to trust you if you say yes. But the code makes me think this was generated beforehand (I'm having dependency issues so can't verify). But if you are generating the perturbations beforehand, then I think the results are irrelevant because you haven't been implicitly teaching the model that ordering doesn't matter. The fact that results get worse for the models without positional encoding is suggestive of concern here. If position does not matter, why does the positional information increase the model's ability to learn? It should be irrelevant to a non-deterministic shuffled language. I am also suspect since the "no shuffle" model appears to have identical learning capabilities w.r.t Fig 2 and 6. (I'm also seeing a lot of reference to error bars but it isn't clear to me what the variance is. Is the bar smaller than the markers? Scaling could really help here as well as placing horizontal bars at the bounds given the visualization of the markers in the legends).
As for limitations, I am also suspect there's a bias introduced due to tokenization. Since the tokenization embedding is generated from the expected ordering. I think this adds additional complexity that could be reduced, but not eliminated, by shuffling words instead of tokens. Not eliminated because tokens are only dependent on single words, but the sentences themselves. Word pairs and sequences matter.
Fwiw, I don't agree with Chomsky. Clearly LLMs are extracting structure in language and I think it is obtuse to claim that a system designed for pattern matching won't identify these patterns. One doesn't need reasoning or abstraction to converge to this, one simply needs sufficient sampling and for structure to exist. Clearly structure exists in the language, so we should expect a sufficient pattern matcher to be able to extract these patterns.
canjobear 5 hours ago [-]
Thanks for the feedback! The point about perplexity is totally valid for the nondeterministic shuffle baseline. This seems to have misled a lot of people. But for all the other baselines, we're applying some one-to-one transformation function to the original training set, and so not increasing the inherent entropy of the distribution being learned.
As for tokenization, good point: it's worth retokenizing based on the altered datasets to see if that changes anything. I think it might not, because the tokenizers we use are based on the frequency distribution of "words" identified by whitespaces, but we'll have to check.
godelski 37 minutes ago [-]
Thanks for the reply!
> one-to-one transformation function to the original training set, and so not increasing the inherent entropy of the _distribution being learned_.
I disagree. Entropy of the model? The language? The sentence? The tokens? The distinction is subtle but important. A one-to-one mapping is not structure preserving. For a trivial example: {a,b,c} -> {c,a,b} doesn't preserve alphabetical ordering. The distribution the LM is learning is that of the intractable(?) distribution of the language itself. Certainty the entropy here changes, and I believe your results demonstrate this. I think the entropy would only be the same if we preserved all structure[0], but my understanding of the impossible languages is that they remove (all) structure. I'm not sure if that'd yield worthwhile results, but I think it could be a good sanity check -- or at least assumption check -- to do deterministic permutations based on syntactic structure. E.g. replace all S,V,O -> O,V,S for all sentences.
> because the tokenizers we use are based on the frequency distribution of "words" identified by whitespaces, but we'll have to check.
Maybe I was mislead by Table 1. Noting that "bookshelf" -> {books,he,lf}. That isn't whitespace delimited. The examples only show "bookshelf" being preserved in the Shuffle (s=21) and the HOP cases (well... split by the hop token). Partial reverse showing {books, [R], ., lf, he}. I think this is a good example of token bias as our token "he" can hold multiple meanings, but I suspect that the statistics changes when we consider the word structure and how that these tokens will appear conditionally. I think this is a good example where I think Figure 2 doesn't do a great job at proving the conclusion. The differences are small, so are they completely offset by the bias? HOP seems even more convincing as the results converge and this bias should be much more easily accounted for. What is unclear to me here is why there's a significant difference in the beginning of training. I am a bit surprised that training from scratch would have this initialization bias.
(Also, just to note, I wouldn't reject this paper were it to come across my desk, but I bias towards accepting works. I am picking on you because you won best paper and a lot of the narrative that formed around the paper)
[0] I think we have "natural experiments" here w.r.t learning other languages and translation. Though not al structure is preserved. Some are lost and some are gained. But this again can be affected by the tokenizer and if you are doing things like stripping accent. Clearly that removes structure, but isn't going to have a big effect on English.
runeblaze 22 hours ago [-]
Maths also mean different things. Your average number theorist or algebraic geometer will most likely not touch statistical techniques day-to-day. Reading this Anthropic article was helpful because I am constantly catching up on my statistical background
lukev 19 hours ago [-]
I don't know what it's true to suspect, since clearly a lot of very smart people are working in the field, in places.
It is empirically true that none of the industry discourse around leaderboards and benchmarks uses any of the techniques this article discusses.
fsndz 24 hours ago [-]
AI engineers just use "vibes" currently haha
nov30 4 hours ago [-]
[dead]
Unlisted6446 18 hours ago [-]
All things considered, although I'm in favor of Anthropic's suggestions, I'm surprised that they're not recommending more (nominally) advanced statistical methods. I wonder if this is because more advanced methods don't have any benefits or if they don't want to overwhelm the ML community.
For one, they could consider using equivalence testing for comparing models, instead of significance testing. I'd be surprised if their significance tests were not significant given 10000 eval questions and I don't see why they couldn't ask the competing models 10000 eval questions?
My intuition is that multilevel modelling could help with the clustered standard errors, but I'll assume that they know what they're doing.
intended 15 hours ago [-]
Since when the heck did evals change what they referred to. Evals were what you did to check if the output of a model was correct. What happened ?
ipunchghosts 22 hours ago [-]
I have been promoting this and saying it since at least 2018. You can see my publication record as evidence!!!
"Random seed xxx is all you need" was another demonstration of this need.
You actually want a wilcoxon sum rank test as many metrics are not gaussian especially as they get to thier limits!! I.e. accuracy roughly 99 or 100! Then it becomes highly sub gaussian.
Rendered at 20:53:58 GMT+0000 (Coordinated Universal Time) with Vercel.
I swear, the big reason models are black boxes are because we _want_ them to be. There's clear anti-sentiment mentality against people doing theory and the result of this shows. I remember not too long ago Yi Tay (under @agihippo but main is @YiTayML) said "fuck theorists". I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.
Also, I'd like to point out, the author uses "we" but the paper only has one author on it. So may I suggest adding their cat as a coauthor? [0]
[0] https://en.wikipedia.org/wiki/F._D._C._Willard
I am on the review panel of some conferences too and it is not uncommon to be assigned a paper outside of my comfort zone. That doesn't mean I cut and bail. You set aside time, read up on the area, ask authors questions, and judge accordingly. Unfortunately this doesn't happen most of the time - people seem to be in a rush to finish their review no matter the quality. At this point, we just mechanically keep resubmitting the paper every once a while.
Sorry, end of rant :)
> exposition is not clear (we have rewritten the paper a few times - we rewrite it based on comments from venue i to submit to venue i+1 - its a wild goose chase)
Does not mean that the paper is invalid, but maybe the storyline is difficult to follow, the results not easy to interpret, or overall badly written or missing justifications. Even if you take into account the reviews to rewrite it, it doesn't mean the paper is clear and easy to understand.
As you noted, researchers need to read material outside of their confort zone, and the publications have shifted in focus. Before you could expect a reader to be familiar to the topic, now you need to educate him as clearly as possible.
I picked a random text inside the paper > The workings of the technique itself are presented at a high-level in Figure 2.
Annoying to read.
> Instead of learning the training distribution directly, which might be expensive because of the dimensionality of the data, we first project the data down to one dimension.
Why is that good enough? Justification missing
> This is done just once, and is shown in the left panel in Figure 2. Since we are solving for classification, we pick this dimension to be a numeric indicator of how close an instance is to a class boundary.
Why is it a good indicator, justification
> As a convenient proxy, we train a separate highly accurate probabilistic
Ok, references on previous research that show it can work?
So in essence, I don't say you need to explain everything, but the text could be more clear on the choices and why they make sense.
My gut feeling is that you know and understand what you are doing, but you miss too many justifications that proves your work valuable.
I didn't read the whole thing, so maybe I'm missing the picture, but from random sampling on the text I expect the rest to follow the same.
While I read the introduction, I don't want to read 'we did that and that and that'. But 'there was this issue, we solve it in this way because this reason '
And following issues->solution->why should give me enough understanding of what you are trying to achieve.
Follow-up sections should refine the solutions
I looked at your blog a bit and was able to find this, which may be it?
> Learning Interpretable Models Using Uncertainty Oracles
https://arxiv.org/abs/1906.06852
https://doi.org/10.48550/arXiv.1906.06852
I have no formal math background really so I can’t speak to your methods but I appreciate that you have shared your work freely.
Did you have any issues defending your thesis due to the issues you described above related to publishing?
Noticed a typo in your abstract:
“Maybe” should be “may be” in sentence below (italics):
> We show that this technique addresses the above challenges: (a) it arrests the reduction in accuracy that comes from shrinking a model (in some cases we observe ~ 100% improvement over baselines), and also, (b) that this maybe applied with no change across model families with different notions of size; results are shown for Decision Trees, Linear Probability models and Gradient Boosted Models.
Thank you for pointing out the typo - will fix it!
[1] https://www.frontiersin.org/journals/artificial-intelligence...
For review, I set aside hours for each paper, and more the further out of my domain that they are (I'm also very happy to increase my score with a rebuttal and mark lower confidence (I frequently write what would change my mind to help authors). My best post rebuttal ever was "The authors answered all my questions, but due to the lack of novelty I'm lowering my score"). I'll keep doing this, but to be honest, after I defend I have no intention to push to conferences or journals. I just fail to see the value. It has caused me to spend more time rewriting and taking away from research. It just makes me upset and isn't making me a better researcher. I crave for someone to actually _criticize_ my work. I have a good citation count and h-index. My best paper is "unpublished", has hundreds of citations, resulted in a very popular blog post, and years later people are still messaging me for using it in their work. I don't think I'm a top researcher, but I don't think I'm well below the pack.
I just hate that my research directions are pigeonholed. That you need to do topics that people care about. That you need to evaluate with large scale. As if we can't have conclusions beyond the empirical. As if this isn't about communicating our work. That I need to write to those that are not "peers" (niche domain experts, as opposed to broader domain experts). As if experiments aren't proxies, but are demonstrations of a product. I think this significantly slows down the progress to AGI since it causes us to railroad to build from large models from big companies, and there is so little interest in anything else. How can we explore more architectures, learning methods, and all that if we're required to get SOTA out of the gate?
I don't want to say too much about my work since it is still bouncing around in review and I don't want to dox myself. But I'll say something about a work that I __reviewed__. It was for Neural PDEs. Review was for a workshop, and it was clear to me that this paper was rejected from the main conference. What was not clear is why. Until I got to see the reviews form my peers. Their complaints had the standard "novelty" and "not well written" (it was very well written btw), but the kicker for them was that the datasets were synthetic... Like... what?! Why does that even matter? They're solving equations! Luckily they had low confidence and I got the paper through. I wasn't surprised when a few months later I stumbled upon the paper again and found out it was from Welling's group.
I really wonder how long it will take conference organizers to recognize that the noise in the review process is a significant contributor to the increasing number of submissions. This seems a rather obvious connection but I rarely hear it discussed. Not to mention that it can damage papers quality (this certainly happened to mine, and I suspect yours). Reviews can improve the papers if the review contains actual critiques. But hey, why do work when no one questions a reject?I feel like mine was more ranty lol. But it helps to not feel alone.
*
During the last few years at various computer science conferences, I heard reiteration of the following claim:
“Complex theories do not work; simple algorithms do.”
One of the goals of this book is to show that, at least in the problems of statistical inference, this is not true. I would like to demonstrate that in the area of science a good old principle is valid:
“Nothing is more practical than a good theory.”
*
It’s seen in page xii of the front matter at: https://link.springer.com/content/pdf/bfm:978-1-4757-3264-1/...
Vladimir was a friend during this time, and I think about this quote a lot with regards to ML tinkering.
I often refer to "Elephant Fitting" w.r.t these systems. I suspect you understand this, but I think most think it is just about overfitting. But the way problem isn't about the number of parameters, but that parameters need to be justified. As explained by Dyson here[0]. Vladimir's quote really reminds me of this. Fermi likewise was stressing the importance of theory.
I think it is a profound quote, and you were (are?) lucky to have that friendship. I do think abstraction is at the heart of intelligence. François Chollet discusses it a lot, and he's far from alone. It seems to be well agreed upon in the neuroscience and cognitive science communities. I think this is essential to understand in our path forward to developing intelligent systems, because there are plenty of problems that need to be solved in which there is no algorithmic procedure. Where there is no explicit density function. Intractable, doubly intractable, and more problems. Maybe we're just too dumb, but it's clear there are plateaus where luck is needed to advance. I do not believe our current machines would be capable of closing a gap.
[0] https://www.youtube.com/watch?v=hV41QEKiMlM
> I've been told I'm "gatekeeping"
I mean...when the alternative is politely (better yet - excitedly) answering the question asked? You kind of are.
> I swear, the big reason models are black boxes are because we _want_ them to be.
Talk is cheap.
> I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.
I agree "fuck theorists" is in no way constructive. But, Deep Mind has objectively helped move the field forward. And your criticism of "get good" stuff? Did you not just tell people to "learn math" rather than help them to understand it yourself? That's the _exact_ meaning of the phrase "get good" on the internet. At best you're both being about as toxic (at least from your own description).
But yes, hurdles exist and they are not bad. I sure as hell don't want someone that can't do calculus designing rocket engines. And you probably don't want a rocket engineer performing surgery on you. Call them what you will, but it's not a bouncer at the door telling you you're not "pretty enough", which is what gatekeeping is generally used to refer to.
Sure, but we actually know a lot more about the inner workings of networks than most people realize. Sure, they aren't transparent, but that doesn't mean they are completely opaque either.But I have no idea how to respond to this comment. What I said was fairly broad and this response is broader. Are you asking for "proof"? Of what? Interpretability? Is not the article proof of it to some degree? Or Golden Gate Claude?
No? I think you misunderstand. Mind you, this is hacker news. Would you like some books for reference? A roadmap? If you have suggestions for how I should phrase my venting differently, I'm all ears. But it feels like that would be out of left field to just drop a bunch of random books and requires a lot of words to explain how all these things connect. I've written many "walls of text" here and frankly, anything longer than a paragraph often gets skipped over. It's fine, it's HN after all. Are you aware of the things I'm referencing? It seems like you are not. Given that, I think you should reserve your judgement and accusations until you know more about the context. (e.g.So I will add more context to clarify my complaints, for any of those interested. I specifically called out Mission Impossible Language Models[0], so what's that about? I suggest reading the paper. The authors create a continuum of difficulties in impossible languages. The hardest being a random word ordering. The claim is that LLMs can't learn impossible languages just as well as natural languages. It's fairly easy to understand the error in this work. They use perplexity, which is sometimes called "surprisal." You take it conditioned on the previous words and you calculate what is likely to come next. But perplexity doesn't tell you that the model didn't learn the language, or even efficiently. The metric isn't going to work for a one-to-one comparison with a structured language. The reason being that there is naturally more entropy in the impossible language. Frankly, because there are more words that are equally likely to come next. It's comparing coin flips to dice throws.
Let's use an example: our target sentence will be "My name is godelski." In a random shuffle language we have 4! (24) ways to represent that sentence that are all __equivalent__. That's the key part. In natural language, all I can think of is 2 ("Godelski, my name is" as a highly unlikely alternative). So in natural language if we have "My name" and are predicting the next word, "is" is pretty likely. But in the random language "is" is just as likely as "<name>". This isn't proof that the language isn't learned, it is just that the language isn't structured. "My name is godelski" and "My name godelski is" are equivalent sentences in a random ordering language. But actually, this gets even harder because the tokenization was trained on natural word order. If you look at Table 1 you'll see how this gets messy (notice that "bookshelf" is the tokens " books" (space intentional) "he" "lf"). The picture gets clearer when you look at how they prepared the data (it isn't shuffled each time the model gets the data, it is shuffled once and then the model is trained on that. This is not the same as the random language and unless you're really lucky, there's going to be certain patterns more common than others and so that'll just make it more difficult for the model. The dataloader should shuffle sentences, which will teach the model to ignore the patterns. You should also measure perplexity against all valid predictions, not a single one. This one is a killer for me).
Side note:
You're always going to be judged. Stand up for yourself and what you believe in. Don't be afraid of being wrong either. Truth is, you're always wrong. It's a matter of degree. The problem isn't being wrong, it is not being able to change your mind. Even if things get heated between people, there typically isn't ill will between them if they believe the other person is capable of changing their mind.Clearly, you have judged me quite harshly.
[0] https://arxiv.org/abs/2401.06416
I disagree with the conclusions of the paper. Maybe I have some misunderstandings, and if so, please do correct me. But my reading of it, is that the experiments and evaluations are insufficient to formulate the conclusion made. I think the results even make sense with Chomsky's claim. (I'll stick to the random shuffle for clarity)
It does not appear that the evaluations are considering all possible valid outputs for the next token. Perplexity is not actually a measure of language performance, though it is wonderful that it has worked out so well so far (I suspect due to the structure in languages). The perplexity being higher is not necessarily indicative of poorer performance. I view this as analogous to sequences of coin flips (our natural language) to sequences of dice rolls (our shuffle). One naturally has more randomness than another. A model that successfully learns the former will have lower perplexity than the model that learns the latter.
To properly evaluate we need to consider if the model is able to produce valid sentences, and consistently. With our coin and dice analogy let's assume we have a sequence of 3 events. Our model conditions on a single flip of heads and we can estimate likelihoods for the sequences HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. Our successful model will tell us that the last 4 are not possible, but that the others are equally likely. Now if we compare to a dice roll, conditioned on a roll of a 1, then the model is not invalid for suggesting higher entropy. That is exactly what we want our model to do. There are just more _valid_ answers. In the same way if we're predicting (conditionally) next token, then we should expect a higher perplexity in the "more impossible" languages, but that does not tell us the success of learning the language (I would also expect these models to take longer to converge due to this, just as with coins and dice. I'll leave "learn just as well" to Chomsky, as this is ambiguous).
Entropy isn't enough. Our metric needs to be based on the mass distribution. To compare against one another, we'd have to normalize the values to their distributions. A direct comparison to one another will always lead to the random shuffle model having higher perplexity (just as with coins and dice), so it is an unfair comparison. Without the normalization we'd expect to find exactly what is shown in Figure 2.
As I understand the writing and the code, you do not compare against all valid tokens, but rather the fixed ones. I'm just seeing the perplexities counted in the usual way (I see loop over batch, but not for valid permutations). I see the line in the text
So I assume that this means the dataloader is shuffling the selected sentences? I don't see this in the code but I'm happy to trust you if you say yes. But the code makes me think this was generated beforehand (I'm having dependency issues so can't verify). But if you are generating the perturbations beforehand, then I think the results are irrelevant because you haven't been implicitly teaching the model that ordering doesn't matter. The fact that results get worse for the models without positional encoding is suggestive of concern here. If position does not matter, why does the positional information increase the model's ability to learn? It should be irrelevant to a non-deterministic shuffled language. I am also suspect since the "no shuffle" model appears to have identical learning capabilities w.r.t Fig 2 and 6. (I'm also seeing a lot of reference to error bars but it isn't clear to me what the variance is. Is the bar smaller than the markers? Scaling could really help here as well as placing horizontal bars at the bounds given the visualization of the markers in the legends).As for limitations, I am also suspect there's a bias introduced due to tokenization. Since the tokenization embedding is generated from the expected ordering. I think this adds additional complexity that could be reduced, but not eliminated, by shuffling words instead of tokens. Not eliminated because tokens are only dependent on single words, but the sentences themselves. Word pairs and sequences matter.
Fwiw, I don't agree with Chomsky. Clearly LLMs are extracting structure in language and I think it is obtuse to claim that a system designed for pattern matching won't identify these patterns. One doesn't need reasoning or abstraction to converge to this, one simply needs sufficient sampling and for structure to exist. Clearly structure exists in the language, so we should expect a sufficient pattern matcher to be able to extract these patterns.
As for tokenization, good point: it's worth retokenizing based on the altered datasets to see if that changes anything. I think it might not, because the tokenizers we use are based on the frequency distribution of "words" identified by whitespaces, but we'll have to check.
(Also, just to note, I wouldn't reject this paper were it to come across my desk, but I bias towards accepting works. I am picking on you because you won best paper and a lot of the narrative that formed around the paper)
[0] I think we have "natural experiments" here w.r.t learning other languages and translation. Though not al structure is preserved. Some are lost and some are gained. But this again can be affected by the tokenizer and if you are doing things like stripping accent. Clearly that removes structure, but isn't going to have a big effect on English.
It is empirically true that none of the industry discourse around leaderboards and benchmarks uses any of the techniques this article discusses.
For one, they could consider using equivalence testing for comparing models, instead of significance testing. I'd be surprised if their significance tests were not significant given 10000 eval questions and I don't see why they couldn't ask the competing models 10000 eval questions?
My intuition is that multilevel modelling could help with the clustered standard errors, but I'll assume that they know what they're doing.
"Random seed xxx is all you need" was another demonstration of this need.
You actually want a wilcoxon sum rank test as many metrics are not gaussian especially as they get to thier limits!! I.e. accuracy roughly 99 or 100! Then it becomes highly sub gaussian.