The fundemental challenge of using log probabilities to measure LLM certainty is the mismatch between how language models process information and how semantic meaning actually works. The current models analyze text token by token-- fragments that don't necessarily align with complete words, let alone complex concepts or ideas.
This creates a gap between the mechanical measurement of certainty and true understanding, much like mistaking the map for the territory or confusing the finger pointing at the moon with the moon itself.
I've done some work before in this space, trying to come up with different useful measures from the logprobs, such as measuring shannon entropy over a sliding window, or even bzip compression ratio as a proxy for information density. But I didn't find anything semantically useful or reliable to exploit.
The best approach I found was just multiple choice questions. "Does X entail Y? Please output [A] True or [B] False. Then measure the linprobs of the next token, which should be `[A` (90%) or `[B` (10%). Then we might make a statement like: The LLM thinks there is a 90% probability that X entails Y.
activatedgeek 18 hours ago [-]
That has been my understanding too. More generally, a verifier at the end certainly helps.
In our paper [1], we find that asking a follow up question like "Is the answer correct?" and taking the normalized probability of "Yes" or "No" token (or more generally any such token trained for) seems to be best bet so far to get well-calibrated probabilities out of the model.
In general, the log-probability of tokens is not a good indicator of anything other than satisfying the pre-training loss function of predicting the "next token." (it likely is very well-calibrated on that task though) Semantics of language are a much less tamable object, especially when we don't quite have a good way to estimate a normalizing constant because every answer can be paraphrased in many ways and still be correct. The volume of correct answers in the generation space of language model is just too small.
There is work that shows one way to approximate the normalizing constant via SMC [2], but I believe we are more likely to benefit from having a verifier at train-time than any other approach.
And there are stop-gap solutions to make log probabilities more reliable by only computing them on "relevant" tokens, e.g. only final numerical answer tokens for a math problem [3]. But this approach kind of side-steps the problem of actually trying to find relevant tokens. Perhaps something more in the spirit of System 2 attention which selects meaningful tokens for the generated output would be more promising [4].
You and the OP talk a lot of smack about logprobs but we show that using them in even the simple case of dynamic truncation of your cutoff point (min_p sampler vs static top_p/top_k) leads to extreme performance improvements (especially on small models) and unlocks very high temperature sampling (for more creativity/less slop/better synthetic data-gen): https://arxiv.org/abs/2407.01082 [1].
Indeed, ultra high temperature sampling in its own right should be studied. I can do top_k = 2 and temperature = system.maxint and get decent results which are extraordinarily creative (with increasing probability of token related spelling issues as top_k goes up).
I'm convinced that the models logprobs hold so much bloody value and knowledge that I unironically do not care about how many "theoretical guarantees" it lacks or about it's non-correspondence to our usage of language.
[1]: Btw, this paper is now ICLR 2025 accepted and likely going to get an oral/honorable mention since we are ranked #18 out of all submissions by scores and have extremely favorable meta-review. Peer review seems to agree with our claims of extreme performance improvements.
activatedgeek 15 hours ago [-]
Congratulations on the strong reception of min-p. Very clever!
We may be talking about two orthogonal things here. And also to be clear, I don't care about theoretical guarantees either.
Now, min-p is solving for the inadequacies of standard sampling techniques. It is almost like a clever adaptive search which other sampling methods fail at (despite truncations like top-k/top-p).
However, one thing that I noticed in the min-p results was that lower temperatures were almost always better in the final performance (and quite expectedly the inverse for creating writing). This observation makes me think that the underlying model is generally fairly good at ranking the best tokens. What sampling allows us is a margin-for-error in cases where the model ranked a relevant next token not at the top, but slightly lower.
Therefore, my takeaway from min-p is that it solves for deficiencies of current samplers but its success is not in contradiction to the fact that logprobs are bad proxies for semantics. Sampling is the simplest form of search, and I agree with you that better sampling methods are a solid ingredient to extract information from logprobs.
deoxykev 17 hours ago [-]
Interesting, I had never heard about min-p until now. From what I understand, it's like a low-pass filter for the token sampling pool which boosts semantic coherence. Like removing static from the radio.
Do you have any benchmarks of min-p sampling with the new reasoning models, such as QwQ and R1?
mrciffa 17 hours ago [-]
Unfortunately LLMs are a gigantic monster to understand, we were considering your same approach with sliding window and we will try to keep the library updated with better and more reliable approaches based on new research papers and our internal tests.
Yes, reasoning models can potentially be optimized with our uncertainty estimations. We are currently testing the library with DeepSeek R1
itssimon 10 hours ago [-]
I ended up playing with the background animtation on your website for 10 minutes, was fun
Folcon 18 hours ago [-]
Hey, you say in the README.md:
MIT License. See LICENSE for more information.
But the LICENSE is Apache-2.0 license.
Which is it?
mrciffa 17 hours ago [-]
Apache-2.0 is correct one
siliconc0w 19 hours ago [-]
It seems like it would be easy to upgrade existing benchmarks to include uncertainty as a dimension. Then if a model is less certain it could maybe spend more time reasoning or route to a bigger model.
mrciffa 17 hours ago [-]
Exactly! Uncertainty is critical to correctly evaluate LLM performance and we don't need reasoning models to spend thousands of tokens on simple questions
KTibow 21 hours ago [-]
Why does the example code use a base model to generate the analysis input?
mrciffa 21 hours ago [-]
In the example I'm using the instruction tuned version of Qwen2.5-7B to generate the insights
kurisufag 21 hours ago [-]
this seems neat but you really need to work on commit messages other than "update code". it makes it harder to get a bearing on the codebase.
mrciffa 21 hours ago [-]
Oh damn, you are right. It's my first opensource project and I didn't thought about it
dleeftink 20 hours ago [-]
You'll get there! Even if a commit doesn't have peculiars, just try to include the reason for making a change.
wruza 20 hours ago [-]
Not all people (and/or not in all development phases) granulate commits to something easily describable that is not “update code”. Having mass changes or flow of consciousness style refactorings in a single commit is absolutely normal.
An author doesn’t need to please a repo reader until they see a good reason to do so.
zxvkhkxvdvbdxz 18 hours ago [-]
Indeed, that's how most of my project commit logs look like in the startup phase. Eventually i make a commit with a "MVP" message and then I try to go from there with meaningful messages.
dleeftink 19 hours ago [-]
Agree! The 'clean commit' is an ideal, not a reality. I just know that looking back on some of my own repo's that I should've included a little more reasoning context, if only intermittently..
andreakl 20 hours ago [-]
Very interesting approach!! what models are u currently consider to integrate?
mrciffa 20 hours ago [-]
We want to integrate reasoning models as next steps because we see a lot of value in understanding better CoTs behaviour (DeepSeek R1 & Co)
andreakl 19 hours ago [-]
Okay thanks that sounds great, have u also thought about extending the scope beyond language models?
thomastjeffery 22 hours ago [-]
On your website, "Learn More" links to a meeting invite? That's... a decision...
I think most people clicking that button would be better served by scrolling down, but that's not made very obvious.
22 hours ago [-]
Rendered at 12:36:07 GMT+0000 (Coordinated Universal Time) with Vercel.
This creates a gap between the mechanical measurement of certainty and true understanding, much like mistaking the map for the territory or confusing the finger pointing at the moon with the moon itself.
I've done some work before in this space, trying to come up with different useful measures from the logprobs, such as measuring shannon entropy over a sliding window, or even bzip compression ratio as a proxy for information density. But I didn't find anything semantically useful or reliable to exploit.
The best approach I found was just multiple choice questions. "Does X entail Y? Please output [A] True or [B] False. Then measure the linprobs of the next token, which should be `[A` (90%) or `[B` (10%). Then we might make a statement like: The LLM thinks there is a 90% probability that X entails Y.
In our paper [1], we find that asking a follow up question like "Is the answer correct?" and taking the normalized probability of "Yes" or "No" token (or more generally any such token trained for) seems to be best bet so far to get well-calibrated probabilities out of the model.
In general, the log-probability of tokens is not a good indicator of anything other than satisfying the pre-training loss function of predicting the "next token." (it likely is very well-calibrated on that task though) Semantics of language are a much less tamable object, especially when we don't quite have a good way to estimate a normalizing constant because every answer can be paraphrased in many ways and still be correct. The volume of correct answers in the generation space of language model is just too small.
There is work that shows one way to approximate the normalizing constant via SMC [2], but I believe we are more likely to benefit from having a verifier at train-time than any other approach.
And there are stop-gap solutions to make log probabilities more reliable by only computing them on "relevant" tokens, e.g. only final numerical answer tokens for a math problem [3]. But this approach kind of side-steps the problem of actually trying to find relevant tokens. Perhaps something more in the spirit of System 2 attention which selects meaningful tokens for the generated output would be more promising [4].
[1]: https://arxiv.org/abs/2406.08391 [2]: https://arxiv.org/abs/2404.17546 [3]: https://arxiv.org/abs/2402.10200 [4]: https://arxiv.org/abs/2311.11829
Indeed, ultra high temperature sampling in its own right should be studied. I can do top_k = 2 and temperature = system.maxint and get decent results which are extraordinarily creative (with increasing probability of token related spelling issues as top_k goes up).
I'm convinced that the models logprobs hold so much bloody value and knowledge that I unironically do not care about how many "theoretical guarantees" it lacks or about it's non-correspondence to our usage of language.
[1]: Btw, this paper is now ICLR 2025 accepted and likely going to get an oral/honorable mention since we are ranked #18 out of all submissions by scores and have extremely favorable meta-review. Peer review seems to agree with our claims of extreme performance improvements.
We may be talking about two orthogonal things here. And also to be clear, I don't care about theoretical guarantees either.
Now, min-p is solving for the inadequacies of standard sampling techniques. It is almost like a clever adaptive search which other sampling methods fail at (despite truncations like top-k/top-p).
However, one thing that I noticed in the min-p results was that lower temperatures were almost always better in the final performance (and quite expectedly the inverse for creating writing). This observation makes me think that the underlying model is generally fairly good at ranking the best tokens. What sampling allows us is a margin-for-error in cases where the model ranked a relevant next token not at the top, but slightly lower.
Therefore, my takeaway from min-p is that it solves for deficiencies of current samplers but its success is not in contradiction to the fact that logprobs are bad proxies for semantics. Sampling is the simplest form of search, and I agree with you that better sampling methods are a solid ingredient to extract information from logprobs.
Do you have any benchmarks of min-p sampling with the new reasoning models, such as QwQ and R1?
MIT License. See LICENSE for more information.
But the LICENSE is Apache-2.0 license.
Which is it?
An author doesn’t need to please a repo reader until they see a good reason to do so.
I think most people clicking that button would be better served by scrolling down, but that's not made very obvious.