Definitely curious, this looks very similar to Coconut, even down to the CoT encoding process in Figure 2. They go into a lot more detail though, seems like parallel innovation.
singularity2001 19 hours ago [-]
I wonder whether even those models which emit thinking tokens in reality do most of the work within the latent space so the difference is only superficial
esafak 1 days ago [-]
I'm behind on reading but don't all models use continuous embeddings to represent reasoning?
winwang 1 days ago [-]
I believe the "continuous" in Coconut means that the CoT is in the continuous latent space, instead of being on output tokens (see Fig. 1).
moolimon 1 days ago [-]
I feel like this is the obvious next step for chain of thought reasoning. Excited to see work on models that try and transform the intermediate thinking space tokens, down to language. Allowing us to still try and see what's happening inside the "mind" of the LLM, if that process is even possible to map to language anymore. I also wonder what the implications of this research are on chain of thought reasoning with reinforcement learning, since from my understanding many of the reward mechanisms set up during reinforcement learning are around the structure of thought process.
Davidzheng 1 days ago [-]
WRT last sentence: I think the recent breakthroughs are precisely not caring at all about the cot itself and evaluating only the end product, allowing the model to develop a method of reasoning which is not necessarily procured by human data distribution (has the benefit of allowing it to collapse to a "personalized" reasoning pattern)
Syzygies 19 hours ago [-]
I'm trying to sort out whether this article is relevant to a problem I've been working on the last few days. (It's staggering how deploying AI changes time scales.)
We need "Centaur" documentation that can efficiently transfer information formerly targeting humans, to AI. To fit within current token windows, one needs semantic compression. What data representation would be ideal for this?
This seems so obvious once you consider it, it becomes impossible to explain why OpenAI or Anthropic or Cursor or Windsurf don't offer "knowledge packs" that can transfer their documentation to AI. Of course, it's frequently the case that people who make tools don't "get" them.
My immediate need is to condense the Lean 4 website into a Claude 3.5 Sonnet context window. No AI can code in Lean 4 reliably (not that many humans either ;-) but I don't want the Lean / AI choice to be either / or.
rlupi 6 hours ago [-]
> No AI can code in Lean 4 reliably
I wonder if this is due to the nature of the language.
Lean 4 lets you redefine its syntax in ways that most other languages do not allow[1], so effectively you are dealing with a recursive language, that could require a Touring complete token representation system.
LLMs based on transformers are not Touring complete (nitpick: they are but only if you use arbitrary precision math, which is not the case in practical implementation https://arxiv.org/abs/1901.03429).
1 days ago [-]
1 days ago [-]
knowaveragejoe 1 days ago [-]
I'm a noob hobbyist, but in theory couldn't SAEs or similar MI constructs learn to decode the "thinking" tokens into something more resembling the CoT they originally encoded? Or am I completely off the mark?
thom 1 days ago [-]
Very importantly here they provide a ways of decoding the encoded thought tokens, so you're not really losing explanatory power or debuggability. As much as OpenAI want to present hidden chain of thought as some sort of long term advantage or safety feature, it's horrible when you want to understand how a model came to some insane conclusion.
byschii 1 days ago [-]
isn't this dangerous?
isn't the efficiency given at the expense of safety and interpretability?
> isn't this dangerous? isn't the efficiency given at the expense of safety and interpretability?
Final text is only a small part of model's thinking. It's produced from embeddings which probably have much more in them. Each next token depends not only on previous, but all the intermediate values for all tokens. We don't know them, they are actually important and represent inner 'thinking'. So, LLM is still a black box. The result is usually A because of B. Sort of explanation for A, but where B came from we can only guess.
winwang 1 days ago [-]
Depends on if we can interpret the final hidden layer. It's plausible we evolve models to _have_ interpretable (final/reasoning) hidden layers, just that they aren't constrained to the (same representation of) input/output domains (i.e. tokens).
swagmoney1606 1 days ago [-]
We should always be able to clearly understand and interpret all of the thinking leading to an action done by an AI. What would the point be if we don't know what it's doing, just that it is doing "something"
IshKebab 1 days ago [-]
I don't see how it is any more dangerous than the already existing black-box nature of DNNs.
nowittyusername 1 days ago [-]
the hidden tokens can be decoded to English language if the user wants to see the thinking process.
patcon 1 days ago [-]
Yeah, agreed. The limits of human minds constrain language. To allow these things to reason outside words is in my intuitions a tactic with more abundant paths toward super intelligence, and the exact sort of path we'll have a harder time monitoring (we'll need fancy tools to introspect instead of just watching it think)
My current thinking is that I would support a ban on this style of research. Really hard to set lines for regulation, but this feels like an easy and intuitive place to exercise caution
Davidzheng 1 days ago [-]
Probably not needed in the end to reason in latent space. Unless constrained by human preference/SFT data, RL spontaneously should create new additions to language to help with new reasoning methods/new concepts invented by the system.
numba888 1 days ago [-]
> RL spontaneously should create new additions to language to help with
Yes, but it may take millions of years. One of the main reasons of LLMs success is their amazing trainability. For every input token it produces predictable output. I.e. loss. While most RL techniques go one by one 'state'. For not tokenized output we cannot predict what it should be. Thus it can be trained only through the next tokens. Which makes it probably unstable and expensive to train, limiting the length of 'continuous' part. But looks like it's still a good idea to have.
pishpash 1 days ago [-]
Can definitely create new math concepts, for example.
"Let two dhdud and three otincjf be called a Uhehjfj"
vessenes 16 hours ago [-]
I’ve been thinking a bit about this lately - reasoning in latent space - especially because it looks like that’s what R1-Zero does — the researchers mention that it’s <think> sections switch back and forth between Chinese and English, but the <say> sections are coherent.
The paper raises a few more questions than it answers, though.
Do they hard code a certain set of CoT token types upfront to train on? While the results are good, they are not ‘great’ - other methods seem to provide better outcomes, based on their own charts.
The interpretability does not seem ‘strong’ to me either - they train decoders on latent space encodings by sort of guessing what must be going on based on text prompts.
That said, this is a fairly sweet ‘hack’ in my mind - training hidden layers to do the reasoning. I guess I’m skeptical that it’s the way forward, though. It feels like until your CoT token can specify it needs more thinking time, you’re stuck without extensibility / deep thinking when needed.
Overall, very cool. Probably not “the future”. More research in latent space reasoning would be very welcome.
another_poster 1 days ago [-]
Is “multimodal reasoning” as big a deal as it sounds? Does this technique mean LLMs can generate chains of thought that map to other modalities, such as sound and images?
ygouzerh 4 hours ago [-]
From what I understood (not an expert), it seems that it's the goal, to see if the knowledge in one modality can be translated in an another one. Typically, if a model trained on sound can leverage the knowledge of musical theory, it would be quite interesting
exclipy 1 days ago [-]
It'd be cool to see its reasoning for solving visual puzzles, as imagery.
deoxykev 1 days ago [-]
I don't think autoregressive models have a fundemental difference in terms of reasoning capability in latent space vs token space. Latent space enables abstract reasoning and pattern recognition, while token space acts as both the discrete interface for communication, and as a interaction medium to extend, refine and synthesize high order reasoning over latent space.
Intuively speaking, most people think of writing as a communication tool. But actually it's also a thinking tool that helps create deeper connections over discrete thoughts which can only occupy a fixed slice of our attention at any given time. Attentional capacity the primary limitation-- for humans and LLMs. So use the token space as extended working memory. Besides, even the Coconut paper got mediocre results. I don't think this is the way.
bravura 1 days ago [-]
I appreciate your argument, but add the following nuance:
Latent space reasoning can represent and manipulate UNCERTAINTY more concisely and elegantly than token space reasoning.
nullc 16 hours ago [-]
If uncertainty is an important signal then a model RL conditioned to perform good COT should be expected to learn how to encode an uncertainty sidechannel in its COT.
If we're fortunate it'll do so using language choice that would also convey uncertainty to humans. Before you complain that English uncertainty has poor precision, consider that nothing prevents the LLM from overloading it with a more precise meaning. Like how "MAY" in an RFC means something much more concrete than in general English. Though unless somehow conditioned for it the uncertainty signal could be something else entirely (including, perhaps, sounding more certain).
This also goes for pretty much any other side information you might hope could be conveyed.
nullc 16 hours ago [-]
Keeping the thinking interpretable makes it easier to impose conditions on it both at runtime and as part of reinforcement. It opens the doors to manually injecting relevant thoughts triggered by supervision ("I must remember to say nothing that could offend the party.", search results, or access to APIs like calculators).
Those advantages are easily worth some efficiency.
I'm skeptical of the safety/security arguments some have made. Models RL trained seeing their own COT may (and in fact almost certainly) will develop hidden context embedded into their word choices that carry through data that we're not aware of, the fact that the COT appears to be English (or some other human language) doesn't mean that we necessarily really understand it.
Consider how a game of Hanabi between long time partners might look to an outsider.
_KnighT_ 1 days ago [-]
I'm new to this topic. Can someone help me understand this sentence?
"Meanwhile, through the next-token prediction constraint, the explicit
textual symbols of the hidden representations for Heima
Encoder are aligned to the text of the corresponding special
tokens {<CoT>(k)} in vocabulary, while the hidden representations contained in hidden states of thinking tokens remain distinct and variable depending on the inputs"
I understand that they have fine-tuned the MLLM to produce, in response to each query and image input, the CoT "thinking tokens" in addition to the answer.
How does that establish an association between the thinking tokens and the original plain-English CoT statements?
The second clause seems to say that the thinking tokens encode information that is "distinct and variable depending on the inputs." Is my interpretation correct?
gunalx 1 days ago [-]
I would be interrested in seeing how a combined latent space and traditional gpro cot could perform vs just one of either.
My intuition is still that latent space would be better at emulating larger models with fewer params, and cot helping refining the output after latent space.
Combined it would kinda being able to think about a problem. Throw down a draft then refine it.
aradox66 1 days ago [-]
Could someone ELI5? It sounds like they generate a compressed token which represents a whole "thought" rather than elaborating the entire "thought" in actual language. Is that right?
ipunchghosts 1 days ago [-]
Currently, when AI models solve problems, they write out long chains of thoughts (like showing their work in math). While helpful, this takes up a lot of computing power.
Heima does something clever - instead of writing out long explanations, it compresses each step of thinking into a single "thinking token." Think of it like using a shorthand symbol instead of writing out a full sentence.
Ancapistani 1 days ago [-]
I've been doing a lot of introspection lately about how I think. I lack the terminology here unfortunately, but your description here sounds super familiar:
> instead of writing out long explanations, it compresses each step of thinking into a single "thinking token." Think of it like using a shorthand symbol instead of writing out a full sentence.
I have clear memories of how cognition worked for me before I understood spoken language. I recall thinking in concepts - kind of a weird mix of forms, motions, and intent. I know this sounds metaphysical, but that's not my intent. I just don't have the words to explain it.
I wish I did, though, because my very early memories of self-awareness certainly seem to map well onto the current state of AI development.
fzzzy 1 days ago [-]
Plenty of people don't think with an internal monologue or internal imagery.
antirez 1 days ago [-]
Cool, but isn't this encoding a potentially very long thinking process into a fixed embedding? Intuitively should not work as well.
pishpash 22 hours ago [-]
That's already the case with visible text. There's an embedding inside the model as it spits out the next token.
jakobschwich 1 days ago [-]
Seems like a promising next step.
Rendered at 22:51:35 GMT+0000 (Coordinated Universal Time) with Vercel.
[1] https://arxiv.org/abs/2412.06769
We need "Centaur" documentation that can efficiently transfer information formerly targeting humans, to AI. To fit within current token windows, one needs semantic compression. What data representation would be ideal for this?
This seems so obvious once you consider it, it becomes impossible to explain why OpenAI or Anthropic or Cursor or Windsurf don't offer "knowledge packs" that can transfer their documentation to AI. Of course, it's frequently the case that people who make tools don't "get" them.
My immediate need is to condense the Lean 4 website into a Claude 3.5 Sonnet context window. No AI can code in Lean 4 reliably (not that many humans either ;-) but I don't want the Lean / AI choice to be either / or.
I wonder if this is due to the nature of the language.
Lean 4 lets you redefine its syntax in ways that most other languages do not allow[1], so effectively you are dealing with a recursive language, that could require a Touring complete token representation system.
[1] https://leanprover-community.github.io/lean4-metaprogramming... What other language lets you redefine the meaning of digits? The mix of syntax + macros + elaboration makes it really flexible, but hard to treat reliably.
LLMs based on transformers are not Touring complete (nitpick: they are but only if you use arbitrary precision math, which is not the case in practical implementation https://arxiv.org/abs/1901.03429).
https://arxiv.org/abs/2412.14093 (Alignment faking in large language models)
https://joecarlsmith.com/2024/12/18/takes-on-alignment-fakin...
PS I m definitely not an expert
Final text is only a small part of model's thinking. It's produced from embeddings which probably have much more in them. Each next token depends not only on previous, but all the intermediate values for all tokens. We don't know them, they are actually important and represent inner 'thinking'. So, LLM is still a black box. The result is usually A because of B. Sort of explanation for A, but where B came from we can only guess.
My current thinking is that I would support a ban on this style of research. Really hard to set lines for regulation, but this feels like an easy and intuitive place to exercise caution
Yes, but it may take millions of years. One of the main reasons of LLMs success is their amazing trainability. For every input token it produces predictable output. I.e. loss. While most RL techniques go one by one 'state'. For not tokenized output we cannot predict what it should be. Thus it can be trained only through the next tokens. Which makes it probably unstable and expensive to train, limiting the length of 'continuous' part. But looks like it's still a good idea to have.
"Let two dhdud and three otincjf be called a Uhehjfj"
The paper raises a few more questions than it answers, though.
Do they hard code a certain set of CoT token types upfront to train on? While the results are good, they are not ‘great’ - other methods seem to provide better outcomes, based on their own charts.
The interpretability does not seem ‘strong’ to me either - they train decoders on latent space encodings by sort of guessing what must be going on based on text prompts.
That said, this is a fairly sweet ‘hack’ in my mind - training hidden layers to do the reasoning. I guess I’m skeptical that it’s the way forward, though. It feels like until your CoT token can specify it needs more thinking time, you’re stuck without extensibility / deep thinking when needed.
Overall, very cool. Probably not “the future”. More research in latent space reasoning would be very welcome.
Intuively speaking, most people think of writing as a communication tool. But actually it's also a thinking tool that helps create deeper connections over discrete thoughts which can only occupy a fixed slice of our attention at any given time. Attentional capacity the primary limitation-- for humans and LLMs. So use the token space as extended working memory. Besides, even the Coconut paper got mediocre results. I don't think this is the way.
Latent space reasoning can represent and manipulate UNCERTAINTY more concisely and elegantly than token space reasoning.
If we're fortunate it'll do so using language choice that would also convey uncertainty to humans. Before you complain that English uncertainty has poor precision, consider that nothing prevents the LLM from overloading it with a more precise meaning. Like how "MAY" in an RFC means something much more concrete than in general English. Though unless somehow conditioned for it the uncertainty signal could be something else entirely (including, perhaps, sounding more certain).
This also goes for pretty much any other side information you might hope could be conveyed.
Those advantages are easily worth some efficiency.
I'm skeptical of the safety/security arguments some have made. Models RL trained seeing their own COT may (and in fact almost certainly) will develop hidden context embedded into their word choices that carry through data that we're not aware of, the fact that the COT appears to be English (or some other human language) doesn't mean that we necessarily really understand it.
Consider how a game of Hanabi between long time partners might look to an outsider.
"Meanwhile, through the next-token prediction constraint, the explicit textual symbols of the hidden representations for Heima Encoder are aligned to the text of the corresponding special tokens {<CoT>(k)} in vocabulary, while the hidden representations contained in hidden states of thinking tokens remain distinct and variable depending on the inputs"
I understand that they have fine-tuned the MLLM to produce, in response to each query and image input, the CoT "thinking tokens" in addition to the answer.
How does that establish an association between the thinking tokens and the original plain-English CoT statements?
The second clause seems to say that the thinking tokens encode information that is "distinct and variable depending on the inputs." Is my interpretation correct?
My intuition is still that latent space would be better at emulating larger models with fewer params, and cot helping refining the output after latent space.
Combined it would kinda being able to think about a problem. Throw down a draft then refine it.
Heima does something clever - instead of writing out long explanations, it compresses each step of thinking into a single "thinking token." Think of it like using a shorthand symbol instead of writing out a full sentence.
> instead of writing out long explanations, it compresses each step of thinking into a single "thinking token." Think of it like using a shorthand symbol instead of writing out a full sentence.
I have clear memories of how cognition worked for me before I understood spoken language. I recall thinking in concepts - kind of a weird mix of forms, motions, and intent. I know this sounds metaphysical, but that's not my intent. I just don't have the words to explain it.
I wish I did, though, because my very early memories of self-awareness certainly seem to map well onto the current state of AI development.