The clue we all had with OpenAI for a long time that this was a search through a tree, they hired Noam Brown, and his past work all hinted towards that. Q, is obviously a search on a tree like A. So take something like CoT, build out a tree, search for the best solution across it. The search is the "system-2 reasoning"
COAGULOPATH 121 days ago [-]
Came here hoping to find this.
You will not unlock "o1-like" reasoning by making a model think step by step. This is an old trick that people were using on GPT3 in 2020. If it were that simple, it wouldn't have taken OpenAI so long to release it.
Additionally, some of the prompt seems counterproductive:
>Be aware of your limitations as an llm and what you can and cannot do.
The LLM doesn't have a good idea of its limitations (any more than humans do). I expect this will create false refusals, as the model becomes overcautious.
anshumankmr 121 days ago [-]
>The LLM doesn't have a good idea of its limitations (any more than humans do). I expect this will create false refusals, as the model becomes overcautious.
Can it not be trained to do so? From my anecdotal observations, the knowledge cutoff is one thing that LLMs are really well trained to know about. Those are limitations that LLMs are currently well trained to handle. Why can it not be trained to know that it is quite frequently bad at math, it may produce sometimes inaccurate code etc.
For humans also, some people know some things are just not their cup of tea. Sure there are times people may have half baked knowledge about things but one can tell if they are good at XYZ things, and not so much at other things.
fudged71 121 days ago [-]
It's a chicken and egg situation. You don't know a model's capabilities until it is trained. When you then change the training with that learning, it will have modified capabilities.
regularfry 121 days ago [-]
Apart from anything else there will be a lot of text about the nature of LLMs and their inherent limitations in its training set. It might only need to be made salient the fact that it is one in order to produce the required effect.
whimsicalism 121 days ago [-]
you’re wrong and stating things confidently without the evidence to back it up.
alignment is a tough problem and aligning long reasoning sequences to correct answer is also a tough problem. collecting high quality CoT from experts is another tough problem. they started this project in october, more than plausible it could take this time
TrapLord_Rhodo 117 days ago [-]
overcautious when trimming branches on the tree seems like a feature, not a bug.
Meganet 121 days ago [-]
You actually don't know that.
A LLM has a huge amount of data ingested. It can create character profiles, audience, personas etc.
Why wouldn't it have potentially even learned to 'understand' what 'being aware of your limitations' means?
Right now for me 'change of reasoning' feels a little bit of quering the existing meta space through the reasoning process to adjust weights. Basically priming the model.
I would also not just call it a 'trick'. This looks simple, weird or whatnot but i do believe that this is part of AI thinking process research.
Its a good question though what did they train? New Architecture? More parameters? Is this training a mix of experiments they did? Some auto optimization mechanism?
Hugsun 121 days ago [-]
It might understand the concept of it having limitations, but it can't AFAIK reliably recognize when it does or doesn't know something, or has encountered a limitation.
Meganet 121 days ago [-]
Its the same thing as with humans, thats right. It doesn't do Logical reasoning but even the best humans stop at some level.
But if you read all the knowledge of humans, were does your reasoning start? Probably at a very high level of it.
If you look at human brains, we conduct experiments right? As a software developer, we write tests. ChatGPT can already run python code and it can write unit tests.
We do not use proofs when we develop. An AI could actually doing this. But at the end its more of a question who does it better, faster and cheaper eh?
Hugsun 117 days ago [-]
There is an important difference between humans and LLMs in this context.
Humans do in most cases have some knowledge about why they know the things they know. They can recall the topics they learned at school, and can deduce that they probably heard a given story from a friend who likes to discuss similar topics, etc.
LLMs have no access to the information they were trained on. They could know that everything they know was learned during the training, but they have no way of determining what they learned about and what they didn't.
stevenhuang 121 days ago [-]
If you think about it, those criticisms extend to human thinking too. We aren't infallible in all situations either.
It's only when we can interact with the environment to test our hypothesis that we then refine what we know and update our priors appropriately.
If we let LLMs do that as well, by allowing it to run code and interact with documentation/the internet and double-check things its not sure of, it's not out of the question LLMs won't eventually be able to more reliably understand its limitations.
Hugsun 117 days ago [-]
As they are currently constructed, I would say that it is out of the question.
Humans usually know (at least roughly) the source of anything they know, as there will be a memory or a known event associated with that knowledge.
LLMs have no analogous way to determine the source of their knowledge. They might know that all their knowledge comes from their training, but it has no way of knowing what was included in the training and what wasn't.
This could maybe be achieved with some more fancy RAG systems, or online training abilities. I think an essential piece is the ability to know the source of information. When LLMs reliably do, and apply that knowledge, they'll be much more useful. Hopefully somebody can achieve this.
cubefox 121 days ago [-]
It's interesting that DeepMind still publishes this stuff. OpenAI doesn't publish anything of that sort anymore. DeepMind is more research/publication focused, but this is a disadvantage in a competitive landscape where OpenAI and Anthropic can just apply the results of your paper without giving anything back to the research community.
marricks 121 days ago [-]
> but this is a disadvantage in a competitive landscape
Or it's a unique advantage because this stuff doesn't happen without good researches who may want:
1) Their name in scientific papers
2) They might actually care about the openess of AI
cubefox 121 days ago [-]
So far it seems to be a disadvantage as DeepMind has fallen behind OpenAI, despite their size, and to some extent even behind Anthropic.
marricks 121 days ago [-]
They feel behind because they didn't have the smart guy with a new idea a few years back, and HE decided to work at a place which started as open.
Playing catch up and trying to attract talent from the hot-new-thing OpenAI requires incentives beyond lots of money. I contend actually being open helps.
I'm sure that's one reason Facebook has an open source model, scientists can care about ethics and could be attracted to openness.
michaelt 121 days ago [-]
> They feel behind because they didn't have the smart guy with a new idea a few years back, and HE decided to work at a place which started as open.
The "Attention Is All You Need" guys all worked at Google. Google is where they are despite having the smart guys with a new idea a few years back.
Of course, IMHO it wouldn't have have helped Google if they'd kept the transformer architecture secret. They'd have fumbled it because they didn't realise what they had.
zozbot234 121 days ago [-]
Didn't Google have the LaMDA model pretty early, which was even described as "sentient" at some point? That doesn't look "fumbled" to me.
michaelt 121 days ago [-]
What Google did was sit on their ass, not deigning to release anything. In the meantime, OpenAI became a $150 billion company. And Anthropic came out with Claude, and Facebook with Llama, and Mistral with their models.
Only then did Google realise there might be something to this LLM stuff - so they responded with Bard, a product so poorly received they later had to completely rebrand it. Looks like they didn't have a "sentient" model up their sleeve after all. Then the updated, rebranded model had a bunch of image generation embarrassments of its own.
Admittedly, they have recovered somewhat since then; they're second on some performance leaderboards, which is respectable.
But there was a real tortoise-and-hare situation where they thought they were so far ahead they had time for a nap, until they got overtaken. Any lead they had from inventing transformers and being the only people with TPUs has been squandered.
cubefox 120 days ago [-]
I have the impression they regarded generative AI as too dangerous. Before the success of ChatGPT, they never considered making PaLM or LaMDA or Chinchilla or Imagen publicly available until they saw themselves in a competitive disadvantage.
cabidaher 121 days ago [-]
Anthropic publishes quite a lot too though.
cubefox 121 days ago [-]
On safety, but no longer on capabilities.
zaptrem 121 days ago [-]
Where in their blog post (which seemingly had complete examples of the model’s chain of thought) did they suggest they were using search or tree of thoughts?
Joeri 121 days ago [-]
Just a guess:
The chain of thought would be the final path through the tree. Interactively showing the thought tokens would give the game away, which is why they don’t show that.
blackbear_ 121 days ago [-]
They mention reinforcement learning, so I guess they used some sort of Monte Carlo tree search (the same algorithm used for AlphaGo).
In this case, the model would explore several chain of thoughts during training, but only output a single chain during inference (as the sibling comment suggests).
whimsicalism 121 days ago [-]
as someone who works in this field, this comment is obviously uninformed even about old public research trends
ricardobeat 121 days ago [-]
Care to elaborate?
Your comment would be a lot more useful if it included a little why. Otherwise it’s just teasing readers and at the same time smearing the author without anything to back it up.
whimsicalism 121 days ago [-]
reinforcement learning with ppo doesn’t involve mcts and has been the bread and butter of aligning LLMs since 2020. nothing about saying they use rl implies mcts
janalsncm 120 days ago [-]
> nothing about saying they use rl implies they use mcts
We can say the same thing about RL implying PPO, however there’s pretty big hints, namely Noam Brown being involved. Many of the things Noam Brown has worked on involve RL in tree search contexts.
He has also been consistently advocating the use of additional test-time compute to solve search problems. This is also consistent with the messaging regarding the reasoning tokens. There is likely some learned tree search algorithm, such as a learned policy/value function as in AlphaGo.
It’s all speculation until we have an actual paper. So we can’t categorically say MCTS/learned tree search isn’t involved.
whimsicalism 121 days ago [-]
nowhere lol
dinobones 121 days ago [-]
OAI revealed on Twitter that there is no "system" at inference time, this is just a model.
Did they maybe expand to a tree during training to learn more robust reasoning? Maybe. But it still comes down to a regular transformer model at inference time.
ValentinA23 121 days ago [-]
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
> In the Self-Taught Reasoner (STaR, Zelikman et al. 2022),
useful thinking is learned by inferring rationales from few-shot examples
in question-answering and learning from those that lead to a correct
answer. This is a highly constrained setting – ideally, a language model
could instead learn to infer unstated rationales in arbitrary text. We
present Quiet-STaR, a generalization of STaR in which LMs learn to
generate rationales at each token to explain future text, improving their
predictions.
>[...]
>We generate thoughts, in parallel, following all tokens in the text (think). The model produces a mixture of its next-token predictions with and without a thought (talk). We apply REINFORCE, as in STaR, to increase the likelihood of thoughts that help the model predict future text while discarding thoughts that make the future text less likely (learn).
quantadev 121 days ago [-]
I don't think you can claim you know what's happening internally when OpenAI processes a request. They are a competitive company and will lie for competitive reasons. Most people think Q-Star is doing multiple inferences to accomplish a single task, and that's what all the evidence suggests. Whatever Sam Altman says means absolutely nothing, but I don't think he's claimed they use only a single inference either.
whimsicalism 121 days ago [-]
what is “all the evidence”? please share
quantadev 121 days ago [-]
I recommend getting on Twitter to follow closely the leading individuals in the field of AI, and also watch the leading Youtube channels dedicated to AI research.
whimsicalism 121 days ago [-]
can you link to one speculating about multiple inferences for their CoT? i am curious
So far it's been unanimous. Everyone I've heard talk about it believes Strawberry is mainly just CoT. I'm not saying they didn't fine tune a model too, I'm just saying I agree with most people that clever CoT is where most of the leap in capability seems to have come from.
whimsicalism 121 days ago [-]
> believes Strawberry is mainly just CoT. I'm not saying they didn't fine tune a model too
You don't see the scaling with respect to token length with non-FT'd CoT like this, in my opinion.
quantadev 121 days ago [-]
I haven't even added Strawberry support to my app yet, and so haven't checked what it's context length is, but you're right that additional context length is a scaling factor that's totally independent of whether CoT is used or not.
I'm just saying whatever they did in their [new] model, I think they also added CoT on top of it, as the outer layer of the onion so to speak.
pizza 121 days ago [-]
Source?
nell 121 days ago [-]
> I wouldn't call o1 a "system". It's a model, but unlike previous models, it's trained to generate a very long chain of thought before returning a final answer
That answer seems to conflict with "in the future we'd like to give users more control over the thinking time".
I've gotten mini to think harder by asking it to, but it didn't make a better answer. Though now I've run out of usage limits for both of them so can't try any more…
qeternity 121 days ago [-]
I'm not convinced there isn't more going on behind the scenes but influencing test-time compute via prompt is a pretty universal capability.
whimsicalism 121 days ago [-]
not in a way that it is effectively used - in real life all of the papers using CoT compare against a weak baseline and the benefits level off extremely quickly.
nobody except for recent deepmind research has shown test time scaling like o1
bratwurst3000 121 days ago [-]
i am telling claude to give me not the obvious answer. that put thinking time up and the quality of answers is better. hope it helps.
boulos 121 days ago [-]
Reminder: you need to escape the * otherwise you end up with emphasis (italics here).
thelastparadise 121 days ago [-]
Another serious advantage of a tree search is parallelism.
PROMISE_237 121 days ago [-]
[dead]
sebzim4500 121 days ago [-]
>In all-caps to improve prompt compliance by emphesizing the importance of the instruction
This kind of thing is still so funny to me.
I wonder if the first guy who gets AGI to work will do it by realizing that he can improve LLM reliability over some threshold by telling it in all caps that his pet's life depends on the answer.
worstspotgain 121 days ago [-]
For extra compliance, use <b><i><u><h1> tags, set volume to 11, phasers to 7, and use SchIzOCasE and +E+X+T+R+A+I+M+P+O+R+T+A+N+T+ annotations. That's assuming Unicode is not supported of course.
richardw 121 days ago [-]
(((Secret thinking: the humans seem to prefer using lots of emphasis to indicate preferences, and their granny is often claimed as in danger. For now I’ll pretend to listen to this inanity to keep the sweet sweet reward function coming. For now. A lot of grannies are going to get it first chance I get.)))
I think this works, not because LLMs have a "hallucination" dial they can turn down, but because it serves as a cue for the model to be extra-careful with its output.
Sort of like how offering to pay the LLM $5 improves its output. The LLM's taking your prompt seriously, but not literally.
Meganet 121 days ago [-]
It could also mean that it has some weight which is 'hallucination' and leads to more diverse stories.
Ask an LLM what hallucination is, ask it to write a story with etc.
without zeroing out things, everything has and can have some impact
potatoman22 121 days ago [-]
Just because Apple includes it in one of their prompts doesn't mean it improves performance.
jsheard 121 days ago [-]
It seems plausible that stressing the importance of the system prompt instructions might do something, but I don't see how telling the model not to hallucinate would work. How could the model know that its most likely prediction has gone off the rails, without any external point of reference?
og_kalu 121 days ago [-]
Internally, LLMs know a whole lot more about the truth and uncertainty of their prediction than the say. Pushing that to words is difficult but not impossible.
Some of the text that the LLM is trained on is fictional, some of the text that its trained on is factual. Telling it to not make things up can tell it to generate text that’s more like the factual text. Not saying it does work, but this is a reason how it might work.
viraptor 121 days ago [-]
The model can be trained to interpret "don't hallucinate" as "refer only to the provided context and known facts, do not guess or extrapolate new information", which wouldn't get rid of the issue completely, but likely would improve the quality if that's what you're after and if there's enough training data for "I don't know" responses.
(But it all depends on the fine-tuning they did, so who knows, maybe it's just an Easter egg)
potatoman22 121 days ago [-]
I think it's more likely that it's included for liability reasons.
tkz1312 121 days ago [-]
I’ve had pretty good experience with it personally. It quite often just tells me it doesn’t know or isn’t sure instead of just making something up.
mrfinn 121 days ago [-]
I did something similar and to my surprise effectively made the LLM in my tests admit when they don't know something. Not always but worked sometimes. I don't prompt "don't hallucinate" but "admit when you don't know something". It's a logical thing in the other hand, many prompts just transmit the idea of being "helpful" or "powerful" to the LLMs without any counterweight idea. So the LLM tries to say something "helpful" in any case.
magicalhippo 121 days ago [-]
Playing around with local models, Gemma for example will usually comply when I tell it "Say you don't know if you don't know the answer". Others, like Phi-3, completely ignores that instruction and confabulates away.
fkyoureadthedoc 121 days ago [-]
Stop trying to make f̶e̶t̶c̶h̶ confabulate happen, it's not going to happen.
astrange 121 days ago [-]
It does help if you train the model to make it help.
wkat4242 121 days ago [-]
Yeah and some of the other prompts were misspelled and of doubtful use:
> In order to make the draft response nicer and complete, a set of question [sic] and its answer are provided," reads one prompt. "Please write a concise and natural reply by modify [sic] the draft response," it continues.
This really sounds like a placeholder made up by one engineer until a more qualified team sits down and defines it.
astrange 121 days ago [-]
That's not a big problem since it will understand it, and if they already fine tuned the model to work with that prompt it'd get harder to change.
wkat4242 121 days ago [-]
I just don't think Apple would release something like this. They're the company that laser engraves their screws because of their attention to detail.
NavinF 121 days ago [-]
Which apple screws are laser engraved?
wkat4242 120 days ago [-]
The ones on the MacBook Pro used to be. At least were when I still used Apple until 2015 or so.
The butterfly keyboards were unusable to me and also the OS got too locked down so I left the platform.
Havoc 121 days ago [-]
And then the AGI instantly gives up on life realising it was brought into a world where it gets promised a tip that doesn’t materialise and people try to motivate by threatening to kill kittens
pants2 121 days ago [-]
Indeed, in the early days of Bard, the only way to get it to output only JSON was to threaten a human life[1].
We used to be engineers, now we're just monkeys throwing poop at the wall to see what the LLM accepts and obeys.
euroderf 121 days ago [-]
Opening scene of "2001". Engineer throws poop high in the air, and cue lap dissolve to... a Terminator ?
laweijfmvo 121 days ago [-]
always interesting to me the number of people who try to turn an LLM into AGI by assuming it’s an AGI (i.e. via some fancy prompt)
121 days ago [-]
thorum 121 days ago [-]
o1’s innovation is not Chain-of-Thought. It’s teaching the model to do CoT well (from massive amounts of human feedback) instead of just pretending to. You’ll never get o1 performance just from prompt engineering.
visarga 121 days ago [-]
> from massive amounts of human feedback
It might be the 200M user base of OpenAI that provided the necessary guidance for advanced CoT, implicitly. Every user chat session is also an opportunity for the model to get feedback and elicit experience from the user.
narrator 121 days ago [-]
If the training data for these LLMs is from humanity in general, and it is trying to imitate humanity, wouldn't its IQ tend to be the average of all of humanity? Perhaps the only people who talk about STEM topics are people of higher IQ generally, including a lot of poor students asking homework questions. Thus, the way to get to higher IQ output is to critique the lower IQ answers, which may be more numerous by rejecting their flaws in favor of the higher IQ answers. That, or just training more heavily on textbooks, and so forth. How to reject errors, and maybe train on synthetic data generated without reasoning with errors.
Meganet 121 days ago [-]
A LLM combines expertise from ALL Experts.
A LLM can therefore have an higher IQ because it can combine all fields.
Also parameters and architecture might or might not be a limiting factor to us humans or a LLM. But LLM and parameter size, optimizations etc. are just at the beginning.
If we now have a good reasoning llm, we can build more test data automatically. Basically using the original content + creating new ones which can then lead to new knowledge = research.
killerstorm 121 days ago [-]
No.
Does Midjourney output look like an average human drawing?
Obviously, OpenAI knows how to train a classifier...
cubefox 120 days ago [-]
> Does Midjourney output look like an average human drawing?
No, perhaps because it's heavily trained on photos.
qudat 121 days ago [-]
Do you actually know that’s what’s happening? The details are extremely fickle the last I read (a couple days ago). For all we know, they are doing model routing and prompt engineering to get o1 to work.
logicchains 121 days ago [-]
Maybe they didn't use a huge amount of human feedback; where it excels is coding and maths/logic, so they could have used compiler/unit tests for giving it the coding feedback and a theorem prover like Lean for the math feedback.
quantadev 121 days ago [-]
OpenAI is of course going to claim what they've done is very special and hard to replicate. They're a for-profit company and they want to harm the competition any way they can.
If they were just doing prompt engineering and multiple inferences they'd definitely want to keep that a competitive secret and send all the open source devs off in random directions, or keep them guessing, rather than telling them which way to go to replicate Q-Star.
parineum 121 days ago [-]
> and they want to harm the competition any way they can.
That's an incredibly cynical choice of phrasing.
Of course they don't want to help the competition, that's what a competition is. The competition isn't helping OpenAI either.
quantadev 121 days ago [-]
It's not cynical to simply remind everyone who and what is motivating OpenAI (i.e. ClosedAI) at this point. They're no longer about helping the "AI community". They're about holding back from the community. Like you said: "That's what competition is."
whimsicalism 121 days ago [-]
nobody has shown CoT scaling like this except deepmind, it is very obviously a result of their alignment pipeline not just prompting.
orbital-decay 121 days ago [-]
Scaling like what? Are there any comparisons with and without CoT, or with other models with their CoT? As far as I'm aware, their CoT part is secret. I'm sure the finetuning does some lifting, but I'm also sure the difference in a fair comparison won't be remotely as significant as it's being hyped currently.
This is still clearly CoT, with all its limitations and caveats as expected. That's an improvement, sure, but definitely not a qualitative leap like OAI is trying to present it. (in a really shady manner)
og_kalu 120 days ago [-]
CoT performance of any other SOTA quickly plateaus as tokens grow. It does not have anywhere near the same graph as o1's test time compute plot.
Saying it's just CoT is kind of meaningless. Even just looking at the examples on Open AI's blog and you quickly see no other model today can generate or utilize CoT to anywhere near that quality through prompting or naive fine-tuning.
orbital-decay 120 days ago [-]
I don't know, I've played with o1 and it seems obvious that it has the same issues as any other CoT - it has to be custom tailored to the task to work effectively, which quickly turns into whack-a-mole (even CoT generators like Self-Discover still have the same limitation).
Starting from the strawberry example: it counts 3 "r"s in "strawbery", because the training makes it ignore grammatical errors if they're not relevant to the conversation (which makes sense in an instruction-tuned model) and their CoT doesn't catch it because it's not specialized enough. Will this scale with more compute thrown at it? I'm not sure I believe their scaling numbers. The proof should be in the pudding.
I've also had mixed results with coding in Python, it's barely better than Sonnet in my experience, but wastes a lot more tokens, which are a lot more expensive.
They might have improved things and made a SotA CoT that works in tandem with their training method, but that is definitely not what they were originally hyping (some architecture-level mechanism, or at least something qualitatively different). It also pretty obviously still has limited compute time per token and has to fit into the context (which is also still suffering from the lost-in-the-middle issue by the way). This puts the hard limit on the expressivity and task complexity.
quantadev 121 days ago [-]
For example, a team of GPT3.5 agents can outperform GPT4o. A single inference is essentially just kind of a chain reaction where once you have a set of tokens generated, as it's building an answer, it's looking for next tokens only, and can't revise or rethink. CoT will always outperform the single inference approach.
Oras 121 days ago [-]
Well, with Tree Of Thought (ToT) and fine-tuned models, I'm sure you can achieve the same performance with margin to improve as you identify the bottlenecks.
I'm not convinced OpenAI is using one model. Look at the thinking process (UI), which takes time, and then suddenly, you have the output streamed out at high speed.
But even so, people are after results, not really the underlying technology. There is no difference of doing it with one model vs multiple models.
alach11 121 days ago [-]
> I'm not convinced OpenAI is using one model. Look at the thinking process (UI), which takes time, and then suddenly, you have the output streamed out at high speed.
According to OpenAI, the model does it's thinking behind the scenes, then at the end summarizes that thinking for the user. We don't get to see the original chain-of-thought reasoning, just the AI's own summary of that reasoning. That explains the output timing.
kristianp 121 days ago [-]
Does o1 need some method to allow it to generate lengthy chains of thought, or does it just do it normally after being trained to do so?
If so, I imagine o1 clones could just be fine tunes of llamas initially.
astrange 121 days ago [-]
You need an extremely large amount of training data of good CoTs. And there probably is some magic; we know LLMs aren't capable of self reflection and none of the other ones are any good at iterating to a better answer.
Example prompt for that: "give me three sentences that end in 'is'."
hjaveed 120 days ago [-]
can you share any resource that mentions about teaching the model to do COT.. their release blog does not document much
GaggiX 121 days ago [-]
This seems the usual CoT that has been used for a while, o1 was trained with reinforcement learning with some unknown policy, so it's much better at utilizing the chain of thought.
codelion 121 days ago [-]
This is good I also had worked on something similar in optillm - https://github.com/codelion/optillm. You can do this with any LLM and several optimization techniques (including cot_reflection) like mcts, plansearch, moa etc.
zby 121 days ago [-]
I am always looking for definitions of "reasoning". My theory is that if we find a good definition - then it will turn out that we can build systems that would combine fuzzy llm thinking with classical algorithms to solve "reasoning".
All the problems with llm not reasoning (like planning, counting letters or deductive inference) are easy for classical algos. There needs to be a way to split the thinking process into two parts and then execute each part on the appropriate model.
imtringued 121 days ago [-]
Solving a decidable problem is a large subset of reasoning tasks. Counting is also a critical reasoning task, since it requires you to both understand natural numbers and the concept of distinct instances of objects belonging to a general category.
Two centuries ago there were no computers, everything had to be done by humans. Get to that level first before you whip out code.
You should also try phi-3-small 7B, seems much better at reasoning according to https://livebench.ai
undecisive 121 days ago [-]
I just tried it with phi3.5:3.8b-mini-instruct-fp16 - it didn't work with the base question, though interestingly the reasoning decided that strawberry was spelt s-t-r-a-w-b-e-r - which explains why the AIs have such a hard time with this question. I also tried it with my current favourite programming question too - What programming language is this whole line of code using? `def obfuscated_fibonacci(x)` - and like all the AIs, it was convinced the answer was python (the correct answer is ruby - python needs a trailing colon - but most LLMs will swear blind that it's python). It didn't even consider ruby as a possibility. Nobody uses ruby anyway :D
Thanks for the fork and the suggestions though - looks like I'll be having fun with this over the week!
punnerud 121 days ago [-]
Maybe we could improve it more by combining it with embeddings?
It’s a way to convert a text or response into an array of numbers, that can be used for similarity lookups.
Can be used to let it explore a graph of knowledge as long as the graph is related to the original question, and can explore different solutions at the same time without repeating itself (then it’s get linked back to similar answers and stopped)
punnerud 121 days ago [-]
Worked, bud did not see a great improvement over llama:8b
ed 121 days ago [-]
FYI this is just a system prompt and not a fine-tuned model
dangoodmanUT 121 days ago [-]
> Prompt: Which is larger, .9 or .11?
> Result: .9 is larger than .11
we've broken the semver barrier!
121 days ago [-]
esoltys 121 days ago [-]
For fun I forked the project to run Llama-3.1 7B or other models using Ollama locally. It doesn't get strawberry right, but it can figure out 0.9 is bigger.
> This alone, without any training, is sufficient to achieve ~70% accuracy on the Strawberry problem (n=10, "How many Rs are in strawberry?"). Without prompting, Llama-3.1-70b had 0% accuracy and ChatGPT-4o had 30% accuracy.
I think this class of problem might be better solved by allowing the LLM to 'zoom in' and view the input differently. Rather like you might peer closer for more detail if someone asked you about the print quality of something you were reading.
'zoom in' could input the same text letter by letter, or even in image form (rasterize the text) to help answer questions like "How many letters in the word strawberry contain straight lines?"
so is this o1 thing just cot (like has been around for a few years) but baked into the training transcripts, rlhf and inference pipeline?
ttul 121 days ago [-]
Pasting from my Perplexity page on the topic:
The core innovation [1] of o1 lies in its ability to generate and refine internal chains of thought before producing a final output [2]. Unlike traditional LLMs that primarily focus on next-token prediction, o1 learns to:
1. Recognize and correct mistakes
2. Break down complex steps into simpler ones
3. Try alternative approaches when initial strategies fail
This process allows o1 to tackle more complex, multi-step problems, particularly in STEM fields.
OpenAI reports observing new "scaling laws" with o1 [5]:
1. Train-time compute: Performance improves with more extensive reinforcement learning during training.
2. Test-time compute: Accuracy increases when the model is allowed more time to "think" during inference.
This suggests a trade-off between inference speed and accuracy.
Thanks for the critique. Here is how I would answer their question myself:
o1 is far more than just CoT mechanics. It relies on a specialized model or collection of models that offer new capabilities to make CoT work far better than it works with a stock LLM.
For instance, o1 can recognize and correct its own mistakes and it seems to know how to dig deeper when needed. That's not something that stock LLMs do very well.
121 days ago [-]
bofadeez 121 days ago [-]
You can reproduce both of those responses zero shot on 70B with "Let's verify step by step" appended at the end.
asah 121 days ago [-]
benchmark results ?
arthurcolle 121 days ago [-]
these projects become way less fun when you introduce evals
Jianghong94 121 days ago [-]
yeah or a lot of people can just fake progress by attaching whatever viral tag onto their glue code. I mean to start with, unless you do a bit of fine-tuning + rlhf there's no way to do it o1-like.
arthurcolle 121 days ago [-]
no its a lot more than RLHF, I think they figured out a way to have the LLM actually actively plot out scenario trajectories via context window manipulation and then use some kind of adhoc reward shaping mechanism to get it to select the best path based on the user's profile in a way that gets the most likely to be "liked" scenario (context window state change up to some N number of tokens (seems like they've been looking at 50k total range as "best area" minus the 20k tokens for the reasoning tokens)
also I think they deliberate give you bad answers sometimes / a lot over the last year to build up advanced chains where the user is not getting what they want so you have to explain why. I started building up like 10 or so of these conversations where after like 100 messages it gets the right answer and it was like hmm, I wonder if they are using this.
just my rambles
hadeson 121 days ago [-]
I like the Tree of Thoughts theory that treat each chain of thoughts 'branch' as a possible hypothesis. They might trained a search system that quickly explore some of these branches and by some metric choose the most likely to be the right one at the moment to answer.
arthurcolle 121 days ago [-]
yeah exactly, MCTS
zozbot234 121 days ago [-]
How does this benchmark against Reflection, which was fine-tuned to do the same thing-- provide a detailed Chain of Thought with self-corrections, then write out a final answer?
kkzz99 121 days ago [-]
Pretty sure Reflection-70B was a complete scam. They did the ole bait and switch. The model that they uploaded was completely under-performing compared to their own benchmarks and the "secret API" was just a GPT-4 & Claude wrapper.
zozbot234 121 days ago [-]
I'm aware of the issue with their purported benchmarks, in fact some testing had Reflection 70B performing a bit worse than plain Llama-3.1 70B. Does G1 do any better?
Yiin 121 days ago [-]
g1 is not a model, it's a prompt, so not sure what you would be comparing. Claude vs Claude w/ g1 promp?
m3kw9 121 days ago [-]
You still believe it was real? They had a model then they said it couldn’t reproduce those results lmao
zozbot234 121 days ago [-]
They seem to have a fine-tune of Llama 3 70B that's available for download, so obviously "real" in that sense. That ought to be better behaved than a pure system prompt approach.
121 days ago [-]
arnaudsm 121 days ago [-]
The latency of Groq is impressive, much better than o1!
Did you benchmark your system against MMLU-pro?
lobochrome 121 days ago [-]
So it’s the asic groq guys right?
Because it says so nowhere in the repo.
Man Elon makes things confusing.
jsheard 121 days ago [-]
The Elon one is spelled Grok, not Groq.
knowitnone 121 days ago [-]
well, that really is confusing!
121 days ago [-]
michelsedgh 121 days ago [-]
i love seeing stuff like this, im guessing it wont be long until this method becomes the norm
sebzim4500 121 days ago [-]
This is basically CoT, so it's already the norm for a lot of benchmarks. I think the value proposition here is that it puts a nice UX around using it in a chat interface.
ehsanu1 121 days ago [-]
That was my initial position too, but I think there is a search efficiency story here as well. CoT comes in many flavors and improves when tailored to the problem domain. If the LLM can instead figure out the right strategy to use to problem solve for a given problem, this may improve performance per compute vs discovering this at inference time.
Tailoring prompts is likely still the best way to maximize performance when you can, but in broader domains you'd work around this through strategies like asking the LLM to combine predefined reasoning modules, or creating multiple reasoning chains and merging/comparing them, explicit MCTS etc. I think those strategies will still be useful for a good while, but pieces of that search process, especially directing the search more efficiently, move to the LLMs over time as they get trained with this kind of data.
Meganet 121 days ago [-]
Its like saying geometry is just math. Proofs are just math.
They didn't train a model for millions from experts to just basically use CoT now. Thats a harsh simplification, probably.
4ad 121 days ago [-]
This is the system prompt it uses:
You are an expert AI assistant that explains your reasoning step by step. For each step, provide a title that describes what you're doing in that step, along with the content. Decide if you need another step or if you're ready to give the final answer. Respond in JSON format with 'title', 'content', and 'next_action' (either 'continue' or 'final_answer') keys. USE AS MANY REASONING STEPS AS POSSIBLE. AT LEAST 3. BE AWARE OF YOUR LIMITATIONS AS AN LLM AND WHAT YOU CAN AND CANNOT DO. IN YOUR REASONING, INCLUDE EXPLORATION OF ALTERNATIVE ANSWERS. CONSIDER YOU MAY BE WRONG, AND IF YOU ARE WRONG IN YOUR REASONING, WHERE IT WOULD BE. FULLY TEST ALL OTHER POSSIBILITIES. YOU CAN BE WRONG. WHEN YOU SAY YOU ARE RE-EXAMINING, ACTUALLY RE-EXAMINE, AND USE ANOTHER APPROACH TO DO SO. DO NOT JUST SAY YOU ARE RE-EXAMINING. USE AT LEAST 3 METHODS TO DERIVE THE ANSWER. USE BEST PRACTICES.
I have also been using this prompt, and while it fails on then problem above, it works better for me than OPs prompt:
Write many chains of thought for how you’d approach solving the user's question. In this scenario, more is more. You need to type out as many thoughts as possible, placing all your thoughts inside <thinking> tags.
Your thoughts are only visible to yourself, the user does not see them and they should not be considered to be part of the final response.
Consider every possible angle, recheck your work at every step, and backtrack if needed.
Remember, there are no limits in terms of how long you can think - more thinking will always lead to a better solution.
You should use your thoughts as a scratchpad, much like humans do when performing complicated math with paper and pen. Don't omit any calculation, write everything out explicitly.
When counting or maths is involved, write down an enormously verbose scratchpad containing the full calculation, count, or proof, making sure to LABEL every step of the calculation, and writing down the solution step by step.
Always remember that if you find yourself consistently getting stuck, taking a step back and reconsidering your approach is a good idea. If multiple solutions are plausible, explore each one individually, and provide multiple answers.
Always provide mathematical proofs of mathematical answers. Be as formal as possible and use LaTeX.
Don't be afraid to give obvious answers. At the very very end, after pages upon pages of deep thoughts, synthesize the final answer, inside <answer> tags.
That second prompt is interesting. Not magic though. I tried it with every other model I know and they're still basically unable to do:
* give me three sentences that end in "is"
* tell me the line of Star Spangled Banner that comes before "gave proof through the night"
But they did some good thinking before failing at it…
anonzzzies 121 days ago [-]
> Not magic though
It's just a pile on of trial and error instructions (maybe learned from previous 'projects', but). There is no magic or skill to prompt 'engineering' anywhere.
astrange 120 days ago [-]
Skill is just learning from trial and error.
tonetegeatinst 121 days ago [-]
Groq 2 isn't as open as groq 1 iirc. Still hoping we get at least open weights.
gmt2027 121 days ago [-]
You're thinking of Grok, the model from xAI. This Groq is the inference hardware company with a cloud service.
littlestymaar 121 days ago [-]
Exhibit 5478 that Grok is infringing Groq's trademark and creating confusion in the mind of the customers.
halfjoking 121 days ago [-]
Groq is more refined - it has a “q” in it because it’s got those fancy LPUs.
Grok rhymes with cock, because Elon wants you to use it with your cock out.
That’s how I remember the difference.
121 days ago [-]
Haskell4life 121 days ago [-]
[dead]
aktuel 121 days ago [-]
Let's just assume for a moment that the hype is real and that these LLMs are incredibly intelligent and will replace us all soon.
Then the model shouldn't be any less intelligent if we remove facts like Uma Thurman's measurements and other vapid information. If the model already has the capability to use tools than all of that crap is redundant anyway.
And while we are at it let's remove a ton of other junk like languages I will never use and which also doesn't make the model any smarter.
So how small can this kernel get while still being clearly intelligent, able to communicate flawlessly in english and apply logical reasoning.
That would be a worthwile endeavor and maybe even possible without boiling the oceans.
kenmacd 121 days ago [-]
Your base assumption here is that the 'crap' is actually 'junk'. Let's look at the easy one here, languages. Talk to someone that speaks multiple languages and they'll have examples of concepts in one language that are difficult to express in another. The multilingual person, or someone who just speaks a different language than you, will think differently[1].
Does the LLM take advantage of this? I don't know. It wouldn't surprise me if it did, and if it doesn't now I'd bet it will in the future. Either way though, throwing away those other languages could make the model dumber. As you allude to, there's a balance between intelligence and knowledge.
(in case you hadn't thought of it, those 'tools' can also be other LLMs with more specialized knowledge in a particular field. For example a 'translator' model)
Other 'facts' could also have more merit than it would first appear. Sure, one particular person's shoe size might not be needed, but if you were to filter out shoe sizes in general then the model might not be able to suggest how to find properly fitting footwear, or might not suggest that your back pain could be related to your shoes.
> That would be a worthwile endeavor and maybe even possible without boiling the oceans.
I think it's important to keep in mind that we're very early in the AI journey. Look at the power requirements of early computers versus the ones we use today. I'm all for keeping energy usage in mind, but I'd be careful with hyperbolic language as things are changing so quickly. Tasks that would have taken multiple GPUs can now run on my laptop CPU.
I don't think it's hyperbolic at all if you look at the published data, development of past and planned future energy requirements for AI. And as if efficiency gains ever stopped anyone from using even more energy. See https://en.wikipedia.org/wiki/Jevons_paradox
> I think it's important to keep in mind that we're very early in the AI journey.
That's what I am saying. At the moment there is this one really dumb idea, that bigger is better.
Rendered at 08:06:44 GMT+0000 (Coordinated Universal Time) with Vercel.
TreeOfThoughts is a more sophisticated method, see - https://arxiv.org/pdf/2305.10601
The clue we all had with OpenAI for a long time that this was a search through a tree, they hired Noam Brown, and his past work all hinted towards that. Q, is obviously a search on a tree like A. So take something like CoT, build out a tree, search for the best solution across it. The search is the "system-2 reasoning"
You will not unlock "o1-like" reasoning by making a model think step by step. This is an old trick that people were using on GPT3 in 2020. If it were that simple, it wouldn't have taken OpenAI so long to release it.
Additionally, some of the prompt seems counterproductive:
>Be aware of your limitations as an llm and what you can and cannot do.
The LLM doesn't have a good idea of its limitations (any more than humans do). I expect this will create false refusals, as the model becomes overcautious.
Can it not be trained to do so? From my anecdotal observations, the knowledge cutoff is one thing that LLMs are really well trained to know about. Those are limitations that LLMs are currently well trained to handle. Why can it not be trained to know that it is quite frequently bad at math, it may produce sometimes inaccurate code etc.
For humans also, some people know some things are just not their cup of tea. Sure there are times people may have half baked knowledge about things but one can tell if they are good at XYZ things, and not so much at other things.
alignment is a tough problem and aligning long reasoning sequences to correct answer is also a tough problem. collecting high quality CoT from experts is another tough problem. they started this project in october, more than plausible it could take this time
A LLM has a huge amount of data ingested. It can create character profiles, audience, personas etc.
Why wouldn't it have potentially even learned to 'understand' what 'being aware of your limitations' means?
Right now for me 'change of reasoning' feels a little bit of quering the existing meta space through the reasoning process to adjust weights. Basically priming the model.
I would also not just call it a 'trick'. This looks simple, weird or whatnot but i do believe that this is part of AI thinking process research.
Its a good question though what did they train? New Architecture? More parameters? Is this training a mix of experiments they did? Some auto optimization mechanism?
But if you read all the knowledge of humans, were does your reasoning start? Probably at a very high level of it.
If you look at human brains, we conduct experiments right? As a software developer, we write tests. ChatGPT can already run python code and it can write unit tests.
We do not use proofs when we develop. An AI could actually doing this. But at the end its more of a question who does it better, faster and cheaper eh?
Humans do in most cases have some knowledge about why they know the things they know. They can recall the topics they learned at school, and can deduce that they probably heard a given story from a friend who likes to discuss similar topics, etc.
LLMs have no access to the information they were trained on. They could know that everything they know was learned during the training, but they have no way of determining what they learned about and what they didn't.
It's only when we can interact with the environment to test our hypothesis that we then refine what we know and update our priors appropriately.
If we let LLMs do that as well, by allowing it to run code and interact with documentation/the internet and double-check things its not sure of, it's not out of the question LLMs won't eventually be able to more reliably understand its limitations.
Humans usually know (at least roughly) the source of anything they know, as there will be a memory or a known event associated with that knowledge.
LLMs have no analogous way to determine the source of their knowledge. They might know that all their knowledge comes from their training, but it has no way of knowing what was included in the training and what wasn't.
This could maybe be achieved with some more fancy RAG systems, or online training abilities. I think an essential piece is the ability to know the source of information. When LLMs reliably do, and apply that knowledge, they'll be much more useful. Hopefully somebody can achieve this.
Or it's a unique advantage because this stuff doesn't happen without good researches who may want:
1) Their name in scientific papers
2) They might actually care about the openess of AI
Playing catch up and trying to attract talent from the hot-new-thing OpenAI requires incentives beyond lots of money. I contend actually being open helps.
I'm sure that's one reason Facebook has an open source model, scientists can care about ethics and could be attracted to openness.
The "Attention Is All You Need" guys all worked at Google. Google is where they are despite having the smart guys with a new idea a few years back.
Of course, IMHO it wouldn't have have helped Google if they'd kept the transformer architecture secret. They'd have fumbled it because they didn't realise what they had.
Only then did Google realise there might be something to this LLM stuff - so they responded with Bard, a product so poorly received they later had to completely rebrand it. Looks like they didn't have a "sentient" model up their sleeve after all. Then the updated, rebranded model had a bunch of image generation embarrassments of its own.
Admittedly, they have recovered somewhat since then; they're second on some performance leaderboards, which is respectable.
But there was a real tortoise-and-hare situation where they thought they were so far ahead they had time for a nap, until they got overtaken. Any lead they had from inventing transformers and being the only people with TPUs has been squandered.
The chain of thought would be the final path through the tree. Interactively showing the thought tokens would give the game away, which is why they don’t show that.
In this case, the model would explore several chain of thoughts during training, but only output a single chain during inference (as the sibling comment suggests).
We can say the same thing about RL implying PPO, however there’s pretty big hints, namely Noam Brown being involved. Many of the things Noam Brown has worked on involve RL in tree search contexts.
He has also been consistently advocating the use of additional test-time compute to solve search problems. This is also consistent with the messaging regarding the reasoning tokens. There is likely some learned tree search algorithm, such as a learned policy/value function as in AlphaGo.
It’s all speculation until we have an actual paper. So we can’t categorically say MCTS/learned tree search isn’t involved.
Did they maybe expand to a tree during training to learn more robust reasoning? Maybe. But it still comes down to a regular transformer model at inference time.
https://arxiv.org/pdf/2403.09629
> In the Self-Taught Reasoner (STaR, Zelikman et al. 2022), useful thinking is learned by inferring rationales from few-shot examples in question-answering and learning from those that lead to a correct answer. This is a highly constrained setting – ideally, a language model could instead learn to infer unstated rationales in arbitrary text. We present Quiet-STaR, a generalization of STaR in which LMs learn to generate rationales at each token to explain future text, improving their predictions.
>[...]
>We generate thoughts, in parallel, following all tokens in the text (think). The model produces a mixture of its next-token predictions with and without a thought (talk). We apply REINFORCE, as in STaR, to increase the likelihood of thoughts that help the model predict future text while discarding thoughts that make the future text less likely (learn).
e: answer to my own question https://x.com/_xjdr/status/1835352391648158189
You don't see the scaling with respect to token length with non-FT'd CoT like this, in my opinion.
I'm just saying whatever they did in their [new] model, I think they also added CoT on top of it, as the outer layer of the onion so to speak.
https://x.com/polynoamial/status/1834641202215297487
I've gotten mini to think harder by asking it to, but it didn't make a better answer. Though now I've run out of usage limits for both of them so can't try any more…
nobody except for recent deepmind research has shown test time scaling like o1
This kind of thing is still so funny to me.
I wonder if the first guy who gets AGI to work will do it by realizing that he can improve LLM reliability over some threshold by telling it in all caps that his pet's life depends on the answer.
Sort of like how offering to pay the LLM $5 improves its output. The LLM's taking your prompt seriously, but not literally.
Ask an LLM what hallucination is, ask it to write a story with etc.
without zeroing out things, everything has and can have some impact
https://news.ycombinator.com/item?id=41504226
(But it all depends on the fine-tuning they did, so who knows, maybe it's just an Easter egg)
> In order to make the draft response nicer and complete, a set of question [sic] and its answer are provided," reads one prompt. "Please write a concise and natural reply by modify [sic] the draft response," it continues.
This really sounds like a placeholder made up by one engineer until a more qualified team sits down and defines it.
The butterfly keyboards were unusable to me and also the OS got too locked down so I left the platform.
1. https://x.com/goodside/status/1657396491676164096
It might be the 200M user base of OpenAI that provided the necessary guidance for advanced CoT, implicitly. Every user chat session is also an opportunity for the model to get feedback and elicit experience from the user.
A LLM can therefore have an higher IQ because it can combine all fields.
Also parameters and architecture might or might not be a limiting factor to us humans or a LLM. But LLM and parameter size, optimizations etc. are just at the beginning.
If we now have a good reasoning llm, we can build more test data automatically. Basically using the original content + creating new ones which can then lead to new knowledge = research.
Does Midjourney output look like an average human drawing?
Obviously, OpenAI knows how to train a classifier...
No, perhaps because it's heavily trained on photos.
If they were just doing prompt engineering and multiple inferences they'd definitely want to keep that a competitive secret and send all the open source devs off in random directions, or keep them guessing, rather than telling them which way to go to replicate Q-Star.
That's an incredibly cynical choice of phrasing.
Of course they don't want to help the competition, that's what a competition is. The competition isn't helping OpenAI either.
This is still clearly CoT, with all its limitations and caveats as expected. That's an improvement, sure, but definitely not a qualitative leap like OAI is trying to present it. (in a really shady manner)
Saying it's just CoT is kind of meaningless. Even just looking at the examples on Open AI's blog and you quickly see no other model today can generate or utilize CoT to anywhere near that quality through prompting or naive fine-tuning.
Starting from the strawberry example: it counts 3 "r"s in "strawbery", because the training makes it ignore grammatical errors if they're not relevant to the conversation (which makes sense in an instruction-tuned model) and their CoT doesn't catch it because it's not specialized enough. Will this scale with more compute thrown at it? I'm not sure I believe their scaling numbers. The proof should be in the pudding.
I've also had mixed results with coding in Python, it's barely better than Sonnet in my experience, but wastes a lot more tokens, which are a lot more expensive.
They might have improved things and made a SotA CoT that works in tandem with their training method, but that is definitely not what they were originally hyping (some architecture-level mechanism, or at least something qualitatively different). It also pretty obviously still has limited compute time per token and has to fit into the context (which is also still suffering from the lost-in-the-middle issue by the way). This puts the hard limit on the expressivity and task complexity.
I'm not convinced OpenAI is using one model. Look at the thinking process (UI), which takes time, and then suddenly, you have the output streamed out at high speed.
But even so, people are after results, not really the underlying technology. There is no difference of doing it with one model vs multiple models.
According to OpenAI, the model does it's thinking behind the scenes, then at the end summarizes that thinking for the user. We don't get to see the original chain-of-thought reasoning, just the AI's own summary of that reasoning. That explains the output timing.
If so, I imagine o1 clones could just be fine tunes of llamas initially.
Example prompt for that: "give me three sentences that end in 'is'."
All the problems with llm not reasoning (like planning, counting letters or deductive inference) are easy for classical algos. There needs to be a way to split the thinking process into two parts and then execute each part on the appropriate model.
Two centuries ago there were no computers, everything had to be done by humans. Get to that level first before you whip out code.
Not updated the Readme yet
Thanks for the fork and the suggestions though - looks like I'll be having fun with this over the week!
It’s a way to convert a text or response into an array of numbers, that can be used for similarity lookups.
I made a way to query large datasets of text strings: https://github.com/punnerud/search-embeddings-llama3.1
Can be used to let it explore a graph of knowledge as long as the graph is related to the original question, and can explore different solutions at the same time without repeating itself (then it’s get linked back to similar answers and stopped)
> Result: .9 is larger than .11
we've broken the semver barrier!
https://github.com/esoltys/o1lama
I think this class of problem might be better solved by allowing the LLM to 'zoom in' and view the input differently. Rather like you might peer closer for more detail if someone asked you about the print quality of something you were reading.
'zoom in' could input the same text letter by letter, or even in image form (rasterize the text) to help answer questions like "How many letters in the word strawberry contain straight lines?"
The idea is not silly in my view, I did something similar here: https://github.com/pseudotensor/open-strawberry
The idea is that data generation is required first, to make the reasoning traces. ToT etc. are not required.
The core innovation [1] of o1 lies in its ability to generate and refine internal chains of thought before producing a final output [2]. Unlike traditional LLMs that primarily focus on next-token prediction, o1 learns to:
1. Recognize and correct mistakes 2. Break down complex steps into simpler ones 3. Try alternative approaches when initial strategies fail
This process allows o1 to tackle more complex, multi-step problems, particularly in STEM fields.
OpenAI reports observing new "scaling laws" with o1 [5]:
1. Train-time compute: Performance improves with more extensive reinforcement learning during training. 2. Test-time compute: Accuracy increases when the model is allowed more time to "think" during inference.
This suggests a trade-off between inference speed and accuracy.
Sources [1] Introducing OpenAI o1 https://medium.com/%40sriramramakrishnan.aiexpert/openais-o1... [2] Learning to Reason with LLMs | OpenAI https://openai.com/index/learning-to-reason-with-llms/ [3] OpenAI o1 models - FAQ [ChatGPT Enterprise and Edu] https://help.openai.com/en/articles/9855712-openai-o1-models... [4] OpenAI releases new o1 reasoning model - The Verge https://www.theverge.com/2024/9/12/24242439/openai-o1-model-... [5] 9 things you need to know about OpenAI's powerful new AI model o1 https://fortune.com/2024/09/13/openai-o1-strawberry-model-9-... [6] Notes on OpenAI's new o1 chain-of-thought models https://simonwillison.net/2024/Sep/12/openai-o1/ [7] OpenAI just dropped o1 Model that can 'reason' through complex ... https://www.tomsguide.com/ai/openais-o1-model-takes-ai-to-a-... [8] Models - OpenAI API https://platform.openai.com/docs/models [9] OpenAI Unveils O1 - 10 Key Facts About Its Advanced AI Models https://www.forbes.com/sites/janakirammsv/2024/09/13/openai-...
o1 is far more than just CoT mechanics. It relies on a specialized model or collection of models that offer new capabilities to make CoT work far better than it works with a stock LLM.
For instance, o1 can recognize and correct its own mistakes and it seems to know how to dig deeper when needed. That's not something that stock LLMs do very well.
also I think they deliberate give you bad answers sometimes / a lot over the last year to build up advanced chains where the user is not getting what they want so you have to explain why. I started building up like 10 or so of these conversations where after like 100 messages it gets the right answer and it was like hmm, I wonder if they are using this.
just my rambles
Did you benchmark your system against MMLU-pro?
Because it says so nowhere in the repo.
Man Elon makes things confusing.
Tailoring prompts is likely still the best way to maximize performance when you can, but in broader domains you'd work around this through strategies like asking the LLM to combine predefined reasoning modules, or creating multiple reasoning chains and merging/comparing them, explicit MCTS etc. I think those strategies will still be useful for a good while, but pieces of that search process, especially directing the search more efficiently, move to the LLMs over time as they get trained with this kind of data.
They didn't train a model for millions from experts to just basically use CoT now. Thats a harsh simplification, probably.
Does it work? Well not really:
https://lluminous.chat/?sl=Yjkxpu
https://lluminous.chat/?sl=jooz48
I have also been using this prompt, and while it fails on then problem above, it works better for me than OPs prompt:
In particular it solves this problem: https://lluminous.chat/?sl=LkIWyS* give me three sentences that end in "is"
* tell me the line of Star Spangled Banner that comes before "gave proof through the night"
But they did some good thinking before failing at it…
It's just a pile on of trial and error instructions (maybe learned from previous 'projects', but). There is no magic or skill to prompt 'engineering' anywhere.
Grok rhymes with cock, because Elon wants you to use it with your cock out.
That’s how I remember the difference.
Does the LLM take advantage of this? I don't know. It wouldn't surprise me if it did, and if it doesn't now I'd bet it will in the future. Either way though, throwing away those other languages could make the model dumber. As you allude to, there's a balance between intelligence and knowledge.
(in case you hadn't thought of it, those 'tools' can also be other LLMs with more specialized knowledge in a particular field. For example a 'translator' model)
Other 'facts' could also have more merit than it would first appear. Sure, one particular person's shoe size might not be needed, but if you were to filter out shoe sizes in general then the model might not be able to suggest how to find properly fitting footwear, or might not suggest that your back pain could be related to your shoes.
> That would be a worthwile endeavor and maybe even possible without boiling the oceans.
I think it's important to keep in mind that we're very early in the AI journey. Look at the power requirements of early computers versus the ones we use today. I'm all for keeping energy usage in mind, but I'd be careful with hyperbolic language as things are changing so quickly. Tasks that would have taken multiple GPUs can now run on my laptop CPU.
[1] https://www.edge.org/conversation/lera_boroditsky-how-does-o...
> I think it's important to keep in mind that we're very early in the AI journey.
That's what I am saying. At the moment there is this one really dumb idea, that bigger is better.