Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Evaluating GPT5's reasoning ability using the Only Connect game show (ingram.tech)

36 points by scrollaway 1 days ago | 45 comments

OtherShrezzing 2 hours ago [-]

>We were unable to find evidence that the Only Connect games are in the training materials (which of course is likely to change now).

I just don't think this is a credible assumption. The BBC is one of the highest-trusted sources of millions of hours online audio/visual content, all of which is accompanied by human curated & edited closed captions. All of which is trivially easy to download. The base assumptions should be that the entire BBC iPlayer corpus is inside of all the frontier model training datasets.

The communities on Reddit (known to be included in all models) extensively discuss each show & question - usually creating google docs tracking the questions asked and answers given.

Finally, there's the OCDB[0], which lists every question and answer on the show.

While using real questions from the show, this benchmark should be assumed to be testing the model fact-recall ability, rather than its reasoning capabilities.

[1]https://ocdb.cc/

2 hours ago [-]

scrollaway 2 hours ago [-]

To clarify what I meant by this: Despite looking, we haven't seen any evidence of any of the models consistently responding based on pre-trained knowledge (outside of easier-to-guess trivia-type questions). It's likely the questions are in some form in the training data but it doesn't necessarily mean the results will be significantly influenced.

empath75 2 hours ago [-]

Models will engage in post-hoc rationalizations, so you can't trust their purported reasoning -- in particular, if you sneak an answer into the context (even an incorrect answer), it will provide reasoning for giving you that as the final answer, even if the answer is wrong. So, it could be arguing backwards from an answer that is in it's training data, you can't possibly tell that it isn't from its reasoning.

On the other hand we do know the training cut off of these models, so you could easily create a corpus of post-cut off Connections with confidence that it doesn't have access to them.

scrollaway 2 hours ago [-]

We didn't test using post-hoc reasoning. Instead, we focused on checking whether specific, obscure questions could be recognized or identified in any way, using various ad-hoc methods to see if the answers could be surfaced without relying on reasoning.

It's very difficult to prove either way (and basically impossible without the model weights), but we're reasonably confident that there's no significant prior knowledge of the questions that would affect the score.

mr_wiglaf 1 hours ago [-]

I'm new to this sort of inquiry. What do you do to see if questions can be recognized? Do you just ask/prompt "do you recognize this puzzle?"

What does it mean for it to "be surfaced without relying on reasoning"?

scrollaway 1 hours ago [-]

> Do you just ask/prompt "do you recognize this puzzle?"

In essence, yes, but with a bit more methodology (though as I mentioned it was all ad-hoc).

We've tried to extract pre-existing questions as well through a variety of "You are a contestant on the british TV show Only Connect" and see if it can recognize questions - couldn't find anything that reliably reproduced preexisting knowledge. It's absolutely possible we missed something.

andrepd 2 hours ago [-]

There's a database of every question and answer, and almost every episode is also on youtube, complete with transcripts. I really don't see how you can assume that the fact that questions+answers are in the training data (which they are) doesn't affect the results of your "benchmark"...

It also doesn't pass the smell test. These models routinely make basic mistakes, yet can answer these devilish lateral thinking questions more than 9 times out of 10? Seems very strange.

scrollaway 1 hours ago [-]

> These models routinely make basic mistakes, yet can answer these devilish lateral thinking questions more than 9 times out of 10?

You could also say "These models routinely make basic mistakes, yet they're able to one-shot write entire webpages and computer programs that compile with no errors".

There are classes of mistakes the models make, this is what we're digging into.

bgwalter 1 hours ago [-]

How can the results not be influenced if Grok for example lists all questions and answers of a particular episode if asked?

It is as easy as the lion/goat/cabbage riddle in canonical form.

Amorymeltzer 3 hours ago [-]

I cannot, cannot, cannot recommend Only Connect enough. Readers here should enjoy it. As an American, a lot of UK geography and history questions are beyond my ken, but it's well-worth your time. Victoria Coren Mitchell is a national treasure.

It's sort of exactly what I'd think LLMs would be good at. Maybe not so much the lateral thinking—the occasional "math(s) question but actually it's a word one"—but rapidly assessing "what connects these" or "what would continue this sequence" is right up their alley. The joy, of course, is trying it yourself and seeing others do the same.

stoneman24 1 hours ago [-]

I will echo the recommendation. Though some of the puzzles do require local knowledge, there are quite a few based on the US.

Though not really applicable to LLM, the last rapid fire round (missing vowels, work out what the words/phrase without any vowels, on the buzzer) is a favourite.

Currently Monday evening on BBC 2 is Mastermind, Only Connect and then University Challenge. This excellent sequence will stretch pretty much anyone's knowledge.

dpoloncsak 2 hours ago [-]

James Acaster introduced me to the wonderful world of British Game-Shows. Victoria pops up on a lot of them. 8 Out of Ten Cats. Countdown. Would I Lie to You? All great viewing

xnorswap 2 hours ago [-]

Do you mean Countdown or "8 out of 10 cats does countdown"? Because they're somewhat different shows.

sunrunner 1 hours ago [-]

Perhaps parent poster actually did mean Countdown or 8 Out of 10 Cats, and is about to discover the wonder that is 8 Out of 10 Cats Does Countdown.

scrollaway 2 hours ago [-]

You're giving me an idea for a spinoff show "Would I Hallucinate To You?"...

sunrunner 1 hours ago [-]

> The joy, of course, is trying it yourself

"Trying" being the operative word here (for myself at least)

waisbrot 1 hours ago [-]

> As an American, a lot of UK geography and history questions are beyond my ken

But note that the UK-based contestants have no problem with sequences like "US vice presidents ordered by number of terms served" or "US capitals ordered alphabetically by the state they're in".

empath75 2 hours ago [-]

I actually find it very stressful and frustrating to watch because they answer faster than I can reason through stuff.

whimsicalism 2 hours ago [-]

> We were unable to find evidence that the Only Connect games are in the training materials (which of course is likely to change now).

Respectfully, I do not think this is a good assumption for any TV show broadcast prior to 2025.

e: tonality wise, llm threads seem to bring out the worst

andrepd 2 hours ago [-]

Googling "only connect questions answers" yields a literal database of all questions and answers in the show's history https://ocdb.cc/. Most episodes are also in youtube complete with transcripts.

Par on course for AI """benchmarks""" if you ask me... x)

scrollaway 2 hours ago [-]

Looking at the other comments, you'll see this is in fact the database of questions & answers we used as our source material for the benchmarks. You'll also find the explanation of what I meant by this particular sentence and a preview of how we tested for it.

andrepd 2 hours ago [-]

Your statement was

> We were unable to find evidence that the Only Connect games are *in the training materials*.

which is obviously completely false. You acknowledge as much in another comment when you say

> To clarify what I meant by this: Despite looking, we haven't seen any evidence of any of the models consistently responding based on pre-trained knowledge (outside of easier-to-guess trivia-type questions).

which has nothing to do with what you said x)) Basically: "to clarify, when I said X, I actually meant something else entirely".

But fine, at least now it's not bullshit, it's just vague enough that it wouldn't pass in a 9th grade science project where I went to school.

Just my 2 cents.

-----

If you'd like to explain more how you supposedly concluded that it wasn't returning data in its training set, I'm all ears.

scrollaway 1 hours ago [-]

Sorry; I dropped out of school, so I wouldn't know about 9th grade science projects. Would you like to phrase your constructive feedback as an attack instead? (/shrug)

Edit after your update: As mentioned in the other comment, the tests were mostly ad-hoc. It's nearly impossible to prove whether something is absent from the training data, but it's possible to put the LLM in a bunch of situations which would be conducive to completing with pre-existing knowledge.

orwin 3 hours ago [-]

It's extremely interesting. I do have a suggestion to make sure the question are not in training data:

If this game show is like the ones in my country, people do come together in 'clubs' to train for the event, sometimes organising internal tournaments. Some people in those clubs are question-writers, and write the internal tournaments questions.

Maybe try to contact those clubs, and find those writers, you'll be sure the LLM won't be trained on that specific set.

scrollaway 3 hours ago [-]

Very good point and a great idea. I think there would be value in such an archive that is not reshared for AI training. Very time consuming to build up though.

amoe_ 3 hours ago [-]

> Wall: Players group 16 elements into four categories (similar to the NYT Connections game)

I have to be the designated pedant here and point out that Only Connect was first.

xnorswap 3 hours ago [-]

And unless they've improved it, NY Connections doesn't have nearly enough red-herrings to be interesting.

The difficulty with the Only Connect wall is that you can have 5 sets of 5 or more in each, and sometimes can have loads that could fit a category, e.g. you might have 7 or 8 Pixar Movies listed.

You know that's going to be a category, but you also know it's a waste of time to try them before finding other categories.

There's also linguistic and add/drop letter ones, so if you see "Soul", that might actually be "Capitols missing a letter", "Stars with an extra letter", or "Homophones of Fish".

But it might be straight, so could fit "Davids", "Pixar Movies", etc.

There is a meta-pattern, you typically only have one such missing/add letter group on a wall, and you typically have one "Names" (often common firstname) set. Places in general also feature very regularly, especially with missing parts.

Fully solving most walls in the time limit is extremely challenging. It's slightly easier for the home viewer on the second wall, because there are often common themes across the two walls, but of course the competitors don't get that help.

scrollaway 3 hours ago [-]

NYT Connections definitely has less red herrings than Only Connect's wall (and doesn't have the pressure of the timer, either...), but the quality has gone up significantly in the past year and half.

If anyone wants to try, I actually built a command line player of the NYT Connections game here: https://github.com/jleclanche/connections-tui (for some definition of the word "I") -- you can jump by date and easily compare.

beepbooptheory 2 hours ago [-]

To be possibly even more pendantic: this quote does not imply otherwise.

VegaKH 1 hours ago [-]

The top reasoning competitor to GPT-5 would probably be Gemini 2.5 Pro. That is the first model I would like to see it compared with.

jackbrookes 3 hours ago [-]

I tried for a while to get ChatGPT to generate connections style puzzles with some suggested topics, including red herrings to create some answers that seemingly fit in multiple categories. Then it would post them to https://connections.swellgarfo.com/. Overall they were really bad but that was using GPT4

IanCal 3 hours ago [-]

As someone who loves the show this is very fun to see. I'm impressed at the number of correct answers most get.

Helpfully as well there is a current season on, and while the total questions wouldn't be enough to fully validate results while ensuring the data isn't in the training it's certainly good enough as a sanity check that people could do.

Mizza 3 hours ago [-]

That's great, I thought about building a similar thing data from PuzzGrid, an onlyconnect fan site, but some of the questions there are a bit iffy compared to the ones on the show. How did you build the dataset - just binge watching with a notepad?

scrollaway 3 hours ago [-]

The source data is from the unofficial OC fansite ocdb.cc - I'd like to make it available publicly but we haven't yet received a response from the website's author allowing us to do so.

energy123 3 hours ago [-]

Good job being explicit about the reasoning effort and verbosity settings. And interesting but not surprising that verbosity helped performance.

AIPedant 1 hours ago [-]

I am less interested in questioning training data corruption than I am in questioning claims like this:

  test reasoning abilities such as pattern recognition, lateral thinking, abstraction, contextual reasoning (accounting for British cultural references), and multi-step inference.... its emphasis on clever reasoning rather than knowledge recall, Only Connect provides an ideal challenge for benchmarking LLMs' reasoning capabilities.

It seems to me that the null hypothesis should be "LLMs are probabilistic next-word generators and might be able to solve a lot of this stuff with shallow surface statistics built from inhumanly large datasets, without ever properly using abstraction, contextual reasoning, etc." This is particularly true for NYT Connections, but in general evaluations like this seem to be at least partially testing how amenable certain word/trivia games are to naive statistical algorithms. (Many NYT Connections "purple" categories seem like they would be quite obvious to a next n-gram calculator, but not for people who actually use words conversationally!) Humans don't use these statistical algorithms for reasoning except in particular circumstances (many use "folk n-gram statistics" when playing Wordle; poker; serious word game players often learn more detailed tables of info; you could see competitive NYT Connections players learning a giant bag of statistical heuristics to help them speedrun things). We just can't accumulate the data ourselves without making a concerted computer-aided effort.

In general a lot of LLM benchmarks don't adequately consider that LLMs can solve certain things better than humans without using reasoning or knowledge. The most stupid example is how common multiple choice benchmarks are, despite us all learning as children that multiple-choice questions can be partially gamed with shallow statistical-linguistic tricks even if you have no clue how to answer the question honestly[1]; it stands to reason that a superhuman statistical-linguistic computer could accumulate superhuman statistical-linguistic tricks without ever properly learning the subject matter. AI folks have always been quick to say "if it quacks like a duck it reasons like a duck" but these days computers are quite good at playing duck recordings.

[1] "When in doubt, C your way out," sniffing out suspicious answers, shallow pattern-matching to answer reading comprehension, etc etc. One thing humans and LLMs actually do have in common is that multiple-choice tests are terrible ways to assess their knowledge or intelligence.

catigula 2 hours ago [-]

Anecdotally it doesn't feel like GPT-5 is a meaningful improvement over o3 to me.

scrollaway 2 hours ago [-]

Over o3 it's only incremental (which backs up the community's general feeling of gpt5 being an incremental improvement over o3), but it's very consistently better. Also worth mentioning that the score of 77% vs. 90% on the sequences round was shockingly good and shows an improvement over the LLM's ability to not just "classify things" (little to no improvement) but really understand the underlying pattern to get the next one right.

catigula 2 hours ago [-]

How are you determining that it's better?

Care to make a case for it that isn't benchmark (gameable) based?

scrollaway 2 hours ago [-]

By that metric, everything is gameable. Any case we'd make for it would be purely based on vibes (and our take on that would not be any more useful than the general community opinion there).

catigula 58 minutes ago [-]

So the answer would be no.

scrollaway 44 minutes ago [-]

A benchmark is exactly how you measure things reliably instead of "based on vibes". I really don't understand what you're asking or expecting.

arnaudsm 2 hours ago [-]

Did you use o3 pro high, o3 high or o3 medium?

(OpenAI's naming is so confusing)

scrollaway 2 hours ago [-]

Default parameters for o3 (o3-2025-04-16).

swee69 2 hours ago [-]

So basically - LLM usefulness has plateau’d hard - Sam A knows this and will pivot to capturing revenue now that the high growth phase is over - more restrictive rate limits, higher bills

They U-turned on the recent rate limit changes after the release backlash but more is coming

1 days ago [-]

Rendered at 16:45:19 GMT+0000 (Coordinated Universal Time) with Vercel.