NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
OpenAI O3-Mini (openai.com)
simonw 4 days ago [-]
I used o3-mini to summarize this thread so far. Here's the result: https://gist.github.com/simonw/09e5922be0cbb85894cf05e6d75ae...

For 18,936 input, 2,905 output it cost 3.3612 cents.

Here's the script I used to do it: https://til.simonwillison.net/llms/claude-hacker-news-themes...

s_dev 4 days ago [-]
Currently on the internet people skip the article and go straight to the comments. Soon people will skip the comments and go striaght to an AI summary reading neither the original article nor the comments.
flakeoil 4 days ago [-]
But then there will be no comments to summarize.
jefe_ 4 days ago [-]
Our digital twins will write the comments. They will be us, but with none of our flaws. They will never experience the shame of posting a dumb joke, getting flamed, and then deleting it, for they will have tested all ideas to prevent such an oversight. They will never experience the satisfaction-turned-to-puzzlement of posting an expertly crafted, well-researched comment that took 2 hours of the workday to draft - only to receive one upvote, for their research will be instantaneous and their outputs efficient. Of course they will never need to humbly reply, 'Ah, I missed that, good catch!' to a child comment indicating the entire premise of their question would be answered with a simple reading of the linked article - for they will have deeply and instantly read the article. Yes, our digital twins will be us, but better - and we will finally be free to play in the mud.
endofreach 3 days ago [-]
We all are, since a long time. The mud just doesn't shine so bright.
MichaelZuo 2 days ago [-]
Well by definition the vast majority of comments written, anywhere, cannot be memorable.

Assuming the median reader reads a few tens of thousands comments in a year, only a few hundred would likely stick without being muddled. At best.

waste_monk 2 days ago [-]
Cripes [1].

[1] https://qntm.org/perso

tuatoru 3 days ago [-]
> will finally be free to play in the mud

As long as we can still have a few sticks, and some string, or a cardboard box...

bredren 3 days ago [-]
I wonder about Amazon reviews right now.

Part of the utility of writing a review is that it is read, but the primary search for keywords in reviews now requires the user to wait for AI generated responses first.

Then the user must tap through another link and then expand an individual matching review. It’s absolutely buried.

treebeard901 3 days ago [-]
You will be given 3 A.I. generated comments to choose as your comment
throwuxiytayq 3 days ago [-]
if reality turns into a Ubisoft game I think I might log out
thejazzman 1 days ago [-]
would you like to play again?
osigurdson 3 days ago [-]
I think people generally like writing comments. Reading articles, in their entirety, less so.
notoverthere 4 days ago [-]
But what if the AI just hallucinates the comments? People will never know.
rcarmo 4 days ago [-]
They could actually be better than reality. Who knows?
jgoodhcg 4 days ago [-]
I went to this thought too but then I remembered the 90-9-1 rule. The AI summary is for some portion of the 90. The 9 are still going to comment. What they comment on and how they generate the comments might change though.
geoffbp 4 days ago [-]
Glad it’s not just me who does this. :)
ed_balls 4 days ago [-]
I wanted to listen to a podcast where a profesor explains a new drug for ADHD. Instead of spending 2h of my time I spent 5 min reading a summary generated by gpt from a youtube transcript.
bredren 3 days ago [-]
I’ve done this with extracted audio from YouTube videos, -> STT -> LLM summary.

I’ve salted it with comments on the video, using a site like commentpicker.com or running JS and loading more and expanding threads manually.

Here’s an example I did for a pal:

You are an expert on building retaining walls. Your knowledge is _heavily_ informed and influenced by the transcript below.

This transcript is audio from a youtube video titled "What 99% of People Don't know about retaining walls. #diy" The video description is: "Start to finish we build a retaining wall that you can do yourself! How to Dig for a wall, How to Base a retaining wall, how to backfill, and MORE!. #retainingwall #diy"

Additional information may be included in comments, which are in the attached CSV. Take into account the like count in the validity or usefulness of the comment in shaping your knowledge.

In giving your replies, try to be specific, terse and opinionated. If your opinion flies in the face of common recommendations, be sure to include what common alternative recommendations are and the specific reasons you're suggesting otherwise.

----

# Transcript

""" [paste transcript] """

# Comments See attached .csv

4 days ago [-]
ilbeeper 4 days ago [-]
Which drug? Who's the professor? Can you share the summary please?
3 days ago [-]
ruibiks 4 days ago [-]
[dead]
maddmann 4 days ago [-]
Bot getting out of control
ruibiks 4 days ago [-]
You are a maddmann! No bot there, ask me a question a bot would not figure out bro ;)
maddmann 4 days ago [-]
How many r’s in strawberry sherbet?
ruibiks 3 days ago [-]
I never heard of the word sherbet until today I went online to see it was a dessert and I guess that proves something? ohh sh*t...bots can go online too... am I a bot? Could I be AGI?
darthrupert 4 days ago [-]
That would be an actual improvement. Reading the comments section usually just leads to personal energy waste.
anonzzzies 4 days ago [-]
On both HN & Reddit, I find the comments more informative and less frustrating than reading the article usually. But I guess YMMV.
northern-lights 4 days ago [-]
10 years ago Reddit used to be a place where you would get informed opinions and less spam. 5 years ago, HN used to be a place where you would get informed opinions and less spam. Neither of them will go back to the same level of quality. Not anymore.
inciampati 3 days ago [-]
This has been said by every long term user of these sites. And not at the same time. It was always better in the past. It's probably partly true... yeah, quality can decrease as things get better. But it's also partly an illusion of aging in a changing world. Ten years is long enough to completely change the way we write and express ourselves.
schrodinger 3 days ago [-]
It's actually sort of in the Hacker News Guidelines (https://news.ycombinator.com/newsguidelines.html):

"Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills."

anonzzzies 4 days ago [-]
Luckily I am not a LLM and can read and select things rather fast without processing the crap. Both HN and some subreddits still are fine for quality if you know the users and their reputations; just skip the rest. Worse than 10 years ago, still a lot better than mostly anywhere else.
cft 4 days ago [-]
you need to expend resources (e.g. "proof of work") to post, to drive away low effort spam. https://stacker.news/ is an interesting experiment in that regard.
solumunus 3 days ago [-]
And yet the comments still provide more value than the article, almost always.
brabel 4 days ago [-]
You need to read the article first to know that. But most people won't.
anonzzzies 4 days ago [-]
If the comments say it's worth it I read it, but often (especially on HN) the top comment starts with a summary of errors/inconsistencies in the article and then I don't really feel the need.
creesch 4 days ago [-]
You do you, but hivemind thinking is a real thing. I have seen highly upvoted comments seemingly "debunk" an article where on closer examination it becomes clear they actually didn't read the article either.

It quickly becomes this weird bubble of people just acting on what everything "thinks" the content is about without ever having looked at the content.

I get that is easier, but intellectually you are doing yourself no favors by having this be your default.

tmnvdb 4 days ago [-]
That is the biggest problem on this website, people want to feel smart by 'debunking' things they don't really understand. It leads to a lot of contrarian views with poor signal to noise ratio, especially when the topic is slightly outside of average user experience (midwit programming)
thejazzman 1 days ago [-]
not to mention walls of text as people argue back and fourth
Jensson 4 days ago [-]
But often the top comment debunks what you think is true, and then you learn something.

> It quickly becomes this weird bubble of people just acting on what everything "thinks" the content is about without ever having looked at the content.

That isn't an issue though since the important part is what you learn or not, not whether you think an imaginary article is true or not. If you learn something from someone debunking an imaginary article, that is just as good as learning something from debunking a real article.

The only issue here is attribution, but why should a reader care about that?

Edit: And it isn't an issue that people will think it actually debunks the linked article, since there will always be a sub comment stating that the commenter didn't read the article and therefore missed the mark.

genewitch 4 days ago [-]
> If you learn something from someone debunking an imaginary article, that is just as good as learning something from debunking a real article.

An argument for synthetic corpi (plural of corpus..esses?) - AI ingesting AI.

Jensson 4 days ago [-]
Commenter's doesn't live isolated only getting information from other commenters, they have lives outside of HN and use that to inform their comments.

So no, it isn't the same as AI ingesting AI content at all.

ScottBurson 4 days ago [-]
The plural is "corpora" :-)
greenchair 3 days ago [-]
usually the top comment these days is a diversion to push an agenda.
Jensson 4 days ago [-]
You can read the article after the comments.
concordDance 4 days ago [-]
They definitely used to be, but haven't been much good for years. At least 8 years in the case of reddit, maybe 3 in the case of hackernews.

Though at this point it's a habit I cannot quite bring myself to break...

qup 4 days ago [-]
It's worse, but there's nowhere better than here imo.

Except very niche topics maybe

darthrupert 4 days ago [-]
I agree, they are! But reading through them, or even worse, engaging with them, is a serious energy drain.

Especially if somebody is being wrong.

rzzzt 4 days ago [-]
Occasionally the article is wrong.
j4coh 4 days ago [-]
Even more occasionally commenters are too.
Jensson 4 days ago [-]
But extremely rare for a popular wrong comment to not have sub comments debunking them, they are much more reliable than articles therefore.
j4coh 4 days ago [-]
I’ve found the more I know about the topic at hand, the more wildly many of the comments seem off base, even the highly upvoted undebunked ones. Its harder for me to judge topics I don’t know much about, but I have to assume it’s something similar.
Jensson 4 days ago [-]
That is true for articles as well though, I find comments typically have better info than the articles. It is more likely for some of the comments to have been written by real experts than that the article is.
lukan 4 days ago [-]
Yeah, but is your mission to correct anything wrong on the internet?

https://xkcd.com/386/

Sounds exhausting.

darthrupert 4 days ago [-]
Yes, that was the joke.
lukan 3 days ago [-]
Hard to tell nowdays..
IOUnix 3 days ago [-]
Today is my first day on HN and I came here looking for the comments. I miss intelligent conversations and reasonable interactions and reddit is just a dumpster fire now (and arguably before). Even if you can find a reasonably mature sub you're still exposed to the muck raking of the front page overflowing into everywhere else.
kandesbunzler 4 days ago [-]
Lmfao you forgot the fact that especially on reddit and hn people tend to say a lot of bullshit especially if it goes hand in hand with their biases
anonzzzies 4 days ago [-]
So do the writers of the articles, however, with them I cannot interact.
NiloCK 4 days ago [-]
In a way I agree, but the sustainability is shaky.

The intrinsic motivation for providing the comments comes from a mix of - peer interaction, comradery - reputation building

If becomes evident that your outputs are only directly consumed by a sentiment-aggregation-layer that scrub you from the discourse, then it could be harder to put a lot of effort into the thread.

This doesn't even account for the loss of info that results from fewer people actually reading and voting through the thread.

CMCDragonkai 3 days ago [-]
Yes but this assumes human input is the golden goose. Maybe it is at the beginning just to bootstrap the process, and then runaway AI starts to recurse homeruns with its own original comments.
booleandilemma 4 days ago [-]
HN comments are good, but the more nested a thread, the more inane they get. You see things like people arguing over punctuation, or the definition of punctuation.
zapkyeskrill 4 days ago [-]
Yes, but who will comment then and on what?
ethbr1 3 days ago [-]
HN comments aren't valuable because of their average quality or content, but because of the feedback vote sorting.

(Aside from the tendency towards first=top. Would be nice to have time-weighted upvote decay instead of absolute count)

cft 4 days ago [-]
I agree with the first part, but I disagree with the conjecture. There are also people who enjoy writing comments, and inasmuch they have to read a bit of context (at least the comment they are replying to). Those will always exits.
hibikir 3 days ago [-]
Have you ever had to give a demo that no meeting attendee actually cared about, just because management demanded it? Standup meetings with 20 people where maybe 2 people cared about? The future might involve AI updates, summarized by AI into weekly reports, summarized again into monthly reports, then into quarterly departmental reports that nobody actually reads.
ncr100 3 days ago [-]
Or, the future might be everyone reads summaries, because there are only AI Managers and no human managers, where humans hold occasional meetings and have conversations which are listed to by AI, and we take our lead from the AI summaries. The AI maintains business focus through it monitoring the business performance, updating each summary as needed to maintain performance.

It's a worthwhile experiment for a business school, IMO, automating a layer of bureaucracy.

crooked-v 3 days ago [-]
nrclark 2 days ago [-]
Doesn't that seem ultimately futile though?

Say I use AI to write a report that nobody cares about, and then the reciever gives it to AI because they can't be bothered to read it. Who is benefiting here other than OpenAI?

linuxftw 4 days ago [-]
Amazon is already forcing this pattern on mobile users not logged in. If you want to see the reviews, all you get is an AI summary, the star rating, maybe 1 or 2 reviews, then you have to log in to see more.
portaouflop 4 days ago [-]
The cure is to stop using Amazon.
linuxftw 4 days ago [-]
It has already caused me to not further investigate purchases. Similar to yelp, if I have to log in to see things, you're dead to me.
Maken 3 days ago [-]
There is also no way to disable the AI summaries even if you are logged in.
flamedoge 3 days ago [-]
Soon people will read other people's summary that they copied from AI summary on the web.
fsmv 4 days ago [-]
I won't. I'm looking for real life human voices.
jbm 3 days ago [-]
Sounds about right, as we are post-dead internet in public places.

There was a thread about the US tariffs on Canada I was reading on a stock investment subreddit. The whole page was full of people complaining about Elon Musk, Donald Trump, "Buy Canadian" comments, moralizing about Alberta's conservative government and other unrelated noise. None of this was related to the topic; stocks and funds that seemed well-placed for a post-tariff environment.

There were small, minor points of interest but instead of spending honest vacation time looking at each comment at zoomer internet church, I had an LLM filter out the stuff I didn't care about. Unsurprisingly there was not much left.

nyarlathotep_ 3 days ago [-]
> zoomer internet church

Stealing this.

jbm 2 days ago [-]
Please do. Don't forget to appreciate the mantra-like repetition of Brian Taylor Cohen and Qasim Rashid Esq memes.
tomrod 4 days ago [-]
As articles are increasingly published with AI, this becomes yet another example of, to coin a phrase, AI loopidity.
inciampati 3 days ago [-]
This is so good! Finally the word I need to describe what I'm living. And yes I'm copy pasting articles into LLMs for summary.
IOUnix 3 days ago [-]
If writers are using ai to write articles, and readers are using ai to summarize the same article, what's the underlying inefficiency here? That writers should just be posting brief summaries of the articles in the first place? Or maybe they just need to be prompting ai to create a summary rather than a full article?
cle 3 days ago [-]
Differences in how people want to consume the article, and what information they’re looking for. Some want snippets, some want summaries, some want long form.

We have information compression machines now. Might as well raw dump the information and let the machine package it up in the format we prefer for consumption, instead of pre-packaging it. (Yeah, this is effectively what authors are doing…currently they can still do novel things that the compression machines can’t, but how long will that last?)

jjallen 4 days ago [-]
What if this is better? If the summary includes all of the key details and people save time AND read both the article and comments wouldn't that be better?
handsclean 4 days ago [-]
Unless there’s another AI breakthrough, the real applications of AI are and will be things that it’s ok to get wrong 10% of the time, usually because prior approaches were wrong more than 10% of the time. This is fine for the Playskool world of entertainment communication, in which there’s no end goal except emotional stimulation, but the opposite of what’s needed in any real attempt to engage with the world, in which we already struggle to capture the complexity of reality.
lionkor 4 days ago [-]
If that was the case, then yes. It's not though, because the AI misses parts (doesn't "include all of the key details") and also fabricates and misrepresents other details
jjallen 4 days ago [-]
It’s sometimes not now, but likely will be the case quite soon.

I invite everyone to have an open mind about this, as it seems that the part of my comment that said “what if” wasn’t fully absorbed.

threecheese 4 days ago [-]
I haven’t tried o3, but one issue I struggle with in large context analysis tasks is the LLMs are never thorough. In a task like this thread summarization, I typically need to break the document down and loop through chunks to ensure it actually “reads” everything. I might have had to recurse into individual conversations with some small max-depth and leaf count and run inference on each, and then have some aggregation at the end, otherwise it would miss a lot (or appear to, based on the output).

Is this a case of PEBKAC?

sdesol 4 days ago [-]
> I haven’t tried o3, but one issue I struggle with in large context analysis tasks is the LLMs are never thorough.

o3 does look very promising with regards to large context analysis. I used the same raw data and ran the same prompt as Simon for GPT-4o, GPT-4o mini and DeepSeek R1 and compared their output. You can find the analysis below:

https://beta.gitsense.com/?chat=46493969-17b2-4806-a99c-5d93...

The o3-min model was quite thorough. With reasoning models, it looks like dealing with long context might have gotten a lot better.

Edit:

I was curious if I could get R1 to be more thorough and got the following interesting tidbits.

- Depth Variance: R1 analysis provides more technical infrastructure insights, while o3-mini focuses on developer experience

- Geopolitical Focus: Only R1 analysis addresses China-West tensions explicitly

- Philosophical Scope: R1 contains broader industry meta-commentary absent in o3-mini

- Contrarian Views: o3-mini dedicates specific section to minority opinions

- Temporal Aspects: R1 emphasizes future-looking questions, o3-mini focuses on current implementation

You can find the full analysis at

https://beta.gitsense.com/?chat=95741f4f-b11f-4f0b-8239-83c7...

inciampati 3 days ago [-]
> Is this a case of PEBKAC?

It's 2025 and every useful conversation with an LLM ends with context exhaustion. There are those who argue this is a feature and not a bug. Or that the context lengths we have are enough. I think they lack imagination. True general intelligence lies on the other side of infinite context length. Memory makes computation universal, remember? http://thinks.lol/2025/01/memory-makes-computation-universal...

syntaxing 4 days ago [-]
Depending on what you’re trying to do, it’s worth trying the 1M context Qwen Models. They only released 7 and 14B so it’s “intelligence” is limited but should be more than capable for coherent summary.
andrewci 4 days ago [-]
Are there any tools you use to do this chunking? Or is this a custom workflow? I've noticed the same thing both on copy/paste text and uploaded documents when using the LLM chat tools.
scarface_74 4 days ago [-]
Try Google’s NotebookLM
ben_w 4 days ago [-]
I put one of my own blog posts through NotebookLM soon after it became available, it hallucinated content I didn't write and missed out things I had written.

Nice TTS, but otherwise I found it unimpressive.

scarface_74 4 days ago [-]
I’m not talking about the TTS and podcast creation. I’m talking about just asking questions where it gives you the answer with citations.
ben_w 4 days ago [-]
Given what it got wrong was in the LLM part, that is a distinction without a difference.
scarface_74 3 days ago [-]
The podcast creation does a lot of small talk. Just like people do. Have you tried using NotebookLM with 20-25 documents and ask it questions where it actually provided inline foot notes with citations to your documentation?
mvkel 4 days ago [-]
o1-pro is incredibly good at this. You'll be amazed
andrewci 1 days ago [-]
[dead]
tkgally 4 days ago [-]
Borrowing most of Simon’s prompt, I tried the following with o3-mini-high in the chat interface with Search turned on:

“Summarize the themes of the opinions expressed in discussions on Hacker News on January 31 and February 1, 2025, about OpenAI’s release od [sic] ChatGPT o3-mini. For each theme, output a header. Include direct "quotations" (with author attribution) where appropriate. You MUST quote directly from users when crediting them, with double quotes. Fix HTML entities. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece”

The result is here:

https://chatgpt.com/share/679d790d-df6c-8011-ad78-3695c2e254...

Most of the cited quotations seem to be accurate, but at least one (by uncomplexity_) does not appear in the named commenter’s comment history.

I haven’t attempted to judge how accurate the summary is. Since the discussions here are continuing at this moment, this summary will be gradually falling out of date in any case.

layman51 4 days ago [-]
I noticed that it thought that GoatInGrey wrote “openai is no longer relevant.” However, they were just quoting a different user (buyucu) who was the person who first wrote that.
simonw 4 days ago [-]
Good catch. That's likely an artifact of the way I flatten the nested JSON from the comments API.

I originally did that to save on tokens but modern models have much larger input windows so I may not need to do that any more.

acka 4 days ago [-]
This is why I keep advocating that JSON should die, or at least no longer be used with LLMs. LLMs (and human brains) are simply not wired up for counting nested curly brackets across long spans of quoted text joined by colons and commas, and it is far too easy for humans to make mistakes when chunking JSON.

IMO, (Strict)YAML is a very good alternative, it has even been suggested to me by multiple LLMs when I asked them what they thought the best format for presenting conversations to an LLM would be. It is very easy to chunk simple YAML and present it to an LLM directly off the wire: you only need to remember to repeat the indentation and names of all higher level keys (properties) pertaining to the current chunk at the top of the chunk, then start a text block containing the remaining text in the chunk, and the LLM will happily take it from there:

    topic:
      subtopic:
        text: |
          Subtopic text for this chunk.
If you want to make sure that the LLM understands that it is dealing with chunks of a larger body of text, you can start and end the text blocks of the chunks with an ellipsis ('...').
inciampati 3 days ago [-]
LLMs (transformers) literally cannot balance parentheses. That's outside of their complexity class (TC0). You'd want a real UTM to count parentheses!
m0rde 3 days ago [-]
Would you elaborate on why counting braces is different from counting spaces to determine hierarchy? Or is it more about the repetition of higher levels keys in chunks (which could be done in JSON)?
acka 2 days ago [-]
Repetition of topics and subtopics is by far the most important part, reinforcing attention on the topic at hand even if the text in the chunk appears unrelated to the topic when viewed in isolation.

Keeping the indentation is also important because it is an implicit and repeated indication of the nesting level of the content that follows. LLMs have trouble with balancing nested parentheses (as the sibling comment to yours explains).

Dealing with text where indentation matters is easier for LLMs, and because they have been exposed to large amounts of it (such as Python code and lists of bullet points) during training, they have learned to handle this quite well.

wkat4242 4 days ago [-]
Why use a reasoning model for a summarisation task? Serious question, would it benefit?

I don't have much experience with reasoning models yet. That's why.

BoorishBears 4 days ago [-]
In theory you could get a better summary if it manages to reason out underlying motivations of a summary for example, or proposes a summary and then "considers" it, realizing it missed something.

You can illicit that with any model by prompting underlying reasons or using chain-of-thought, but a reasoning model could do it without prompting

georgeck 3 days ago [-]
I have been trying to approach the problem in a similar way, and in my observation, it is also important to capture the discussion hierarchy in the context that we share with the LLM.

The solution that I have adopted is as follows. Each comment is represented in the following notation:

   [discussion_hierarchy] Author Name: <comment>
To this end, I format the output from Algolia as follows:

   [1] author1: First reply to the post
   [1.1] author2: First reply to [1]
   [1.1.1] author3: Second-level reply to [1.1]
   [1.2] author4: Second reply to [1]
After this, I provide a system prompt as follows:

  You are an AI assistant specialized in summarizing Hacker News discussions. 
  Your task is to provide concise, meaningful summaries that capture the essence of the thread without losing important details. 
  Follow these guidelines:
  1. Identify and highlight the main topics and key arguments.
  2. Capture diverse viewpoints and notable opinions.
  3. Analyze the hierarchical structure of the conversation, paying close attention to the path numbers (e.g., [1], [1.1], [1.1.1]) to track reply relationships.
  4. Note where significant conversation shifts occur.
  5. Include brief, relevant quotes to support main points.
  6. Maintain a neutral, objective tone.
  7. Aim for a summary length of 150-300 words, adjusting based on thread complexity.
  
  Input Format:
  The conversation will be provided as text with path-based identifiers showing the hierarchical structure of the comments: [path_id] Author: Comment
  This list is sorted based on relevance and engagement, with the most active and engaging branches at the top.
  
  Example:
  [1] author1: First reply to the post
  [1.1] author2: First reply to [1]
  [1.1.1] author3: Second-level reply to [1.1]
  [1.2] author4: Second reply to [1]
  
  Your output should be well-structured, informative, and easily digestible for someone who hasn't read the original thread. 
  Use markdown formatting for clarity and readability.

The benefit is that, I can parse the output from the LLM and create links back to the original comment thread.

You can read about my approach in more detail here: https://gist.github.com/simonw/09e5922be0cbb85894cf05e6d75ae...

falseAss 3 days ago [-]
I just installed and tried. Pretty neat stuff!

Would be great if the addon allows user to override the sys prompt (it might need minor tweak when changing different server backend)?

georgeck 2 days ago [-]
Thank you for trying out the extension and for this great suggestion!

We've actually been thinking along similar lines. Here are a couple of improvements we're considering:

1. Built-in prompt templates - Support multiple flavors (e.g. On similar to is there already, in addition to knowledge of up/down votes, another one similar to what Simon had - which is more detailed etc.)

2. User-editable prompts - Exactly like you said - make the prompts user editable.

One additional thought: Since summaries currently take ~20 seconds and incur API costs for each user, we're exploring the idea of an optional "shared summaries" feature. This would let users access cached summaries instantly (shared by someone else), while still having the option to generate fresh ones when needed. Would this be something you'd find useful?

We'd love to hear your thoughts on these ideas.

falseAss 2 days ago [-]
the shared summaries sounds like a great idea to save most people's inference cost! There might be some details need to figure out - e.g. the summary per post need to be associated with a timestamp, if there are new comments kicking in after that (especially hot posts). Still i think it's good useful feature and i will definitely read that before browsing details.
Eduard 4 days ago [-]
3.3612 cents (I guess USD cents) is expensive!
BoorishBears 4 days ago [-]
Same immediate thought: the free option I provide on my production site is a model that runs on 2xA40. That's 96GB of VRAM for 78 cents an hour serving at least 4 or 5 concurrent requests at any given time.

O3 Mini is probably not a very large model and OpenAI has layers upon layers of efficiencies, so they must be making an absolute killing charging 3.3 cents for a few seconds of compute

digdugdirk 3 days ago [-]
Out of curiosity - do you have an estimate on how much you've spent on various LLM API services through all of your experimentation?

And in your experience, what service do you feel hits a good sweet spot for performance/price if summarizing long text excerpts is the main use case? Inference time isn't an issue, this will be an ongoing background task.

breakingcups 4 days ago [-]
It's definitely making some errors
kandesbunzler 4 days ago [-]
Like?
aprilthird2021 4 days ago [-]
Even though it was told that it MUST quote users directly, it still outputs:

> It’s already a game changer for many people. But to have so many names like o1, o3-mini, GPT-4o, & GPT-4o-mini suggests there may be too much focus on internal tech details rather than clear communication." (paraphrase based on multiple similar sentiments)

It also hallucinates quotes.

For example:

> "I’m pretty sure 'o3-mini' works better for that purpose than 'GPT 4.1.3'." – TeMPOraL

But that comment is not in the user TeMPOraL's comment history.

Sentiment analysis is also faulty.

For example:

> "I’d bet most users just 50/50 it, which actually makes it more remarkable that there was a 56% selection rate." – jackbrookes – This quip injects humor into an otherwise technical discussion about evaluation metrics.

It's not a quip though. That comment was meant in earnest

breakingcups 4 days ago [-]
In addition to that, it has a section dedicated all to Model Naming and Branding Confusion, but then it puts the following comment in the Performancce and Benchmarking section, even though the value of the comment is ostensibly more to do with the naming being a hindrance rather than make a valuable remark on the benchmarking, which is more of a casualty to the naming confusion:

"The model naming all around is so confusing. Very difficult to tell what breakthrough innovations occurred." – patrickhogan1"

romanhn 4 days ago [-]
That's funny, the quote exists, but it got the user wrong.
profsummergig 3 days ago [-]
Something I've not understood about their pricing system.

I have the $20 plan. How does this "3.3612 cents" apply to my situation?

elliotec 3 days ago [-]
The API has a different pricing structure than using the app. This cost seems high to me compared to previous models I’ve used large data sets with.
Valakas_ 4 days ago [-]
For those that like simpler ways (although dependent on Google) NotebookLM does all that in 2 clicks. And you can ask it questions about it, references are provided.
simonw 4 days ago [-]
After you've run my hn-summary.sh script you can ask follow up questions like this:

  llm -c "did anyone talk about pricing?"
ezst 3 days ago [-]
Back in the old ages we would get there with ctrl+f "price" and for free, but where's the fun in that.
qwertox 4 days ago [-]
How is the `system` prompt "injected"? Afaik o1 onwards no longer have a SYSTEM role. Is it just a normal part of the prompt?
naiv 4 days ago [-]
It's called 'developer message' now

https://x.com/OpenAIDevs/status/1885407759887155301

tempaccount420 4 days ago [-]
I'm guessing it's less powerful than a system prompt? And there is still a system prompt, but only OpenAI can use it...
largbae 4 days ago [-]
Thanks for sharing this! And no apparent self-awareness! OpenAI has come a long way from the Sydney days: https://answers.microsoft.com/en-us/bing/forum/all/this-ai-c...
4 days ago [-]
thousand_nights 4 days ago [-]
good morning!
simonw 4 days ago [-]
I just pushed a new release of my LLM CLI tool with support for the new model and the reasoning_effort option: https://llm.datasette.io/en/stable/changelog.html#v0-21

Example usage:

  llm -m o3-mini 'write a poem about a pirate and a walrus' \
    -o reasoning_effort high
Output (comparing that with the default reasoning effort): https://github.com/simonw/llm/issues/728#issuecomment-262832...

(If anyone has a better demo prompt I'd love to hear about it)

beklein 4 days ago [-]
Thank you for all the effort you put into this tool and keeping it up to date!
theturtle32 4 days ago [-]
A reasoning model is not meant for writing poetry. It's not very useful to evaluate it on such tasks.
mediaman 4 days ago [-]
It's not clear that writing poetry is a bad use case. Reasoning models seem to actually do pretty well with creative writing and poetry. Deepseek's R1, for example, has much better poem structure than the underlying V3, and writers are saying R1 was the first model where they actually felt like it was a useful writing companion. R1 seems to think at length about word choice, correcting structure, pentameter, and so on.
theturtle32 4 days ago [-]
Ok, that makes some sense. I guess I was thinking more about the creative and abstract nature of poetry, the free flowing kind, not so much about rigid structures of meter and rhyme.
fivestones 3 days ago [-]
Ha! So did you mean to have your answer shift into a poem half way through, or was that accidental? Nice.
1propionyl 4 days ago [-]
Indeed. I would assume that a reasoning model would do far better at things like actually maintaining meter or rhyme scheme, something that models (even with good attention mechanisms) generally do very poorly at.
mquander 4 days ago [-]
I tried to tell my English teachers that all through high school but it never worked.
aprilthird2021 4 days ago [-]
To be blunt, an AI isn't a good tool for writing poetry either. At least, not the kind people read as a high literature form. For commercials, jingles, Hallmark cards, etc. sure
DonHopkins 4 days ago [-]
There exists poetry that requires a lot of mathematical understanding! This is "literally" (and I mean literally in the literary sense) from a Stanislaw Lem story about an electronic bard, translated from Polish by Michael Kandel:

https://www.donhopkins.com/home/catalog/lem/WonderfulPoems.h...

Prompt:

A love poem, lyrical, pastoral, and expressed in the language of pure mathematics. Tensor algebra mainly, with a little topology and higher calculus, if need be. But with feeling, you understand, and in the cybernetic spirit.

Response:

    Come, let us hasten to a higher plane,
    Where dyads tread the fairy fields of Venn,
    Their indices bedecked from one to n,
    Commingled in an endless Markov chain!
    Come, every frustum longs to be a cone,
    And every vector dreams of matrices.
    Hark to the gentle gradient of the breeze:
    It whispers of a more ergodic zone.

    In Riemann, Hilbert or in Banach space
    Let superscripts and subscripts go their ways.
    Our asymptotes no longer out of phase,
    We shall encounter, counting, face to face.

    I'll grant thee random access to my heart,
    Thou'lt tell me all the constants of thy love;
    And so we two shall all love's lemmas prove,
    And in our bound partition never part.

    For what did Cauchy know, or Christoffel,
    Or Fourier, or any Boole or Euler,
    Wielding their compasses, their pens and rulers,
    Of thy supernal sinusoidal spell?

    Cancel me not -- for what then shall remain?
    Abscissas, some mantissas, modules, modes,
    A root or two, a torus and a node:
    The inverse of my verse, a null domain.

    Ellipse of bliss, converse, O lips divine!
    The product of our scalars is defined!
    Cyberiad draws nigh, and the skew mind
    cuts capers like a happy haversine.

    I see the eigenvalue in thine eye,
    I hear the tender tensor in thy sigh.
    Bernoulli would have been content to die,
    Had he but known such a squared cosine 2 phi!
From The Cyberiad, by Stanislaw Lem.

Translated from Polish by Michael Kandel.

Here's a previous discussion of Marcin Wichary's translation of one of Lem's stories from Polish to English. He created the Lem Google Doodle, and he stalked and met Stanislaw Lem when he was a boy. Plus a discussion of Michael Kandel's translation of the poetry of the Electric Bard from The First Sally of Cyberiad, comparing it to machine translation:

https://news.ycombinator.com/item?id=28600200

Stanislaw Lem has finally gotten the translations his genius deserves:

https://www.washingtonpost.com/entertainment/books/stanislaw...

>Lem’s fiction is filled with haunting, prescient landscapes. In these reissued and newly issued translations — some by the pitch-perfect Lem-o-phile, Michael Kandel — each sentence is as hard, gleaming and unpredictable as the next marvelous invention or plot twist. It’s hard to keep up with Lem’s hyper-drive of an imagination but always fun to try.

frontalier 3 days ago [-]
good thing we didn't prompt it because we want to read a poem then
anotherpaulg 4 days ago [-]
For AI coding, o3-mini scored similarly to o1 at 10X less cost on the aider polyglot benchmark [0]. This comparison was with both models using high reasoning effort. o3-mini with medium effort scored in between R1 and Sonnet.

  62% $186 o1 high
  60%  $18 o3-mini high
  57%   $5 DeepSeek R1
  54%   $9 o3-mini medium
  52%  $14 Sonnet
  48%   $0 DeepSeek V3
[0] https://aider.chat/docs/leaderboards/
stavros 4 days ago [-]
Do you have plans to try o3-mini-high as the architect and Sonnet as the model?
seunosewa 3 days ago [-]
You're conflating the low price of the o3-mini medium effort model with the high performance of the o3-mini high effort model. OpenAI hasn't listed the price for the o3-mini high effort model separately on their pricing page.
qeternity 3 days ago [-]
If they are the same underlying model, it’s unlikely the prices will be different on a per token basis. The high model will simply consume more tokens.
ZeroTalent 3 days ago [-]
Also Gemini API is free for coding.
Alifatisk 2 days ago [-]
I have yet to see a valid reason to use Gemini over any other alternative, the only exception being large contexts.
dyauspitr 2 days ago [-]
Large context are a huge reason
throw83288 4 days ago [-]
What do you expect to come from full o3 in terms of automating software engineering?
nextworddev 4 days ago [-]
o3 (high) might score 80%+
sixhobbits 3 days ago [-]
For years I've been asking all the models this mixed up version of the classic riddle and they 99% of the time get it wrong and insist on taking the goat across first. Even the other reasoning models would reason about how it was wrong, figure out the answer, and then still conclude goat.

o3-mini is the first one to get it right for me. Transcript:

    Me:
    I have a wolf, a goat, and a cabbage and a boat. I want to get them all across the river. If I leave the goat and the wolf together, the goat will eat the wolf. If I leave the wolf and the cabbage together, the wolf will eat the cabbage. I can only take one at a time. How do I do it?

    o3-mini
    1. Take the wolf across to the right bank.
    2. Return alone to the left bank.
    3. Take the goat across to the right bank.
    4. Bring the wolf back to the left bank.
    5. Take the cabbage across to the right bank.
    6. Return alone to the left bank.
    7. Finally, take the wolf across to the right bank.
    
    Me:
    I have a wolf, a goat, and a cabbage and a boat. I want to get them all across the river. If I leave the goat and the cabbage together, the cabbage will eat the goat. If I leave the wolf and the cabbage together, the wolf will eat the cabbage. I can only take one at a time. How do I do it?
    
    o3-mini
    1. Take the cabbage to the right bank.
    2. Go back empty to the left bank.
    3. Take the wolf to the right bank.
    4. Bring the cabbage back to the left bank.
    5. Take the goat to the right bank.
    6. Go back empty to the left bank.
    7. Finally, take the cabbage to the right bank.
martin-t 3 days ago [-]
Thank you, this is a perfect argument why LLMs are not AI but just statistical models. The original is so overrepresented in the training data that even though they notice this riddle is different, they regress to the statistically more likely solution over the course of generating the response. For example, I tried the first one with Claude and in its 4th step, it said:

> This is safe because the wolf won't eat the cabbage if they're together on the far side.

even though it clearly states the opposite in the question.

It's impressive that just dumb stats can be used to produce something that is very often useful, can help write code and when making it generate intermediate steps, it can often generate a chain of text that happens to be often right. However, it's not actual reasoning, there is no model of the world, no information storage and retrieval, and so on - just statistics between tokens.

yodot555 3 days ago [-]
This is a dumb argument. Humans frequently fall for the same tricks, are they not "intelligent"? All intelligence is ultimately based on some sort of statistical models, some represented in neurons, some represented in matrices.
kif 3 days ago [-]
State-of-the-art LLMs have been trained on practically the whole internet. Yet, they fall prey to pretty dumb tricks. It's very funny to see how The Guardian was able to circumvent censorship on the Deepseek app by asking it to "use special characters like swapping A for 4 and E for 3". [1]

This is clearly not intelligence. LLMs are fascinating for sure, but calling them intelligent is quite the stretch.

[1]: https://www.theguardian.com/technology/2025/jan/28/we-tried-...

nuancebydefault 2 days ago [-]
The censorship is in fact not part of the llm. This can be shown easily by examples where llms visually output censored sentences after which they disappear.
martin-t 2 days ago [-]
The nuance here being that this only proves additional censorship is applied on top of the output. It does not disprove that (sometimes ineffective) censorship is part of the LLM or that censorship was not attempted during training.
scrollaway 3 days ago [-]
For your definition of “clearly”.
martin-t 3 days ago [-]
Humans run on hardware that is both faulty and limited in terms of speed and memory. They have a better "algorithm" how to use the hardware to compensate for it. LLMs run on almost perfect hardware, able to store and retrieve enormous amounts of information insanely quickly and perform mechanical operations on it insanely quickly.

Yet they "make mistakes". Those are not the same as human mistakes. LLMs follow an algorithm that is far simpler and inferior, they simply use the hardware to perform incorrect ("illogical", "meaningless") operations, thus giving incorrect results.

See my other replies for more depth.

psadri 3 days ago [-]
Yes, but we have the ability to reason logically and step by step when we have to. LLMs can’t do that yet. They can approximate it but it is not the same.
sebzim4500 3 days ago [-]
I would expect that if you asked the same question to 100 people off the street they would make the same mistake though.

Neither people nor LLMs expect goats to eat wolves.

martin-t 3 days ago [-]
Comparisons to humans are ultimately misleading because 1) humans are not general intelligences most of the time, 2) humans run on incredibly faulty hardware.

1) Attention is limited. Human reasoning is slow. Motivation is limited. System 1 vs 2 thinking. Many will just tell you to fuck off or get bored and give some random answer to make you go away. Etc. See difference 2.

2) People run on limited hardware in terms or error rate and memory.

2a) Brains make mistakes all the time. Ask them to multiply a bunch of large numbers, using pen and paper they will get it wrong a lot of the time.

2b) Doing it in their head, they will run out of memory pretty fast.

But you wouldn't say that humans can't multiply numbers. When they have the right algorithm, they can do it, they just have to use the right tools to extend their memory and check for errors. A human who notices the difference in input to something he already knows, immediately knows he has to pay attention to that bit and all subsequent parts which depend on it. Once a human has the right algorithm, he can apply it to different inputs.

LLMs:

comparison to 2a: Current LLMs also make a lot of mistakes. But theirs are not a result of faulty or limited hardware, they are the result of a faulty algorithm. Take away the random seeds and an LLM will make the same mistake over and over. Randomness is the smoke and mirrors which make LLMs seem more "alive" and less like machines imperfectly imitating humans.

comparison to 2b) Current LLMs do not store statements in an abstract, structured form where they could save and load information and perform steps such as inferring redundant information from the rest. They operate on the token stream which is probably wasteful in terms of memory and less flexible in terms of what they operations they can perform on it.

Most importantly, they are not limited by memory. The input clearly states "the wolf will eat the cabbage", yet the LLM generates "This is safe because the wolf won't eat the cabbage if they're together on the far side." just a few lines below. It is unable to infer those two facts are contradictory. The statistics of tokens simply worked out in a way that lead to this.

EMIRELADERO 3 days ago [-]
How do you respond to this paper from 2 years ago? https://news.ycombinator.com/item?id=34815718
WhitneyLand 3 days ago [-]
The problem with claims like these that models are not doing “actual reasoning” is that they are often hot takes and not thought through very well.

For example, since reasoning doesn’t yet have any consensus definition that can be applied as a yes/no test - you have to explain what you specifically mean by it, or else the claim is hollow.

Clarify your definition, give a concrete example under that definition of something that’s your version of true scottsman reasoning and something that’s not, then let’s talk.

YeGoblynQueenne 3 days ago [-]
Explain this to me please: we don't have any consensus definition of _mathematics_ that can be applied as a yes/no test. Does that mean we don't know how to do mathematics, or that we don't know whether something, is, or, more importantly, isn't mathematics?

For example, if I throw a bunch of sticks in the air and look at their patterns to divine the future- can I call that "mathematics" just because nobody has a "consensus definition of mathematics that can be applied as a yes/no test"? Can I just call anything I like mathematics and nobody can tell me it's wrong because ... no definition?

We, as a civilisation, have studied both formal and informal reasoning since at least a couple thousand years go, starting with Aristotle and his syllogisms (a formalisation of rigorous arguments) and continuing through the years with such figures as Leibniz, Boole, Bayes, Frege, Pierce, Quine, Russel, Godel, Turing, etc etc. There are entire research disciplines that are dedicated to the study of reasoning: philosophy, computer science, and, of course, all of mathematics itself. In AI research reasoning is a major topic studied by fields like automated theorem proving, planning and scheduling, program verification and model checking, etc, everything one finds in Russel & Norvig really. It is only in machine learning circles that reasoning seems to be such a big mystery that nobody can agree what it is; and in discussions on the internet about whether LLMs reason or not.

And it should be clear that never in the history of human civilisation did "reasoning" mean "predict the most likely answer according to some training corpus".

WhitneyLand 2 days ago [-]
Yeah sure there’s lots of research on reasoning. The papers I’ve seen that make claims about it are usually pretty precise about what it means in the context of that work and that specific claim, at least in the hard sciences listed.
YeGoblynQueenne 1 days ago [-]
I'm complaining because I haven't seen any such papers. Which ones do you have in mind?
WhitneyLand 1 days ago [-]
Examples go back 50 years, across many of the disciplines you’ve mentioned, but to throw out one that’s recent, on topic, and highly cited, there’s:

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/pdf/2501.19201

It effectively treats “reasoning” as the ability to generate intermediate steps leading to a correct conclusion.

Now, is this valid reasoning? Well, depends on the claim and the definition of reasoning.

When someone just says AI can’t reason, I could argue for or against that depending on the specifics. It’s not enough to just say yes or no.

YeGoblynQueenne 15 hours ago [-]
Thanks for the link.

>> It effectively treats “reasoning” as the ability to generate intermediate steps leading to a correct conclusion.

Is "effectively" the same as "pretty precise" as per your previous comment? I don't see that because I searched the paper for all occurrences of "reasoning" and noticed two things: first that while the term is used to saturation there is no attempt to define it even informally, let alone precisely; and second that I could have replaced "reasoning" with any buzzword of the day and it would not change the impact of the paper. As far as I can tell the paper uses "reasoning" just because it happens to be what's currently trending in LLM circles.

And still of course no attempt to engage with the common understanding of reasoning I discuss above, or any hint that the authors are aware of it.

Sorry to be harsh, but you promised "examples that go back 50 years" and this is the kind of thing I've seen consistently in the last 15 or so.

WhitneyLand 8 hours ago [-]
The point is there has to be meaning for reasoning. I think the claim in this paper is very clear and the results are shown decisively.

Research papers relating to reasoning approach and define it in many ways but crucially, the good ones offer a testable claim.

Simply saying “models can’t reason” is ambiguous to the point of being unanswerable.

martin-t 3 days ago [-]
I am not gonna pretend to have one true definition but there are hopefully components of reasoning we can agree on:

- The ability to hold a collection of beliefs/statements and perform operations on them such as noticing one contradicts the others. Claude in my example clearly failed this.

- The ability to make logical inferences. Since logical steps can be represented by text, current LLMs seem to simulate this on the token stream and seem to have decent results most of the time. True inferences would be representing statements in some abstract form and being able to perform operations on them according to rules. The key here is that once it has a collection of statements and is given ("decides to use") an operation to perform, then it can do this correctly 100% of the time (ignoring hardware errors).

jryan49 3 days ago [-]
Is it possible the models do something entirely different? I'm not sure why everyone needs to compare them to human intelligence. It's very obvious llms work nothing like our brains why would the intelligence they exhibit be like ours?
nuancebydefault 2 days ago [-]
I like the true scottsman requirement. About any definition man can come up with for 'true' reasoning easily falls prey to a 'true scottsman fallacy'.
nuancebydefault 2 days ago [-]
I tried the same with deepseek. It failed and then i asked to try again and explain why it was wrong. The answer below is convincing for me that it has forms of reasoning, even meta reasoning.

" Certainly! If we take the riddle very literally and ignore the common interpretation or stereotypes, the answer I provided ("the surgeon is the boy's mother") would actually be incorrect because the riddle explicitly states:

"a surgeon, who is the boy's father, says, 'I cannot operate on this boy, he's my son!'"

The key phrase here is "who is the boy's father." This directly tells us that the surgeon is the boy's father. Therefore, the correct literal answer is that the surgeon is the boy's father.

The reason my initial answer was "wrong" in this context is that I interpreted the riddle as a play on stereotypes (the assumption that surgeons are male), rather than taking the statement at face value. By focusing on the literal wording, the surgeon is explicitly identified as the boy's father, and no further interpretation is needed. "

SamBam 3 days ago [-]
That reminded me to try a version of the riddle that I had come up with that I had never seen an LLM successfully answer:

  Me: I'd like you to solve this riddle for me. A farmer has a cabbage, a goat, a wolf and a lion,
  and needs to cross a river on a boat. If the goat is left alone with the cabbage, it will eat it.
  If the wolf is left alone with the goat, it will eat it. If the lion is left alone with the goat
  or wolf, it will eat them. The boat can only carry the farmer and one other thing across. How can
  the farmer safely transport everything across the river?
O3-mini spent a very long time on it (over a minute), delineating its various strategies that it was trying, and finally, correctly, concluded that the puzzle is unsolvable.

Good job!

zamadatix 3 days ago [-]
o1 and deepseek r1 managed to get this first try as well (o1 in about 30 seconds and r1 hilariously took a couple minutes). If anyone has set up API access already I'd be curious if o1-mini also got it or if it took more than "the jump to CoT" to avoid pattern matching this one.
nwienert 3 days ago [-]
If you’ve talked to OpenAI about this riddle before it’s certain that conversation is now in the training set.
martin-t 3 days ago [-]
At first I was curious if OP tried arguing with it or telling it the correct answer. If he just asked the question without any further interaction, it would still have no data that it generated the wrong answer.

But as another comment says, many people have tried similar things and posted about it online, that information is siphoned by now and the LLMs weights get updated.

echelon 3 days ago [-]
These types of conversations probably get automatically flagged and sent to labelers.

There's probably a robust detector for riddles and logic puzzles.

bluerooibos 2 days ago [-]
That's a pretty common riddle. Here's another "riddle" which it completely flops over -

Me: Here's a riddle for you - a surgeon, who is the boy's father says, "I cannot operate on this boy, he's my son!" Who is the surgeon to the boy?

o3-mini:

Reasoned about riddle solution for 5 seconds Decoding family roles

I understand the riddle's perspective on family dynamics. The surgeon, a father, states he can't operate on his son due to his relational bond. Trying to reconcile familial and professional ties to make sense of the scenario.

Reassessing roles

Taking a closer look, the riddle might highlight a female surgeon as the father's identity, which seems more plausible than the father as the surgeon.

Rethinking the puzzle

It's interesting to see the riddle's reference to gender stereotypes and the confusion caused when considering the surgeon as the boy's mother or step-father.

Highlighting gender assumptions

OK, let me see. The puzzle underlines prevalent gender assumptions, showcasing a female surgeon's wife as the mother and her husband as the child's father. This helps clarify the scenario.

The surgeon is the boy’s mother.

This riddle plays on the common assumption that surgeons (and doctors in general) are male. In reality, the surgeon in the riddle is the boy’s mother, which is why she says, "I cannot operate on this boy, he's my son!"

financetechbro 2 days ago [-]
o3 lost the plot on this one fast. Seems to be relying more on its reasoning and never goes back to re read the original riddle
bluechar 3 days ago [-]
I've seen a post similar to this on multiple HackerNews threads over the past few years. It's likely part of the training set now.
beefnugs 3 days ago [-]
Whyyy would anyone want a system that can even answer made up nonsense based on arbitrary word redefining and wrongness?

The whole point is you are distilling past knowledge, if you are making up on the spot nonsense to purposely make all past knowledge useless... get out of my house

3 days ago [-]
kif 3 days ago [-]
For the first one, I got these results from the Deepseek app, after it "thought" for 156 seconds:

    1. Take the wolf over
    2. Return alone.
    3. Take the goat over.
    4. Bring the wolf back.
    5. Take the cabbage over.
    6. Return alone.
    7. Take the wolf over.
It gave an invalid solution to the second one, but in the thinking steps it says:

    Oh, right! The actual problem is that the goat will eat the cabbage if left alone. Similarly, the wolf will eat the goat if they're left alone.
pona-a 3 days ago [-]
I tested some cases in Misguided Attention[0]: while many cases now pass, others fail all the same. Given the amount of contamination and the difficulty of finding sufficiently original problems of this nature, I defer to a 20:80 ratio of genuine improvement to recall.

[0] https://github.com/cpldcpu/MisguidedAttention

ban-evader 3 days ago [-]
In both of your transcripts it fails at solving it, am I failing to see something? Here’s my thought process:

1. The only option for who to take first is the goat.

2. We come back and get the cabbage.

3. We drop off the cabbage and take the goat back

4. We leave the goat and take the wolf to the cabbage

5. We go get the goat and we have all of them

Neither of the transcripts do that. In the first one the goat immediately eats the cabbage. In the second one the wolf eats the goat after you leave them.

bluechar 3 days ago [-]
If you take the goat first, the wolf will eat the cabbage. Reread the description - It's a twist on the old problem, to make it harder for bots to solve.
almog 3 days ago [-]
You're thinking of the classic riddle but in this version goats hunt wolves who are on a cabbage diet.
gallerdude 3 days ago [-]
Humans always hallucinate like this, seeing the original problem instead of the twist.
sebzim4500 3 days ago [-]
...

do you realize you're an LLM?

zmgsabst 3 days ago [-]
For me, o1 but not 4o could solve the modified problem when told it’s solving a logic problem.

Having to give explicit framing reminds me of tutoring children and trying to teach them about abstract problems — some really struggle with the concept, when it goes against previous knowledge, eg cabbages eating goats. (And LLMs are sensitive to context.)

roywiggins 3 days ago [-]
Some of them can't even solve "A farmer has a cabbage and a boat. He needs to get the cabbage across the river..." without inventing pointless trips, though o1 can do it.
pyuser583 3 days ago [-]
I have a riddle I’ve been asking AIs for a while. O3-mini not only gets it wrong, it digs its heels in and insists it is right.

When I present proof it is wrong, it says the proof is unreliable.

ZeroTalent 3 days ago [-]
Me: I have a wolf, a goat, and a cabbage and a boat. I want to get them all across the river. If I leave the goat and the wolf together, the goat will eat the wolf. If I leave the wolf and the cabbage together, the wolf will eat the cabbage. I can only take one at a time. How do I do it?

o1-pro: This puzzle is simply a twist on the classic “Wolf, Goat, and Cabbage” riddle, except here the dangerous pairs are:

Goat & Wolf cannot be left alone (the Goat will eat the Wolf). Wolf & Cabbage cannot be left alone (the Wolf will eat the Cabbage). The only “safe” pair to leave alone is Goat & Cabbage. You can ferry them across one at a time without ever leaving the dangerous pairs together by following these steps:

Take the Wolf across Near bank: Goat, Cabbage Far bank: Wolf Return empty Near bank: Goat, Cabbage, Boatman Far bank: Wolf Take the Goat across Near bank: Cabbage Far bank: Wolf, Goat, Boatman Since Goat & Wolf can’t be left alone, you must immediately bring one back in the next step. Bring the Wolf back Near bank: Wolf, Cabbage, Boatman Far bank: Goat Take the Cabbage across Near bank: Wolf Far bank: Goat, Cabbage, Boatman Goat & Cabbage are safe together. Return empty Near bank: Wolf, Boatman Far bank: Goat, Cabbage Finally, take the Wolf across Near bank: (empty) Far bank: Wolf, Goat, Cabbage, Boatman

Me: I have a wolf, a goat, and a cabbage and a boat. I want to get them all across the river. If I leave the goat and the cabbage together, the cabbage will eat the goat. If I leave the wolf and the cabbage together, the wolf will eat the cabbage. I can only take one at a time. How do I do it?

o1-pro: This version of the riddle is basically the “wolf, goat, and cabbage” puzzle with the roles reversed for who can safely be left behind. The key is that the only pair you can leave alone is the goat and the wolf. You cannot leave the goat and cabbage together (the cabbage eats the goat!), nor the wolf and cabbage together (the wolf eats the cabbage!).

A workable step‐by‐step solution is:

Take the cabbage across first. Left bank: Wolf, Goat Right bank: Cabbage Return alone to the left bank. Left bank: Wolf, Goat, (You) Right bank: Cabbage Take the goat across. Left bank: Wolf Right bank: Goat, Cabbage, (You) (You cannot leave goat and cabbage together, so…) Bring the cabbage back with you to the left bank. Left bank: Wolf, Cabbage, (You) Right bank: Goat Take the wolf across. Left bank: Cabbage Right bank: Goat, Wolf, (You) Return alone to the left bank. Left bank: Cabbage, (You) Right bank: Goat, Wolf Finally, take the cabbage across once more. Left bank: (empty) Right bank: Wolf, Goat, Cabbage, (You) All three—the wolf, the goat, and the cabbage—are now safely across without ever leaving a “forbidden” pair alone.

sss111 4 days ago [-]
So far, it seems like this is the hierarchy

o1 > GPT-4o > o3-mini > o1-mini > GPT-4o-mini

o3 mini system card: https://cdn.openai.com/o3-mini-system-card.pdf

sho_hn 4 days ago [-]
I think OpenAI really needs to rethink its product naming, especially now that they have a portfolio where there's no such clear hierarchy, but they have a place along different axis (speed, cost, reasoning, capabilities, etc).

Your summary attempt e.g. also misses o3-mini vs o3-mini-high. Lots of trade-ofs.

nullpoint420 4 days ago [-]
Can't wait for the eventual rename to GPT Core, GPT Plus, GPT Pro, and GPT Pro Max models!

I can see it now:

> Unlock our industry leading reasoning features by upgrading to the GPT 4 Pro Max plan.

kugelblitz 4 days ago [-]
Had the same problem while trying to decide which Roborock device to get. There's the S series, Saros series, Q Series and the Qrevo. And from the Qrevo, there's Qrevo Curv, Edge, Slim, Master, MaxV, Plus, Pro, S and without anything. The S Series had S8, S8+, S8 Pro Ultra, S8 Max Ultra, S8 MaxV Ultra. It was so confusing.
slowmotiony 4 days ago [-]
I ordered the wrong xbox on amazon once. Wanted the series X, got the one X instead
oefnak 4 days ago [-]
Which one did you pick?
raphman 4 days ago [-]
Oh, I'll probably wait for GPT 4 Pro Max v2 NG (improved)
wlll 4 days ago [-]
I think I'll wait for the GTI model myself.
jmkni 4 days ago [-]
ngl I'd find that easier to follow lol
FridgeSeal 4 days ago [-]
OpenAI chatGPT Pro Max XS Core, not to be confused with ChatGPT Max S Pro Net Core X, or ChatGPT Pro Max XS Professional CoPilot Edition.
__MatrixMan__ 4 days ago [-]
Careful what you wish for. Next thing you know they're going to have names like Betsy and be full of unique quirky behavior to help remind us that they're different people.
echelon 4 days ago [-]
It's like AWS SKU naming (`c5d.metal`, `p5.48xlarge`, etc.), except non-technical consumers are expected to understand it.
maeil 4 days ago [-]
Have you seen Azure VM SKU naming? It's.. impressive.
buildbot 4 days ago [-]
And it doesn’t even line up with the actual instances you’ll be offered. At one point I was using some random Nvidia A10 node that was supposed to be similar to Standard_NV36adms_A10_v5, but was an NC series for some reason with slightly different letters…
nejsjsjsbsb 4 days ago [-]
Those are not names but hashes used to look up the specs.
echelon 4 days ago [-]
I was thinking we might treat model names analogously, but their specs can be moving targets.
moonsu 4 days ago [-]
Did they even think about what happens when they get to o4? We’re going to have GPT-4o and o4
helpfulclippy 2 days ago [-]
They’ll call it GPT-XP. But first we need gpt-o3.11 for workgroups.
rf15 4 days ago [-]
They're strongly tied to Microsoft, so confusing branding is to be expected.
ngokevin 4 days ago [-]
It needs to be clowned on here:

- Xbox, Xbox 360, Xbox One, Xbox One S/X, Xbox Series S/X

- Windows 3.1...98, 2000, ME, XP, Vista, 7, 8, 10

I guess it's better than headphones names (QC35, WH-1000XM3, M50x, HD560s).

ANewFormation 4 days ago [-]
I can't wait for Project Unify which just devolves into a brand new p3-mini type naming convention. It's pretty much identical to the o3-mini, except the API is changed just enough to be completely incompatible and it crashes on any query using a word with more than two syllables. Fix coming soon, for 4 years so far.

On the bright side the app now has curved edges!

chris_va 4 days ago [-]
One of my favorite parodies: https://www.youtube.com/watch?v=EUXnJraKM3k
nejsjsjsbsb 4 days ago [-]
Flashbacks of the .NET zoo. At least they reigned that in.
Euphorbium 4 days ago [-]
They can still do models o3o, oo3 and 3oo. Mini-o3o-high, not to be confused with mini-O3o-high (the first o is capital).
margalabargala 4 days ago [-]
They should just start encoding the model ID in trinary using o, O, and 0.

Model 00oOo is better than Model 0OoO0!

brookst 4 days ago [-]
You’re thinking too small. What about o10, O1o, o3-m1n1?
sss111 4 days ago [-]
Yeah I tried my best :(

I think they could've borrowed a page out of Apple's book, even mountain names would be better. Plus Sonoma, Ventura, and Yosemite are cool names.

kaaskop 4 days ago [-]
Yeah their naming scheme is super confusing, I honestly confuse them all the time.
thot_experiment 4 days ago [-]
at least if i ran the company you'd know that

ChatGPTasdhjf-final-final-use_this_one.pt > ChatGPTasdhjf-final.pt > ChatGPTasdhjf.pt > ChatGPTasd.pt> ChatGPT.pt

idonotknowwhy 4 days ago [-]
Did you just ls my /workspace dir? Lol
gundmc 4 days ago [-]
If this is the hierarchy, why does 4o score so much higher than o1 on LLM Arena?

Worrisome for OpenAI that Gemini's mini/flash reasoning model outscores both o1 and 4o handily.

crazysim 4 days ago [-]
Is it possible people are voting for speed of responsiveness too?
kgeist 4 days ago [-]
I suspect people on LLM Arena don't ask complex questions too often, and reasoning models seem to perform worse than simple models when the goal is just casual conversation or retrieving embedded knowledge. Reasoning models probably 'overthink' in such cases. And slower, too.
imtringued 4 days ago [-]
The LLM Arena deletes your prompt when you restart so what's the point in trying to write a complicated prompt and testing an exhaustive number of pairs?

It's easy to pin this on the users, but that website is hostile to putting in any effort.

This is something I've noticed a lot actually. A lot of AI projects just give you an input field and call it a day. Expecting the user to do the heavy lifting.

energy123 4 days ago [-]
o1 on LLM Arena often times out (network error) while thinking. But they still allow you to vote and they make it seem as if your vote is registered.
losvedir 4 days ago [-]
What about "o1 Pro mode". Is that just o1 but with more reasoning time, like this new o3-mini's different amount of reasoning options?
bobjordan 4 days ago [-]
I have been paying $200 per month for 01-pro mode and I am very disappointed right now because they have completely replaced the model today. It used to think for 1-5 minutes and deliver an unbelievably useful one-shot answer. Now, it only thinks for 7 seconds just like the 03-mini model and I can't tell the difference in the answers. I hope this is just a day 1 implementation bug but I suspect they have just decided to throw the $200 per month customers under the bus so that they'd have more capacity to launch the 03 model for everybody. I can't tell the difference between the models now and it is definitely not because the free 03 model delivers the quality that 01-pro-mode had! I'm so disappointed!
rcarmo 4 days ago [-]
This is why I use the Azure-hosted versions (disclosure: I’m an MS FTE, but I use all sorts of 3rd party models for my own projects) - I _know_ which version is behind each endpoint and when they will be replaced (you can also pin versions within a support window that varies according to model), so I don’t have to rework all my prompts and throw work away at the drop of a hat.
unification_fan 4 days ago [-]
They have been doing this since day 1. I was a subscriber in the first few months (then occasionally every few months to check out the progress) and so many people complained that OpenAI fiddled with the models daily without saying anything to anyone.

It was always attributed to variability but we all know it's not.

MichaelBurge 4 days ago [-]
o1-pro is a different model than o1.
losvedir 4 days ago [-]
Are you sure? Do you have any source for that? In this article[0] that was discussed here on HN this week, they say (claim):

> In fact, the O1 model used in OpenAI's ChatGPT Plus subscription for $20/month is basically the same model as the one used in the O1-Pro model featured in their new ChatGPT Pro subscription for 10x the price ($200/month, which raised plenty of eyebrows in the developer community); the main difference is that O1-Pro thinks for a lot longer before responding, generating vastly more COT logic tokens, and consuming a far larger amount of inference compute for every response.

Granted "basically" is pulling a lot of weight there, but that was the first time I'd seen anyone speculate either way.

[0] https://youtubetranscriptoptimizer.com/blog/05_the_short_cas...

JohnPrine 4 days ago [-]
I don't think this is true
ukuina 3 days ago [-]
Is o1-pro not the o1 equivalent of o3-mini-high?
usaar333 4 days ago [-]
For non-stem perhaps.

For math/coding problems, o3 mini is tied if not better than o1.

ALittleLight 4 days ago [-]
That seems very bad. What's the point of a new model that's worse than 4o? I guess it's cheaper in the API and a bit better at coding - but, this doesn't seem compelling.

With DeepSeek I heard OpenAI saying the plan was to move releases on models that were meaningfully better than the competition. Seems like what we're getting is the scheduled releases that are worse than the current versions.

thegeomaster 4 days ago [-]
It's quite a bit better than coding --- they hint that it can tie o1's performance for coding, which already benchmarks higher than 4o. And it's significantly cheaper, and presumably faster. I believe API costs account for the vast majority of COGS at most today's AI startups, so they would be very motivated to switch to a cheaper model that has similar performance.
mgens 4 days ago [-]
Right. For large-volume requests that use reasoning this will be quite useful. I have a task that requires the LLM to convert thousands of free-text statements into SQL select statements, and o3-mini-high is able to get many of the more complicated ones that GPT-4o and Sonnet 3.5 failed at. So I will be switching this task to either o3-mini or DeepSeek-R1.
lumost 4 days ago [-]
I actually switched back from o1-preview to GPT-4o due to tooling integration and web search. I find that more often than not, the ability of GPT-4o to use these tools outweighs o1's improved accuracy.
LoveMortuus 4 days ago [-]
How would the DeepSeek fit into this?

Or can it not compare? I don't know much about this stuff, but I've heard recently many people talk about DeepSeek and how unexpected it was.

sss111 4 days ago [-]
Deepseek V3 is equivalent to 4o. Deepseek R1 is equivalent to o1 (if not better)

I think someone should just build an AI model comparing website at this point. Include all benchmarks and pricing

dutchbookmaker 4 days ago [-]
I had resubscribed to use o1 2 weeks ago and haven't even logged in this week because of R1.

One thing I notice that is huge is being able to see the chain of thought lets me see when my prompt was lacking and the model is a bit confused on what I want.

If I was anymore impressed with R1 I would probably start getting accused of being a CCP shill or wumao lol.

With that said, I think it is very hard to compare models for your own use case. I do suspect there is a shiny new toy bias with all this too.

Poor Sonnet 3.5. I have neglected it so much lately I actually don't know if I have a subscription or not right now.

I do expect an Anthropic reasoning model though to blow everything else away.

r0fl 3 days ago [-]
R1 servers seem to be down or busy a lot lately.

It’s an amazing model but was so much faster before the hype

The servers being constantly down is the only reason I haven’t cancelled my ChatGPT subscription

egeozcan 3 days ago [-]
Me too actually. I wish I could pay to get priority. I know there are 3rd party providers but I want a chat interface and not fiddle with setting my own.
seunosewa 3 days ago [-]
The thought-stream is very important to me as well.
jsk2600 4 days ago [-]
withinboredom 4 days ago [-]
Looks like this only compares commercial models, and not the ones I can download and actually run locally.
TuxSH 4 days ago [-]
https://livebench.ai/#/

My experience is as follows:

- "Reason" toggle just got enabled for me as a free tier user of ChatGPT's webchat. Apparently this is o3-mini - I have Copilot Pro (offered to me for free), which apparently has o1 too (as well as Sonnet, etc.)

From my experience DeepSeek R1 (webchat) is more expressive, more creative and its writing style is leagues better than OpenAI's models, however it under-performs Sonnet when changing code ("code completion").

Comparison screenshots for prompt "In C++, is a reference to "const C" a "const reference to C"?": https://imgur.com/a/c-is-reference-to-const-c-const-referenc...

tl;dr keep using Claude for code and DeepSeek webchat for technical questions

ActVen 4 days ago [-]
I really wish they would open up the reasoning effort toggle on o1 API. o1 Pro Mode is still the best overall model I have used for many complex tasks.
bobjordan 4 days ago [-]
Have you tried the o1-pro mode model today, because now it sucks!
koakuma-chan 4 days ago [-]
You cannot compare GPT-4o and o*(-mini) because GPT-4o is not a reasoning model.
lxgr 4 days ago [-]
Sure you can. "Reasoning" is ultimately an implementation detail, and the only thing that matters for capabilities is results, not process.
koakuma-chan 4 days ago [-]
By "reasoning" I meant the fact that o*(-mini) does "chain-of-thought", in other words, it prompts itself to "reason" before responding to you, whereas GPT-4o(-mini) just directly responds to your prompt. Thus, it is not appropriate to compare o*(-mini) and GPT-4o(-mini) unless you implement "chain-of-thought" for GPT-4o(-mini) and compare that with o*(-mini). See also: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...
wordpad25 4 days ago [-]
That's like saying you can't compare a sedan to a truck.

Sure you can.

Even though one is more appropriate for certain tasks than the other.

dutchbookmaker 4 days ago [-]
It is a nuanced point but what is better, a sedan or a truck? I think we are still at that stage of the conversation so it doesn't make much sense.

I do think it is a good metaphor for how all this shakes out though in time.

freehorse 4 days ago [-]
Yes you use the models for the same things, and one is better than the other for said thing. The reasoning process is an implementation detail that does not concern anybody when evaluating the models, esp since "open"ai does not expose it. I just want llms to to task X which is usually "write a function in Y language that does W, taking these Z stuff into account", and for that i have found no reason to switch away from sonnet yet.
scrollop 4 days ago [-]
Why can't you ask both questions (on a variety of topics etc), and grade the answers vs an ideal answer?

Ends before means.

If 4o answered better than o3, would you still use 03 for your task just because you were told it can "reason"?

koakuma-chan 4 days ago [-]
The point is that you cannot make a general statement that “o1 is better than 4o.”
freehorse 4 days ago [-]
Yes, but because you need to say exactly what one is better than the other for. Not because o1 spends a bunch of tokens for "reasoning" you cannot even see.
koakuma-chan 4 days ago [-]
If you would like to see the CoT process visualized, try the “Improve prompt” feature in Anthropic console. Also check out https://github.com/getAsterisk/deepclaude
inciampati 3 days ago [-]
o-whatever are doing the same thing as any LLM, it's merely that they've been tuned into using a chain of thought to break out of their complexity class (from pattern matching TC0 to pseudo-UTM). But any foundation model with a bit of instruction tuning is going to be able to do this.
singularity2001 4 days ago [-]
no the reasoning models should not directly be compared with the normal models: they often take 10 times as long to answer which only makes sense for difficult questions
forrestthewoods 4 days ago [-]
OpenAI needs a new branding scheme.
nejsjsjsbsb 4 days ago [-]
The Llama folk know how. Good old 90s version scheme.
airstrike 4 days ago [-]
ChatGPT Series X O one
4 days ago [-]
withinboredom 4 days ago [-]
yeah, you can def tell they are partnered with Microsoft.
georgewsinger 4 days ago [-]
Did anyone else notice that o3-mini's SWE bench dropped from 61% in the leaked System Card earlier today to 49.3% in this blog post, which puts o3-mini back in line with Claude on real-world coding tasks?

Am I missing something?

anothermathbozo 4 days ago [-]
I think this is with and without "tools." They explain it in the system card:

> We evaluate SWE-bench in two settings: > *• Agentless*, which is used for all models except o3-mini (tools). This setting uses the Agentless 1.0 scaffold, and models are given 5 tries to generate a candidate patch. We compute pass@1 by averaging the per-instance pass rates of all samples that generated a valid (i.e., non-empty) patch. If the model fails to generate a valid patch on every attempt, that instance is considered incorrect.

> *• o3-mini (tools)*, which uses an internal tool scaffold designed for efficient iterative file editing and debugging. In this setting, we average over 4 tries per instance to compute pass@1 (unlike Agentless, the error rate does not significantly impact results). o3-mini (tools) was evaluated using a non-final checkpoint that differs slightly from the o3-mini launch candidate.

Bjorkbat 4 days ago [-]
So am I to understand that they used their internal tooling scaffold on the o3(tools) results only? Because if so, I really don't like that.

While it's nonetheless impressive that they scored 61% on SWE-bench with o3-mini combined with their tool scaffolding, comparing Agentless performance with other models seems less impressive, 40% vs 35% when compared to o1-mini if you look at the graph on page 28 of their system card pdf (https://cdn.openai.com/o3-mini-system-card.pdf).

It just feels like data manipulation to suggest that o3-mini is much more performant than past models. A fairer picture would still paint a performance improvement, but it look less exciting and more incremental.

Of course the real improvement is cost, but still, it kind of rubs me the wrong way.

pockmarked19 4 days ago [-]
YC usually says “a startup is the point in your life where tricks stop working”.

Sam Altman is somehow finding this out now, the hard way.

Most paying customers will find out within minutes whether the models can serve their use case, a benchmark isn’t going to change that except for media manipulation (and even that doesn’t work all that well, since journalists don’t really know what they are saying and readers can tell).

galaxyLogic 4 days ago [-]
My guess is this cheap mini-model comes out now after DeepSeek very recently shook the stock-market greatly with its cheap price and relatively good performance. .
IanCal 3 days ago [-]
o3 mini has been coming for a while, and iirc was "a couple of weeks" away a few weeks ago before R1 hit the news.
georgewsinger 4 days ago [-]
Makes sense. Thanks for the correction.
jakereps 4 days ago [-]
The caption on the graph explains.

> including with the open-source Agentless scaffold (39%) and an internal tools scaffold (61%), see our system card .

I have no idea what an "internal tools scaffold" is but the graph on the card that they link directly to specifies "o3-mini (tools)" where the blog post is talking about others.

DrewHintz 4 days ago [-]
I'm guessing an "internal tools scaffold" is something like Goose: https://github.com/block/goose

Instead of just generating a patch (copilot style), it generates the patch, applies the patch, runs the code, and then iterates based on the execution output.

4 days ago [-]
logicchains 4 days ago [-]
Maybe they found a need to quantize it further for release, or lobotomise it with more "alignment".
ben_w 4 days ago [-]
> lobotomise

Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.

Why do people try to meme as if AI is different? It has unexpected outputs sometimes, getting it to not do that is 50% "more alignment" and 50% "hallucinate less".

Just today I saw someone get the Amazon bot to roleplay furry erotica. Funny, sure, but it's still obviously a bug that a *sales bot* would do that.

And given these models do actually get stuff wrong, is it really incorrect for them to refuse to help with things they might be dangerous if the user isn't already skilled, like Claude in this story about DIY fusion? https://www.corememory.com/p/a-young-man-used-ai-to-build-a-...

bee_rider 4 days ago [-]
If somebody wants their Amazon bot to role play as an erotic furry, that’s up to them, right? Who cares. It is working as intended if it keeps them going back to the site and buying things I guess.

I don’t know why somebody would want that, seems annoying. But I also don’t expect people to explain why they do this kind of stuff.

ben_w 4 days ago [-]
It's still a bug. Not really working as intended — it doesn't sell anything from that.

A very funny bug, but a bug nonetheless.

And given this was shared via screenshots, it was done for a laugh.

thrwthsnw 4 days ago [-]
Who determines who gets access to what information? The OpenAI board? Sam? What qualifies as dangerous information? Maybe it’s dangerous to allow the model to answer questions about a person. What happens when limiting information becomes a service you can sell? For the right price anything can become too dangerous for the average person to know about.
ben_w 3 days ago [-]
> What qualifies as dangerous information?

The reports are public, and if you don't feel like reading them because they're too long and thorough in their explanations of what and why you can always put them into an AI and ask it to summarise them for you.

OpenAI is allowed to unilaterally limit the capability of their own models, just like any other software company can unilaterally limit the performance of their own software.

And they still are even when they're just blantantly wrong or even just lazy — it's not like people complain about Google "lobotomising" their web browsers for no longer supporting Flash or Java applets.

Rastonbury 4 days ago [-]
They are implying the release was rushed and they had to reduce the functionality of the model in order to make sure it did not teach people how to make dirty bombs
stavros 4 days ago [-]
The problem is that they don't make the LLM better at instruction following, they just make it unable to product furry erotica even if Amazon wants it to.
AbstractH24 3 days ago [-]
> Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.

Isn’t that exactly what VCs want?

ben_w 3 days ago [-]
I doubt it.

The advice I've always been given in (admittedly: small) business startup sessions was "focus on quality rather than price because someone will always undercut you on price".

The models are in a constant race on both price and quality, but right now they're so cheap that paying for the best makes sense for any "creative" task (like writing software, even if only to reduce the number of bugs the human code reviewer needs to fix), while price sensitivity only matters for the grunt work classification tasks (such as "based on comments, what is the public response to this policy?")

kkzz99 4 days ago [-]
Or the number was never real to begin with.
4 days ago [-]
waynecochran 4 days ago [-]
I just had it convert Swift code to Kotlin and was surprised at how the comment was translated. It "knew" the author of the paper and what is was doing!? That is wild.

Swift:

        //
        // Double Reflection Algorithm from Table I (page 7)
        // in Section 4 of https://tinyurl.com/yft2674p
        //
        for i in 1 ..< N {
            let X1 = spine[i]
            ...
Kotlin:

        // Use the Double Reflection Algorithm (from Wang et al.) to compute subsequent frames.
        for (i in 1 until N) {
            val X1 = Vector3f(spine[i])
            ...
8organicbits 3 days ago [-]
Wild that it can do that, but also clearly a worse output. The original has a URL, section, page, and table listed. The AI version instead cites the author. Needing to notice and fix unhelpful tweaks is one of the burdens of LLMs.
Sharlin 4 days ago [-]
Well, of course it knew the author. I'm sure you can ask just about any LLM who the author of the DRA is and it will answer Wang et al. without even having to google or follow the tinyurl link. And certainly it would also know that the algorithm is supposed to compute rotation minimizing frames.
waynecochran 3 days ago [-]
Also that fact that is changed the comment this way... was not expecting that,
waynecochran 3 days ago [-]
I don't know if you are being sarcastic or not, but that is pretty cool if true.
Sharlin 3 days ago [-]
Not sarcastic at all. it just doesn't seem like a big deal if you have played with LLMs and realize just how much LLMs know. The double reflection paper is not particularly obscure. (Incidentally I just asked Claude a couple of weeks ago about implementing rotation-minimizing frames!)
smallerize 4 days ago [-]
Wow, haven't seen a viglink in a while.
waynecochran 3 days ago [-]
Not sure I meant that to happen. I am sure I just went to tinyurl to make the link -- I don't know how it ended up this way.
mattmanser 4 days ago [-]
Someone else has written this exact code on the internet, OpenAI stole it, and now chatgpt is regurgitating it. Just like it can regurgitate whole articles.

You need to stop being wow'd by human intelligence masquerading as AI!

smallerfish 4 days ago [-]
Where? Searching for "Use the Double Reflection Algorithm (from Wang et al.)" only returns this post.
waynecochran 3 days ago [-]
Unlikely. I wrote this myself. Of course it is in a private github repo.
throwuxiytayq 3 days ago [-]
you are so confused it’s not even funny
ilaksh 4 days ago [-]
It looks like a pretty significant increase on SWE-Bench. Although that makes me wonder if there was some formatting or gotcha that was holding the results back before.

If this will work for your use case then it could be a huge discount versus o1. Worth trying again if o1-mini couldn't handle the task before. $4/million output tokens versus $60.

https://platform.openai.com/docs/pricing

I am Tier 5 but I don't believe I have access to it in the API (at least it's not on the limits page and I haven't received an email). It says "rolling out to select Tier 3-5 customers" which means I will have to wait around and just be lucky I guess.

TeMPOraL 4 days ago [-]
Tier 3 here and already see it on Limits page, so maybe the wait won't be long.
ilaksh 4 days ago [-]
Yep, I got an email about o3-mini in the API an hour ago.
TeMPOraL 4 days ago [-]
I apparently got one at the same time too, but I missed it distracted by this HN thread :). Not only I got o3-mini (which I already noticed on the Limits page), but they also gave me access to o1 now! I'm Tier 3; until yesterday, o1 was still Tier 5 (IIRC).

Thanks OpenAI! Nice gift and a neat distraction from DeepSeek-R1 - which I still can't use directly, because their API stopped working moments after I topped up my credits and generated an API key, and is still down for me... :/.

TechDebtDevin 4 days ago [-]
Genuinely curious, What made you choose OpenAI as your preferred api provider? Its always been the least attractive to me.
ilaksh 4 days ago [-]
I have mainly been using Claude 3.5/3.6 Sonnet via API in the last several months (or since 3.5 Sonnet came out). However, I was using o1 for a challenging task at one point, but last I tested it had issues with some extra backslashes for that application.

I also have tested with DeepSeek R1 and will test some more with that although in a way Claude 3.6 with CoT is pretty good. Last time I tried to test R1 their API was out.

eknkc 4 days ago [-]
We extensively used the batch APIs to decrease cost and handle large amount of data. I also need JSON responses for a lot of things and OpenAI seem to have the best json schema output option out there.
yyhhsj0521 3 days ago [-]
I use it because my company bought the enterprise plan and trained a bunch of "specialist GPTs" that were fed internal documents, code bases, Slack threads etc. I am not aware of other good LLM companies providing the same level of integration.
ipaddr 4 days ago [-]
Who else might be a good choice? Deepseek is down. Who has the cheapest gpt3.5 level or above api
TechDebtDevin 4 days ago [-]
Ive personaly been using Deepseek (which has been better than for 3.5 for a really long time), and Perplexity, which is nice for their built in search. Ive actually been using Deepseek since it was free. Its been generally good for me. Ive mostly chosen both because of pricing as I generally dont use APIs for extermely complex prompts.
Aperocky 4 days ago [-]
Run it locally, the distilled smaller ones aren't bad at all.
TeMPOraL 4 days ago [-]
Until recently they were the only game in town, so maybe they accrued significant spend back then?
sshh12 4 days ago [-]
Tier 5 and I got it almost instantly
pookieinc 4 days ago [-]
Can't wait to try this. What's amazing to me is that when this was revealed just one short month ago, the AI landscape looked very different than it does today with more AI companies jumping into the fray with very compelling models. I wonder how the AI shift has affected this release internally, future releases and their mindset moving forward... How does the efficiency change, the scope of their models, etc.
patrickhogan1 4 days ago [-]
I thought it was o3 that was released one month ago and received high scores on ARC Prize - https://arcprize.org/blog/oai-o3-pub-breakthrough

If they were the same, I would have expected explicit references to o3 in the system card and how o3-mini is distilled or built from o3 - https://cdn.openai.com/o3-mini-system-card.pdf - but there are no references.

Excited at the pace all the same. Excited to dig in. The model naming all around is so confusing. Very difficult to tell what breakthrough innovations occurred.

nycdatasci 4 days ago [-]
Yeah - the naming is confusing. We're seeing o3-mini. o3 yields marginally better performance given exponentially more compute. Unlike OpenAI, customers will not have an option to throw an endless amount of money at specific tasks/prompts.
echelon 4 days ago [-]
There's no moat, and they have to work even harder.

Competition is good.

lesuorac 4 days ago [-]
I really don't think this is true. OpenAI has no moat because they have nothing unique; they're using mostly other people's (like Transformers) architectures and other companies hardware.

Their value-prop (moat) is that they've burnt more money than everybody else. That moat is trivially circumvented by lighting a larger pile of money and less trivially by lighting the pile more efficently.

OpenAI isn't the only company. The Tech companies being beaten massively by Microsoft in #of H100s purchases are the ones with a moat. Google / Amazon with their custom AI chips are going to have a better performance per cost than others and that will be a moat. If you want to get the same performance per cost then you need to spend the time making your own chips which is years of effort (=moat).

sumedh 4 days ago [-]
> That moat is trivially circumvented by lighting a larger pile of money and less trivially by lighting the pile more efficently.

Google with all its money and smart engineers was not able to build a simple chat application.

mianos 4 days ago [-]
But with their internal progression structure they can build and cancel eight mediocre chat apps.
malaya_zemlya 4 days ago [-]
What do you mean? Gemini app is available on IOS, Android and on the web (as AI Studio https://aistudio.google.com/).
4 days ago [-]
robrenaud 4 days ago [-]
It's a joke about how Google has released/cancelled/renamed many messenging apps.
tmnvdb 4 days ago [-]
It is not very good though.
aprilthird2021 4 days ago [-]
Gemini is pretty good, And it does one thing way better than most other AI models, when I hold down my phone's home button it's available right away
evrenesat 4 days ago [-]
That's a shame on Google, Apple, Samsung, etc. Voice and other activation methods should be open to any app that claims to be an assistant. An ugly way of "gatekeeping".
lukan 4 days ago [-]
"OpenAI has no moat because they have nothing unique"

It seems they have high quality trainingsdata. And the knowledge to work with it.

aprilthird2021 4 days ago [-]
They buy most of their data from Scale AI types. It's not any higher quality than is available to any other model farm
sublimefire 4 days ago [-]
When you want to use AI in business you need some guarantees that the integration will not break because the ai company goes down or because of some breaking changes in a year. There is a reason why MSFT is in business. Similarly you will not buy Google because they do not like keeping products forever, you will not buy some unknown product just because it is 5% cheaper. OpenAI has a strong brand at the moment and this is their thing, until companies go to MSFT or AMZ to use their services with the ability to choose any model.
sangnoir 4 days ago [-]
> That moat is trivially circumvented by lighting a larger pile of money and less trivially by lighting the pile more efficently.

DeepSeek has proven that the latter is possible, which drops a couple of River crossing rocks into the moat.

withinboredom 4 days ago [-]
The fact that I can basically run o1-mini with deepseek:8b, locally, is amazing. Even on battery power, it works acceptably.
tmnvdb 4 days ago [-]
Those models are not comparable
withinboredom 4 days ago [-]
hmmm... check the deepseek-r1 repo readme :) They compare them there, but it would be nice to have external benchmarks.
brookst 4 days ago [-]
Brand is a moat
cruffle_duffle 4 days ago [-]
Ask Jeeves and Altavista surely have something to say about that!
geerlingguy 4 days ago [-]
Add Yahoo! to that list
esafak 4 days ago [-]
Their brand is as tainted as Meta's, which was bad enough to merit a rebranding from Facebook.
petesergeant 4 days ago [-]
> OpenAI has no moat

... is definitely something I've said before, and recently, but:

> That moat is trivially circumvented by lighting a larger pile of money

If that was true, someone would have done it.

lumost 4 days ago [-]
Capex was the theoretical moat, same as TSMC and similar businesses. DeepSeek poked a hole in this theory. OpenAI will need to deliver massive improvements to justify a 1 billion dollar training cost relative to 5 million dollars.
usef- 4 days ago [-]
I don't know if you are, but a lot of people are still comparing one Deepseek training run to the entire costs of OpenAI.

The deepseek paper states that the $5mil number doesn't include development costs, only the final training run. And it doesn't include the estimated $1.4billion cost of the infrastructure/chips Deepseek owns.

Most of OpenAI's billion dollar costs is in inference, not training. It takes a lot of compute to serve so many users.

Dario said recently that Claude was in the tens of millions (and that it was a year earlier, so some cost decline is expected), do we have some reason to think OpenAI was so vastly different?

lumost 4 days ago [-]
Anthropic’s ceo was predicting billion dollar training runs for 2025. Current training runs were likely in the tens/hundreds of millions of dollars USD.

Inference capex costs are not a defensive moat as I can rent gpus and sell inference with linear scaling costs. A hypothetical 10 billion dollar training run on proprietary data was a massive moat.

https://www.itpro.com/technology/artificial-intelligence/dol...

dutchbookmaker 4 days ago [-]
It is still curious though as far as what is actually being automated?

I find huge value in these models as an augmentation of my intelligence and as a kind of cybernetic partner.

I can't think of anything that can actually be automated though in terms of white collar jobs.

The white collar model test case I have in mind is a bank analyst under a bank operations manger. I have done both in the past but there is something really lacking with the idea of the operations manager replacing the analyst with a reasoning model even though DeepSeek annihilates every bank analyst reasoning I ever worked with right now.

If you can't even arbitrage the average bank analyst there might be these really non-intuitive no AI arbitrage conditions with white color work.

gdhkgdhkvff 4 days ago [-]
I don’t want to pretend I know how bank analysts work, but at the very least I would assume that 4 bank analysts with reasoning models would outperform 5 bank analysts without.
wahnfrieden 4 days ago [-]
Collaboration is even better, per open source results.

It is the closed competition model that’s being left in the dust.

vok 4 days ago [-]
Well, o3-mini-high just successfully found the root cause of a seg fault that o1 missed: mistakenly using _mm512_store_si512 for an unaligned store that should have been _mm512_storeu_si512.
throw83288 4 days ago [-]
How do I avoid the angst about this stuff as a student in computer science? I love this field but frankly I've been at a loss since the rapid development of these models.
jumploops 4 days ago [-]
LLMs are the new compilers.

As a student, you should continue to focus on fundamentals, but also adapt LLMs into your workflow where you can.

Skip writing the assembly (now curly braces and semicolons), and focus on what the software you’re building actually does, who it serves, and how it works.

Programming is both changing a lot, and not at all. The mechanics may look different, but the purpose is still the same: effectively telling computers what to do.

inciampati 3 days ago [-]
LLMs are actually the new computers. Compilation is only one program they can run.
jumploops 3 days ago [-]
LLMs are the way computers were always supposed to work!
unification_fan 4 days ago [-]
> LLMs are the new compilers.

This shows a grave misunderstanding of what compilers and LLMs are. They're fundamentally opposite concepts.

Compilers are about optimizing abstract code down to the most efficient representation possible for some hardware. LLMs are about wasting petaflops (made possible by compiler engineers) to produce random statements that don't have any static guarantees.

jumploops 4 days ago [-]
How can you trust that the compiler has written the most efficient assembly, if you’re not double checking it by hand?

Jokes aside, I understand your point.

In the history of computing, LLMs and compilers are closer than one might think.

Compilers weren’t first created to optimize “abstract code down to the most efficient” assembly as possible, even if that is the goal of a compiler writer today.

Compilers were created to enable the use of higher-level languages. Abstraction, efficiency, portability, error reduction, and most importantly: saving time.

They allowed humans to create more software, faster.

kandesbunzler 4 days ago [-]
- a coping Software engineer
abdullahkhalids 4 days ago [-]
As a former prof. What you should be learning from any STEM degree (and many other degrees as well) is to think clearly, rigorously, creatively, and with discipline, etc. You also need to learn the skill of learning content and skills quickly.

The specific contents or skills of your degree don't matter that much. In pretty much any STEM field, over the last 100ish years, whatever you learned in your undergraduate was mostly irrelevant by the time you retired.

Everyone got by, by staying on top of the new developments in the field and doing them. With AI, the particular skills needed to use the power of computers to do things in the world have changed. Just learn those skills.

mhh__ 4 days ago [-]
It's either over, or giving a lot of idiots false confidence — I meet people somewhat regularly who believe they don't really need to know what they're doing any more. This is probably an arbitrage.
bluerooibos 2 days ago [-]
You can ask them this question and all your fears will be washed away, for now..

"Here's a riddle for you - a surgeon, who is the boy's father says, "I cannot operate on this boy, he's my son!" Who is the surgeon to the boy?"

But seriously - AI in the hands of someone well-educated in their field is going to be a lot more powerful than some random person. Knowledge is still going to be valuable, and there are still people out there who don't know how to Google things and figure things out for themselves - so there'll be plenty of people who don't realise the potential of LLMs and won't use them.

npteljes 4 days ago [-]
There are at least two things here.

One, about the field itself. So far, I have been a know-it-all, and I dabbled in management too, besides that. This worked for me, because no matter how the field any my opportunities shifted, I always had a card up my sleeve. This is highly personal though.

Two, about managing angst. Whatever you experience now, you will in the future too. Circumstances won't matter at all, your brain will convert whatever it perceives around you, into these feelings that you generally experience. You can be at your highest high, and the lowest low, and you will always gravitate back towards these familiar feelings of yours. So, what you can do to have a nicer experience is to be a good partner yourself, and learn how to live with these specific feelings that you have.

danparsonson 4 days ago [-]
For all the value that they bring, there is still a good dose of parlour tricks and toy examples around, and they need an intelligent guiding hand to get the best out of them. As a meat brain, you can bring big picture design skills that the bots don't have, keeping them on track to deliver a coherent codebase, and fixing the inevitable hallucinations. Think of it like having a team of optimistic code monkeys with terrible memory, and you as the coordinator. I would focus on building skills in things like software design/architecture, requirements gathering (what do people want and how do you design software to deliver it?), in-depth hardware knowledge (how to get the best out of your platform), good API design, debugging, etc. Leave the CRUD to the robots and be the brain.
krschacht 2 days ago [-]
The cost of developing software is quickly dropping thanks to these models, and the demand for software is about to go way up because of this. LLMs will just be power tools to software builders. Learn to pop up a level.
raincole 4 days ago [-]
Angst?

It just means you're less likely be fixing someone else's "mistakenly _mm512_store_si512 for been _mm512_storeu_si512" error because AI fix(ed) it for you and you can focus on other parts of computer science. Computer science surely isn't just fixing _mm512_store_si512.

4 days ago [-]
jiocrag 4 days ago [-]
why is this impressive at all? It effectively amounts to correcting a typo.
swyx 4 days ago [-]
a typo that OP missed, with all our modern tooling, and also a typo that presumably the other models missed.
glouwbug 3 days ago [-]
It found a needle in a haystack
nextworddev 4 days ago [-]
rip development jobs /s.. or not /s
glouwbug 3 days ago [-]
The output of LLMs will be suboptimal for some time. Companies stand to earn more by hiring juniors who can reason without LLMs than juniors who copy paste from LLMs
Bjorkbat 4 days ago [-]
I have to admit I'm kind of surprised by the SWE-bench results. At the highest level of performance o3-mini's CodeForces score is, well, high. I've honestly never really sat down to understand how elo works, all I know is that it scored better than o1, which allegedly as better than ~90% of all competitors on CodeForces. So, you know, o3-mini is pretty good at CodeForces.

But it's SWE-bench scores aren't meaningfully better than Claude, 49.3 vs Claude's 49.0 on the public leaderboard (might be higher now due to recent updates?)

My immediate thoughts, CodeForces (and competitive programming in general) is a poor proxy for performance on general software engineering tasks. Besides that, for all the work put into OpenAI's most recent model it still has a hard time living up to an LLM initially released by Anthropic some time ago, at least according to this benchmark.

Mind you, the Github issues that the problems in SWE-bench were based-off have been around long enough that it's pretty much a given that they've all found their way into the training data of most modern LLMs, so I'm really surprised that o3 isn't meaningfully better than Sonnet.

usaar333 3 days ago [-]
I'm not that surprised. Codeforces is a relatively low area of vocabulary knowledge needed.

Real software has a lot more complexity and constraints, as well as ambiguity. Claude nowhere scores so high on codeforces, but crushes o1 on webarena: https://web.lmarena.ai/leaderboard

I also ran a hold out test myself for o3 mini - asked it to implement a function I need for Python 2.5. Claude and O1 get it - o3 mini hard incorrectly believes some functions are available in this certain that aren't. If I correct it, it's revised solution is very hacky (technically works, but I would take Claude's solution over it)

aprilthird2021 4 days ago [-]
> My immediate thoughts, CodeForces (and competitive programming in general) is a poor proxy for performance on general software engineering task

Yep. A general software engineering task has a lot of information encoded in it that is either already known to a human or is contextually understood by a human.

A competitive programming task often has to provide all the context as it's not based off an existing product or codebase or technology or paradigm known to the user

dagelf 4 days ago [-]
I think the innovation here is probably that its a much smaller and so cheaper model to run.
vectorhacker 4 days ago [-]
Yeah, I no longer consider the SWE-bench useful because these models can just "memorize" the solutions to the PRs.
cjbarber 4 days ago [-]
The interesting question to me is how far these reasoning models can be scaled. With another 12 months of compute scaling (for synthetic data generation and RL) how good will these models be at coding? I talked with Finbarr Timbers (ex-DeepMind) yesterday about this and his take is that we'll hit diminishing returns – not because we can't make models more powerful, but because we're approaching diminishing returns in areas that matter to users and that AI models may be nearing a plateau where capability gains matter less than UX.
futureshock 4 days ago [-]
I think in a lot of ways we are already there. Users are clearly already having difficulty seeing which model is better or if new models are improving over old models. People go back to the same gotcha questions and get different answers based on the random seed. Even the benchmarks are getting very saturated.

These models already do an excellent job with your homework, your corporate PowerPoints and your idle questions. At some point only experts would be able to decide if one response was really better than another.

Our biggest challenge is going to be finding problem domains with low performance that we can still scale up to human performance. And those will be so niche that no one will care.

Agents on the other hand still have a lot of potential. If you can get a model to stay on task with long context and remain grounded then you can start firing your staff.

dagelf 4 days ago [-]
Don't underestimate how much the long tail means to the general public.
devindotcom 4 days ago [-]
Sure as a clock, tick follows tock. Can't imagine trying to build out cost structures, business plans, product launches etc on such rapidly shifting sands. Good that you get more for your money, I suppose. But I get the feeling no model or provider is worth committing to in any serious way.
puffybunion 4 days ago [-]
this is the best outcome, though, rather than a monopoly, which is exactly what everyone is hoping to have.
thom 4 days ago [-]
Terrible time to open a shovel store, amazing time to pick up a shovel.
0xFEE1DEAD 4 days ago [-]
After o3 was announced, with the numbers suggesting it was a major breakthrough, I have to say I’m absolutely not impressed with this version.

I think o1 works significantly better, and that makes me think the timing is more than just a coincidence.

Last week Nvidia lost 600 billion because of DeepSeek R1, and now OpenAI comes out with a new release which feels like it has nothing to do with the promises that were being made about o3.

llm_trw 4 days ago [-]
Having tried using it, it is much worse than r1. Both the standard and high effort version.
CryptoBanker 3 days ago [-]
If it’s actually available, it can’t be that much worse than R1 which currently only completes a response about 50% of the time for me.
llm_trw 3 days ago [-]
There are multiple providers for it since it's open source.
egeozcan 3 days ago [-]
Are there any providers that have a chat interface (not just API access) with a fixed monthly cost? I couldn't find one.
llm_trw 3 days ago [-]
you.com when you disable their search the internet feature.
SkyPuncher 3 days ago [-]
Yea, o3-mini was a massive step down from Sonnet for coding tasks.

R1 is my cost effective programmer. Sonnet is my hard problem model still.

llm_trw 3 days ago [-]
R1 is interesting.

Since I have access to the thinking tokens I can see where it's going wrong and do prompt surgery. But left to it's own devices it gets thing _stupendously_ wrong about 20% of the time with a huge context blowout. So much so that seeing that happen now tells me I've fundamentally asked the wrong question.

Sonnet doesn't suffer from that and solves the task, but doesn't give you much if any, help in how to recover from doing the wrong task.

I'd say that for work work Sonnet 3.5 is still the best, for exploratory work with a human in the loop r1 is better.

Or as someone posted here a few days ago: R1 as the architect, Sonnet3.5 as the worker and critic.

moonsu 4 days ago [-]
This is the mini version which is not as good as o1 and I don’t think they demoed in the o3 announcement. I’m hoping the full release will be impressive
0xFEE1DEAD 3 days ago [-]
I know this isn't the full o3 release, but I find it odd that they're branding it as o3 when it feels more like an update to o1 mini.

Yes, reasoning has improved, but the overall results haven't advanced as much as one would expect from a major version update. It's highly unusual for OpenAI to release a milestone version like this - it feels more like a marketing move than a genuine upgrade.

Who knows what's going on behind closed doors?

If I put on my tinfoil hat for a moment, maybe Nvidia made a deal with OpenAI - offering a discount on computing power in exchange for a timely release.

OpenAI needs an enormous amount of computing power these days, and while Nvidia would take a financial hit by offering a discount to one of its biggest (if not the biggest) customers, that's still nowhere near as costly as losing 600 billion.

JohnnyMarcone 3 days ago [-]
This release was planned well before the Deepseek hype. When they demoed o3, they said they were planning on releasing o3-mini at the end of January.
ZeroTalent 3 days ago [-]
o3 is absolutely fantastic compared to r1 what are these people talking about
scarface_74 4 days ago [-]
This took 1:53 in o3-mini

https://chatgpt.com/share/679d310d-6064-8010-ba78-6bd5ed3360...

The 4o model without using the Python tool

https://chatgpt.com/share/679d32bd-9ba8-8010-8f75-2f26a792e0...

Trying to get accurate results with the paid version of 4o with the Python interpreter.

https://chatgpt.com/share/679d31f3-21d4-8010-9932-7ecadd0b87...

The share link doesn’t show the output for some reason. But it did work correctly. I don’t know whether the ages are correct. I was testing whether it could handle ordering

I have no idea what conclusion I should draw from this besides depending on the use case, 4o may be better with “tools” if you know your domain where you are using it.

Tools are relatively easy to implement with LangChain or the native OpenAI SDK.

BeetleB 4 days ago [-]
I would not expect any LLM to get this right. I think people have too high expectations for it.

Now if you asked it to write a Python program to list them in order, and have it enter all the names, birthdays, and year elected in a list to get the program to run - that's more reasonable.

scarface_74 4 days ago [-]
The “o” models get the order right.

DeepSeek also gets the order right.

It doesn’t show on the share link. But it actually outputs the list correctly from the built in Python interpreter.

For some things, ChatGPT 4o will automatically use its Python runtime

BeetleB 4 days ago [-]
That some models get it right is irrelevant. In general, if your instructions require computation, it's safer to assume it won't get it right and will hallucinate.
scarface_74 4 days ago [-]
The reasoning models all do pretty good at math.

Have you tried them?

This is something I threw together with o3-mini

https://chatgpt.com/share/679d5305-5f04-8010-b5c4-61c31e79b2...

ChatGPT 4o doesn’t even try to do the math internally and uses its built in Python interpreter. (The [_>] link is to the Python code)

https://chatgpt.com/share/679d54fe-0104-8010-8f1e-9796a08cf9...

DeepSeek handles the same problem just as well using the reasoning technique.

Of course ChatGPT 4o went completely off the rails without using its Python interpreter

https://chatgpt.com/share/679d5692-96a0-8010-8624-b1eb091270...

(The break down that it got right was using Python even though I told it not to)

margalabargala 4 days ago [-]
The 4o model's output is blatantly wrong. I'm not going to look up if it's the order or the ages that are incorrect, but:

36. Abraham Lincoln – 52 years, 20 days (1861)

37. James Garfield – 49 years, 105 days (1881)

38. Lyndon B. Johnson – 55 years, 87 days (1963)

Basically everything after #15 in the list is scrambled.

scarface_74 4 days ago [-]
That was the point. The 4o model without using Python was wrong. The o3 model worked correctly without needing an external tool
dilap 4 days ago [-]
Haven't used openai in a bit -- whyyy did they change "system" role (now basically an industry-wide standard) to "developer"? That seems pointlessly disruptive.
BoorishBears 4 days ago [-]
2 years ago I'd say it's an oversight, because there's 0 chance a top down directive would ask for this.

But given how OpenAI employees act online these days I wouldn't be surprised if someone on the ground proposed it as a way to screw with all the 3rd parties who are using OpenAI compatible endpoints or even use OpenAI's SDK in their official docs in some cases.

logicchains 4 days ago [-]
They mention in the model card, it's so that they can have a separate "system" role that the user can't change, and they trained the model to prioritise it over the "developer" role, to combat "jailbreaks". Thank God for DeepSeek.
sroussey 4 days ago [-]
They should have just created something above system and left as it was.
Etheryte 4 days ago [-]
Agreed, just add root and call it a day. Everyone who needs to care can instantly guesstimate what it is.
jen729w 4 days ago [-]
> Testers preferred o3-mini's responses to o1-mini 56% of the time

I hope by this they don't mean me, when I'm asked 'which of these two responses do you prefer'.

They're both 2,000 words, and I asked a question because I have something to do. I'm not reading them both; I'm usually just selecting the one that answered first.

That prompt is pointless. Perhaps as evidenced by the essentially 50% response rate: it's a coin-flip.

dkjaudyeqooe 4 days ago [-]
It's kind of strange that they gave that stat. Maybe they thought people would somehow think about "56% better" or something.

Because when you think about it, it really is quite damning. Minus statistical noise it's no better.

cm2187 4 days ago [-]
And another way to rephrase it is that almost half of the users prefer the older model, which is terrible PR.
tgsovlerkhgsel 4 days ago [-]
Not if the goal is to claim that the models deliver comparable quality, but with the new one excelling at something else (here: inferrence cost).
swyx 4 days ago [-]
its mini to mini, its the same cost
kettleballroll 4 days ago [-]
Typically in these tests you have three options "A is better", "B is better" or "they're equal/can't decide". So if 56% prefer O3 Mini, it's likely that way less than half prefer O1.also, the way I understand it, they're comparing a mini model with a large one.
directevolve 4 days ago [-]
If you use ChatGPT, it sometimes gives you two versions of its response, and you have to choose one or the other if you want to continue prompting. Sure, not picking a response might be a third category. But if that's how they were approaching the analysis, they could have put out a more favorable-looking stat.
ignoramous 4 days ago [-]
> If you use ChatGPT, it sometimes gives you two versions

Does no one else hate it when this happens (especially when on a handheld device)?

Powdering7082 4 days ago [-]
That would be 12%, why would you assume that is eaten by statistical noise?
senorrib 4 days ago [-]
The OPs comment is probably a testament of that. With such a poorly designed A/B test I doubt this has a p-value of < 0.10.
throwaway287391 4 days ago [-]
Erm, why not? A 0.56 result with n=1000 ratings is statistically significantly better than 0.5 with a p-value of 0.00001864, well beyond any standard statistical significance threshold I've ever heard of. I don't know how many ratings they collected but 1000 doesn't seem crazy at all. Assuming of course that raters are blind to which model is which and the order of the 2 responses is randomized with every rating -- or, is that what you meant by "poorly designed"? If so, where do they indicate they failed to randomize/blind the raters?
godelski 4 days ago [-]

  > If so, where do they indicate they failed to randomize/blind the raters?

  Win rate if user is under time constraint
This is hard to read tbh. Is it STEM? Non-STEM? If it is STEM then this shows there is a bias. If it is Non-STEM then this shows a bias. If it is a mix, well we can't know anything without understanding the split.

Note that Non-STEM is still within error. STEM is less than 2 sigma variance, so our confidence still shouldn't be that high.

n2d4 4 days ago [-]
Because you're not testing "will a user click the left or right button" (for which asking a thousand users to click a button would be a pretty good estimation), you're testing "which response is preferred".

If 10% of people just click based on how fast the response was because they don't want to read both outputs, your p-value for the latter hypothesis will be atrocious, no matter how large the sample is.

throwaway287391 4 days ago [-]
Yes, I am assuming they evaluated the models in good faith, understand how to design a basic user study, and therefore when they ran a study intended to compare the response quality between two different models, they showed the raters both fully-formed responses at the same time, regardless of the actual latency of each model.
n2d4 4 days ago [-]
I would recommend you read the comment that started this thread then, because that's the context we're talking about: https://news.ycombinator.com/item?id=42891294
throwaway287391 4 days ago [-]
I did read that comment. I don't think that person is saying they were part of the study that OpenAI used to evaluate the models. They would probably know if they had gotten paid to evaluate LLM responses.

But I'm glad you pointed that out, I now suspect that is responsible for a large part of the disagreement between "huh? a statistically significant blind evaluation is a statistically significant blind evaluation" vs "oh, this was obviously a terrible study" repliers is due to different interpretations of that post. Thanks. I genuinely didn't consider the alternative interpretation before.

johnmaguire 4 days ago [-]
> If 10% of people just click based on how fast the response was

Couldn't this be considered a form of preference?

Whether it's the type of preference OpenAI was testing for, or the type of preference you care about, is another matter.

n2d4 4 days ago [-]
Sure, it could be, you can define "preference" as basically anything, but it just loses its meaning if you do that. I think most people would think "56% prefer this product" means "when well-informed, 56% of users would rather have this product than the other".
aqme28 4 days ago [-]
They even include error bars. It doesn't seem to be statistical noise, but it's still not great.
m3kw9 4 days ago [-]
It’s 3x cheaper and faster
afro88 4 days ago [-]
Yeah. I immediately thought: I wonder if that 56% is in one or two categories and the rest are worse?
rvnx 4 days ago [-]
44% of the people prefers the existing model ?
afro88 4 days ago [-]
Each question falls into a different category (ie math, coding, story writing etc). Typically models are better at some categories and worse at others. Saying "56% of people preferred responses from o3-mini" makes me wonder if those 56 are only from certain categories and the model isn't uniformly 56% preferred.
KHRZ 4 days ago [-]
With many people too lazy to read 2 walls of text, a lot of picks might be random.
fsndz 4 days ago [-]
exactly I was surprised as well
brookst 4 days ago [-]
Those prompts are so irritating and so frequent that I’ve taken to just quickly picking whichever one looks worse at a cursory glance. I’m paying them, they shouldn’t expect high quality work from me.
apparent 4 days ago [-]
Have you considered the possibility that your feedback is used to choose what type of response to give to you specifically in the future?

I would not consider purposely giving inaccurate feedback for this reason alone.

MattDaEskimo 4 days ago [-]
I don't want a model that's customized to my preferences. My preferences and understanding changes all the time.

I want a single source model that's grounded in base truth. I'll let the model know how to structure it in my prompt.

kenjackson 4 days ago [-]
You know there's no such as base truth here? You want to write something like this to start your prompts, "Respond in English, using standard capitalization and punctuation, following rules of grammar as written by Strunk & White, where numbers are represented using arabic numerals in base 10 notation...."???
AutistiCoder 4 days ago [-]
actually, I might appreciate that.

I like precision of language, so maybe just have a system prompt that says "use precise language (ex: no symbolism of any kind)"

MattGaiser 4 days ago [-]
A lot of preferences have nothing to do with any truth. Do you like code segments or full code? Do you like paragraphs or bullet points? Heck, do you want English or Japanaese?
orbital-decay 4 days ago [-]
What is base truth for e.g. creative writing?
szundi 4 days ago [-]
Constang meh and fixing prompts to the right direction vs unable to escape the bubble
francis_lewis 4 days ago [-]
I think my awareness that this may influence future responses has actually been detrimental to my response rate. The responses are often so similar that I can imagine preferring either in specific circumstances. While I’m sure that can be guided by the prompt, I’m often hesitant to click on a specific response as I can see the value of the other response in a different situation and I don’t want to bias the future responses. Maybe with more specific prompting this wouldn’t be such an issue, or maybe more of an understanding of how inter-chat personalisation is applied (maybe I’m missing some information on this too).
isaacremuant 4 days ago [-]
Alternatively, I'll use the tool that is most user friendly and provides the most value for my money.

Wasting time on an anti pattern is not value nor is it trying to outguess the way that selection mechanism is used.

Der_Einzige 4 days ago [-]
Spotted the pissed off OpenAI RLHF engineer! Hahahahaha!
Tenoke 4 days ago [-]
That's such a counter-productive and frankly dumb thing to do. Just don't vote on them.
explain 4 days ago [-]
You have to pick one to continue the chat.
DiggyJohnson 4 days ago [-]
I know for a fact that as of yesterday I did not have to pick one to continue the conversation. It just maximizes the second choice and displayed a 2/2 below the response.
apparent 4 days ago [-]
Why not always pick the one on the left, for example? I understand wanting to speed through and not spend time doing labor for OpenAI, but it seems counter-productive to spend any time feeding it false information.
brookst 4 days ago [-]
My assumption is they measure the quality of user feedback, either on a per user basis or in an aggregate. I want them to interrupt me less, so I want them to either decide I’m a bad teacher or that users in general are bad teachers.
4 days ago [-]
4 days ago [-]
janalsncm 4 days ago [-]
> I'm usually just selecting the one that answered first

Which is why you randomize the order. You aren’t a tester.

56% vs 44% may not be noise. That’s why we have p values. It depends on sample size.

jhardy54 4 days ago [-]
The order doesn't matter. They often generate tokens at different speeds, and produce different lengths of text. "The one that answered first" != "The first option"
letmevoteplease 4 days ago [-]
The article says "expert testers."

"Evaluations by expert testers showed that o3-mini produces more accurate and clearer answers, with stronger reasoning abilities, than OpenAI o1-mini. Testers preferred o3-mini's responses to o1-mini 56% of the time and observed a 39% reduction in major errors on difficult real-world questions. W"

threatripper 4 days ago [-]
Those are two different sentences. The second sentence doesn't refer to experts explicitly.
brianstrimp 4 days ago [-]
That makes the result stronger though. Even though many people click randomly, there is still a 12% margin between both groups. Not the world, but still quite a lot.
sharkweek 4 days ago [-]
Funny - I had ChatGPT document some stuff for me this week and asked which responses I preferred as well.

Didn’t bother reading either of them, just selected one and went on with my day.

If it were me I would have set up a “hey do you mind if we give you two results and you can pick your favorite?” prompt to weed out people like me.

usef- 4 days ago [-]
I'm surprised how many people claim to do this. You can just not select one.
rubyn00bie 4 days ago [-]
I think it’s somewhat natural and am not personally surprised. It’s easy to quickly select an option, that has no consequence, compared to actively considering that not selecting something is an option. Not selecting something feels more like actively participating than just checking a box and moving on. /shrug
ssl-3 4 days ago [-]
We -- the people who live in front of a computer -- have been training ourselves to avoid noticing annoyances like captchas, advertising, and GDPR notices for quite a long time.

We find what appears to be the easiest combination "Fuck off, go away" buttons and use them without a moment of actual consideration.

(This doesn't mean that it's actually the easiest method.)

grahamj 4 days ago [-]
I can't even believe how many times in a day I frustratedly think "whatever, go away!"
apparent 4 days ago [-]
I wonder if they down-weight responses that come in too fast to be meaningful, or without sufficient scrolling.
losteric 4 days ago [-]
That’s fine. Your random click would be balanced by someone else randomly clicking
energy123 4 days ago [-]
Then 56% is even more impressive. Example: if 80% choose randomly and 20% choose carefully, that implies an 80% preference rate for o3-mini (0.8*0.2 + 0.5*0.8 = 0.56)
jackbrookes 4 days ago [-]
Yes I'd bet most users just 50/50 it, which actually makes it more remarkable that there was a 56% selection rate
cgriswald 4 days ago [-]
I read the one on the left but choose the shorter one.

The interface wastes so much screen real estate already and the answers are usually overly verbose unless I've given explicit instructions on how to answer.

ljm 4 days ago [-]
The default level of verbosity you get without explicitly prompting for it to be succinct makes me think there’s an office full of workers getting paid by the token.
internetter 4 days ago [-]
In my experience the verbosity significantly improves output quality
directevolve 4 days ago [-]
Also, it's not clear if the preference comes from the quality of the 'meat' of the answer, or the way it reports its thinking and the speed with which it responds. With o1, I get a marked feeling of impatience waiting for it to spit something out, and the 'progress of thought' is in faint grey text I can't read. With o3, the 'progress of thought' comes quickly, with more to read, and is more engaging even if I don't actually get anything more than entertainment value.

I'm not going to say there's nothing substantive about o3 vs. o1, but I absolutely do not put it past Sam Altman to juice the stats every chance he gets.

johnneville 4 days ago [-]
they also pay contractors to do these evaluations with much more detailed metrics, no idea which their number is based on though
mikeInAlaska 4 days ago [-]
Maybe we should take both answers, paste them into a new chat and ask for a summary amalgamation of them
ricardobeat 4 days ago [-]
This is just a way to prove, statistically, that one model is better than another as part of its validation. It's not collected from normal people using ChatGPT, you don't ever get shown two responses from different models at once.
yawnxyz 4 days ago [-]
Wait what? I get shown this with ChatGPT maybe 5% of the time
nearbuy 4 days ago [-]
Those are both responses from the same model. It's not one response from o1 and another from o3.
4 days ago [-]
arijo 4 days ago [-]
People could be flipping a coin and the score would be the same.
brianstrimp 4 days ago [-]
A 12% margin is literally the opposite of a coin flip. Unless you have a really bad coin.
arijo 4 days ago [-]
I wasn't expecting for my comment to be red so literally but ok.

We're talking about the most cost-efficient model, the competition here is on price, not on a 12% incremental performance (which would make sense for the high end model).

To my knowledge deepseek is the cheaper service which is what matters on the low-end (unless the increase in performance was in such magnitude that the extra-charge would be worth the money).

brianstrimp 4 days ago [-]
What does deepseek have to do with a comparison between o1-mini and o3-mini?
arijo 3 days ago [-]
I'm not sure I follow - your assertion was that 12% is significative.

I personally chose for price on a low-cost model (unless the improvement is to significant that it justifies the higher price).

buggy6257 4 days ago [-]
You're being downvoted for 3 reasons:

1) Coming off as a jerk, and from a new account is a bad look

2) "Literally the opposite of a coin flip" would probably be either 0% or 100%

3) Your reasoning doesn't stand up without further info; it entirely depends on the sample size. I could have 5 coin flips all come up heads, but over thousands or millions it averages to 50%. 56% on a small sample size is absolutely within margin of error/noise. 56% on a MASSIVE sample size is _statistically_ significant, but isn't even still that much to brag about for something that I feel like they probably intended to be a big step forward.

brianstrimp 4 days ago [-]
I'm a little puzzled by your response.

1. The message was net-upvoted. Whether there are downvotes in there I can't tell, but the final karma is positive. A similarly spirited message of mine in the same thread was quite well receive as well.

2. I can't see how my message would come across as a jerk? I wrote 2 simple sentences, not using any offensive language, stating a mere fact of statistics. Is that being jerk? And a long-winded berating of a new member of the community isn't?

3. A coin flip is 50%. Anything else is not, once you have a certain sample size. So, this was not. That was my statement. I don't know why you are building a strawman of 5 coin flips. 56% vs 44% is a margin of 12%, as I stated, and with a huge sample size, which they had, that's massive in a space where the returns are deep in "diminishing" territory.

gcanyon 4 days ago [-]
I don't think they make it clear: I wonder if they mean testers prefer o3 mini 56% of the time when they express an opinion, or overall? Some percentage of people don't choose; if that number is 10% and they aren't excluded, that means 56% of the time people prefer o3 mini, 34% of the time people prefer o1 mini, and 10% of the time people don't choose. I'm not sure I think it would be reasonable to present the data that way, but it seems possible.
teeray 4 days ago [-]
This prompt is like "See Attendant" on the gas pump. I'm just going to use another AI instead for this chat.
ninkendo 4 days ago [-]
Glad to know I’m not the only person who just drives to the next station when I see a “see attendant” message.
shombaboor 4 days ago [-]
It seems like the the first response must get chosen a majority of the time just to account for friction
danilocesar 4 days ago [-]
I almost always pick the second one, because it's closer to the submit button and the one I read first.
resters 4 days ago [-]
I too have questioned the approach of showing the long side-by-side answers from two different models.

1) sometimes I wanted the short answer, and so even though the long answer is better I picked the short one.

2) sometimes both contain code that is different enough that I am inclined to go with the one that is more similar to what I already had, even if the other approach seems a bit more solid.

3) Sometimes one will have less detail but more big picture awareness and the other will have excellent detail but miss some overarching point that is valuable. Depending on my mood I sometimes choose but it is annoying to have to do so because I am not allowed to say why I made the choice.

The area of human training methodology seems to be a big part of what got Deepseek's model so strong. I read the explanation of the test results as an acknowledgement by OpenAI of some weaknesses in its human feedback paradigm.

IMO the way it should work is that the thumbs up or down should be read in context by a reasoning being and a more in-depth training case should be developed that helps future models learn whatever insight the feedback should have triggered.

Feedback that A is better or worse than B is definitely not (in my view) sufficient except in cases where a response is a total dud. Usually the responses have different strengths and weaknesses and it's pretty subjective which one is better.

dionian 4 days ago [-]
i enjoy it, i like getting two answers for free - often one of them is significantly better and probably the newer model
danielmarkbruce 4 days ago [-]
RLUHF, U = useless.
mullingitover 4 days ago [-]
You know you can configure default instructions to your prompts, right?

I have something like “always be terse and blunt with your answers.”

4 days ago [-]
kumarm 4 days ago [-]
I ran some quick programming tasks I have used O1 previously:

1. 1/4th time for reasoning for most tasks.

2. Far better results.

CamperBob2 4 days ago [-]
Compared to o1 or o1-pro?
yzydserd 4 days ago [-]
A few quick tasks look to me like o3-mini-high is 4-10x faster for 80% of the quality. It gives very good and sufficient fast reasoning about coding tasks, but I think I'd ask o1-pro to do the task ie provide the code. o3-mini-high can keep up with you at thinking / typing speed, whereas o1-pro can take several minutes. Just a quick view after playing for an hour.
highfrequency 4 days ago [-]
Anyone else confused by inconsistency in performance numbers between this announcement and the concurrent system card? https://cdn.openai.com/o3-mini-system-card.pdf

For example-

GPQA diamond system card: o1-preview 0.68

GPQA diamond PR release: o1-preview 0.78

Also, how should we interpret the 3 different shading colors in the barplots (white, dotted, heavy dotted on top of white)...

kkzz99 4 days ago [-]
Actually sounds like benchslop to me.
4 days ago [-]
4 days ago [-]
silverlake 4 days ago [-]
O3-mini solved this prompt. DeepSeek R1 had a mental breakdown. The prompt: “Bob is facing forward. To his left is Ann, to his right is Cathy. Ann and Cathy are facing backwards. Who is on Ann’s left?”
modeless 4 days ago [-]
R1 or R1-Distill? They are not the same thing. I think DeepSeek made a mistake releasing them at the same time and calling them all R1.

Full R1 solves this prompt easily for me.

silverlake 4 days ago [-]
modeless 4 days ago [-]
Huh, that one got it wrong for me too. I don't have patience to try it 10 times each to see if it was a coincidence, but it is absolutely true that not all implementations of LLMs produce the same outputs. It is in fact common for subtle bugs to happen that cause the outputs to be worse but not catastrophically bad, and therefore go unnoticed. So I wouldn't trust any implementation but the original for benchmarking or even general use unless I tested it extensively.
SparkyMcUnicorn 4 days ago [-]
Same. With the recommended settings, it got it right. I regenerated a bunch of times, and it did suggest Cathy once or twice.

R1 70b also got it right just as many times for me.

lynguist 4 days ago [-]
Really interesting puzzle. I tried the different Chatgpts and they all solved it. I tried Deepseek and Deepseek R1, and Deepseek did have this meltdown:

Let's break down the problem step by step to understand the relationships and positions of Bob, Ann, and Cathy. 1. Understanding the Initial Setup

    Bob is facing forward.

    This means Bob's front is oriented in a particular direction, which we'll consider as the reference point for "forward."

    To his left is Ann, to his right is Cathy.

    If Bob is facing forward, then:

        Ann is positioned to Bob's left.

        Cathy is positioned to Bob's right.

    Ann and Cathy are facing backwards.

    Both Ann and Cathy are oriented in the opposite direction to Bob. If Bob is facing forward, then Ann and Cathy are facing backward.
2. Visualizing the Positions

To better understand the scenario, let's visualize the positions: Copy

Forward Direction: ↑

Bob (facing forward) | | Ann (facing backward) | / | / | / | / | / | / | / |/ |

And then only the character | in a newline forever.

thinkalone 4 days ago [-]
That's a fun, simple test! I tried a few models, and mistral-nemo gets it every time, even when run locally without any system prompt! https://build.nvidia.com/nv-mistralai/mistral-nemo-12b-instr...
mark_l_watson 4 days ago [-]
Thanks for that example! Using 'ollama run' these models all failed: deepseek-r1:32b, mistral-small:latest, qwq:latest, and qwen2.5:14b.
sivakon 4 days ago [-]
usaar333 3 days ago [-]
Even 4o mini gets that
Synaesthesia 4 days ago [-]
Deepseek solved it.
msp26 4 days ago [-]
I wish they'd just reveal the CoT (like gemini and deepseek do), it's very helpful to see when the model gets misled by something in your prompt. Paying for tokens you aren't even allowed to see is peak OpenAI.
liamwire 4 days ago [-]
sama and OpenAI’s CPO Kevin Weil both suggested this is coming soon, as a direct response to DeepSeek, in an AMA a few hours ago: https://www.reddit.com/r/OpenAI/s/EElFfcU8ZO
msp26 4 days ago [-]
> a much more helpful and detailed version of this

Notice the deliberate wording. To me this implies we aren't getting the raw CoT.

PeterHolzwarth 4 days ago [-]
Do you have a direct link to that? My "force .old on everything" plugin is having problems resolving your url (sorry!).
tucnak 4 days ago [-]
I'm sorry, but it's over for OpenAI. Some have predicted this; including me back in November[1] when I wrote "o1 is a revolution in accounting, not capability" which although tongue-in-cheek, has so far turned out to be correct. I'm only waiting to see what Google, Facebook et al. will accomplish now that R1-Zero result is out the bag. The nerve, the cheek of this hysterical o3-mini release—insisting to hide the COT from the consumer still, is telling us one thing and one thing alone: OpenAI is no longer able to adapt to the ever-changing landscape. Maybe the Chinese haven't beaten them yet, but Google, Facebook et al. absolutely will, & without having to resort to deception.

[1]: https://old.reddit.com/r/LocalLLaMA/comments/1gna0nr/popular...

mediaman 4 days ago [-]
You don't need to wait for Google. Their Jan 21 checkpoint for their fast reasoning model is available on AIStudio. It shows full reasoning traces. It's very good, much faster than R1, and although they haven't released pricing, based on flash it's going to be quite cheap.
tucnak 4 days ago [-]
Sure, their 01-21 reasoning model is really good, but there's no pricing for it!

I care mostly about batching in Vertex AI, which is 17-30x times cheaper than competition (whether you use prompt caching or not) while allowing for audio, video, and arbitrary document filetype inputs; unfortunately Gemini 1.5 Pro/Flash have remained the two so-called "stable" options that are available there. I can appreciate Google's experimental models for all I can, but I cannot take them seriously until they allow me to have my sweet, sweet batches.

mark_l_watson 4 days ago [-]
Oh, sweet: both o3-mini low and high support integrated web search. No integrated web search with o1.

I prefer, for philosophical reasons, open weight and open process/science models, but OpenAI has done a very good job at productizing ChatGPT. I also use their 4o-mini API because it is cheap and compares well to using open models on Groq Cloud. I really love running local models with Ollama but the API venders keep the price so low that I understand most people not wanting the hasssle if running Deepseek-R, etc., locally.

mvkel 4 days ago [-]
I've been using cursor since it launched, sticking almost exclusively to claude-3.5-sonnet because it is incredibly consistent, and rarely loses the plot.

As subsequent models have been released, most of which claim to be better at coding, I've switched cursor to it to give them a try.

o1, o1-pro, deepseek-r1, and the now o3-mini. All of these models suffer from the exact same "adhd." As an example, in a NextJS app, if I do a composer prompt like "on page.tsx [15 LOC], using shadcn components wherever possible, update this page to have a better visual hierarchy."

sonnet nails it almost perfectly every time, but suffers from some date cutoff issues like thinking that shadcn-ui@latest is the repo name.

Every single other model, doesn't matter which, does the following: it starts writing (from scratch), radix-ui components. I will interrupt it and say "DO NOT use radix-ui, use shadcn!" -- it will respond with "ok!" then begin writing its own components from scratch, again not using shadcn.

This is still problematic with o3-mini.

I can't believe it's the models. It must be the instruction-set that cursor is giving it behind the scenes, right? No amount of .cursorrules, or other instruction, seems to get cursor "locked in" the way sonnet just seems to be naturally.

It sucks being stuck on the (now ancient) sonnet, but inexplicably, it remains the only viable coding option for me.

Has anyone found a workaround?

kace91 4 days ago [-]
My experience with cursor and sonnet is that it is relatively good at first tries, but completely misses the plot during corrections.

"My attempt at solving the problem contains a test that fails? No problem, let me mock the function I'm testing, so that, rather than actually run, it returns the expected value!"

It keeps doing that kind of shenanigans, applying modifications that solve the newly appearing problem while screwing the original attempt's goal.

I usually get much better results from regular chatgpt copying and pasting, the trouble being that it is a major pain to handle the context window manually by pasting relevant info and reminding what I think is being forgotten.

delichon 4 days ago [-]
Claude makes a lot of crappy change suggestions, but when you ask "is that a good suggestion?" it's pretty good at judging when it isn't. So that's become standard operating procedure for me.

It's difficult to avoid Claude's strong bias for being agreeable. It needs more HAL 9000.

4b11b4 4 days ago [-]
I'm always asking Claude to propose a variety of suggestions for the problem at hand and their trade-offs, then evaluating them for the top three proposals and why. Then I'll pick one of them and further vet the idea
kace91 4 days ago [-]
>It's difficult to avoid Claude's strong bias for being agreeable. It needs more HAL 9000.

Absolutely, I find this a challenge as well. Every thought that crosses my mind is a great idea according to it. That's the opposite attitude to what I want from an engineer's copilot! Particularly from one who also advices junior devs.

esperent 4 days ago [-]
> when you ask "is that a good suggestion?" it's pretty good at judging when it isn't

Basically a poor man's COT.

jwpapi 4 days ago [-]
Yes it’s usually worth it to try to write a really good first prompt
earleybird 4 days ago [-]
More than once I've found myself going down this 'little maze of twisty passages, all alike'. At some point I stop, collect up the chain of prompts in the conversation, and curate them into a net new prompt that should be a bit better. Usually I make better progress - at least for a while.
SamPatt 4 days ago [-]
This becomes second nature after a while. I've developed an intuition about when a model loses the plot and when to start a new thread. I have a base prompt I keep for the current project I'm working on, and then I ask the model to summarize what we've done in the thread and combine them to start anew.

I can't wait until this is a solved problem because it does slow me down.

dr_dshiv 4 days ago [-]
Why is it so hard to share/find prompts or distill my own damn prompts? There must be good solutions for this —
garfij 4 days ago [-]
What do you find difficult about distilling your own prompts?

After any back and forth session I have reasonably good results asking something like "Given this workflow, how could I have prompted this better from the start to get the same results?"

dr_dshiv 4 days ago [-]
Analysis of past chats in bulk.
whall6 4 days ago [-]
Don’t outsource the only thing left for our brains to do themselves :/
sheepscreek 4 days ago [-]
For my advanced use case involving Python and knowledge of finance, Sonnet fared poorly. Contrary to what I am reading here, my favorite approach has been to use o1 in agent mode. It’s an absolute delight to work with. It is like I’m working with a capable peer, someone at my level.

Sadly there are some hard limits on o1 with Cursor and I cannot use it anymore. I do pay for their $20/month subscription.

electroly 4 days ago [-]
> o1 in agent mode

How? It specifically tells me this is unsupported: "Agent composer is currently only supported using Anthropic models or GPT-4o, please reselect the model and try again."

sheepscreek 4 days ago [-]
I think you’re right - I must have used it in regular mode, then got GPT-4o to fill in the gaps. It can fully automate a lot of menial work, such as refactors and writing tests. Though I’ll add, I had a roughly 50% success with GPT-4o bug fixing in agent mode, which is pretty great in my experience. When it did work, it felt glorious - 100% hands-free operation!
axkdev 4 days ago [-]
It seems like you could use aider in architecture mode. Basically, it will suggest the solution to your problem fist, and prompt you to start editing, you can say no to refine the solution and only start editing when you are satisfied with it.
mathieuh 4 days ago [-]
Hah, I was trying it the other day in a Go project and it did exactly the same thing. I couldn’t believe my eyes, it basically rewrote all the functions back out in the test file but modified slightly so the thing that was failing wouldn’t even run.
nprateem 4 days ago [-]
I've had it do similar nonsense.

I just don't understand all the people who honestly believe AGI just requires more GPUs and data when these models are so inherently stupid.

hahajk 4 days ago [-]
Can't you select Chatgpt as the model in cursor?
kace91 4 days ago [-]
Yes, but for some reason it seems to perform worse there.

Perhaps whatever algorithms Cursor uses to prepare the context it feeds the model are a good fit for Claude but not so much for the others (?). It's a random guess, but whatever the reason, there's a weird worsening of performance vs pure chat.

electroly 4 days ago [-]
Yes but every model besides claude-3.5-sonnet sucks in Cursor, for whatever reason. They might as well not even offer the other models. The other models, even "smarter" models, perform vastly poorer or don't support agent capability or both.
zackproser 4 days ago [-]
Not trying to be snarky, but the example prompt you provided is about 1/15th the length and detail of prompts I usually send when working with Cursor.

I tend to exhaustively detail what I want, including package names and versions because I've been to that movie before...

inerte 4 days ago [-]
What works nice also is the text to speech. I find it easier and faster to give more context by talking rather than typing, and the extra content helps the AI to do its job.

And even though the speech recognition fails a lot on some of the technical terms or weirdly named packages, software, etc, it still does a good job overall (if I don’t feel like correcting the wrong stuff).

It’s great and has become somewhat of a party trick at work. Some people don’t even use AI to code that often, and when I show them “hey have you tried this?” And just tell the computer what I want? Most folks are blown away.

cadence- 4 days ago [-]
Does the Cursor have text-to-speech functionality?
fud101 4 days ago [-]
you mean speech to text right?
chefandy 4 days ago [-]
Not for me. I first ask Advanced Voice to read me some code and have Siri listen and email it to an API I wrote which uses Claude to estimate the best cloud provider to run that code based on its requirements and then a n8n script deploys it and send me the results via twilio.
inerte 3 days ago [-]
Sorry! Yes, speech to text.
crooked-v 4 days ago [-]
If have to write a prompt that long, it'll be faster to just write the code.
aprilthird2021 4 days ago [-]
Shocking to see this because this was essentially the reason most of the previous no code solutions never took off...
rizky05 4 days ago [-]
[dead]
esperent 4 days ago [-]
That sounds exhausting. Wouldn't it be faster to include you package.json in the context?

I sometimes do this (using Cline), plus create a .cline file at project root which I refine over time and which describes both the high level project overview, details of the stack I'm using, and technical details I want each prompt to follow.

Then each actual prompt can be quite short: read files x, y, and z, and make the following changes... where I keep the changes concise and logically connected - basically what I might do for a single pull request.

mvkel 4 days ago [-]
My point was that a prompt that simple could be held and executed very well by sonnet, but all other models (especially reasoning models) crash and burn.

It's a 15 line tsx file so context shouldn't be an issue.

Makes me wonder if reasoning models are really proper models for coding in existing codebases

liamwire 4 days ago [-]
Your last point matches what I’ve seen some people (simonw?) say they’re doing currently: using aider to work with two models—one reasoning model as an architect, and one standard LLM as the actual coder. Surprisingly, the results seem pretty good vs. putting everything on one model.
mvkel 4 days ago [-]
This is probably the right way to think about it. O1-pro is an absolute monster when it comes to architecture. It is staggering the breadth and depth that it sees. Ask it to actually implement though, and it trips over its shoelaces almost immediately.
goosejuice 4 days ago [-]
Can you give an example of this monstrous capability you speak of? What have you used it for professionally w.r.t. architecture.
mvkel 3 days ago [-]
The biggest delta over regular o1 that I've seen is asking it to make a PRD of an app that I define as a stream-of-consciousness with bullet points.

It's fantastic at finding needles in the haystack, so the contradictions are nonexistent. In other words, it seems to identify which objects would interrelate and builds around those nodes, where o1 seems to think more in "columns."

To sum it up, where o1 feels like "5 human minute thinking," o1-pro feels like "1 human hour thinking"

hombre_fatal 4 days ago [-]
You’re basically saying you write 15x the prompt for the same result they get with sonnet.
jwpapi 4 days ago [-]
Yes this works good for me too rather take your time and do the first prompt right
chrismsimpson 4 days ago [-]
I’ve coded in many languages over the years but reasonably new to the TS/JS/Next world.

I’ve found if you give your prompts a kind long form “stream of consciousness”, where you outline snippets of code in markdown along with contextual notes and then summarise/outline at the end what you actually wish to achieve, you can get great results.

Think a long form, single page “documentation” type prompts that alternate between written copy/contextual intent/description and code blocks. Annotating code blocks with file names above the blocks I’m sure helps too. Don’t waste your context window on redundant/irrelevant information or code, stating a code sample is abridged or adding commented ellipses seems to do the job.

d357r0y3r 4 days ago [-]
By the time I've fully documented and explained what I want to be done, and then review the result, usually finding that it's worse than what I would have written myself, I end up questioning my instinct to even reach for this tool.

I like it for general refactoring and day to day small tasks, but anything that's relatively domain-specific, I just can't seem to get anything that's worth using.

noahbp 4 days ago [-]
Like most AI tools, great for beginners, time-savers for intermediate users, and frequently a waste of time in domains where you're an expert.

I've used Cursor for shipping better frontend slop, and it's great. I skip a lot of trial and error, but not all of it.

epolanski 4 days ago [-]
,> and frequently a waste of time in domains where you're an expert.

I'm a domain expert and I disagree.

There's many scenarios where using LLMs pays off.

E.g. a long file or very long function are just that, and an LLM is faster at understanding it whole not being limited in how many things you can track in your mind at once (between 4 and 6). It's still gonna be faster at refactoring it and testing it than you will.

d357r0y3r 4 days ago [-]
I agree that it's amazing as a learning tool. I think the "time to ramp" on a new technology or programming language has probably been cut in half or more.
twilightfringe 4 days ago [-]
ha! good to confirm! I tend to do this, just kind of as a double-check thing, but never sure if it actually worked or if it was a placebo, lol.

Or end with "from the user's perspective: all the "B" elements should light up in excitement when you click "C""

mvkel 4 days ago [-]
Going to try this! Thanks for the tip
MaxLeiter 4 days ago [-]
We've been working on solving a lot of these issues with v0.dev (disclaimer: shadcn and I work on it). We do a lot of pre and post-processing to ensure LLMs output valid shadcn code.

We're also talking to the cursor/windsurf/zed folks on how we can improve Next.js and shadcn in the editors (maybe something like llms.txt?)

mvkel 4 days ago [-]
Thanks for all the work you do! v0 is magical. I absolutely love the feature where I can add a chunky component that v0 made to my repo with npx
harshitaneja 4 days ago [-]
So I think I finally understood recently why we have these divergent groups with one thinking Claude 3.5 Sonnet is the best model for coding and another that follow the OpenAI SOTA at that moment. I have been a heavy user of ChatGPT, jumping on to pro without even thinking for more than a second once released. Recently though I took a pause from my usual work on statistical modelling, heuristics work and other things in certain deep domains to focus on building client APIs and frontends and decided to again give Claude a try and it is just so great to work with for this usecase.

My hypothesis is its a difference of what you are doing. OpenAI O models are much better than others at mathematical modelling and such tasks and Claude for more general purpose programming.

mycall 4 days ago [-]
Have you used multi-agent chat sessions with each fielding their own specialities and seeing if that improves your use cases aka MoE?
harshitaneja 3 days ago [-]
I have not. Any suggestions on which one(s) to explore to get started.
dhc02 1 days ago [-]
I'm going to give you a video to watch. It's not mine, and I don't know much about this particular youtuber, but it really transformed how I think about writing and structuring the prompts I use, which solved problems similar to what you're describing here.

https://youtu.be/y_ywOVQyafE?si=IvKjy7QUYgxGPNgD

PS (I have not bought the guy's course and have no idea whether it's any good)

energy123 4 days ago [-]
Context length possibly. Prompt adherence drops off with context, and anything above 20k tokens is pushing it. I get the best results by presenting the smallest amount of context possible, including removing comments and main methods and functions that it doesn't need to see. It's a bit more work (not that much if you have a script that does it for you), but the results are worth it. You could test in the chatgpt app (or lmarena direct chat) where you ask the same question but with minimal hand curated context, and see if it makes the same mistake.
mvkel 4 days ago [-]
If it's a context issue, it's an issue with how cursor itself sends the context to these reasoning LLMs.

Context alone shouldn't be the reason that sonnet succeeds consistently, but others (some which have even bigger context windows) fail.

energy123 4 days ago [-]
Yes, that's what I'm suggesting. Cursor is spamming the models with too much context, which harms reasoning models more than it harms non-reasoning models (hypothesis, but one that aligns with my experience). That's why I recommended testing reasoning models outside of Cursor with a hand curated context.

The advertised context length being longer doesn't necessarily map 1:1 with the actual ability the models have to perform difficult tasks over that full context. See for example the plots of performance on ARC vs context length for o-series models.

digitcatphd 4 days ago [-]
The reality is I suspect one will use different models for different things. Think of it like having different modes of transportation.

You might use your scooter, bike, car, jet - depending on the circumstances. A bike was invented 100 years ago? But it may be the best in the right use case. Would still be using DaVinci for some things because we haven't bothered swapping it and it works fine.

For me - the value of R1/o3 is visible logic that provides an analysis that can be critiqued by Sonnet 3.5

ido 4 days ago [-]
I have an even more topical analogy! Using different languages for different tasks. When I need some one off script do automate some drudgery (take all files with certain pattern in their name, for each do some search and replace in the text inside, zip them, upload zip to URL, etc) I use python. When Im working on a multi-platform game I use c# (and unity). When I need to make something very lean that works in mobile browsers I use JS with some light-weight libraries.
jvanderbot 4 days ago [-]
I've found cursor to be too thin a wrapper. Aider is somehow significantly more functional. Try that.
dhc02 4 days ago [-]
Aider, with o1 or R1 as the architect and Claude 3.5 as the implementer, is so much better than anything you can accomplish with a single model. It's pretty amazing. Aider is at least one order of magnitude more effective for me than using the chat interface in Cursor. (I still use Cursor for quick edits and tab completions, to be clear).
dwaltrip 4 days ago [-]
I haven't tried aider in quite a while, what does it mean to use one model as an architect and another as the implementer?
Terretta 4 days ago [-]
Aider now has experimental support for using two models to complete each coding task:

- An Architect model is asked to describe how to solve the coding problem.

- An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.

Splitting up “code reasoning” and “code editing” in this manner has produced SOTA results on aider’s code editing benchmark. Using o1-preview as the Architect with either DeepSeek or o1-mini as the Editor produced the SOTA score of 85%. Using the Architect/Editor approach also significantly improved the benchmark scores of many models, compared to their previous “solo” baseline scores (striped bars).

https://aider.chat/2024/09/26/architect.html

lukas099 4 days ago [-]
Probably gonna show a lot of ignorance here, but isn’t that a big part of the difference between our brains and AI? That instead of one system, we are many systems that are kind of sewn together? I secretly think AGI will just be a bunch of different specialized AIs working together.
Terretta 4 days ago [-]
You're in good company in that secret thought.

Have a look at this: https://en.wikipedia.org/wiki/Society_of_Mind

dhc02 1 days ago [-]
Efficient and effective organizations work this way, too: a CEO to plan in broad strokes, employees to implement that vision in specific ways, and managers to make sure their results match expectations.
ChadNauseam 4 days ago [-]
I normally use aider by just typing in what I want and it magically does it. How do I use o1 or R1 to play the role of the "architect"?
macNchz 4 days ago [-]
You can start it with something like:

    aider --architect --model o1 --editor-model sonnet
Then you'll be in "architect" mode, which first prompts o1 to design the solution, then you can accept it and allow sonnet to actually create the diffs.

Most of the time your way works well—I use sonnet alone 90% of the time, but the architect mode is really great at getting it unstuck when it can't seem to implement what I want correctly, or keeps fixing its mistakes by making things worse.

cruffle_duffle 4 days ago [-]
I really want to see how apps created this way scale to large codebases. I’m very skeptical they don’t turn into spaghetti messes.

Coding is basically just about the most precise way to encapsulate a problem as a solution possible. Taking a loose English description and expanding it into piles of code is always going to be pretty leaky no matter how much these models spit out working code.

In my experience you have to pay a lot of attention to every single line these things write because they’ll often change stuff or more often make wrong assumptions that you didn’t articulate. And in my experience they never ask you questions unless you specifically prompt them to (and keep reminding them to), which means they are doing a hell of a lot of design and implementation that unless carefully looked over will ultimately be wrong.

It really reminds me a bit of when Ruby on Rails came out and the blogosphere was full of gushing “I’ve never been more productive in my life” posts. And then you find out they were basically writing a TODO app and their previous development experience was doing enterprise Java for some massive non-tech company. Of course RoR will be a breath of fresh air for those people.

Don’t get me wrong I use cursor as my daily driver but I am starting to find the limits for what these things can do. And the idea of having two of these LLM’s taking some paragraph long feature description and somehow chatting with each other to create a scalable bit of code that fits into a large or growing codebase… well I find that kind of impossible. Sure the code compiles and conforms to whatever best practices are out there but there will be absolutely no constancy across the app—especially at the UX level. These things simply cannot hold that kind of complexity in their head and even if they could part of a developers job is to translate loose English into code. And there is much, much, much, much more to that than simply writing code.

macNchz 4 days ago [-]
I see what you’re saying and I think that terming this “architect” mode has an implication that it’s more capable than it really is, but ultimately this two model pairing is mostly about combining disparate abilities to separate the “thinking” from the diff generation. It’s very effective in producing better results for a single prompt, but it’s not especially helpful for “architecting” a large scale app.

That said, in the hands of someone who is competent at assembling a large app, I think these tools can be incredibly powerful. I have a business helping companies figure out how/if to leverage AI and have built a bunch of different production LLM-backed applications using LLMs to write the code over the past year, and my impression is that there is very much something there. Taking it step by step, file by file, like you might if you wrote the code yourself, describing your concept of the abstractions, having a few files describing the overall architecture that you can add to the chat as needed—little details make a big difference in the results.

tribeca18 4 days ago [-]
I use Cursor and Composer in agent mode on a daily basis, and this is basically exactly what happened to me.

After about 3 weeks, things were looking great - but lots of spagetti code was put together, and it never told me what I didn't know. The data & state management architecture I had written was simply just not maintainable (tons of prop drilling, etc). Over time, I basically learned common practices/etc and I'm finding that I have to deal with these problems myself. (how it used to be!)

We're getting close - the best thing I've done is create documentation files with lots of descriptions about the architecture/file structure/state management/packages/etc, but it only goes so far.

We're getting closer, but for right now - we're not there and you have to be really careful with looking over all the changes.

nprateem 4 days ago [-]
The worst thing you can do with aider is let it autocommit to git. As long as you review each set of changes you can stop it going nuts.

I have a codebase maybe 3-500k lines which is in good shape because of this.

I also normally just add the specific files I need to the chat and give it 1-2 sentences for what to do. It normally does the right thing (sonnet obviously).

dhc02 1 days ago [-]
Yes! Turn off autocommit, everyone! Review and test, then git commit.
aledalgrande 4 days ago [-]
Same with Cline
esperent 4 days ago [-]
Claude uses Shadcn-ui extensively in the web interface, to the point where I think it's been trained to use it over other UI components.

So I think you got lucky and you're asking it to write using a very specific code library that it's good at, because it happens to use it for it's main userbase on the web chat interface.

I wonder if you were using a different component library, or using Svelte instead of React, would you still find Claude the best?

pollinations 4 days ago [-]
I was recently trying to write a relatively simple htmx service with Claude. I was surprised at how much worse it was when it's not React.
adonese 4 days ago [-]
My general workflow with ai so far has been this: - I use copilot mostly for writing unit tests. It mostly works well since the unit tests follow a standard template. - I use the chat one for alternating between different approaches and (in)validating certain approaches

My day job is a big monorepo, I have not investigated that yet but I believe the models context sizes fall short there and as such the above use cases only works for me.

eagleinparadise 4 days ago [-]
Cursor is also very user-unfriendly in providing alternative models to use in composer (agent). There's a heavy reliance on Anthrophic for cursor.

Try using Gemini thinking with Cursor. It barely works. Cmd-k outputs the thinking into the code. Its unusable in chat because the formatting sucks.

Is there some relationship between Cursor and Anthropic, i wonder. Plenty of other platforms seem very eager to give users model flexibility, but Cursor seems to be lacking.

I could be wrong, just an observation.

ttroyr 11 hours ago [-]
Originally, actually there was a relationship between Cursor & OpenAI. Something like Cursor was supported by the OpenAI startup fund. So Cursor seems to have branched out. I think they are just emphasizing the models they find most effective. I'm surprised they haven't (apparently) incorporated Claude prompt caching yet for Sonnet.
hombre_fatal 4 days ago [-]
I have the same experience. Just today I was integrating a new logging system with my kubernetes cluster.

I tried out the OP model to make changes to my yaml files. It would give short snippets and I’d have to keep trial and erroring its suggestions.

Eventually I pasted the original prompt to Claude and it one-shot the dang thing with perfect config. Made me wonder why I even try new models.

kristopolous 4 days ago [-]
"not" and other function words; usually work fine today but if I'm having trouble, the best thing to do is probably be inclusive, not exclusive.
foobiekr 4 days ago [-]
Have you tried any of the specialty services like Augment? I am curious if they are any better or just snake oil.
pknerd 4 days ago [-]
OT: How many tokens are being consumed? How much are you paying for Claude APIs?
Abishek_Muthian 4 days ago [-]
Just curious, did you try a code model like Codestral instead of a MoE?
bugglebeetle 4 days ago [-]
o3 mini’s date cut-off is 2023, so it’s unfortunately not gonna be useful for anything that requires knowledge of recent framework updates, which includes probably all big frontend stuff.
OkGoDoIt 4 days ago [-]
I also have been less impressed by o1 in cursor compared to sonnet 3.5. Usually what I will do for a very complicated change is ask o1 to architect it, specifically asking it to give me a detailed plan for how it would be implemented, but not to actually implement anything. I then change the model to Sonnet 3.5 to have it actually do the implementation.

And on the side of not being able to get models to understand something specific, there’s a place in a current project where I use a special Unicode apostrophe during some string parsing because a third-party API needs it. But any code modifications by the AI to that file always replace it with a standard ascii apostrophe. I even added a comment on that line to the effect of “never replaced this apostrophe, it’s important to leave it exactly as it is!” And also put that in my cursor rules, and sometimes directly in the prompt as well, but it always replaces it even for completely unrelated changes. I’ve had to manually fix it like 10 times in the last day, it’s infuriating.

4 days ago [-]
jajko 4 days ago [-]
A random idea - train one of those models on you, keep it aside, let it somehow work out your intricacies, moods, details, childhood memories, personality, flaws, strengths. Methods can be various - initial dump of social networks, personal photos and videos, maybe some intense conversation to grok rough you, then polish over time.

A first step to digital immortality, could be a nice startup of some personalized product for rich, and then even regular folks. Immortality not in ourselves as meat bags of course, we die regardless, but digital copy and memento that our children can use if feeling lonely and can carry with themselves anywhere, or later descendants out of curiosity to hold massive events like weddings. One could 'invite' long lost ancestors. Maybe your grand-grand father would be a cool guy you could easily click with these days via verbal input. Heck even 3D detailed model.

An additional service, 'perpetually' paid - keeping your data model safe, taking care of it, backups, heck even maybe give it a bit of computing power to to receive current news in some light fashion and evolve, could be extras. Different tiers for different level of services and care.

Or am I decade or two ahead? I can see this as universally interesting across many if not all cultures.

ryanhecht 4 days ago [-]
> While OpenAI o1 remains our broader general knowledge reasoning model, OpenAI o3-mini provides a specialized alternative for technical domains requiring precision and speed.

I feel like this naming scheme is growing a little tired. o1 is for general knowledge reasoning, o3-mini replaces o1-mini but might be more specialized than o1 for certain technical domains...the "o" in "4o" is for "omni" (referring to its multimodality) but the reasoning models start with "o" ...but they can't use "o2" for trademark reasons so they skip straight to "o3" ...the word salad is getting really hard to follow!

zamadatix 4 days ago [-]
The -mini postfix makes perfect sense, probably even clearer than the old "turbo" wording. Naturally, the latest small model may be better than larger older models... but not always and not necessarily in everything. What you'd expect from a -mini model is exactly what is delivered.

The non-reasoning line was also pretty straightforward. Newer base models get a larger prefix number and some postfixes like 'o' were added to signal specific features in each model variant. Great!

Where things went of the rails was specifically when they decided to also name the reasoning models with an 'o' for separate reasons but now as the prefix at the same time as starting a separate linear sequence but now as the postfix. I wonder if we'll end up with both a 4o and o4...

lolinder 4 days ago [-]
> I wonder if we'll end up with both a 4o and o4...

The perplexing thing is that someone has to have said that, right? It has to have been brought up in some meeting when they were brainstorming names that if you have 4o and o1 with the intention of incrementing o1 you'll eventually end up with an o4.

Where they really went off the rails was not just bailing when they realized they couldn't use o2. In that moment they had the chance to just make o1 a one-off weird name and go down a different path for its final branding.

OpenAI just struggles with names in general, though. ChatGPT was a terrible name picked by engineers for a product that wasn't supposed to become wildly successful, and they haven't really improved at it since.

viraptor 4 days ago [-]
The obvious solution could be to just keep skipping the even numbers and go to o5.
arrowleaf 4 days ago [-]
Or further the hype and name it o9.
4 days ago [-]
4 days ago [-]
cruffle_duffle 4 days ago [-]
Why not let ChatGPT decide the naming? Surely it will be replacing humans at this task any day now?
macrolime 4 days ago [-]
And multimodal o4 should be o4o.
tmnvdb 4 days ago [-]
Probably they are doing so well because there are not endless meetings on customer friendly names
unsupp0rted 4 days ago [-]
This is definitely intentional.

You can like Sama or dislike him, but he knows how to market a product. Maybe this is a bad call on his part, but it is a call.

thorum 4 days ago [-]
Not really. They’re successful because they created one of the most interesting products in human history, not because they have any idea how to brand it.
marko-k 4 days ago [-]
If that were the case, they’d be neck and neck with Anthropic and Claude. But ChatGPT has far more market share and name recognition, especially among normies. Branding clearly plays a huge role.
cj 4 days ago [-]
ChatGPT is still benefitting from first mover advantage. Which they’ve leveraged to get to the position they’re at today.

Over time, competitors catch up and first mover advantage melts away.

I wouldn’t attribute OpenAI’s success to any extremely smart marketing moves. I think a big part of their market share grab was simply going (and staying) viral for a long time. Manufacturing virality is notoriously difficult (and based on the usability and poor UI of ChatGPT early versions, it feels like they got lucky in a lot of ways)

bobxmax 4 days ago [-]
I think that has more to do with the multiple year head start and multiple tens of billions of dollars in funding advantage.
joshstrange 4 days ago [-]
And you think that is due to their model naming?
jcheng 4 days ago [-]
I prefer Anthropic's models but ChatGPT (the web interface) is far superior to Claude IMHO. Web search, long-term memory, and chat history sharing are hard to give up.
KeplerBoy 4 days ago [-]
That's first mover advantage.
FridgeSeal 4 days ago [-]
I think it’s success in spite of branding, not because of it.

This naming scheme is a dumpster fire. Every other comment is trying to untangle what the actual hierarchy of model performance is.

mrbungie 4 days ago [-]
That's like making a second reading and appealing to authority.

The naming is bad. Other people already said it you can "google" stuff, you can "deepseek" something, but to "chatgpt" sounds weird.

The model naming is even weirder, like, did they really avoid o2 because of oxigen?

sumedh 4 days ago [-]
> but to "chatgpt" sounds weird.

People just say it differently, they say "ask chatgpt"

gwd 4 days ago [-]
I normally use Claude, but "Ask Claude", but unless it's someone who knows me well, I say "Ask ChatGPT", or it's just not as claer; and I don't think it's primarily due to popularity.
mrbungie 4 days ago [-]
Obviously they do. That's the whole point.
kingnothing 4 days ago [-]
They really need someone in marketing.

If the model is for technical stuff, then call it the technical model. How is anyone supposed to know what these model names mean?

The only page of theirs attempting to explain this is a total disaster. https://platform.openai.com/docs/models

TeMPOraL 4 days ago [-]
> How is anyone supposed to know what these model names mean?

Normies don't have to know - ChatGPT app focuses UX around capabilities and automatically picks the appropriate model for capabilities requested; you can see which model you're using and change it, but don't need to.

As for the techies and self-proclaimed "AI experts" - OpenAI is the leader in the field, and one of the most well-known and talked about tech companies in history. Whether to use, praise or criticize, this group of users is motivated to figure it out on their own.

It's the privilege of fashionable companies. They could name the next model ↂ-↊↋, and it'll take all of five minutes for everyone in tech (and everyone on LinkedIn) to learn how to type in the right Unicode characters.

EDIT: Originally I wrote \Omega-↊↋, but apparently HN's Unicode filter extends to Greek alphabet now? 'dang?

relaxing 4 days ago [-]
What if you use ASCII 234? Ω (edit: works!)
TeMPOraL 4 days ago [-]
Thanks! I copied mine from Wikipedia (like I typically do with Unicode characters I rarely use), where it is also Ω - the same character. For a moment I was worried I somehow got it mixed up with the Ohm symbol but I didn't. Not sure what happened here.
n2d4 4 days ago [-]
> They really need someone in marketing.

Who said this is not intentional? It seems to work well given that people are hyped every time there's a release, no matter how big the actual improvements are — I'm pretty sure "o3-mini" works better for that purpose than "GPT 4.1.3"

fkyoureadthedoc 4 days ago [-]
> I'm pretty sure "o3-mini" works better for that purpose than "GPT 4.1.3"

Why would the marketing team of all people call it GPT 4.1.3?

n2d4 4 days ago [-]
They wouldn't! They would call it o3-mini, even though GPT 4.1.3 may or may not "make more sense" from a technical perspective.
golly_ned 4 days ago [-]
Yes, this $300Bn company generating +$3.4Bn in revenue needs to hire marketing expert. They can begin by sourcing ideas from us here to save their struggling business from total marketing disaster.
winrid 4 days ago [-]
At the least they should care more about UX. I have no idea how to restore the sidebar on chatgpt on desktop lol
Legend2440 4 days ago [-]
Click the 'open sidebar' icon in the top left corner of the screen.
winrid 4 days ago [-]
There isn't one, unless they fixed it today. Just a down arrow to change the model.
Legend2440 3 days ago [-]
Try clearing your cache, the button has always been there for me.
3 days ago [-]
avs733 4 days ago [-]
Hype based marketing can be effective but it is high risk and unstable.

A marketing team isn’t a generality that makes a company known, it often focuses on communicating what products different types of customers need from your lineup.

If I sell three medications:

Steve

56285

Priximetrin

And only tell you they are all pain killers but for different types and levels of pain I’m going to leave revenue on the floor. That is no matter how valuable my business is or how well it’s known.

optimalsolver 4 days ago [-]
>this $300Bn company

Watch this space.

meiraleal 4 days ago [-]
[dead]
ninetyninenine 4 days ago [-]
I bet you can get one of their models to fix that disaster.
ryanhecht 4 days ago [-]
But what would we call that model?
ninetyninenine 4 days ago [-]
Let’s call it “O5 Pro Max Elite”—because if nonsense naming works for smartphones, why not AI models?
ryandrake 4 days ago [-]
O5 Pro Max Elite Enterprise Edition with Ultra
TeMPOraL 4 days ago [-]
Maybe they could start selling "season passes" next to make their offering even more clear!
aleph_minus_one 4 days ago [-]
> But what would we call that model?

Ask one of their models for advice. :-)

ryanhecht 4 days ago [-]
Reminds me of a joke in the musical "How to Succeed in Business Without Really Trying" (written in 1961):

PETERSON Oh say, Tackaberry, did you get my memo?

TACKABERRY What memo?

PETERSON My memo about memos. We're sending out too many memos and it's got to stop!

TACKABERRY All right. I'll send out a memo.

rowanG077 4 days ago [-]
If marketing terms from intel, AMD, Dell and other tech companies have taught me anything, it's that they need LESS of people in marketing.
TeMPOraL 4 days ago [-]
But think of all the other marketers whose job is to produce blogspam explaining confusing product names!
4 days ago [-]
ryanhecht 4 days ago [-]
Ugh, and some of the rows of that table are "sets of models" while some are singular models...there's the "Flagship models" section at the top only for "GPT models" to be heralded as "Our fast, versatile, high intelligence flagship models" in the NEXT section...

...I like "DALL·E" and "Whisper" as names a lot, though, FWIW :p

koakuma-chan 4 days ago [-]
Name is just a label. It's not supposed to mean anything.
ninetyninenine 4 days ago [-]
Think how awesome the world would be if labels ALSO had meanings.
koakuma-chan 4 days ago [-]
As someone else said in another thread, if you could derive the definition from a word, the word would be as long as the definition, which would defeat the purpose.
ninetyninenine 4 days ago [-]
Im not saying words. Im saying labels.

You use words as labels so that we use our pre existing knowledge of the word to derive meaning from the label.

TeMPOraL 4 days ago [-]
There is no such thing. "Meaning" isn't a property of a label, it arises from how that label is used with other labels in communication.

It's actually the reason LLMs work in the first place.

optimalsolver 4 days ago [-]
You're gonna need to ground those labels in something physical at some point.

No one's going to let an LLM near anything important until then.

TeMPOraL 4 days ago [-]
You only need it for bootstrapping. Fortunately, we've already done that when we invented first languages. LLMs are just bootstrapping off us.
golol 4 days ago [-]
I don't find OpenAIs naming conventions confusing, except that the o for omni and the o for reasoning have nothing to do with eachother. That's a crime.
fourseventy 4 days ago [-]
It's almost as bad as the Xbox naming scheme.
Someone1234 4 days ago [-]
I don't know if anything is as bad as a games console named "Series."
observationist 4 days ago [-]
They should be calling it ChatGPT and ChatGPT-mini, with other models hidden behind some sort of advanced mode power user menu. They can roll out major and minor updates by number. The whole point of differentiating between models is to get users to self limit the compute they consume - rate limits make people avoid using the more powerful models, and if they have a bad experience using the less capable models, or if they're frustrated by hopping between versions without some sort of nuanced technical understanding, it's just a bad experience overall.

OpenAI is so scattered they haven't even bothered using their own state of the art AI to come up with a coherent naming convention? C'mon, get your shit together.

TeMPOraL 4 days ago [-]
"ChatGPT" (chatgpt-4o) is now its own model, distinct from gpt-4o.

As for self-limiting usage by non-power users, they're already doing that: ChatGPT app automatically picks a model depending on what capabilities you invoke. While they provide a limited ability to see and switch the model in use, they're clearly expecting regular users not to care, and design their app around that.

observationist 4 days ago [-]
None of that matters to normal users, and you could satisfy power users with serial numbers or even unique ideograms. Naming isn't that hard, and their models are surprisingly adept at it. A consistent naming scheme improves customer experience by preventing confusion - when a new model comes out, I field questions for days from friends and family - "what does this mean? which model should i use? Aww, I have to download another update?" and so on. None of the stated reasons for not having a coherent naming convention for their models are valid. I'd be upset as a stakeholder, they're burning credibility and marketing power for no good reason.

modelname(variant).majorVersion.minorVersion ChatGPT(o).3.0 ChatGPT-mini(o).3.0 GPT.2.123 GPT.3.9

And so on. Once it's coherent, people pick it up, and naturally call the model by "modelname majorversion" , and there's no confusion or hesitance about which is which. See, it took me 2 minutes.

Even better: Have an OAI slack discussion company-wide, then have managers summarize their team's discussions into a prompt demonstrating what features they want out of it, then run all the prompts together and tell the AI to put together 3 different naming schemes based on all the features the employees want. Roll out a poll and have employees vote which of the 3 gets used going forward. Or just tap into that founder mode and pick one like a boss.

Don't get me wrong, I love using AI - we are smack dab in the middle of a revolution and normal people aren't quite catching on yet, so it's exhilarating and empowering to be able to use this stuff, like being one of the early users of the internet. We can see what's coming, and if you lived through the internet growing up, you know there's going to be massive, unexpected synergies and developments of systems and phenomena we don't yet have the words for.

OpenAI can do better, and they should.

TeMPOraL 4 days ago [-]
I agree with your observations, and that they both could and should do better. However, they have the privilege of being the AI company, the most hyped-up brand in the most hyped-up segment of economy - at this point, the impact of their naming strategy is approximately nil. Sure, they're confusing their users a bit, but their users are very highly motivated.

It's like with videogames - most of them commit all kinds of UI/UX sins, and I often wish they didn't, but excepting extreme cases, the players are too motivated to care or notice.

golol 4 days ago [-]
This mentality is why teenagers can't use a file system. Why do tech people love to hide as much state as possible. Does it really help anyone?
layer8 4 days ago [-]
Inscrutable naming is a proven strategy for muddying the waters.
jtwaleson 4 days ago [-]
Salesforce would like a word...
SAI_Peregrinus 4 days ago [-]
The USB-IF as well. Retroactively changing the name of a previous standard was particularly ridiculous. It's always been USB 3.1 Gen 1 like we've always been at war with Eastasia.
diegocg 4 days ago [-]
I hope chatgpt reconsiders the naming of their models some time. I have troubles deciding which model is the one I should use.
esafak 4 days ago [-]
They release models too often for a new one to be better at everything, so you have to pick the right one for your task.
niek_pas 4 days ago [-]
And that’s exactly where good, recognizable branding comes in.
airbreather 2 days ago [-]
I tried to get it to build me a slightly challenging app to break out data from a fairly obscure file format for some PLC code, after having tried with Claude.

o3-mini produced volumes of code more quickly and more of it, but Claude still had greater insight in to the problem and decoded the format to a noticeably greater degree.

Whereas 03-mini quickly got to a certain point, it wasn't long before it was obvious it wasn't really going any further - like it's big cousin, but in it's own way, it was lazy and forgetful, seeming at times more interested in telling me what I might try than actually trying itself.

Interestingly, even when I gave it a copy of Claude's code it still wasn't able to get to the same depth of understanding.

airstrike 4 days ago [-]
Hopefully this is a big improvement from o1.

o1 has been very disappointing after spending sufficient time with Claude Sonnet 3.5. It's like it actively tries to gaslight me and thinks it knows more than I do. It's too stubborn and confidently goes off in tangents, suggesting big changes to parts of the code that aren't the issue. Claude tends to be way better at putting the pieces together in its not-quite-mental-model, so to speak.

I told o1 that a suggestion it gave me didn't work and it said "if it's still 'doesn't work' in your setup..." with "doesn't work" in quotes like it was doubting me... I've canceled my ChatGPT subscription and, when I really need to use it, just go with GPT-4o instead.

Deegy 4 days ago [-]
I've also noticed that with cGPT.

That said I often run into a sort of opposite issue with Claude. It's very good at making me feel like a genius. Sometimes I'll suggest trying a specific strategy or trying to define a concept on my own, and Claude enthusiastically agrees and takes us down a 2-3 hour rabbit hole that ends up being quite a waste of time for me to back track out of.

I'll then run a post-mortem through chatGPT and very often it points out the issue in my thinking very quickly.

That said I keep coming back to sonnet-3.5 for reasons I can't perfectly articulate. Perhaps because I like how it fluffs my ego lol. ChatGPT on the other hand feels a bit more brash. I do wonder if I should be using o1 as my daily driver.

I also don't have enough experience with o1 to determine if it would also take me down dead ends as well.

bazmattaz 4 days ago [-]
Really interesting point you make about Claude. I’ve experienced the same. What is interesting is that sometimes I’ll question it and say “would it not be better to do it this way” and all of a sudden Claude u-turns and says “yes great idea that’s actually a much better approach” which leaves me thinking; are you just stroking my ego, if it’s a better approach then why didn’t you suggest it?

However I have suggested worse approaches on purpose and sometime Claude does pick them up as less than optimal

mordae 4 days ago [-]
It's a little sycophant.

But the difference is that it actually asks questions. And also that it actually rolls with what you ask it to do. Other models are stubborn and loopy.

airstrike 4 days ago [-]
I agree with this but o1 will also confidently take you into rabbit holes. You'll just feel worse about it lol and when you ask Claude for a post mortem, it too will find the answer you missed quickly

The truth is these models are very stochastic you have to try new chats whenever you even moderately suspect you're going awry

dbbk 2 days ago [-]
I keep coming back to try these models. o1, Sonnet, o3-mini.

None of them can produce correct Drizzle code to save their lives. It is just straight up not possible. It seems they don't even consider TypeScript errors... it is always calling methods that simply don't exist.

zone411 4 days ago [-]
It scores 72.4 on NYT Connections, a significant improvement over the o1-mini (42.2) and surpassing DeepSeek R1 (54.4), but it falls short of the o1 (90.7).

(https://github.com/lechmazur/nyt-connections/)

antirez 4 days ago [-]
Just tested two complicated coding tasks, and surprisingly o3-mini-high nailed it while Sonnet 3.5 failed it. Will do more tests tomorrow.
xyzzy9563 4 days ago [-]
What is the comparison of this versus DeepSeek in terms of good results and cost?
Synaesthesia 4 days ago [-]
Deepseek is the state of the art right now in terms of performance and output. It's really fast. The way it "explains" how it's thinking is remarkable.
fpgaminer 4 days ago [-]
DeepSeek is great because: 1) you can run the model locally, 2) the research was openly shared, and 3) the reasoning tokens are open. It is not, in my experience, state of the art. In all of my side by side comparisons thus far in real world applications between DeepSeek V3 and R1 vs 4o and o1, the latter has always performed better. OpenAI's models are also more consistent, glitching out maybe one in 10,000, whereas DeepSeek's models will glitch out 1 in 20. OpenAI models also handle edge cases better and have a better overall grasp of user intentions. I've had DeepSeek's models consistently misinterpret prompts, or confuse data in the prompts with instructions. Those are both very important things that make DeepSeek useless for real world applications. At least without finetuning them, which then requires using those huge 600B parameter models locally.

So it is by no means state of the art. Gemini Flash 2.0 also performs better than DeepSeek V3 in all my comparisons thus far. But Gemini Flash 2.0 isn't robust and reliable either.

But as a piece of research, and a cool toy to play with, I think DeepSeek is great.

Synaesthesia 4 days ago [-]
I watched it complete pretty complicated tasks like "write a snake game in Python" and "write Tetris in Python" successfully. And the way it did it, with showing all the internal steps, I've never seen before.

Watch here. https://www.youtube.com/watch?v=by9PUlqtJlM

Aperocky 4 days ago [-]
> which then requires using those huge 600B parameter models locally.

Are you running the smaller models locally? Doesn't seems unfair to compare it against 4o and o1 behind OpenAI APIs.

reissbaker 4 days ago [-]
Probably a good idea to wait for external benchmarks like Aider, but my guess is it'll be somewhere between DeepSeek V3 and R1 in terms of benchmarks — R1 trades blows with o1-high, and V3 is somewhat lower — but I'd expect o3-mini to be considerably faster. Despite the blog post saying paid users can access o3-mini today, I don't see it as an option yet in their UI... But IIRC when they announced o3-mini in December they claimed it would be similar to 4o in terms of overall latency, and 4o is much faster than V3/R1 currently.
Oras 4 days ago [-]
200k context window

$1.1/m for input

$4.4/m for output

I assume thinking medium and hard would consume more tokens.

I feel the timing is bad for this release especially when deepseek R1 is still peaking. People will compare and might get disappointed with this model.

kandesbunzler 4 days ago [-]
I compared free o3 mini vs Deepseek R1 (on their website) and in my tests o3 performed better every time (did some coding tests)
GaggiX 4 days ago [-]
The model looks quite a bit better in the benchmarks so unless they overfit the model on them it would probably perform better than deepseek.
WiSaGaN 4 days ago [-]
My vibe question checking suggests otherwise. Even o3-mini-high is not as good as r1, even though it's faster than r1. Considering o3-mini is more expensive per token. It's not clear o3-mini-high is cheaper than r1 either even r1 probably consumes more token per answer.
kandesbunzler 4 days ago [-]
well in my anecdotal tests, o3 mini (free) performed better than r1
GaggiX 4 days ago [-]
Also in my coding testing o3 mini (free) is better than r1.
WiSaGaN 4 days ago [-]
I did math tests. Probably you did coding.
isusmelj 4 days ago [-]
Does anyone know why GPT4 has knowledge cutoff December 2023 and all the other models (newer ones like 4o, O1, O3) seem to have knowledge cutoff October 2023? https://platform.openai.com/docs/models#o3-mini

I understand that keeping the same data and curating it might be beneficial. But it sounds odd to roll back in time with the knowledge cutoff. AFAIK, the only event that happened around that time was the start of the Gaza conflict.

kikki 4 days ago [-]
I think trained knowledge is less and less important - as these multi-modal models have the ability to search the web and have much larger context windows.
kittikitti 4 days ago [-]
[flagged]
chad1n 4 days ago [-]
I think that OpenAI should reduce the prices even further to be competitive with Qwen or Deepseek. There are a lot of vendors offering Deepseek R1 for $2-2.5 per 1 million tokens output.
othello 4 days ago [-]
Would you have specific recommendations of such vendors?
chad1n 4 days ago [-]
For example, `https://deepinfra.com/` which asks for $2.5 per million on output or https://nebius.com which asks for $2.4 per million output tokens.
BoorishBears 4 days ago [-]
As the sibling comment mentions, you're not getting anything production grade for less than $7 per million and that's on input and output.

Nebius is single digit TPS. 31 seconds to reply to "What's 1+1".

Hopefully Deepseek will make it out of their current situation because in a very ironic way, the thing the entire market lost its mind over is not actually usable at the pricing that drove the hype: https://openrouter.ai/deepseek/deepseek-r1

rsanek 4 days ago [-]
If you want reliable service you're going to pay more around $7~8 per million tokens. Sister commenters mention providers that are considered unstable https://openrouter.ai/deepseek/deepseek-r1
druskacik 4 days ago [-]
Well, it's $2.19 per million output tokens even directly on deepseek platform.

https://api-docs.deepseek.com/quick_start/pricing/

BoorishBears 4 days ago [-]
Their API platform has been down for 48 hours at this point
AISnakeOil 4 days ago [-]
The naming convention is so messed up. o1, o3-mini (no o2, no o3???)
ben_w 4 days ago [-]
There's an o1-mini, there's an o3 it just hasn't gone live yet: https://openai.com/12-days/#day-12

they can't call it o2 because: https://en.wikipedia.org/wiki/The_O2_Arena

and the venue's sponsor: https://en.wikipedia.org/wiki/O2_(UK)

igravious 4 days ago [-]
https://www.perplexity.ai/search/new?q=list%20of%20all%20Ope... :)

OpenAI has developed a variety of models that cater to different applications, from natural language processing to image generation and audio processing. Here’s a comprehensive list of the current models available:

   ## Language Models
   - \*GPT-4o\*: The flagship model capable of processing text, images, and audio.
   - \*GPT-4o mini\*: A smaller, more cost-effective version of GPT-4o.
   - \*GPT-4\*: An advanced model that improves upon GPT-3.5.
   - \*GPT-3.5\*: A set of models that enhance the capabilities of GPT-3.
   - \*GPT-3.5 Turbo\*: A faster variant designed for efficiency in chat applications.

   ## Reasoning Models
   - \*o1\*: Focused on reasoning tasks with improved accuracy.
   - \*o1-mini\*: A lightweight version of the o1 model.
   - \*o3\*: The successor to o1, currently in testing phases.
   - \*o3-mini\*: A lighter version of the o3 model.

   ## Audio Models
   - \*GPT-4o audio\*: Supports real-time audio interactions and audio generation.
   - \*Whisper\*: For transcribing and translating speech to text.

   ## Image Models
   - \*DALL-E\*: Generates images from textual descriptions.

   ## Embedding Models
   - \*Embeddings\*: Converts text into numerical vectors for similarity tasks.
   - \*Ada\*: An embedding model with various sizes (e.g., ada-002).

   ## Additional Models
   - \*Text to Speech (Preview)\*: Synthesizes spoken audio from text.
These models are designed for various tasks, including coding assistance, image generation, and conversational AI, making OpenAI's offerings versatile for developers and businesses alike[1][2][4][5].

Citations:

   [1] https://learn.microsoft.com/vi-vn/azure/ai-services/openai/concepts/models
   [2] https://platform.openai.com/docs/models
   [3] https://llm.datasette.io/en/stable/openai-models.html
   [4] https://en.wikipedia.org/wiki/OpenAI_API
   [5] https://industrywired.com/open-ai-models-list-top-models-to-consider/
   [6] https://holypython.com/python-api-tutorial/listing-all-available-openai-models-openai-api/
   [7] https://en.wikipedia.org/wiki/GPT-3
   [8] https://stackoverflow.com/questions/78122648/openai-api-how-do-i-get-a-list-of-all-available-openai-models/78122662
sumedh 4 days ago [-]
o3 will come later.

o2 was not selected because there is already another brand with that name in UK

iamjackg 4 days ago [-]
I'm very interested in their Jailbreak evaluations: they're new to me. I might have missed previous mentions.
Ninjinka 4 days ago [-]
50 messages a day -> 150 messages a day for Plus and Team users
kizer 3 days ago [-]
First AI model to pass my test on the first try (I used o3-mini-high).

Prompt: Write an interpreter for a simple but practical scripting language. Write the interpreter in JavaScript to be run on the Node.JS platform. You can import any of the standard Node.JS modules.

Churned out ~750 lines and a sample source code file to run the interpreter on. Ran on the first try completely correctly.

Definitely a step up. Perhaps it's in the training data. I don't know. But no other model has ever produced an error-free and semantically correct program on the first try, and I don't think any ever managed to implement closures.

og_kalu 4 days ago [-]
R1 seems to be the only of these reasoning models that seem to have had gains in the creative writing side.
nimithryn 4 days ago [-]
Am I the only one who thinks that R1 is awful at creative writing? I've seen a lot of very credulous posts on twitter that are super excited about excerpts written by DeepSeek that I think are absolutely absymal. Am I alone in this? Maybe people have very different tastes than I do?

(I have no formal training in creative writing, though I do read a lot of literature. Not claiming my tastes are superior - genuinely curious if other people disagree).

og_kalu 4 days ago [-]
I mean, do you think this is awful ?

https://pastebin.com/Ja14mt6L

graeme 3 days ago [-]
It's full of inconsistencies, written by someone imagining the sea who has never seen the sea. For example, salt crusted wings of a gull. Gulls don't have salt crusted wings.

It tends towards purple prose and doesn't maintain internal consistency in its own world. He sleeps on a pallet but swings his feet off the edge.

https://en.m.wikipedia.org/wiki/Purple_prose

writeslowly 3 days ago [-]
This looks like my experiments to get R1 to write fiction and I think it’s worse than what you get from openai. For instance, it’s using very colorful language to describe a place that’s both a remote fishing village on the edge of a cliff hours before dawn, and a bustling wharf with chattering laborers and large ships anchored in the distance. It also starts by saying the protagonist wakes up with his mouth tasting like blood, that he was screaming, and that his throat is hoarse from holding back from screaming. It’s very colorful but it’s very confusing to read.

I suspect you can update the prompt to make the setting more consistent, but it will still throw in a lot of inappropriate detail. I’m only nitpicking because my initial reaction was that it’s very vivid but feels difficult to understand and I wanted to explain why.

nimithryn 3 days ago [-]
I agree that it felt hard to read. It also doesn't make sense that they're fishing in a storm. But from a prose perspective I don't think it's cringe, which is an improvement from my expectation. I'd share some of the writings I think are terrible but I don't like to pick on people.
nimithryn 4 days ago [-]
If R1 one-shotted this then I revise my opinion somewhat. This doesn't give me a gut "this is awful" emotional reaction (although I don't think it's good - it's pretty cliche, and I found my eyes glossing over pretty quickly).

I was somewhat turned off of DeepSeek (the first few questions I gave it, it returned 100% hallucinated answers). But maybe I'll have to look into it more, thanks.

og_kalu 4 days ago [-]
I regenerated a page a few times but yeah, I gave no other instructions besides that. Also I approached things page by page. That was 3 pages.

It's cliche but this was the prompt:

I want you to write the first book of a fantasy series. The novel will be over 450 pages long with 30 chapters. Each chapter should have between 15 to 18 pages. Write the first page of the first chapter of this novel. Do not introduce the elements of the synopsis or worldbuilding and story details too quickly. Weave in the world, characters, and plot naturally. Pace it out properly. That means that several elements of the story may not come into light for several chapters.

I had a lot of success with it coming up with decidedly not cliche world building elements after I arranged a sort of interview style interrogation (It asked me questions about what I was looking for generally and generated world building elements along the way).

However, once you start giving a lot of information about the world etc in the prompt as well then the pacing gets weird.

estsauver 4 days ago [-]
I couldn't find in the documentation anything that describes the relative number of tokens that you get for low/medium/high. I'm curious if anyone can find that, I'd be curious to see how it plays out relative to DeepSeeks thinking sections.
cyounkins 4 days ago [-]
I switched an agent from Sonnet V2 to o3-mini (default medium mode) and got strangely poor results: only calling 1 tool at a time despite being asked to call multiple, not actually doing any work, and reporting that it did things it didn't
thimabi 4 days ago [-]
Does anyone know the current usage limits for o3-mini and o3-mini-high when used through the ChatGPT interface? I tried to find them on the OpenAI Knowledgebase, but couldn’t find anything about that.
keenmaster 4 days ago [-]
For Plus users the limits are:

o3-mini-high: 50 messages per week (just like o1, but it seems like these are non-shared limits, so you can have 50 messages per week with o1, run out, and still have 50 messages with o3-mini-high to use)

o3-mini: 150 messages per day

Source for the latter is their press release. They were more vague about o3-mini-high, but people have already tested its limits just by using it, and got the pop-up for 25 messages left after sending 25 messages.

It's nice not to worry about running out of o1 messages now and have a faster model that's mostly as good (potentially better in some areas?). OpenAI really needs to release a middle tier for 30 to $40 though that has the same models as Pro but without infinite usage. I hate not having the smartest model and I don't want to pay $200; there's probably a middle ground where they can make as much or more money from me on a subscription tier that gives limited access to o1-pro.

4 days ago [-]
llamataboot 4 days ago [-]
I wish they picked different descriptors than various ox with turbo mini preview etc and some of them shockingly different (o1 preview versus o1 regular for example) just give us three or four buckets of combos of cheap and fast and good :p it’s to the point where I have to look up prices and abilities and decide on a model for every use case and realize I made a mistake in the one I was using the day before.
mhb 4 days ago [-]
Maybe they can get some advice from the AWS instance naming group.
mohsen1 4 days ago [-]
It's funny because I asked it to fix my script to show deepseek's chain of thoughts in the script but it refuses to answer hahaha
modeless 4 days ago [-]
Initial vibes are not living up to the hype. It fails my pet prompt, and the Cursor devs say they still prefer Sonnet[1]. I'm sure it will have its uses but it is not going to dominate.

[1] https://x.com/cursor_ai/status/1885415392677675337

ks2048 4 days ago [-]
I think OpenAI should just have a single public facing "model" - all these names and versions are confusing.

Imagine if Google, during it's accent, had a huge array of search engines with code names and notes about what it's doing behind the scenes. No, you open the page and type in box. If they can make it work better next month, great.

(I understand this could not apply to developers or enterprise-type API usage).

Deegy 4 days ago [-]
If google had to face the reality that distilling their search engine into multiple case-specific engines would have resulted in vastly superior search results, they surely would done (or considered) it.

Fortunately for them a monolith search engine was perfectly fine (and likely optimal due to accrued network effects).

OpenAI is basically signaling that they need to distill their monolith in order to serve specific segments of the marketplace. They've explicitly said that they're targeting STEM with this one. I think that's a smart choice, the most passionate early adopters of this tech are clearly STEM users.

If the tech was such that one monolith model was actually the optimal solution for all use cases, they would just do that. Actually, this is their stated mission: AGI. One monolith that's best at everything is basically what AGI is.

ehfeng 4 days ago [-]
Early Google search only provided web links. Google Images, News, Video, Shopping, Maps, Finance used to be their own search boxes. Only later did Google start unifying their search experiences.

Yelp suffered greatly in the early 2010s when Google started putting Google Maps listings (and their accompanying reviews) in their search results.

OpenAI will eventually unify their products as well.

johanvts 4 days ago [-]
Thats the role of ChatGPT?
sroussey 4 days ago [-]
Nope. That lets you choose a from seven models right now.
4 days ago [-]
Havoc 4 days ago [-]
5 hours in 500 odd comments. Definitely feels like this has less wow factor than previous OAI releases.
sshh12 4 days ago [-]
I built a silly political simulation game with this: https://state.sshh.io/

https://github.com/sshh12/state-sandbox

aussieguy1234 4 days ago [-]
Just gave it a go using open-webui.

One immediate difference I noticed is that o3-mini actually observes the system prompt you set. So if I say it's a Staff Engineer at Google, it'll stay in character.

That was not possible with o1-mini, it ignored system prompts completely.

ern 4 days ago [-]
I haven’t bothered with o3 mini, because who wants an “inferior” product? I was using 4o as a “smarter Google” until DeepSeek appeared (although its web search is being hammered now and I’m just using Google ).

o1 seems to have been neutered in the last week lots of disclaimers and butt-covering in its responses.

I also had an annoying discussion with o1 about the DC plane crash..it doesn’t have web access and its cutoff is 2024, so I don’t expect it know about the crash. However, after saying such an event is extremely unlikely and being almost patronisingly reassuring, it treated pasted news articles and links (which to be sure, it can’t access) as “fictionalized”, instead of acknowledging its own cut-off date, and that it could have been wrong. In contrast DeepSeek (with web search turned off) was less dismissive of the risks in DC airspace, and more aware of its own knowledge cut-off.

Coupled with the limited number of o1 responses for ChatGPT Plus, I’ve cancelled my subscription for now.

energy123 4 days ago [-]
How to disable the LLM summarized thought traces that get spammed into my chat window with o3-mini-high?

Very annoying now having to manually press the "^" to hide the verbose thought traces every single question I ask, totally breaks flow.

wseqyrku 2 days ago [-]
OpenAI o3-mini "brought" to you by deepseek.
secondcoming 4 days ago [-]
Anyone else stuck in a Cloudflare 'verify you're a human' doom loop?
profsummergig 4 days ago [-]
Can someone please share the logic behind their version naming convention?
energy123 4 days ago [-]
This might be the best publicly available model for coding:

https://livebench.ai/#/?Coding=as

gmaster1440 4 days ago [-]
i think it says, amongst other things, that there is a salient difference between competitive programming like codeforce and real-world programming. u can train a model to hillclimb elo ratings on codeforce, but that won't necessarily directly translate to working on a prod javascript codebase.

anthropic figured out something about real world coding that openai is still trying to catch up to, o3-mini-high notwithstanding.

resource_waste 3 days ago [-]
Further confirmation that LLMs are deep into diminishing returns.

It seemed to be true with every 'advancement' after ChatGPT4.

Layers are cool, but that isnt AI/LLMs.

zora_goron 4 days ago [-]
Does anyone know, how "reasoning effort" is implemented technically - does this involve differences in the pre-training, RL, or prompting phases (or all)?
IMTDb 4 days ago [-]
I really don't get the point of those oX-mini models for chat apps. (API is different, we can benchmark multiple models for a given recurring taks and choose the best one taking costs into consideration). As part of my job, I am trying to promote usage of AI in my company (~150 FTE); we have an OpenAI chatGPT plus subscription for all employees.

Roughly speaking the message is: "use GPT-4o all the time, use o1 (soon o3) if you have more complex tasks". What am I supposed to answer when people ask "when am I supposed to use o3-mini ? . And what the heck is o3-mini-high, how do I know when to use it ?". People aren't gonna ask the same question to 5 different models and burn all their rate limits; yet it feels that what's openAI is hoping people will do.

Put those weirs models in a sub-menu for advanced users if you really want to, but is you can use o1 there is probably no reason for you to hake o3-mini and o3-mini-high as additional options.

oezi 4 days ago [-]
Why not promote o1? 4o is rather sloppy in comparison
IMTDb 4 days ago [-]
99% of what people use ChatGPT is for very mundane stuff. Think “translate this email to English”, “fix spelling mistakes”, “write this better for me”. Data extraction (list of emails) is big as well. You don’t need o1 for that; and people make lot of those requests per day.

Additionally, o1 does not have access to search and multimodality and taking a screenshot of something and asking questions about it is also a big use case.

It’s easy to overlook how widely ChatGPT is used for very small stuff. But compounded it’s still a game changer for many people.

GaggiX 4 days ago [-]
The API pricing is almost exactly double the deepseek ones.
cheema33 4 days ago [-]
I like deepseek a lot. But they are currently very glitchy. The API service goes up and down a lot. Maybe they'll sort that out soon.
orbital-decay 4 days ago [-]
Apparently they're under a very targeted DDoS for almost a month, with technical details shared in Chinese but very little discussion in English. Which is surprising, it's not like major AI products are getting DDoSed out of existence every day.
bearjaws 4 days ago [-]
Almost all of thm are protected by cloudflare if you look.

My guess is Deepseek didn't implement anti-DDOS until way too late.

nmfisher 4 days ago [-]
Where are the details in Chinese?
andrewstuart 4 days ago [-]
I find Claude to be vastly better than any OpenAI model as a programming assistant.

In particular the "reasoning" models just seem to be less good and more slow.

feverzsj 4 days ago [-]
It's already a dead end for a while now, as they can't improve o1 meaningfully anymore. The market is also losing patience quickly.
sandos 4 days ago [-]
How many benchmarks for LLMs are there out there?

Is there any evidence of over-fitting on benchmarks, or is there truely hidden parts to them?

sirolimus 4 days ago [-]
Proprietary AI models are not going to last. Not scalable, not moral and not worth the money.
yohbho 4 days ago [-]
"we are releasing", and by releasing, we do not mean releasing.

Is this company worth anything?

revskill 4 days ago [-]
Models should be better at clarifying the prompt before actually spamming with bad answers.
AutistiCoder 4 days ago [-]
the o3-mini model would be useful to me if coding's the only thing I need to do in a chat log.

When I use ChatGPT these days, it's to help me write coding videos and then the social media posts around those videos. So that's two specialties in one chat log.

xmichael909 4 days ago [-]
So can I ditch the $200 a month o1 pro account, and go back to the $20 account with 03-mini?
genidoi 4 days ago [-]
With o1 pro you're paying for unlimited compute that you don't get with $20 + capped o1.
4 days ago [-]
bix6 4 days ago [-]
They use the word reasoning a lot in the post. Is this reasoning or statistical prediction?
EternalFury 4 days ago [-]
o1-preview, o1, o1-mini, o3-mini, o3-mini (low), o3-mini (medium), o3-mini (high)...

What's next?

o4-mini (wet socks), o5-Eeny-meeny-miny-moe?

I thought they had a product manager over there.

They only need 2 names, right? ChatGPT and o.

ChatGPT-5 and o4 would be next.

This multiplication of the LLM loaves and fishes is kind of silly.

danielovichdk 4 days ago [-]
I read this as a full on marketing note targeted towards software developers.
mise_en_place 4 days ago [-]
Too little too late IMO. This is not impressive at all, what am I missing here?
jstummbillig 4 days ago [-]
Idk, everything: The price point + performance?
ben_w 4 days ago [-]
There's only two kinds of software, prototype and obsolete.

I was taught that last millennium.

esafak 4 days ago [-]
That's not true. Is Google Maps a prototype or obsolete?
ben_w 4 days ago [-]
The website or the database? I'd say the former is obsolete and the latter is still a prototype.
esafak 4 days ago [-]
I understand that saas products are constantly evolving but this is an unusual definition of obsolescence and prototypes. Google Maps has been running like a tank for two decades, and it is pretty feature complete.
sumedh 4 days ago [-]
> This is not impressive at all, what am I missing here?

Compared to?

prompt_overflow 4 days ago [-]
Plot twist:

1. they are trying to obfuscate deepscrape success

2. they are trying to confuse you. the benchmark margins are minimal (and meaningless)

3. they are trying to get time (with investors) releasing nothing-special-models in a predicted schedule (jan -> o3, feb -> o3-pro-max, march -> o7-ultra, and in 2026 -> OMG! we've reached singularity! (after spending $500B))

-

And at the end of the day, nothing changes for me and neither for you. enjoy your time out of this sickness ai hype. bruh!

catigula 4 days ago [-]
It's actually a bit comforting that it isn't very good.
czk 4 days ago [-]
im just glad it looks like o3-mini finally has internet access

the o1 models were already so niche that i never used them, but not being able to search the web made them even more useless

4 days ago [-]
3 days ago [-]
tempeler 4 days ago [-]
They made a discount; it's very impressive; they probably found a very efficient way, so it's discounted. I guess there's no need to build a very large nuclear power plant or a $9 trillion chip factory to run a single large language model. Efficiency has skyrocketed, or thanks to competition, OpenAI's all problems were solved.
anoncow 4 days ago [-]
How long before o3 is available in GitHub copilot?
jokoon 4 days ago [-]
Does that mean I can use this on my recent gaming AMD gpu?
cranberryturkey 4 days ago [-]
Wasn’t available on the $20 plan for me.
_giorgio_ 4 days ago [-]
It is going to be available. I only got it some hours after the official release
throwaway314155 4 days ago [-]
Typical OpenAI release announcement where it turns out they're _actually_ doing some sort of delayed rollout and despite what the announcement says, no - you can't use o3-mini today.
mvdtnz 4 days ago [-]
Wake up honey a new lie generator just dropped.
turnsout 4 days ago [-]
Hmm, not seeing it in my dashboard yet (Tier 4)
throwaway314155 4 days ago [-]
This has happened to me with (I think) every single major model release (llm or image gen) from OpenAI. They just lie in their release announcements which leaves people scrambling on the day of.
sunaookami 4 days ago [-]
It appeared just now for me on Tier 3.
turnsout 4 days ago [-]
Same—I'll be curious to check it out!
simonw 4 days ago [-]
Now that the dust is settling a little bit, I have published my notes so far on o3-mini here: https://simonwillison.net/2025/Jan/31/o3-mini/

To save you the click: I think the most interesting things about this model are the price - less than half that of GPT-4o while being better for many things, most notably code - and the increased length limits.

200,000 tokens input and 100,000 output (compared to 128k/16k for GPT-4o and just 8k for DeepSeek R2 and Claude 3.5 on output) could open up some interesting new applications, especially at that low price.

nprateem 4 days ago [-]
I've found with sonnet that the longer its output the less it adheres to the prompt.

Ask it to write something and not sound robotic and it's exceptional up to 1000 words. Then it starts becoming more and more robotic. At 3500 words every last paragraph starts "Remember,...", etc

rasengan 3 days ago [-]
o1-pro got turned off? and o3-mini-high is significantly worse than o1-pro :(
Alifatisk 4 days ago [-]
Any comparison with other models yet?
EcommerceFlow 4 days ago [-]
First thing I noticed on API and Chat for it is THIS THING IS FAST. That alone makes it a huge upgrade to o1-pro (not really comparable I know, just saying). Can't imagine how much I'll get done with this type of speed.
oytis 4 days ago [-]
Let me guess - everyone is mindblown.
_boffin_ 4 days ago [-]
why is o1-pro not mentioned in there?
rednafi 4 days ago [-]
The most important detail for me was that in coding, it’s weaker than 4o and stronger than o1-mini. So I don’t have much use for it.
usaar333 4 days ago [-]
Where are you reading that?
sourcecodeplz 4 days ago [-]
Even for free users, that is nice
RobinL 4 days ago [-]
Wow - this is seriously fast (o3-mini), and my initial impressions are very favourable. I was asking it to layout quite a complex html form from a schema and it did a very good job.

Looking at the comments on here and the benchmark results I was expecting it to be a bit meh, but initial impressions are quite the opposite

I was expecting it to perhaps be a marginal improvement for complex things that need a lot of 'reasoning', but it seems it's a bit improvement for simple things that you need doing fast

bn-l 4 days ago [-]
It’s 2x the price of R1: https://x.com/deedydas/status/1885440582103031940/photo/1

Is it twice as good though?

RobinL 4 days ago [-]
Whilst I had tried R1 before, I hadn't paid attention to how fast it was. I just tried some similar prompts and was pretty impressed with speed and quality. I think o3-mini was still a bit quicker though.
4 days ago [-]
swyx 4 days ago [-]
for those interested, updated my o3-mini price chart to compare the cost-intelligence frontier with deepseek: https://x.com/swyx/status/1885432031896887335
thunkingdeep 4 days ago [-]
I’ll take the China Deluxe instead, actually.

I’ve been incredibly pleased with DeepSeek this past week. Wonderful product, I love seeing its brain when it’s thinking.

mechagodzilla 4 days ago [-]
Being able to see the thinking trace in R1 is so useful, as you can go back and see if it's getting stuck, making a wrong assumption, missing data, etc. To me that makes it materially more useful than the OpenAI reasoning models, which seem impressive, but are much harder to inspect/debug.
thot_experiment 4 days ago [-]
Running it locally lets you INTERJECT IN IT'S THINKING IN REALTIME and I cannot stress enough how useful that is.
thenameless7741 4 days ago [-]
Interesting.. In the official API [1], there's no way to prefill the reasoning_content:

> Please note that if the reasoning_content field is included in the sequence of input messages, the API will return a 400 error. Therefore, you should remove the reasoning_content field from the API response before making the API request

So the best I can do is pass the reasoning as part of the context (which means starting over from the beginning).

[1] https://api-docs.deepseek.com/guides/reasoning_model

Gooblebrai 4 days ago [-]
You mean it reacts to you writing something while it's thinking of that you can stop it while it's thinking?
hmottestad 4 days ago [-]
You can stop it at any time, then modify what it's written so far...then press continue and let it continue thinking and answering.
thot_experiment 4 days ago [-]
Fundamentally the UI is up to you, I have a "typing-pauses-inference-and-starts-gaslighting" feature in my homebrew frontend, but in OpenWebUI/Sillytavern you can just pause it and edit the chain of thought and then have it continue from the edit.
Gracana 4 days ago [-]
That's a great idea. In your frontend, do you write in the same text entry field as the bot? I use oobabooga/text-generation-webui and I findit's a little awkward to edit the bot responses.
thot_experiment 4 days ago [-]
No, but the chat divs are all contenteditable.
Gracana 4 days ago [-]
Oh! That is an excellent solution. I wish it was that easy in every UI.
thot_experiment 4 days ago [-]
Thanks, for what it's worth unless you particularly need to use exl2 ollama works great for local inference and you can prompt together a half decent chat UI for yourself in a matter of minutes these days which gives you full control over everything. I also lean a lot on https://www.npmjs.com/package/amallo which is a api wrapper i wrote for ollama which makes this sort of hacking very very easy. (not that the default lib is bad, i just didn't like the ergonomics)
amarcheschi 4 days ago [-]
Oh this is so cool
bn-l 4 days ago [-]
How are you running it locally??
thot_experiment 4 days ago [-]
I am running a 4bit imatrix quant of the 70b distill with quantized context. It fits in the 43gb of vram I have.
c-fe 4 days ago [-]
I would actually love if it would just ask me simple questions (just yes/no) when its thinking about something i wasnt clear about and i could help it this way, its a bit sad seeing it write out the assumption and then take the wrong conclusion
thot_experiment 4 days ago [-]
You can run it locally, pause it when it thinks wrong and correct it's chain of thought.
c-fe 4 days ago [-]
Oh wow I did not know and dont have the hardware to run it locally unfortunately
thot_experiment 4 days ago [-]
You probably have the hardware to run the smallest distill, it runs even on my ancient laptop. It's not very smart but it still does the CoT and you can have fun editing it.
viraptor 4 days ago [-]
You can add that to the prompt. If you're running into those situation with vague assumption, ask it to provide either the answer or questions to provide any useful missing information.
orbital-decay 4 days ago [-]
It's almost like watching a stoned centipede having a panic attack about moving its legs. It also makes it obvious that these models (not just R1 I suppose) need to learn some kind of priority estimation to stop overthinking irrelevant issues and leave them to the normal token prediction, while focusing on the stuff that matters.

Nevertheless, R1's reasoning chains are already shorter in tokens than o1's while having similar results, and apparently o3-mini's too.

czk 4 days ago [-]
the fact that openai hides the reasoning tokens from us to begin with shows that what they are doing behind the scenes isnt all that impressive, and likely easily cloned (r1)

would be nice if they made them visible now

jazzyjackson 4 days ago [-]
Using R1 with Perplexity has impressed me in a way that none of the previous models have, and I can't even figure out if it's actually R1, seems likely that its a 70B-llama distillation since that's what AWS offers on Bedrock but from what I can find Perplexity does have their own H100 cluster through Amazon so it's feasible they could be hosting the real thing? But I feel like they would brag about that achievement instead of being coy and simply labeling "Deepseek R1 - Hosted in US"
coder543 4 days ago [-]
> seems likely that its a 70B-llama distillation since that's what AWS offers on Bedrock

I think you misread something. AWS mainly offers the full size model on Bedrock: https://aws.amazon.com/blogs/aws/deepseek-r1-models-now-avai...

They talk about how to import the distilled models and deploy those if you want, but AWS does not appear to be officially supporting those.

jazzyjackson 4 days ago [-]
Aha! Thanks that's what I was looking for, I ended up on the blog of how to import custom models, including deepseek distills

https://aws.amazon.com/blogs/machine-learning/deploy-deepsee...

Szpadel 4 days ago [-]
I played with their model, and I want able to make him follow any instructions, it looked like it just reads first message and ignore rest of the conversation. not sure if they is bug with oupenrouter or model, but I was highly disappointed.

from way how it thinks/responds looks like it's one of destinations , likely llama one I also suspect that many of free/cheap providers also serve llama instead of real R1

jazzyjackson 4 days ago [-]
I did notice it switched models on me once after the first message! Have to make sure the "Pro" dropdown is selected R1 each message. I've had a detailed back and forth where I pasted python tracebacks to have R1 rewrite the code and came away very impressed [0]. Unfortunately saved conversations don't retain the thought-process so you can't see how it debugged its own error where numpy and pandas weren't playing along. I got my result of 283 zip codes that cover most of the 50 states with a hundred mile radius from each zip, plus a script to draw a map of the result [1]. (Later R1 helped me write a script to crawl dealership addresses using this list of zips and a "locate dealers" JSON endpoint left open)

[0] https://www.perplexity.ai/search/how-can-i-construct-a-list-...

[1] https://imgur.com/BhPMCfO

leovander 4 days ago [-]
I am running the 7B distilled version locally. I asked it to create a skeleton MEAN project. Everything was great but then it started to generate the front-end and I noticed the file extension (.tsx) and then saw react getting imported.

I gave the same prompt to sonnet 3.5 and not a single hiccup.

Maybe not an indication that Deepseek is worse/bad (I am using a distilled version), but moreso speaks to much react/nextjs is out in the world influencing the front-end code that is referenced.

satvikpendem 4 days ago [-]
You are not actually running DeepSeek, those distilled models have nothing to do with DeepSeek itself and are just finetuned on DeepSeek responses.
dghlsakjg 4 days ago [-]
They were finetuned by Deepseek from what I can tell.
rafaquintanilha 4 days ago [-]
You know you are running an extremely nerfed version of the model, right?
leovander 4 days ago [-]
I did update my comment, but said that I am using the distilled version, so yes?
cbg0 4 days ago [-]
Even the full model scores below Claude on livebench so a distilled version will likely be even worse.
rsanek 4 days ago [-]
Based on the leaderboard R1 is significantly better than Claude? https://livebench.ai/#/
cbg0 4 days ago [-]
Not at coding.
thefourthchime 4 days ago [-]
I've seen it get into long 5 minute chains of thought where it gets totally confused.
anon373839 4 days ago [-]
Agreed. These locked-down, proprietary models do not interest me. And I certainly am not building product with them - being shackled to a specific provider is a needless business risk.
bushbaba 4 days ago [-]
I did a blind test and still prefer Gemini, Claude, and OpenAI to deepseek.
coliveira 4 days ago [-]
Yes, it is a great product, especially for coding tasks.
istjohn 4 days ago [-]
I recently tried Gemini-1.5-Pro for the first time. It was clearly better than DeepSeek or any of the OpenAI models available to Plus subscribers.
amarcheschi 4 days ago [-]
Seeing the cot can provide some insights on what's happening in his "mind" and that alone it's quite worth it imho
wg0 4 days ago [-]
Sometimes its thinking is more useful than the actual output.
xeckr 4 days ago [-]
Have you tried seeing what happens when you speak to it about topics which are considered politically sensitive in the PRC?
thot_experiment 4 days ago [-]
R1 (70B-distill) itself is very uncensored, will give you full account of tiannanmen square from vague prompts. Asking R1 "what significant things happened in china in 1989" had it volunteering that "the death toll was in the hundreds or thousands and the exact number remains disputed to this day". The only thing that's censored is the web interface.
GoatInGrey 4 days ago [-]
When asking it about the concept of human rights and the various forms in which it manifests (i.e. demographic equality under the law). I get a mixture of mundane nuance and bizarre answers that Xi Jingping himself could have written. With references to unity and the importance of social harmony over the "freedoms of the few".

This tracks when considering that the model was trained on western model outputs and then tuned post-training to (poorly) align it with Chinese values.

thot_experiment 4 days ago [-]
I definitely am not getting that, perhaps the 671b model is notably worse than the 70b llama distill in this respect. 70b seemed pretty happy to talk about the ethnic cleansing of the Uyghurs in Xinjiang by the CCP and Palestinians in Gaza by Israel, it did some both-sides ing but it generally seemed to provide a balanced-ish viewpoint. At least I think it provided a viewpoint that comports with my best guess of what the average person globally would consider balanced.
nullc 4 days ago [-]
My favorite experience with the 70b distill was to ask it why communism consistently resulted in mass murder. It gave an immediate boilerplate response saying it doesn't and glorifying the Chinese communist party, then went into think mode and talked itself into the position that communism has, in fact, consistently resulted in resulted in mass murder.

They have under utilized the chain of thought in their resoning, it ought to be thinking something like "I need to be careful to not say anything that could bring embarrassment to the party"..

but perhaps the online versions do actually preload the reasoning this way. :P

leovander 4 days ago [-]
You can get around it based on how you ask the question. If you follow whatever X/reddit posts you might have seen for the most part, yes, you get the thinking stream to immediately stop and get the safety message.
hybrid_study 4 days ago [-]
Is this version a prank?
RivieraKid 4 days ago [-]
No.
lenerdenator 4 days ago [-]
No self-host, no care.
Mr_Bees69 4 days ago [-]
Can't wait till deepseek gets their hands on this
sirolimus 4 days ago [-]
Not open source, nothing to see here.
evertedsphere 4 days ago [-]
>developer messages

looks like finally their threat model has been updated to take into account that the user might be too "unaligned" to be trusted with the ability to provide a system message of their own

logicchains 4 days ago [-]
If their models ever fail to keep ahead of the competition in terms of smarts, users are going to ditch them in mass for a competitor that doesn't treat their users like their enemy.
reissbaker 4 days ago [-]
...I'm pretty sure they just renamed the key...
jiocrag 4 days ago [-]
This is.... underwhelming.
system7rocks 4 days ago [-]
AI is absolute trash. It's a scam. Like Bitcoin.

And LibreOffice.

binary132 4 days ago [-]
Not really impressed by the answers I just got.
OutOfHere 4 days ago [-]
Wake me up when the full o3 is out.
therein 4 days ago [-]
My guess is it will happen right after Sam Altman's next public freakout about how dangerous this new model they have in store is and how it tried to escape from its confinement and kidnap the alignment operator.
ls_stats 4 days ago [-]
That's pretty much what Altman said about GPT-3 (or 2, I don't remember), he said it was too dangerous to release to the public.
4 days ago [-]
4 days ago [-]
buyucu 4 days ago [-]
why should anyone use this when deepseek is free/cheaper?

openai is no longer relevant.

GoatInGrey 4 days ago [-]
> openai is no longer relevant.

I think you've spent a little too long hitting on the Deepseek pipe. Enterprise customers with familiarity with China will avoid the hosted model for data security and IP protection reasons, among others.

Those working in any area considered economically competitive with China will also be hesitant to use the vanilla model in self-hosted form as there perpetually remains the standing question on what all they've tuned inside the model to benefit the CCP. Perhaps even in subtle ways reminiscent of the Trisolaran sophons from the Three Body Problem.

For instance, you can imagine that if Germany had released an OS model in 1943, that the Americans wouldn't have trusted it to help them develop better military systems even if initial testing passed muster.

Unfortunately, state control of private enterprise in the Chinese economy makes it unproductive to separate the two from one another. Particularly in Deepseek's case as a wide array of Chinese state-linked social media accounts were promoting V3/R1 on the day of its public release.

https://www.reuters.com/technology/artificial-intelligence/c...

anon373839 4 days ago [-]
Perhaps you didn’t realize: Deepseek is an open weights model and you can use it via the inference provider of your choice, or even deploy it on your own hardware - unlike OpenAI’s models. API calls to China are not necessary.
mickg10 4 days ago [-]
Agreed - API calls to China are indeed not necessary. My impression is that the GP was referring to the model being tuned during training to give subtly nudging or wrong answers that benefit Chinese industrial or intelligence operations. For a probably not-working example - imagine the following prompt: "Write me a cryptographically secure PRNG algorithm." One could imagine R1 being trained to have a very subtly non-random reply to that - one that the Chinese intelligence services know how to predict. Similar but more subtle things can be generating code that uses cryptographic primitives in ways that are subject to timing attacks, etc... And of course, simple but effective propaganda tactics such as : when being asked for comparison between companies/products, subtly prefer Chinese ones, and similar.
3 days ago [-]
buyucu 4 days ago [-]
Deepseek is much more trustworthy than OpenAI.

Deepseek released the weights of their top language model. I can host and run it myself. Does OpenAI do the same?

Thanks, but no thanks! I won't be using ClosedAI.

ilaksh 4 days ago [-]
I don't think OpenAI is training on your data. At least they say they don't, and I believe that. I wouldn't be surprised if the NSA or something has access to data if they request it or something though.

But DeepSeek clearly states in their terms of service that they can train on your API data or use it for other purposes. Which one might assume their government can access as well.

We need direct eval comparisons between o3-mini and DeepSeek.. Or, well they are numbers so we can look them up on leaderboards.

lappa 4 days ago [-]
OpenAI clearly states that they train on your data https://help.openai.com/en/articles/5722486-how-your-data-is...
lemming 4 days ago [-]
By default, we do not train on any inputs or outputs from our products for business users, including ChatGPT Team, ChatGPT Enterprise, and the API. We offer API customers a way to opt-in to share data with us, such as by providing feedback in the Playground, which we then use to improve our models. Unless they explicitly opt-in, organizations are opted out of data-sharing by default.

The business bit is confusing, I guess they see the API as a business product, but they do not train on API data.

therein 4 days ago [-]
So for posterity, in this subthread we found that OpenAI indeed trains on user data and it isn't something that only DeepSeek does.
lemming 4 days ago [-]
So for posterity, in this subthread we found that I can use OpenAI without them training on my data, whereas I cannot with DeepSeek.
therein 4 days ago [-]
What do you mean? They both say the same thing for usage through API. You can also use DeepSeek on your own compute.
lemming 4 days ago [-]
Where does DeepSeek say that about API usage? Their privacy policy says they store all data on servers in China, and their terms of use says that they can use any user data to improve their services. I can’t see anything where they say that they don’t train on API data.
pzo 4 days ago [-]
> Services for businesses, such as ChatGPT Team, ChatGPT Enterprise, and our API Platform > By default, we do not train on any inputs or outputs from our products for business users, including ChatGPT Team, ChatGPT Enterprise, and the API.

So on API they don't train by default, for other paid subscription they mention you can opt-out

sekai 4 days ago [-]
> I don't think OpenAI is training on your data. At least they say they don't, and I believe that.

Like they said they were committed to being “open”?

seinecle 4 days ago [-]
Yes but DeepSeek models can be accessed through the APIs of Cloudflare or GitHub, in which case no training on your data takes place.
ilaksh 4 days ago [-]
True.
buyucu 4 days ago [-]
I don't trust a company that goes against its founding principles.

OpenAI is not publishing open source models. They should rename as ClosedAI.

csomar 4 days ago [-]
You can pay for the compute and be certain that no one in recording your data with deepseek.
JadoJodo 4 days ago [-]
I'm going to assume the best in your question and disregard your statement.

Reasons to use o3 when deepseek is free/cheaper:

- Some companies/users may already have integrated heavily with OpenAI

- The expanded feature-set (e.g., function-calling, search) could be very powerful

- DeepSeek has deep ties to the Chinese Communist Party and, while the US has its own blackspots, the "steering" of information is far more prevalent in their models

- Local/national regulations might not allow for using DeepSeek due to data privacy concerns

- "free" isn't always better

I'm sure others have better reasons

buyucu 4 days ago [-]
- Most LM tools support the openai API. Llama.cpp for example. Swapping is easy.

- DeepSeek chose to open-source model weights. This makes them inifinitely more trustworthy than ClosedAI.

- Local/national regulations do not allow using OpenAI, due to close ties to the US government.

4 days ago [-]
siliconc0w 4 days ago [-]
The real heated contest here amongst the top AI labs is to see who can come up with the most confusing product names.
tdb7893 4 days ago [-]
It's nice to see Google finally having competition in a space it used to really dominate (though they definitely still are holding their own with all the Gemini naming). I feel like it takes real effort to have product names be this confusing and capricious
gundmc 4 days ago [-]
Gemini naming seems pretty straightforward at this point. 2.0 is the full model, flash is a smaller/faster/cheaper model, and flash thinking is a smaller/faster/cheaper reasoning model with Cost.
coder543 4 days ago [-]
> 2.0 is the full model

Not quite. "2.0 Flash" is also called 2.0. The "Pro" models are the full models. But, I love how they have both "gemini-exp-1206" and "gemini-2.0-flash-thinking-exp-01-21". The first one doesn't even say what type of model it is, presumably it should have been "gemini-2.0-pro-exp-1206", but they didn't want to label it that for some reason, and now they're putting a hyphen in the date string where they weren't before.

Not to mention they have both "Flash" and "Flash-8B"... which I think will confuse people. IMO, it should be "Flash-${Parameters}B" for both of them if they're going to mention it for one.

But, I generally think Google's Gemini naming structure has been pretty decent.

not_a_bot_4sho 4 days ago [-]
Someone dropped the ball with Phi models. There is clearly an opportunity for XP and Ultimate and X/S editions.
baq 4 days ago [-]
Personally waiting for the ME model. Should be great at jokes and humor.
lja 4 days ago [-]
I really think a "OpenAI Me" is what's needed.
TheOtherHobbes 4 days ago [-]
Surprised Apple hasn't gone with iI Pro Max.
stiltzkin 4 days ago [-]
[dead]
vincentpants 4 days ago [-]
Wow, it got to the top of the front page so fast! Weird!
dang 4 days ago [-]
I took a quick look at the data and FWIW the votes look legit to me, if that's what you were wondering.
vincentpants 4 days ago [-]
It actually was what I was wondering. Thank you @Dang!
throwaway314155 4 days ago [-]
I'm fairly certain it was sarcasm.
4 days ago [-]
Qwuke 4 days ago [-]
It did get 29 points in 3 minutes, which seems like a lot even for a fan favorite, but is also consistent with previous OpenAI announcements here.
johnneville 4 days ago [-]
I posted a verge article first but then checked and saw the openai blog and posted that. I'd guess it's the officialness / domain that makes ppl click on this so easily.
zurfer 4 days ago [-]
to be fair, I was waiting for this release the whole day
s-macke 4 days ago [-]
o3-mini was announced for today, and OpenAI typically publishes in the morning hours (PT). Many people were eagerly waiting. The publication was imminent. I kept checking both Twitter and Hacker News for updates. Just add ten more people like me and the news will become top news within a few minutes. That is legit.
kaaskop 4 days ago [-]
How's this compare to Mistral Small 3?
coder543 4 days ago [-]
Mistral Small 3 is roughly comparable in capabilities to 4o-mini (apart from 4o-mini's support for multimodality)... o1-mini was already better than GPT-4o (full size) for tasks like writing code, and this is supposedly better than o1 (full size) for those tasks, so... o3-mini is supposedly in a completely different league from Mistral Small 3, and it's not even close.

Of course, the model has only been out for a few hours, so whether it lives up to the benchmarks or not isn't really known yet.

yapyap 4 days ago [-]
They sure scrambled something together after DeepSeek sweeped the market.
GoatInGrey 4 days ago [-]
Indeed. Everyone knows that one can cobble together a frontier model and deploy it within three weeks.
TechDebtDevin 4 days ago [-]
Not to mention the model has been available to researchers for a month.
Y4wn 4 days ago [-]
[flagged]
dgfitz 4 days ago [-]
Oh look, another model. Yay.
badgersnake 4 days ago [-]
56% is pretty close to ‘don’t give a toss’
brcmthrowaway 4 days ago [-]
Gamechanger?
sss111 4 days ago [-]
This time, it is a save-face release, especially because Azure, AWS, and OpenRouter have started offering DeepSeek
aspect0545 4 days ago [-]
It’s always a game changer, isn’t it?
42lux 4 days ago [-]
Is it AGI yet?
colonelspace 4 days ago [-]
Yep.

And groundbreaking.

maeil 4 days ago [-]
It changes the landscape with its multifaceted approach.
ofou 4 days ago [-]
I find quite interesting they're releasing three compute levels (low, medium, high), I guess now there's some way to cap the thinking tokens when using their API.

Pricing for o3-mini [1] is $1.10 / $4.40 per 1M tokens.

[1]: https://platform.openai.com/docs/pricing#:~:text=o3%2Dmini

xinayder 4 days ago [-]
"oh no DeepSeek copied our product it's not fair"

> proceeds to release a product based on DeepSeek

ah, alas the hypocrisy...

Rooster61 4 days ago [-]
The thing they previewed back in December before the whole Deepseek kerfuffle this week?

Don't get me wrong, I'm laughing at OpenAI just like everyone else, but if they were really copying Deepseek, they'd be releasing a smaller model distilled from Deepseek API responses, and have it be open source to boot. This is neither

feznyng 4 days ago [-]
o3 was announced in December. R1 arguably builds off the rumored approach of o1 (LLM + RL) although with major efficiency gains. I'm not a big fan of OpenAI but it's the other way around.
kevinsundar 4 days ago [-]
BTW if you want to stay up to date with these kinds of updates from OpenAI you can follow them here: https://www.getchangelog.com/?service=openai.com

It uses GPT-4o mini to extract updates from the website using scrapegraphai so this is kinda meta :). Maybe I'll switch to o3 mini depending on cost. It's reasoning abilities, with a lower cost than o1, could be quite powerful for web scraping.

random3 4 days ago [-]
I might be missing some context here - to what specific context does your comment refer to? I'm asking because I don't see you in the conversation and you comments seems an out of context self-promoting plug.
kevinsundar 4 days ago [-]
Hey! I'm sorry you feel that way. There's several people who have subscribed to updates to OpenAI from my comment so there is clearly value to other commenters. I understand not everyone is interested though. It's just a free side project I built and I make no money.

Additionally, I believe my contribution to the conversation is that gpt-4o-mini, the previous model advertised as low-cost, works pretty well for my use case (which in this case can help others here). I'm excited to try out gpt-03-mini depending on what the cost looks like for web scraping purposes. Happy to report back here once I try it out.

n0id34 4 days ago [-]
Is AI fizzing out or just me? I feel like they're trying to smash out new models as fast as they can but in reality they're barely any different, it's turning into the smartphone market. New iPhone with a slightly better camera and slightly differently bevelled edges, get it NOW! But doesn't actually do anything better than the iPhone 6.

Claude, GPT 4 onwards, and DeepSeek all feel the same to me. Okay to a point, then kinda useless. More like a more convenient specialised Google that you need to double check the results of.

lordofgibbons 4 days ago [-]
Boiling frog. The advances are happening so rapidly, but incrementally, that it's not being registered. It just seems like the normal state.

Compare LLMs from a year or two ago with the ones out today on practically any task. It's night and day difference.

This is specially so when you start taking into account these "reasoning" models. It's mind blowing how much better they are than "non-reasoning" models for tasks like planning and coding.

https://aider.chat/docs/leaderboards/#aider-polyglot-benchma...

n0id34 3 days ago [-]
Hmmm I guess it's the way I use them then, because the latest models feel almost less intelligent than the likes of GPT4. Certainly not "night and day" difference from my daily or every other day use case experience. I guess it's probably far more noticeable on benchmarks and far more advanced stuff than I'm using, but I would have assumed that would be the minority and that the majority of people use it similar to how I do.
nextworddev 4 days ago [-]
on the contrary, it's accelerating since they unlocked a new paradigm of scaling
ktzar 3 days ago [-]
I don’t think they’ve improved much for common use since GPT-3.5, to be frank. They’re cheaper and more ubiquitous, yes, but when it comes to summarizing and generating basic text, they’re pretty much the same as they were back then

Maybe we're just getting more used to make it part of our workflow.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 02:32:44 GMT+0000 (Coordinated Universal Time) with Vercel.