Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Developers are choosing older AI models, and the data explains why (augmentcode.com)

91 points by knes 7 days ago | 69 comments

Anduia 2 hours ago [-]

To the authors of the site, please know that your current "Cookiebot by Usercentrics" is old and pretty much illegal. You shouldn't need to click 5 times to "Reject all" if accepting all is one click. Newer versions have a "Deny" button.

esskay 1 hours ago [-]

Weirdly this site also requested bluetooth access on my mac.

azalemeth 52 minutes ago [-]

That would be the browser fingerprinting in action. I often get a lot of requests to use widevine on ddg's browser on android (which informs one about it) for I suspect similar reasons.

esskay 49 minutes ago [-]

Interesting, I'm on Brave and have never had a site request bluetooth access before, so much so that I'd never even granted Brave bluetooth access, hence why it popped up as a system notification this time around.

nic547 44 minutes ago [-]

Doesn't Brave disable WebBluetooth by default via a flag?

sharken 15 minutes ago [-]

Brave indeed does block WebBluetooth by default, but it can be turned on by the user using flags.

It's by no means a new feature, but the privacy concerns outlined in this post are still valid 10 years later: https://blog.lukaszolejnik.com/w3c-web-bluetooth-api-privacy...

slig 8 minutes ago [-]

Just set up your browser to never even load that BS.

anothernewdude 17 minutes ago [-]

Or you could just reject all third party cookies, see no sites break and enjoy your privacy.

LouisSayers 40 minutes ago [-]

I wish we could pin down not only the model but also the way the UI works as well.

Last week Claude seemed to have a shift in the way it works. The way it summarises and outputs its results is different. For me it's gotten worse. Slower, worse results, more confusing narrowing down what actually changed etc etc.

Long story short, I wish I was able to checkpoint the entire system and just revert to how it was previously. I feel like it had gotten to a stage where I felt pretty satisfied, and whatever got changed ... I just want it reverted!

KronisLV 3 hours ago [-]

For development use cases, I switched to Sonnet 4.5 and haven't looked back. I mean, sure, sometimes I also use GPT-5 (and mini) and Gemini 2.5 Pro (and Flash), and also Cerebras Code just switched to providing GLM 4.6 instead of the previous Qwen3 Coder so those as well, but in general the frontier models are pretty good for development and I wouldn't have much reason to use something like Sonnet 4 or 3.7 or whatever.

JanSt 2 hours ago [-]

I have canceled my Claude Max subscription because Sonnet 4.5 is just too unreliable. For the rest of the month I'm using Opus 4.1 which is much better but seems to have much lower usage limits than before Sonnet 4.5 was released. When I hit 4.1 Opus limits I'm using Codex. I will probably go through with the Codex pro subscription.

CuriouslyC 1 hours ago [-]

Definitely do it. You get a lot of deep research, access to GPT5 Pro, Sora and the Codex limits are MUCH higher.

Shank 40 minutes ago [-]

I think this is one of the many indicators that even though these models get “version upgrades” it’s closer to switching to a different brain that may or may not understand or process things the way you like. Without a clear jump in performance, people test new models and move back to ones they know work if the new ones aren’t better or are actually worse.

breezk0 16 minutes ago [-]

Interesting to use a term like brain in the context of LLMs.

LoganDark 12 minutes ago [-]

Neural networks are quite brain-like.

teaearlgraycold 8 minutes ago [-]

I describe all of the LLM "upgrades" as more akin to moving the dirt around than actually cleaning.

blitzar 2 hours ago [-]

GPT-5 usage is 20% higher on days that start with "S"

Nevertheless, 7 datapoints does not a trend make (and the data presented certainly doesnt explain why). The daily variation is more than I would have expected, but could also be down to what day of the week the pizza party is or the weekly scrum meetings is at a few of their customers workplaces.

bambax 38 minutes ago [-]

> Each model appears to emphasize a different balance between reasoning and execution. Rather than seeking one “best” system, developers are assembling model alloys—ensembles that select the cognitive style best suited to a task.

This (as well as the table above it) matches my experience. Sonnet 4.0 answers SO-type questions very fast and mostly accurately (if not on a niche topic), Sonnet 4.5 is a little bit more clever but can err on the side of complexity for complexity's sake, and can have a hard time getting out of a hole it dug for itself.

ChatGPT 5 is excellent at finding sources on the web; Gemini simply makes stuff up and continues to do so even when told to verify; ChatGPT provides link that work and are generally relevant.

tifa2up 2 hours ago [-]

We tried GPT-5 for a RAG use case, and found that it performs worse than 4.1. We reverted and didn't look back.

sigmoid10 2 hours ago [-]

4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.

HugoDias 1 hours ago [-]

Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them

tifa2up 1 hours ago [-]

For large context (up to 100K tokens in some cases). We found that GPT-5: a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error

Xmd5a 9 minutes ago [-]

Ah, 100k/125K this is what poses problems I believe. GPT-5 scores should go up should you process contexts that are 10 times shorter.

Shank 44 minutes ago [-]

ChatGPT when using 5 or 5-Thinking doesn’t even follow my “custom instructions” on the web version. It’s a serious downgrade compared to the prior generation of models.

internet_points 44 minutes ago [-]

Interesting. https://www.robert-glaser.de/prompts-as-programs-in-gpt-5/ claims GPT-5 has amazing!1!! instruction following. Is your use-case very different, or is this yet another case of "developer A got lucky, developer B tested more things"?

teekert 2 hours ago [-]

So… You did look back then didn’t look forward anymore… sorry couldn’t resist.

s1mplicissimus 2 hours ago [-]

Seems to completely ignore usage of local/free models as well as anything but Sonnet/ChatGPT. So my confidence in the good faith of the author is... heavily restricted.

nicce 2 hours ago [-]

Most people can’t affort the GPUs for local models if you want to get close to cloud capabilities.

s1mplicissimus 2 hours ago [-]

Most people I know can't afford to leak business insider information to 3rd party SaaS providers, so it's unfortunately not really an option.

ruszki 15 minutes ago [-]

But… they do all the time. Almost everybody uses some mix of Office, Slack, Notion, random email providers, random “security” solutions etc. The exception is the opposite. The only thing prevents info leaking is ToS, and there are options for that even with LLMs. Nothing changed regarding that.

rhdunn 2 hours ago [-]

A 4090 has 24GB of VRAM allowing you to run a 22B model entirely in memory at FP8 and 24B models at Q6_K (~19GB).

A 5090 has 32GB of VRAM allowing you to run a 32B model in memory at Q6_K.

You can run larger models by splitting the GPU layers that are run in VRAM vs stored in RAM. That is slower, but still viable.

This means that you can run the Qwen3-Coder-30B-A3B model locally on a 4090 or 5090. That model is a Mixture of Experts model with 3B active parameters, so you really only need a card with 3B of VRAM so you could run it on a 3090.

The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.

Yes, it will be slower than running it in the cloud. But you can get a long way with a high-end gaming rig.

iberator 1 hours ago [-]

That's out of touch for 90% of developers worldwide

brazukadev 10 minutes ago [-]

Today. But what about in 5 years? Would you bet we will be paying hundreds of billions to OpenAI yearly or buying consumer GPUs? I know where I will be doing.

Foobar8568 2 hours ago [-]

Yes but they are really less performant than claude code or codex. I really cried with the 20-25GB models ( 30b Qwen, Devstral etc). They really don't hold a candle, I didn't think the gap was this large or maybe Claude code and GPT performs much better than I imagined.

ashirviskas 1 hours ago [-]

How much context do you get with 2GB of leftover VRAM on Nvidia GPU?

jen729w 2 hours ago [-]

Honestly though how many people reading this do you think have that setup vs. 85% of us being on a MBx?

> The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.

Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.

Less than 0.1% of the people reading this are doing that. Me, I gave $20 to some cloud service and I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.

EagnaIonat 1 hours ago [-]

The more recent LLMs work fine on an M1 mac. Can't speak for Windows/Linux.

There was even a recent release of Granite4 that runs on a Raspberry Pi.

https://github.com/Jewelzufo/granitepi-4-nano

For my local work I use Ollama. (M4 Max 128GB)

- gpt-oss. 20b or 120b depending on complexity of use cases.

- granite4 for speed and lower complexity (around the same as gpt20b).

Tepix 2 hours ago [-]

Isn't the point that you don't need SOTA capabilities all the time?

pistoriusp 2 hours ago [-]

Do you use a local/ free model?

s1mplicissimus 2 hours ago [-]

Yes, for the little it's good I'm currently using LMStudio with varying models

busymom0 2 hours ago [-]

I am currently using a local model qwen3:8b running on a 2020 (2018 intel chip) Mac mini for classifying news headlines and it's working decently well for my task. Each headline takes about 2-3 seconds but is pretty accurate. Uses about 5.3 gigs of ram.

darkwater 2 hours ago [-]

Can you expand a bit on your software setup? I thought running local models was restricted to having expensive GPUs or latest Apple Silicon with unified memory. I have a Intel 11th gen home server which I would like to use to run some local model for tinkering if possible.

NumerousProcess 2 hours ago [-]

Augment doesn't support local models or anything else other than Claude/GPT

moffkalast 1 hours ago [-]

I think it's also true for many local models. People still use NeMo, QwQ, Llama3 for use cases that fit them despite there being replacements that do better on "benchmarks". Not to mention relics like BERT that are still tuned for classification even today. ML models always have weird behaviours and a successor is unlikely to be better in literally every way, once you have something that works well enough it's hard to upgrade without facing different edge cases.

Inference for new releases is routinely bugged for at least a month or two as well, depending on how active the devs of a specific inference engine are and how much model creators collaborate. Personally, I hate how data from GPT's few week (and arguably somewhat ongoing) sycophancy rampage has leaked into datasets that are used for training local models, making a lot of new LLM releases insufferable to use.

xiphias2 2 hours ago [-]

Even for non-developer use cases o3 is a much better model for me than GPT5 on any setting.

30 seconds-1 minute is just the time I am patient enough to wait as that's the time I am spending on writing a question.

Faster models just make too many mistakes / don't understand the question.

nusl 41 minutes ago [-]

I use both Codex and Claude, mostly cuz it's cheaper to jump between them than to buy a Max sub for my use-case. My subjective experience is that Codex is better with larger or weird, speghetti-ish codebases, or codebases with more abstract concepts, while Claude is good for more direct uses. I haven't spent significant time fine-tuning the tools for my codebases.

Once, I set up a proxy that allowed Claude and Codex to "pair program" and collaborate, and it was cool to watch them talk to each other, delegate tasks, and handle different bits and pieces until the task was done.

sbinnee 1 hours ago [-]

I don't get the point of this post. Personally, I think that the thinking process is essential for accurate tool usage. Whenever I interact with Claude family models, either on a web chat or via a coding agent CLI, I believe that this thinking process is what makes Claude more accurate in using tools.

It could be true that newer models just produce more tokens seemingly out of no reasons. But with the increasing number of tool definitions, in the long run, I think it will pay off.

Just a few days ago, I read "Interleaved Thinking Unlocks Reliable MiniMax-M2 Agentic Capability"[1]. I think they have a valid point that this thinking process has significance as we are moving towards agents.

[1] https://www.minimax.io/news/why-is-interleaved-thinking-impo...

ashirviskas 1 hours ago [-]

Just one week data right after the release, when it is already one month later?

This data is basically meaningless, show us the latest stats.

mrasong 48 minutes ago [-]

I usually switch models depending on the situation, for simpler stuff, I lean toward 4o since it’s faster to get answers.

But when things get more complex, I prefer GPT-5, talking with it often gives me fresh ideas and new perspectives.

ACCount37 41 minutes ago [-]

You might be the first technical user spotted out in the wild who actually prefers 4o for anything.

Manfred 3 hours ago [-]

It could be an interesting data point, but without correcting for absolute usage figures and their customers it's kind of hard to make general statements.

BluSyn 17 minutes ago [-]

grok-code-fast-1 is my current pick, found accuracy and speed better than Sonnet 4.5 for day-to-day usage.

frabia 1 hours ago [-]

Tangential to this: what are the most reliable benchmarks for LLM in coding these days?

falcor84 54 minutes ago [-]

I found Terminal-Bench [0] to be the most relevant for me, even for tasks that go far outside the terminal. It's been very interesting to see tools climb up there, and it matches my own experimentation, that they generally get the most out of Sonnet (and even those that use a mix of models like Warp, typically default to Sonnet).

[0] https://www.tbench.ai/?ch=1

jonplackett 2 hours ago [-]

Isn’t this obvious? When you have a task you think is hard. You give it to a cleverer model. When a task is straight forward you give it to an older one.

PeterStuer 2 hours ago [-]

Not realy. Most developers would prefer one model that does everything best. That is the easiest, set it and forget it, no manual descision required.

What is unclear from the presentation is wether they do this or not. Do teams that use Sonnet 4.5 just always use it, and teams on Sonnet 4.0 likewise? Or do individuals decided which model to use on a per task basis.

Personally I tend to default to just 1, and only go to an alternative if it gets stuck or doesn't get me what I want.

hn_throw2025 2 hours ago [-]

Not sure why you were downvoted.. I think you are correct.

As evidenced by furious posters on r/cursor, who make every prompt to super-opus-thinking-max+++ and are astonished when they have blown their monthly request allowance in about a day.

If I need another pair of (artificial) eyes on a difficult debugging problem, I’ll occasionally use a premium model sparingly. For chore tasks or UI layout tweaks, I’ll use something more economical (like grok-4-fast or claude-4.5-haiku - not old models but much cheaper).

l5870uoo9y 2 hours ago [-]

To those who complain about GPT5 being slow; I recently migrated https://app.sqlai.ai and found that setting service_tier = “priority” makes it reason twice as fast.

iLoveOncall 39 minutes ago [-]

My team still uses Sonnet 3.5 for pretty much everything we do because it's largely enough and it's much, much faster than newer models. The only reason we're switching is because the models are getting deprecated...

rcarmo 2 hours ago [-]

I think this is somewhat disingenuous since not everyone uses the latest thing, and people tend to stick to “what works” for them.

Models are picky enough about prompting styles that changing to a new model every week/month becomes an added chunk of cognitive overload, testing and experimentation, plus even in developer tooling there have been minor grating changes in API invocations and use of parameters like temperature (I have a fairly low-level wrapper for OpenAI, and I had to tweak the JSON handling for GPT-5).

Also, there are just too many variations in API endpoints, providers, etc. We don’t really have a uniform standard. Since I don’t use “just” OpenAI, every single tool I try out requires me to jump through a bunch of hoops to grab a new API key, specify an endpoint, etc.—and it just gets worse if you use a non-mainstream AI endpoint.

rafaelmn 2 hours ago [-]

> I think this is somewhat disingenuous since not everyone uses the latest thing, and people tend to stick to “what works” for them.

They say that the number of users on Claude 4.5 spiked and then a significant number of users reverted to 4.0 with the trend going up, and they are talking about their usage metrics. So I don't get how your comment is relevant to the article ?

dotancohen 2 hours ago [-]

His comment is relevant to the headline. You must be new here.

gptfiveslow 2 hours ago [-]

GPT5 is HELLISHLY slow. That's all there is to it.

It loves doing a whole bunch of reasoning steps and prolaim how mucf of a very good job it did clearing up its own todo steps and all that mumbo jumbo, but at the end of the day, I only asked it a small piece of information about nginx try_files that even GPT3 could answer instantly.

Maybe before you make reasoning models that go on funny little sidequests wher they multiply numbers by 0 a couple of times, make it so its good at identfying the length of a task. ntil then, I'll ask little bro and advance only if necessity arrives. And if it ends up gathering dust, well... yeah.

rho4 43 minutes ago [-]

This. Speed determines whether I (like to) use a piece of software.

Imagine waiting for a minute until Google spits out the first 10 results.

My prediction: All AI models of the future will give an immediate result, with more and more innovation in mechanisms and UX to drill down further on request.

Edit: After reading my reply I realize that this is also true for interactions with other people. I like interacting with people who give me a 1 sentence response to my question, and only start elaborating and going on tangents and down rabbit holes upon request.

philipwhiuk 8 minutes ago [-]

> All AI models of the future will give an immediate result, with more and more innovation in mechanisms and UX to drill down further on request.

I doubt it. In fact I would predict the speed/detail trade-off continues to diverge.

confirmmesenpai 36 minutes ago [-]

> Imagine waiting for a minute until Google spits out the first 10 results.

what if the instantaneous responses make you waste 10 min realizing they were not what you searched for?

rho4 7 minutes ago [-]

I understand your point, but I still prefer instantaneous responses.

Only when the immediate answers become completely useless will I want to look into slower alternatives.

But first "show me what you've got so far", and let me decide whether it's good enough or not.

EagnaIonat 1 hours ago [-]

> It loves doing a whole bunch of reasoning steps

If you are talking about local models, you can switch that off. The reasoning is a common technique now to improve the accuracy of the output where the question is more complex.

szundi 2 hours ago [-]

[dead]

Tepix 2 hours ago [-]

The article(§) talks about going from Sonnet 4.5 back to Sonnet 4.0.

(§) You know that it's a hyperlink, do you? /s

Rendered at 10:23:52 GMT+0000 (Coordinated Universal Time) with Vercel.