LLMs exist on a logaritmhic performance/cost frontier. It's not really clear whether Opus 4.5+ represent a level shift on this frontier or just inhabits place on that curve which delivers higher performance, but at rapidly diminishing returns to inference cost.
To me, it is hard to reject this hypothesis today. The fact that Anthropic is rapidly trying to increase price may betray the fact that their recent lead is at the cost of dramatically higher operating costs. Their gross margins in this past quarter will be an important data point on this.
I think the tendency for graphs of model assessment to display the log of cost/tokens on the x axis (i.e. Artificial Analysis' site) has obscured this dynamic.
_pdp_ 6 minutes ago [-]
IMHO there is a point where incremental model quality will hit diminishing returns.
It is like comparing an 8K display to a 16K display because at normal viewing distance, the difference is imperceptible, but 16K comes at significant premium.
The same applies to intelligence. Sure, some users might register a meaningful bump, but if 99% can't tell the difference in their day-to-day work, does it matter?
A 20-30% cost increase needs to deliver a proportional leap in perceivable value.
snek_case 39 seconds ago [-]
It probably depends what you're using the models for. If you use them for web search, summarizing web pages, I can imagine there's a plateau and we're probably already hitting it.
For coding though, there is kind of no limit to the complexity of software. The more invariants and potential interactions the model can be aware of, the better presumably. It can handle larger codebases. Probably past the point where humans could work on said codebases unassisted (which brings other potential problems).
atonse 19 minutes ago [-]
Just yesterday I was happy to have gotten my weekly limit reset [1]. And although I've been doing a lot of mockup work (so a lot of HTML getting written), I think the 1M token stuff is absolutely eating up tokens like CRAZY.
I'm seeing the opposite. With Opus 4.7 and xhigh, I'm seeing less session usage , it's moving faster, and my weekly usage is not moving that much on a Team Pro account.
aray07 12 minutes ago [-]
yeah similar for me - it uses a bunch more tokens and I haven’t been able to tell the ROI in terms of better instruction following
it seems to hallucinate a bit more (anecdotal)
uberman 55 minutes ago [-]
On actual code, I see what you see a 30% increase in tokens which is in-line with what they claim as well. I personally don't tend to feed technical documentation or random pros into llms.
Given that Opus 4.6 and even Sonnet 4.6 are still valid options, for me the question is not "Does 4.7 cost more than claimed?" but "What capabilities does 4.7 give me that 4.6 did not?"
Yesterday 4.6 was a great option and it is too soon for me to tell if 4.7 is a meaningful lift. If it is, then I can evaluate if the increased cost is justified.
pier25 41 minutes ago [-]
haven't people been complaining lately about 4.6 getting worse?
solenoid0937 34 minutes ago [-]
People complain about a lot of things. Claude has been fine:
I will be the first to acknowledge that humans are a bad judge of performance and that some of the allegations are likely just hallucinations...
But... Are you really going to completely rely on benchmarks that have time and time again be shown to be gamed as the complete story?
My take: It is pretty clear that the capacity crunch is real and the changes they made to effort are in part to reduce that. It likely changed the experience for users.
Majromax 20 minutes ago [-]
While that's a nice effort, the inter-run variability is too high to diagnose anything short of catastrophic model degradation. The typical 95% confidence interval runs from 35% to 65% pass rates, a full factor of two performance difference.
Moreover, on the companion codex graphs (https://marginlab.ai/trackers/codex-historical-performance/), you can see a few different GPT model releases marked yet none correspond to a visual break in the series. Either GPT 5.4-xhigh is no more powerful than GPT 5.2, or the benchmarking apparatus is not sensitive enough to detect such changes.
cbg0 19 minutes ago [-]
That performance monitor is super easy to game if you cache responses to all the SWE bench questions.
ed_elliott_asc 35 minutes ago [-]
No we increased our plans
grim_io 34 minutes ago [-]
How long will they host 4.6? Maybe longer for enterprise, but if you have a consumer subscription, you won't have a choice for long, if at all anymore.
Jeremy1026 14 minutes ago [-]
I was trying to figure out earlier today how to get 4.6 to run in Claude Code, as part of the output it included "- Still fully supported — not scheduled for retirement until Feb 2027." Full caveat of, I don't know where it came up with this information, but as others have said, 4.5 is still available today and it is now 5, almost 6 months old.
nfredericks 27 minutes ago [-]
Opus 4.5 is still available
iknowstuff 27 minutes ago [-]
Interesting because I already felt like current models spit out too much garbage verbose code that a human would write in a far more terse, beautiful and grokable way
aray07 11 minutes ago [-]
yeah opus 4.7 feels a lot more verbose - i think they changed the system prompt and removed instructions to be terse in its responses
rafram 13 minutes ago [-]
Pretty funny that this article was clearly written by Claude.
jmward01 15 minutes ago [-]
Yeah. I just did a day with 4.7 and I won't be going back for a while. It is just too expensive. On top of the tokenization the thinking seems like it is eating a lot more too.
aray07 13 minutes ago [-]
yeah i am still not clear why there are 5 effort modes now on top of more expensive tokenization
markrogersjr 13 minutes ago [-]
4.7 one-shot rate is at least 20-30% higher for me
dallen33 34 minutes ago [-]
I'm still using Sonnet 4.6 with no issues.
risyachka 26 minutes ago [-]
How does this solve the issue? 4.6 will be disabled after one or more release like any other legacy model.
stefan_ 7 minutes ago [-]
I don't know anything about tokens. Anthropic says Pro has "more usage*", Max has 5x or 20x "more usage*" than Pro. The link to "usage limits" says "determines how many messages you can send". Clearly no one is getting billed for tokens.
bcjdjsndon 12 minutes ago [-]
Because those braniacs added 20-30% more system prompt
CodingJeebus 10 minutes ago [-]
The fundamental problem with these frontier model companies is that they're incentivized to create models that burn through more tokens, full stop. It's a tale as old as capitalism: you wake up every day and choose to deliver more value to your customers or your shareholders, you cannot do both simultaneously forever.
People love to throw around "this is the dumbest AI will ever be", but the corollary to that is "this is the most aligned the incentives between model providers and customers will ever be" because we're all just burning VC money for now.
To me, it is hard to reject this hypothesis today. The fact that Anthropic is rapidly trying to increase price may betray the fact that their recent lead is at the cost of dramatically higher operating costs. Their gross margins in this past quarter will be an important data point on this.
I think the tendency for graphs of model assessment to display the log of cost/tokens on the x axis (i.e. Artificial Analysis' site) has obscured this dynamic.
It is like comparing an 8K display to a 16K display because at normal viewing distance, the difference is imperceptible, but 16K comes at significant premium.
The same applies to intelligence. Sure, some users might register a meaningful bump, but if 99% can't tell the difference in their day-to-day work, does it matter?
A 20-30% cost increase needs to deliver a proportional leap in perceivable value.
For coding though, there is kind of no limit to the complexity of software. The more invariants and potential interactions the model can be aware of, the better presumably. It can handle larger codebases. Probably past the point where humans could work on said codebases unassisted (which brings other potential problems).
I'm already at 27% of my weekly limit in ONE DAY.
https://news.ycombinator.com/item?id=47799256
it seems to hallucinate a bit more (anecdotal)
Given that Opus 4.6 and even Sonnet 4.6 are still valid options, for me the question is not "Does 4.7 cost more than claimed?" but "What capabilities does 4.7 give me that 4.6 did not?"
Yesterday 4.6 was a great option and it is too soon for me to tell if 4.7 is a meaningful lift. If it is, then I can evaluate if the increased cost is justified.
https://marginlab.ai/trackers/claude-code-historical-perform...
But... Are you really going to completely rely on benchmarks that have time and time again be shown to be gamed as the complete story?
My take: It is pretty clear that the capacity crunch is real and the changes they made to effort are in part to reduce that. It likely changed the experience for users.
Moreover, on the companion codex graphs (https://marginlab.ai/trackers/codex-historical-performance/), you can see a few different GPT model releases marked yet none correspond to a visual break in the series. Either GPT 5.4-xhigh is no more powerful than GPT 5.2, or the benchmarking apparatus is not sensitive enough to detect such changes.
People love to throw around "this is the dumbest AI will ever be", but the corollary to that is "this is the most aligned the incentives between model providers and customers will ever be" because we're all just burning VC money for now.
1. https://github.com/juliusbrussee/caveman
Much of the token usage is in reasoning, exploring, and code generation rather than outputs to the user.
Does making Claude sound like a caveman actually move the needle on costs? I am not sure anymore whether people are serious about this.
To me, caveman sounds bad and is not as easy to understand compared to normal English.