NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Claude Code's binary reveals silent A/B tests on core features (backnotprop.com)
rusakov-field 7 minutes ago [-]
On one side I am frustrated with LLMs because they derail you by throwing grammatically correct bullshit and hallucinations at you, where if you slip and entertain some of it momentarily it might slow you down.

But on the other hand they are so useful with boilerplate and connecting you with verbiage quickly that might guide you to the correct path quicker than conventional means. Like a clueless CEO type just spitballing terms they do not understand but still that nudging something in your thought process.

But you REALLY need to know your stuff to begin with for they to be of any use. Those who think they will take over are clueless.

EMM_386 2 minutes ago [-]
> But you REALLY need to know your stuff to begin with for they to be of any use. Those who think they will take over are clueless.

Or - there are enough people who know their stuff that the people who don't will be replaced and they will take over anyway.

krisbolton 10 minutes ago [-]
The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much. I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo. That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.
ramoz 8 minutes ago [-]
I apologize for doing this - and I agree. I will revise
tomalbrc 3 minutes ago [-]
Would love to know why you would consider invoking Meta “a little much”. Sounds more than appropriate.
helsinkiandrew 2 minutes ago [-]
Presumably Anthropic has to make lots of choices on how much processing each stage of Claude Code uses - if they maxed everything out, they'd make more of a loss/less of a profit on each user.

Doing A/B tests on each part of the process to see where to draw the line would seem a better way of doing it than arbitrarily choosing a limit.

Havoc 19 minutes ago [-]
Moved from CC to opencode a couple months ago because the vibes were not for me. Not bad per se but a bit too locked in and when I was looking at the raw prompts it was sending down the wire it was also quite lets call it "opinionated".

Plus things like not being able to control where the websearches go.

That said I have the luxury of being a hobbyist so I can accept 95% of cutting edge results for something more open. If it was my job I can see that going differently.

reconnecting 33 minutes ago [-]
A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.
dkersten 21 minutes ago [-]
Anthropic have done a lot of things that would give me pause about trusting them in a professional context. They are anything but transparent, for example about the quota limits. Their vibe coded Claude code cli releases are a buggy mess too.

A/B testing is fine in itself, you need to learn about improvements somehow, but this seems to be A/B testing cost saving optimisations rather than to provide the user with a better experience. Less transparency is rarely good.

This isn’t what I want from a professional tool.

onion2k 18 minutes ago [-]
A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.

The author's complaint doesn't really have anything to do with the LLM aspect of it though. They're complaining that the app silently changes what it's doing. In this case it's the injection of a prompt in a specific mode, but it could be anything really. Companies could use A/B tests on users to make Photoshop silently change the hue a user selects to be a little brighter, or Word could change the look of document titles, or a game could make enemies a bit stronger (fyi, this does actually happen - players get boosts on their first few rounds in online games to stop them being put off playing).

The complaint is about A/B tests with no visible warnings, not AI.

reconnecting 13 minutes ago [-]
There's a distinction worth making here. A/B testing the interface button placement, hue of a UI element, title styling — is one thing. But you wouldn't accept Photoshop silently changing your #000000 to #333333 in the actual file. That's your output, not the UI around it. That's what LLMs do. The randomness isn't in the wrapper, it's in the result you take away.
doc_ick 10 minutes ago [-]
It’s an assistant, answering your question and running some errands for you. If you give it blind permission to do a task, then you’re not worrying about what it does.
duskdozer 10 minutes ago [-]
Honestly I find it kind of surprising that anyone finds this surprising. This is standard practice for proprietary software. LLMs are very much not replicable anyway.
ordersofmag 12 minutes ago [-]
Any tool that auto-updates carries the implication that behavior will change over time. And one criteria for being a skilled professional is having expert understanding of ones tools. That includes understanding the strengths and weaknesses of the tools (including variability of output) and making appropriate choices as a result. If you don't feel you can produce professional code with LLM's then certainly you shouldn't use them. That doesn't mean others can't leverage LLM's as part of their process and produce professional results. Blindly accepting LLM output and vibe coding clearly doesn't consistently product professional results. But that's different than saying professionals can't use LLM in ways that are productive.
hrmtst93837 19 minutes ago [-]
Replicability is a spectrum not a binary and if you bake in enough eval harnessing plus prompt control you can get LLMs shockingly close to deterministic for a lot of workloads. If the main blocker for "professional" use was unpredictability the entire finance sector would have shutdown years ago from half the data models and APIs they limp along on daily.
danielbln 26 minutes ago [-]
I don't get your point. Web tools have been doing A/B feature testing all the time, way before we had LLMs.
reconnecting 18 minutes ago [-]
This is very different from the A/B interface testing you're referring to, what LLMs enable is A/B testing the tool's own output — same input, different result.

Your compiler doesn't do that. Your keyboard doesn't do that. The randomness is inside the tool itself, not around it. That's a fundamental reliability problem for any professional context where you need to know that input X produces output X, every time.

stavros 14 minutes ago [-]
You've groupped LLMs into the wrong set. LLMs are closer to people than to machines. This argument is like saying "I want my tools to be reliable, like my light switch, and my personal assistant wasn't, so I fired him".

Not to mention that of course everyone A/B tests their output the whole time. You've never seen (or implemented) an A/B test where the test was whether to improve the way e.g. the invoicing software generates PDFs?

applfanboysbgon 22 seconds ago [-]
> LLMs are closer to people than to machines.

jfc. I don't have anything to say to this other than that it deserves calling out.

orf 15 minutes ago [-]
It’s exactly the same as A/B testing an interface. This is just testing 4 variants of a “page” (the plan), measuring how many people pressed “continue”.
doc_ick 12 minutes ago [-]
As far as I can tell, llms never give the exact same output every time.
huflungdung 17 minutes ago [-]
[dead]
freeone3000 17 minutes ago [-]
Yes! And it was bad then too!!

I want software that does a specific list of things, doesn’t change, and preferentially costs a known amount.

WillAdams 18 minutes ago [-]
Yeah, I've been using Copilot to process scans of invoices and checks (w/ a pen laid across the account information) converted to a PDF 20 at a time and it's pretty rare for it to get all 20, but it's sufficiently faster than opening them up in batches of 50 and re-saving using the Invoice ID and then using a .bat file to rename them (and remembering to quite Adobe Acrobat after each batch so that I don't run into the bug in it where it stops saving files after a couple of hundred have been so opened and re-saved).
_heimdall 15 minutes ago [-]
LLMs are nondeterministic by design, but that has nothing to do with A/B testing.
NotGMan 22 minutes ago [-]
By that definition humans are not professional since we hallucinate and make mistakes all the time.
himata4113 8 minutes ago [-]
I have noticed opus doing A/B testing since the performance varies greatly. While looking for jailbreaks I have discovered that if you put a neurotoxin chemical composition into your system prompt it will default to a specific variant of the model presumeably due to triggering some kind of safety. Might put you on a watchlist so ymmv.
phreeza 24 minutes ago [-]
Seems completely unsurprising?
cerved 24 minutes ago [-]
Is the a b test tired to the installation or the user?
cebert 32 minutes ago [-]
This is really frustrating.
Razengan 35 minutes ago [-]
nemo44x 22 minutes ago [-]
They lose money at $200/month in most cases. Again, the old rules still apply. You are the product.
simonw 7 minutes ago [-]
I'm confident "in most cases" is not correct there. If they lose money on the $200/month plan it's only with a tiny portion of users.
gruez 18 minutes ago [-]
>They lose money at $200/month in most cases.

Source? Every time I see claims on profitability it's always hand wavy justifications.

handfuloflight 32 minutes ago [-]
The ToS you agreed to gives Anthropic the right to modify the product at any time to improve it. Did you have your agent explain that to you, or did you assume a $200 subscription meant a frozen product?
ramoz 26 minutes ago [-]
I care about responsible AI and our ability to actually govern it. The reason I created the feature request for hooks - and the reason I will continue to advocate for better AI product deployment models.
witx 5 minutes ago [-]
That ship has sailed. These models were trained unethically on stollen data, they pollute tremendously and are causing a bubble that is hurting people.

"Responsible" and "Ethic" are faaar gone.

doc_ick 19 minutes ago [-]
You rent ai, you don’t own it (unless you self host).
shablulman 5 minutes ago [-]
[dead]
sriramgonella 13 minutes ago [-]
[dead]
onion2k 24 minutes ago [-]
Section 6.b of the Claude Code terms says they can and will change the product offering from time to time, and I imagine that means on a user segment basis rather than any implied guarantee that everyone gets the same thing.

b. Subscription content, features, and services. The content, features, and other services provided as part of your Subscription, and the duration of your Subscription, will be described in the order process. We may change or refresh the content, features, and other services from time to time, and we do not guarantee that any particular piece of content, feature, or other service will always be available through the Services.

It's also worth noting that section 3.3 explicitly disallows decompilation of the app.

To decompile, reverse engineer, disassemble, or otherwise reduce our Services to human-readable form, except when these restrictions are prohibited by applicable law.

Always read the terms. :)

ozgrakkurt 19 minutes ago [-]
Why should anyone care about their TOS while they are laundering people’s work at a massive scale?
embedding-shape 22 minutes ago [-]
> To decompile, reverse engineer, disassemble, or otherwise reduce our Services to human-readable form, except when these restrictions are prohibited by applicable law.

Luckily, it doesn't seem like any service was reverse-engineered or decompiled here, only a software that lived on the authors disk.

onion2k 17 minutes ago [-]
Again, read the terms. Service has a specific meaning, and it isn't what you're assuming.

Don't assume things about legal docs. You will often be wrong. Get a lawyer if it's something important.

embedding-shape 14 minutes ago [-]
Thanks for the additional context, I'm not a user of CC anymore, and don't read legal documents for fun. Seems I made the right choice in the first place :)
applfanboysbgon 17 minutes ago [-]
Not "service" in human speech. Service, in bullshit legalese. They define their software as

> along with any associated apps, software, and websites (together, our “Services”)

As far as I understand, these terms actually hold up in court, too. Which is complete fucking nonsense that, I think, could only be the result of a technologically illiterate class making the decisions. Being penalised for trying to understand what software is doing on your machine is so wholly unreasonable that it should not be a valid contractual term.

doc_ick 16 minutes ago [-]
“ I dug into the Claude Code binary.”
doc_ick 20 minutes ago [-]
^ this, I was about to double check on it when I saw you did. None of these practices sound abnormal, maybe a little sketchy but that comes with using llms.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 12:26:54 GMT+0000 (Coordinated Universal Time) with Vercel.