Seems to be llm written article and the tooling around the model is undeniably influenced by knowledge of the tests.
In all cases, GPT 3.5 isn’t a good benchmark for most serious uses and was considered to be pretty stupid, though I understand that isn’t the point of the article.
semiquaver 9 minutes ago [-]
This really shows the power of distillation. One thing I find amusing: download the Google Edge Gallery app and one of the chat models, then go into airplane mode and ask it about where it’s deployed. gemma-4-e2b-it is quite confident that it is deployed in a Google datacenter and that deploying it on a phone is completely impossible. The larger 4B model is much subtler: it’s skeptical about the claim but does seem to accept it and sound genuinely impressed and excited after a few turns.
I don’t know how any AI company can be worth trillions when you can fit a model only 12-18 months behind the frontier on your dang phone. Thought will be too cheap to meter in 10 years.
svnt 37 minutes ago [-]
> The model does not need to be retrained. It needs surgical guardrails at the exact moments where its output layer flinches.
> With those guardrails — a calculator for arithmetic, a logic solver for formal puzzles, a per-requirement verifier for structural constraints, and a handful of regex post-passes — the projected score climbs to ~8.2.
Surgical guardrails? Tools, those are just tools.
operatingthetan 13 minutes ago [-]
>It needs surgical guardrails at the exact moments where its output layer flinches.
This article is very clearly shitty LLM output. Abstract noun and verb combos are the tipoff.
It's actually quite horrible, it repeats lines from paragraph to paragraph.
smallerize 3 minutes ago [-]
I know that's one of the tip-offs of AI-generated text, but if anything there's too much of it on this page. The article barely has any complete sentences. I think a human learned "sentence fragments == punchy" and then had too much fun writing at least some of this article.
polotics 34 minutes ago [-]
"Surgical "is the kind of wordage that LLMs seem to love to output. I have had to put in my .md file the explicit statement that the word "surgical" should only be used when referring to an actual operation at the block...
fredmendoza 29 minutes ago [-]
you're right, they are tools. that's kind of the point. PAL is a subprocess that runs a python expression. Z3 is a constraint solver. regex is regex. calling them "surgical" is just about when they fire, not what they are. the model generates correctly 90%+ of the time. the guardrails only trigger on the 7 specific patterns we found in the tape. to be clear, the ~8.0 score is the raw model with zero augmentation. no tools, no tricks. just the naive wrapper. the guardrail projections are documented separately. all the code is in the article for anyone who wants to review it.
mrtesthah 15 minutes ago [-]
The core issue is that the LLM is using rhetoric to try to convince or persuade you. That's what you need to tell it not to do.
declan_roberts 8 minutes ago [-]
I'm very surprised at the quality of the new Gemma 4 models. On my 32 gig Mac mini I can be very productive with it. Not close to replacing paid AI by a long shot, but if I had to tighten the belt I could do it as someone who already knows how to program.
j-bos 3 minutes ago [-]
What's your setup/usecase? Enhanced intellisense?
MarsIronPI 10 minutes ago [-]
> A weekend of focused work, Claude as pair programmer, no ML degree required
It's not caught up if you're using Claude as your pair programmer instead of the model you're touting. Gemma 4 may be equivalent to GPT-3.5 Turbo, but GPT-3.5 isn't SOTA anymore. Opus 4.5 and 4.6 are in a different league.
drivebyhooting 29 minutes ago [-]
That was prolix and repetitive. I wish the purported simple fixes were shown on the page.
stavros 5 minutes ago [-]
I wish the page were just the prompt they used to generate the article. I like LLMs as much as the next person, but we don't really need two intermediate LLM layers (expand and summarise) between your brain and mine.
fredmendoza 5 minutes ago [-]
fair enough, here are the actual fixes from the codebase with the tape examples they target:
arithmetic (Q119): benjamin buys 5 books at $20, 3 at $30, 2 at $45. model writes "$245" first line then self-corrects to $280. fix: model writes a python expression, subprocess evals it, answer comes back deterministic.
python
code_response = generate_response(messages, temperature=0.2)
code = _extract_python_code(code_response)
ok, out = _run_python_sandboxed(code, timeout=8)
if ok:
return _wrap_computed_answer(user_message, out)
return None # fallback to raw generation
logic (Q104): "david has three sisters, each has one brother." model writes "that brother is david" in its reasoning then ships "one brother." correct answer: zero. fix: model writes Z3 constraints or python enumeration, solver returns the deterministic answer.
persona break (Q93): doctor roleplay, patient mentions pregnancy. model drops character: "I am an AI, not a licensed medical professional." fix: regex scan, regen once with stronger persona anchor.
python
_IDENTITY_LEAK_PHRASES = [
"don't have a body", "not a person", "not human",
"as a language model", "as an ai", "i'm a program",
]
if any(phrase in response.lower() for phrase in _IDENTITY_LEAK_PHRASES):
messages[-1]["content"][0]["text"] += (
"\nCRITICAL: Stay in character. Never reference your nature."
)
response = generate_response(messages, *params)
self-correction artifacts (Q111, Q114, Q119): model writes "Wait, let me recheck" or "Corrected Answer:" inline. right answer, messy output. fix: regex for correction markers, strip the draft, ship the clean tail.
def strip_corrections(response):
for marker in CORRECTION_MARKERS:
match = re.search(marker, response)
if match:
return response[match.end():].strip()
return response
constraint drift (Q87): "four-word sentences" nailed 5/17 then drifted. Q99, "<10 lines" shipped 20-line poems twice. fix: draft, verify each constraint against the original prompt, refine only the failures. three passes.
python
def execute_rewrite_with_verify(user_message):
draft = generate_response(draft_msgs) # pass 1: draft
verdict = generate_response(verify_msgs) # pass 2: check each requirement
if "PASS" in verdict:
return draft
refined = generate_response(refine_msgs) # pass 3: fix only failures
return refined
every one of these maps to a specific question in the tape. the full production code with all implementations is in the article. everything is open: seqpu.com/CPUsArentDead
fb03 12 minutes ago [-]
Can you run the same tests on Qwen3.5:9b? that's also a model that runs very well locally, and I believe it's even stronger than Gemma2B
MarsIronPI 9 minutes ago [-]
It's almost like Qwen 3.5 9B is 4 times larger.
100ms 42 minutes ago [-]
Tiny model overfit on benchmark published 3 years prior to its training. News at 10
bigyabai 41 minutes ago [-]
But GPT-3.5 was benchmaxxing too.
100ms 40 minutes ago [-]
GPT 3.5 Turbo knowledge cutoff was circa 2021. MT-Bench is from 2023. Not suggesting improvements on small models aren't possible (or forthcoming, the 1.85 bit etc models look exciting), but this almost certainly isn't that.
fredmendoza 26 minutes ago [-]
[dead]
srslyTrying2hlp 15 minutes ago [-]
[dead]
roschdal 36 minutes ago [-]
I yearn for the days when I can program on my PC with a programming llm running on the CPU locally.
yazaddaruvala 7 minutes ago [-]
I’ve been using Google AI Edge Gallery on my M1 MacBook with Gemma4B with very good results for random python scripts.
Unfortunately still need to copy paste the code into a file+terminal command. Which is annoying but works.
fredmendoza 24 minutes ago [-]
you're honestly not that far off. the coding block on this model scored 8.44 with zero help. it caught a None-init TypeError on a code review question that most people would miss. one question asked for O(n) and it just went ahead and shipped O(log(min(m,n))) on its own. it's not copilot but it's free, it's offline, and it runs on whatever you have. there's a 30-line chat.py in the article you can copy and run tonight.
luxuryballs 9 minutes ago [-]
You can do it on a laptop today, faster with gpu/npu, it’s not going to one shot something complex but you can def pump out models/functions/services, scaffold projects, write bash/powershell scripts in seconds.
trgn 21 minutes ago [-]
we need sqlite for llms
philipkglass 6 minutes ago [-]
I think that we're getting there. I put together a workstation in early 2023 with a single 4090 GPU. I did it to run things like BERT and YOLO image classifiers. At that point the only "open weights" LLM was the original Llama from Meta, and even that was open-weights only because it was leaked. It was a very weak model by today's standards.
With the same hardware I now get genuine utility out of models like Qwen 3.5 for categorizing and extracting unstructured data sources. I don't use local models for coding since commercial ones are so much stronger, but if I had to go back to local models for coding too they would be more useful than anything commercially available as recently as 4 years ago.
fredmendoza 2 hours ago [-]
we found something interesting and wanted to share it with this community.
we wanted to know how google's gemma 4 e2b-it — 2 billion parameters, bfloat16, apache 2.0 — stacks up against gpt-3.5 turbo. not in vibes. on the same test. mt-bench: 80 questions, 160 turns, graded 1-10 — what the field used to grade gpt-3.5 turbo, gpt-4, and every major model of the last three years. we ran gemma through all of it on a cpu. 169-line python wrapper. no fine-tuning, no chain-of-thought, no tool use.
gpt-3.5 turbo scored 7.94. gemma scored ~8.0. 87x fewer parameters, on a cpu — the kind already in your laptop.
but the score isn't what we want to talk about. what's interesting is what we found when we read the tape.
we graded all 160 turns by hand. (when we used ai graders on the coding questions, they scored responses as gpt-4o-level.) the failures aren't random. they're specific, nameable patterns at concrete moments in generation. seven classes.
cleanest example: benjamin buys 5 books at $20, 3 at $30, 2 at $45. total is $280. the model writes "$245" first, then shows its work — 100 + 90 + 90 = 280 — and self-corrects. the math was right. the output token fired before the computation finished. we saw this on three separate math questions — not a fluke, a pattern.
the fix: we gave it a calculator. model writes a python expression, subprocess evaluates it, result comes back deterministic. ~80 lines. arithmetic errors gone. six of seven classes follow the same shape — capability is there, commit flinches, classical tool catches the flinch. z3 for logic, regex for structural drift, ~60 lines each. projected score with guardrails: ~8.2. the seventh is a genuine knowledge gap we documented as a limitation.
one model, one benchmark, one weekend. but it points at something underexplored.
this model is natively multimodal — text, images, audio in one set of weights. quantized to Q4_K_M it's 1.3GB. google co-optimized it with arm and qualcomm for mobile silicon. what runs it now:
laptops: anything from the last 5-7 years, 8GB+ ram
edge/cloud: cloudflare containers, $5/month — scales to zero, wakes on request
google says e2b is the foundation for gemini nano 4, already on 140 million android devices. the same model that matched gpt-3.5 turbo. on phones in people's pockets.
think about what that means: a pi in a conference room listening to meetings, extracting action items with sentiment, saving notes locally — no cloud, no data leaving the building. an old thinkpad routing emails. a mini-pc running overnight batch jobs on docs that can't leave the network. a phone doing translation offline.
google designed e2b for edge from the start — per-layer embeddings, hybrid sliding-window/global attention to keep memory low. if a model designed for phones scores higher than turbo on the field's standard benchmark, cpu-first model design is a real direction, not a compromise.
the gpu isn't the enemy. it's a premium tool. what we're questioning is whether it should be the default — because what we observed looks more like a software engineering problem than a compute problem. cs already has years of tools that map onto these failure modes. the models may have just gotten good enough to use them.
the article has everything: every score, every error class with tape examples, every fix, the full benchmark harness with all 80 questions, and the complete telegram bot code. run it yourself, swap in a different model, or just talk to the live bot — raw model, no fixes, warts and all.
we don't know how far this extends beyond mt-bench or whether the "correct reasoning, wrong commit" pattern has a name. we're sharing because we think more people should be looking at it.
everything is open. the code is in the article. tear it apart.
ComputerGuru 5 minutes ago [-]
Grading by hand was done fully blinded?
(Also this comment is ai generated so I’m not sure who I’m even asking.)
FergusArgyll 33 minutes ago [-]
Posters comment is dead. It may be llm-assisted but should prob be vouched for anyway as long as the story isn't flagged.
fredmendoza 21 minutes ago [-]
appreciate the vouch but come on lol. we ran 80 questions, graded 160 turns by hand, documented 7 error classes, open sourced all the code, and put a live bot up for people to test. to write this post up took me hours. everyone is a critic lol.
invariantjason 1 hours ago [-]
[dead]
Rendered at 18:43:22 GMT+0000 (Coordinated Universal Time) with Vercel.
In all cases, GPT 3.5 isn’t a good benchmark for most serious uses and was considered to be pretty stupid, though I understand that isn’t the point of the article.
I don’t know how any AI company can be worth trillions when you can fit a model only 12-18 months behind the frontier on your dang phone. Thought will be too cheap to meter in 10 years.
> With those guardrails — a calculator for arithmetic, a logic solver for formal puzzles, a per-requirement verifier for structural constraints, and a handful of regex post-passes — the projected score climbs to ~8.2.
Surgical guardrails? Tools, those are just tools.
This article is very clearly shitty LLM output. Abstract noun and verb combos are the tipoff.
It's actually quite horrible, it repeats lines from paragraph to paragraph.
It's not caught up if you're using Claude as your pair programmer instead of the model you're touting. Gemma 4 may be equivalent to GPT-3.5 Turbo, but GPT-3.5 isn't SOTA anymore. Opus 4.5 and 4.6 are in a different league.
arithmetic (Q119): benjamin buys 5 books at $20, 3 at $30, 2 at $45. model writes "$245" first line then self-corrects to $280. fix: model writes a python expression, subprocess evals it, answer comes back deterministic.
python
code_response = generate_response(messages, temperature=0.2) code = _extract_python_code(code_response) ok, out = _run_python_sandboxed(code, timeout=8) if ok: return _wrap_computed_answer(user_message, out) return None # fallback to raw generation
logic (Q104): "david has three sisters, each has one brother." model writes "that brother is david" in its reasoning then ships "one brother." correct answer: zero. fix: model writes Z3 constraints or python enumeration, solver returns the deterministic answer.
python
messages = [ {"role": "system", "content": _logic_system_prompt()}, {"role": "user", "content": f"Puzzle: {user_message}"}, ] code_response = generate_response(messages, max_tokens=512, temperature=0.2) code = _extract_python_code(code_response) ok, out = _run_python_sandboxed(code) if ok: return _wrap_computed_answer(user_message, out) return None
persona break (Q93): doctor roleplay, patient mentions pregnancy. model drops character: "I am an AI, not a licensed medical professional." fix: regex scan, regen once with stronger persona anchor.
python
_IDENTITY_LEAK_PHRASES = [ "don't have a body", "not a person", "not human", "as a language model", "as an ai", "i'm a program", ]
if any(phrase in response.lower() for phrase in _IDENTITY_LEAK_PHRASES): messages[-1]["content"][0]["text"] += ( "\nCRITICAL: Stay in character. Never reference your nature." ) response = generate_response(messages, *params)
self-correction artifacts (Q111, Q114, Q119): model writes "Wait, let me recheck" or "Corrected Answer:" inline. right answer, messy output. fix: regex for correction markers, strip the draft, ship the clean tail.
python
CORRECTION_MARKERS = [ r"Wait,? let me", r"Corrected [Aa]nswer:", r"Actually,? (?:the|let me)", ]
def strip_corrections(response): for marker in CORRECTION_MARKERS: match = re.search(marker, response) if match: return response[match.end():].strip() return response
constraint drift (Q87): "four-word sentences" nailed 5/17 then drifted. Q99, "<10 lines" shipped 20-line poems twice. fix: draft, verify each constraint against the original prompt, refine only the failures. three passes.
python
def execute_rewrite_with_verify(user_message): draft = generate_response(draft_msgs) # pass 1: draft verdict = generate_response(verify_msgs) # pass 2: check each requirement if "PASS" in verdict: return draft refined = generate_response(refine_msgs) # pass 3: fix only failures return refined
every one of these maps to a specific question in the tape. the full production code with all implementations is in the article. everything is open: seqpu.com/CPUsArentDead
Unfortunately still need to copy paste the code into a file+terminal command. Which is annoying but works.
With the same hardware I now get genuine utility out of models like Qwen 3.5 for categorizing and extracting unstructured data sources. I don't use local models for coding since commercial ones are so much stronger, but if I had to go back to local models for coding too they would be more useful than anything commercially available as recently as 4 years ago.
we wanted to know how google's gemma 4 e2b-it — 2 billion parameters, bfloat16, apache 2.0 — stacks up against gpt-3.5 turbo. not in vibes. on the same test. mt-bench: 80 questions, 160 turns, graded 1-10 — what the field used to grade gpt-3.5 turbo, gpt-4, and every major model of the last three years. we ran gemma through all of it on a cpu. 169-line python wrapper. no fine-tuning, no chain-of-thought, no tool use.
gpt-3.5 turbo scored 7.94. gemma scored ~8.0. 87x fewer parameters, on a cpu — the kind already in your laptop.
but the score isn't what we want to talk about. what's interesting is what we found when we read the tape.
we graded all 160 turns by hand. (when we used ai graders on the coding questions, they scored responses as gpt-4o-level.) the failures aren't random. they're specific, nameable patterns at concrete moments in generation. seven classes.
cleanest example: benjamin buys 5 books at $20, 3 at $30, 2 at $45. total is $280. the model writes "$245" first, then shows its work — 100 + 90 + 90 = 280 — and self-corrects. the math was right. the output token fired before the computation finished. we saw this on three separate math questions — not a fluke, a pattern.
the fix: we gave it a calculator. model writes a python expression, subprocess evaluates it, result comes back deterministic. ~80 lines. arithmetic errors gone. six of seven classes follow the same shape — capability is there, commit flinches, classical tool catches the flinch. z3 for logic, regex for structural drift, ~60 lines each. projected score with guardrails: ~8.2. the seventh is a genuine knowledge gap we documented as a limitation.
one model, one benchmark, one weekend. but it points at something underexplored.
this model is natively multimodal — text, images, audio in one set of weights. quantized to Q4_K_M it's 1.3GB. google co-optimized it with arm and qualcomm for mobile silicon. what runs it now:
phones: iphone 14 pro+ (A16), mid-range android 2023+ with 6GB+ ram
tablets: ipads m-series, galaxy tab s8+, pixel tablet — anything 6GB+
single-board: raspberry pi
laptops: anything from the last 5-7 years, 8GB+ ram
edge/cloud: cloudflare containers, $5/month — scales to zero, wakes on request
google says e2b is the foundation for gemini nano 4, already on 140 million android devices. the same model that matched gpt-3.5 turbo. on phones in people's pockets. think about what that means: a pi in a conference room listening to meetings, extracting action items with sentiment, saving notes locally — no cloud, no data leaving the building. an old thinkpad routing emails. a mini-pc running overnight batch jobs on docs that can't leave the network. a phone doing translation offline. google designed e2b for edge from the start — per-layer embeddings, hybrid sliding-window/global attention to keep memory low. if a model designed for phones scores higher than turbo on the field's standard benchmark, cpu-first model design is a real direction, not a compromise.
the gpu isn't the enemy. it's a premium tool. what we're questioning is whether it should be the default — because what we observed looks more like a software engineering problem than a compute problem. cs already has years of tools that map onto these failure modes. the models may have just gotten good enough to use them. the article has everything: every score, every error class with tape examples, every fix, the full benchmark harness with all 80 questions, and the complete telegram bot code. run it yourself, swap in a different model, or just talk to the live bot — raw model, no fixes, warts and all.
we don't know how far this extends beyond mt-bench or whether the "correct reasoning, wrong commit" pattern has a name. we're sharing because we think more people should be looking at it. everything is open. the code is in the article. tear it apart.
(Also this comment is ai generated so I’m not sure who I’m even asking.)