Impressive model, for sure. I've been running it on my Mac, now I get to have it locally in my iPhone? I need to test this. Wait, it does agent skills and mobile actions, all local to the phone? Whaaaat? (Have to check out later! Anyone have any tips yet?)
I don't normally do the whole "abliterated" thing (dealignment) but after discovering https://github.com/p-e-w/heretic , I was too tempted to try it with this model a couple days ago (made a repo to make it easier, actually) https://github.com/pmarreck/gemma4-heretical and... Wow. It worked. And... Not having a built-in nanny is fun!
It's also possible to make an MLX version of it, which runs a little faster on Macs, but won't work through Ollama unfortunately. (LM Studio maybe.)
Runs great on my M4 Macbook Pro w/128GB and likely also runs fine under 64GB... smaller memories might require lower quantizations.
I specifically like dealigned local models because if I have to get my thoughts policed when playing in someone else's playground, like hell am I going to be judged while messing around in my own local open-source one too. And there's a whole set of ethically-justifiable but rule-flagging conversations (loosely categorizable as things like "sensitive", "ethically-borderline-but-productive" or "violating sacred cows") that are now possible with this, and at a level never before possible until now.
Note: I tried to hook this one up to OpenClaw and ran into issues
To answer the obvious question- Yes, this sort of thing enables bad actors more (as do many other tools). Fortunately, there are far more good actors out there, and bad actors don't listen to rules that good actors subject themselves to, anyway.
eloisant 1 hours ago [-]
I tried it on my mac, for coding, and I wasn't really impressed compared to Qwen.
I guess there are things it's better at?
nkohari 40 minutes ago [-]
You're comparing apples to oranges there. Qwen 3.5 is a much larger model at 397B parameters vs. Gemma's 31B. Gemma will be better at answering simple questions and doing basic automation, and codegen won't be it's strong suit.
kgeist 33 minutes ago [-]
Qwen3.5 comes in various sizes (including 27B), and judging by the posts on HN, /LocalLlama etc., it seems to be better at logic/reasoning/coding/tool calling compared to Gemma 4, while Gemma 4 is better at creative writing and world knowledge (basically nothing changed from the Qwen3 vs. Gemma3 era)
Mil0dV 18 minutes ago [-]
Does this also apply to gemma's 26B-A4B vs say Qwens 35B-A3B?
I'm not sure if I can make the 35B-A3B work with my 32GB machine
tredre3 33 minutes ago [-]
Gemma 4 31B is still not impressive at coding compare to even Qwen 3.5 27B. It's just not its strong suit.
So far gemma 4 seems excellent at role playing, document analysis, and decent at making agentic decisions.
gigatexal 14 minutes ago [-]
This has been my experience as well, Qwen via Ollama locally has been very very impressive.
barbazoo 2 hours ago [-]
> And there's a whole set of ethically-justifiable but rule-flagging conversations (loosely categorizable as things like "sensitive", "ethically-borderline-but-productive" or "violating sacred cows") that are now possible with this, and at a level never before possible until now.
I checked the abliterate script and I don't yet understand what it does or what the result is. What are the conversations this enables?
pmarreck 60 minutes ago [-]
1) Coming up with any valid criticism of Islam at all (for some reason, criticisms of Christianity or Judaism are perfectly allowed even with public models!).
2) Asking questions about sketchy things. Simply asking should not be censored.
3) I don't use it for this, but porn or foul language.
4) Imitating or representing a public figure is often blocked.
5) Asking security-related questions when you are trying to do security.
6) For those who have had it, people who are trying to use AI to deal with traumatic experiences that are illegal to even describe.
Many other instances.
peyton 5 minutes ago [-]
The manufacturing of biologics can be heavily censored to an absurd degree. I don’t know about Gemma 4 in particular.
spijdar 2 hours ago [-]
Realistically, a lot of people do this for porn.
In my experience, though, it's necessary to do anything security related. Interestingly, the big models have fewer refusals for me when I ask e.g. "in <X> situation, how do you exploit <Y>?", but local models will frequently flat out refuse, unless the model has been abliterated.
tredre3 30 minutes ago [-]
From what I've seen gemma 4 doesn't refuse a lot regarding sex, it only needs little nudging in the right direction sometimes.
But it does refuse being critical of the usual topics: israel, islam, trans, or race.
So wanting to discuss one of those is the real reason people would use an uncensored model.
throwuxiytayq 2 hours ago [-]
The in-ter-net is for porn
rav3ndust 1 hours ago [-]
that song is going to be stuck in my head all day now. lol
c2k 3 hours ago [-]
I run mlx models with omlx[1] on my mac and it works really well.
Haven't built anything on the agent skills platform yet, but it's pretty cool imo.
On Android the sandbox loads an index.html into a WebView, with standardized string I/O to the harness via some window properties. You can even return a rendered HTML page.
Definitely hacked together, but feels like an indication of what an edge compute agentic sandbox might look like in future.
bossyTeacher 32 minutes ago [-]
>there's a whole set of ethically-justifiable but rule-flagging conversations (loosely categorizable as things like "sensitive", "ethically-borderline-but-productive" or "violating sacred cows") that are now possible with this, and at a level never before possible until now.
Mind giving us a few of the examples that you plan to run in your local LLM? I am curious.
jackp96 3 hours ago [-]
[flagged]
potsandpans 2 hours ago [-]
I'm tired of this concern trolling.
1 hours ago [-]
2 hours ago [-]
jackp96 2 hours ago [-]
People are allowed to disagree with you, mate. This is a real concern that will affect real people's lives. I'm all for freedom, but that doesn't mean we ought to let just anyone own a nuke.
That said, show me where I'm wrong. I'd love to change my mind on this.
PullJosh 3 hours ago [-]
This is awesome!
1) I am able to run the model on my iPhone and get good results. Not as good as Gemini in the cloud, but good.
2) I love the “mobile actions” tool calls that allow the LLM to turn on the flashlight, open maps, etc. It would be fun if they added Siri Shortcuts support. I want the personal automation that Apple promised but never delivered.
3) I am so excited for local models to be normalized. I build little apps for teachers and there are stringent privacy laws involved that mean I strongly prefer writing code that runs fully client-side when possible. When I develop apps and websites, I want easy API access to on-device models for free. I know it sort of exists on iOS and Chrome right now, but as far as I’m aware it’s not particularly good yet.
buzzerbetrayed 50 minutes ago [-]
For me the hallucination and gaslighting is like taking a step back in time a couple of years. It even fails the “r’s in strawberry” question. How nostalgic.
It’s very impressive that this can run locally. And I hope we will continue to be able to run couple-year-old-equivalent models locally going forward.
karimf 2 hours ago [-]
This app is cool and it showcases some use cases, but it still undersells what the E2B model can do.
I just made a real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B. I posted it on /r/LocalLLaMA a few hours ago and it's gaining some traction [0]. Here's the repo [1]
I'm running it on a Macbook instead of an iPhone, but based on the benchmark here [2], you should be able to run the same thing on an iPhone 17 Pro.
Parlor is so cool, especially since you’re offering it for free. And a great use case for local LLMs.
karimf 1 hours ago [-]
Thanks! Although, I can't claim any credit for it. I just spent a day gluing what other people have built. Huge props to the Gemma team for building an amazing model and also an inference engine that's focused for edge devices [0]
OP Here. It is my firm belief that the only realistic use of AI in the future is either locally on-device for almost free, or in the cloud but way more expensive then it is today.
The latter option will only bemusedly for tasks that humans are more expensive or much slower in.
This Gemma 4 model gives me hope for a future Siri or other with iPhone and macOS integration, “Her” (as in the movie) style.
crazygringo 1 hours ago [-]
> or in the cloud but way more expensive then it is today.
Why? It's widely understood that the big players are making profit on inference. The only reason they still have losses is because training is so expensive, but you need to do that no matter whether the models are running in the cloud or on your device.
If you think about it, it's always going to be cheaper and more energy-efficient to have dedicated cloud hardware to run models. Running them on your phone, even if possible, is just going to suck up your battery life.
mbesto 1 hours ago [-]
> It's widely understood that the big players are making profit on inference.
This is most definitely not widely understood. We still don't know yet. There's tons of discussions about people disagreeing on whether it really is profitable. Unless you have proof, don't say "this is widely understood".
zozbot234 1 hours ago [-]
The big players are plausibly making profits on raw API calls, not subscriptions. These are quite costly compared to third-party inference from open models, but even setting that up is a hassle and you as a end user aren't getting any subsidy. Running inference locally will make a lot of sense for most light and casual users once the subsidies for subscription access cease.
Also while datacenter-based scaleout of a model over multiple GPUs running large batches is more energy efficient, it ultimately creates a single point of failure you may wish to avoid.
huijzer 1 hours ago [-]
Laptop/desktop could work. Most systems are on charger most of time anyway
nothinkjustai 1 hours ago [-]
> It's widely understood that the big players are making profit on inference.
Are they? Or are they just saying that to make their offerings more attractive to investors?
Plus I think most people using agents for coding are using subscriptions which they are definitely not profitable in.
Locally running models that are snappy and mostly as capable as current sota models would be a dream. No internet connection required, no payment plans or relying on a third party provider to do your job. No privacy concerns. Etc etc.
zozbot234 1 hours ago [-]
You can pick models that are snappy, or models that are as capable as SOTA. You don't really get both unless you spend extremely unreasonable amounts of money on what is essentially a datacenter-scale inference platform of your own, meant to service hundreds of users at once. (I don't care how many agent harnesses you spin up at once, you aren't going to get the same utilization as hundreds of concurrent users.)
This assessment might change if local AI frameworks start working seriously on support for tensor-parallel distributed inference, then you might get away with cheaper homelab-class hardware and only mildly unreasonable amounts of money.
jrflowers 42 minutes ago [-]
> It's widely understood that the big players are making profit on inference.
I love the whole “they are making money if you ignore training costs” bit. It is always great to see somebody say something like “if you look at the amount of money that they’re spending it looks bad, but if you look away it looks pretty good” like it’s the money version of a solar eclipse
_pdp_ 27 minutes ago [-]
If you can run free models on consumer devices why do you think cloud providers cannot do the same except better and bundled with a tone of value worth paying?
amelius 1 hours ago [-]
A local model running on a phone owned and controlled by the vendor is still not really exciting, imho.
It may be physically "local" but not in spirit.
0dayman 2 hours ago [-]
this is not that first step towards your dream
kennywinker 2 hours ago [-]
Did you really watch “Her” and think this is a future that should happen??
Seriously????
jfreds 1 hours ago [-]
I don’t think OP’s point has anything to do with AI companions.
The big benefit of moving compute to edge devices is to distribute the inference load on the grid. Powering and cooling phones is a lot easier than powering and cooling a datacenter
sambapa 1 hours ago [-]
Torment Nexus sounds fun
aninteger 55 minutes ago [-]
Having Scarlett Johansson's voice might not be so bad or even something less robotic.
Nice! Tried on iPhone 16 pro with 30 TPS from Gemma-4-E2B-it model.
Although the phone got considerably hot while inferencing. It’s quite an impressive performance and cannot wait to try it myself in one of my personal apps.
dhbradshaw 15 minutes ago [-]
My son just started using 2B on his Android. I mentioned that it was an impressively compact model and next thing I knew he had figured out how to use it on his inexpensive 2024 Motorolla and was using it to practice reading and writing in foreign languages.
I assume it is the 26B A4B one, if it runs locally?
deckar01 2 hours ago [-]
It doesn’t render Markdown or LaTeX. The scrolling is unusable during generation. E4B failed to correctly account for convection and conduction when reasoning about the effects of thermal radiation (31b was very good). After 3 questions in a session (with thinking) E4B went off the rails and started emitting nonsense fragment before the stated token limit was hit (unless it isn’t actually checking).
TGower 2 hours ago [-]
These new models are very impressive. There should be a massive speedup coming as well, AI Edge Gallery is running on GPU, but NPUs in recent high end processors should be much faster. A16 chip for example (Macbook Neo and iphone 16 series) has 35 TOPS of Neural Engine vs 7 TFLOPS gpu. Similar story for Qualcomm.
api 2 hours ago [-]
That’s nuts actually for such a low power chip. Can’t wait to see the M series version of that.
I’m sure very fast TPUs in desktops and phones are coming.
zozbot234 2 hours ago [-]
The Apple Silicon in the MacBook Neo is effectively a slimmed down version of M4, which is already out and has a very similar NPU (similar TFLOPS rating). It's worth noting however that the TFLOPS rating for Apple Neural Engine is somewhat artificial, since e.g. the "38 TFLOPS" in the M4 ANE are really 19 TFLOPS for FP16-only operation.
hadrien01 3 hours ago [-]
Is it me or does the App Store website look... fake? The text in the header ("Productiviteit", "Alleen voor iPhone") looks pixelated, like it was edited on Paint, the header background is flickering, the app icon and screenshots are very low quality, the title of the website is incomplete ("App Store voor iPho...")
giarc 3 hours ago [-]
It's the dutch version, see /nl/ in the url.
If you just go to https://apps.apple.com/ it does look better, but I agree, still a bit "off".
It looks like there is some sort of glow effect on the text that isn't rendering right on your browser? It arguably doesn't have the best contrast, but seems to be as intended in Safari 26.3. Looks similar on Chrome macOS too: https://imgur.com/yq5PrKm.
t-sauer 2 hours ago [-]
Renders equally weird for me on Firefox on Windows 11. Firefox on MacOS looks good though.
Everything renders crystal clear with Firefox on GrapheneOS.
ezfe 3 hours ago [-]
Nothing weird on my side
burnto 2 hours ago [-]
My iPhone 13 can’t run most of these models. A decent local LLM is one of the few reasons I can imagine actually upgrading earlier than typically necessary.
garff 44 minutes ago [-]
How new of an iPhone model is needed?
__natty__ 2 hours ago [-]
That's a great project! I just wondered whether Google would have a problem with you using their trademark
tech234a 1 hours ago [-]
This is an app published by Google itself
carbocation 2 hours ago [-]
It would be very helpful if the chat logs could (optionally) be retained.
2 hours ago [-]
dwa3592 2 hours ago [-]
I think with this google starts a new race- best local model that runs on phones.
dwa3592 2 hours ago [-]
I wonder why the cut off date for 3n-E4B-it is Oct, 2023. That's really far in the past.
rickdg 2 hours ago [-]
How do these compare to Apple's Foundation Models, btw?
simonw 1 hours ago [-]
So much better. Hard to quantify, but even the small Gemma 4 models have that feels-like-ChatGPT magic that Apple's models are lacking.
snarkyturtle 1 hours ago [-]
AFM had a 4096 token context window and this can be configured to have a 32k+ token context window, for one.
dzhiurgis 1 hours ago [-]
I recently got to a first practical use of it. I was on a plane, filling landing card (what a silly thing these are). I looked up my hotel address using qwen model on my iPhone 16 Pro. It was accurate. I was quite impressed.
After some back and forth the chat app started to crash tho, so YMMV.
beeflet 1 hours ago [-]
Isn't this already possible in a much more open-ended way with PocketPal?
I don't normally do the whole "abliterated" thing (dealignment) but after discovering https://github.com/p-e-w/heretic , I was too tempted to try it with this model a couple days ago (made a repo to make it easier, actually) https://github.com/pmarreck/gemma4-heretical and... Wow. It worked. And... Not having a built-in nanny is fun!
It's also possible to make an MLX version of it, which runs a little faster on Macs, but won't work through Ollama unfortunately. (LM Studio maybe.)
Runs great on my M4 Macbook Pro w/128GB and likely also runs fine under 64GB... smaller memories might require lower quantizations.
I specifically like dealigned local models because if I have to get my thoughts policed when playing in someone else's playground, like hell am I going to be judged while messing around in my own local open-source one too. And there's a whole set of ethically-justifiable but rule-flagging conversations (loosely categorizable as things like "sensitive", "ethically-borderline-but-productive" or "violating sacred cows") that are now possible with this, and at a level never before possible until now.
Note: I tried to hook this one up to OpenClaw and ran into issues
To answer the obvious question- Yes, this sort of thing enables bad actors more (as do many other tools). Fortunately, there are far more good actors out there, and bad actors don't listen to rules that good actors subject themselves to, anyway.
I guess there are things it's better at?
I'm not sure if I can make the 35B-A3B work with my 32GB machine
So far gemma 4 seems excellent at role playing, document analysis, and decent at making agentic decisions.
I checked the abliterate script and I don't yet understand what it does or what the result is. What are the conversations this enables?
2) Asking questions about sketchy things. Simply asking should not be censored.
3) I don't use it for this, but porn or foul language.
4) Imitating or representing a public figure is often blocked.
5) Asking security-related questions when you are trying to do security.
6) For those who have had it, people who are trying to use AI to deal with traumatic experiences that are illegal to even describe.
Many other instances.
In my experience, though, it's necessary to do anything security related. Interestingly, the big models have fewer refusals for me when I ask e.g. "in <X> situation, how do you exploit <Y>?", but local models will frequently flat out refuse, unless the model has been abliterated.
But it does refuse being critical of the usual topics: israel, islam, trans, or race.
So wanting to discuss one of those is the real reason people would use an uncensored model.
[1] https://github.com/jundot/omlx
On Android the sandbox loads an index.html into a WebView, with standardized string I/O to the harness via some window properties. You can even return a rendered HTML page.
Definitely hacked together, but feels like an indication of what an edge compute agentic sandbox might look like in future.
Mind giving us a few of the examples that you plan to run in your local LLM? I am curious.
That said, show me where I'm wrong. I'd love to change my mind on this.
1) I am able to run the model on my iPhone and get good results. Not as good as Gemini in the cloud, but good.
2) I love the “mobile actions” tool calls that allow the LLM to turn on the flashlight, open maps, etc. It would be fun if they added Siri Shortcuts support. I want the personal automation that Apple promised but never delivered.
3) I am so excited for local models to be normalized. I build little apps for teachers and there are stringent privacy laws involved that mean I strongly prefer writing code that runs fully client-side when possible. When I develop apps and websites, I want easy API access to on-device models for free. I know it sort of exists on iOS and Chrome right now, but as far as I’m aware it’s not particularly good yet.
It’s very impressive that this can run locally. And I hope we will continue to be able to run couple-year-old-equivalent models locally going forward.
I just made a real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B. I posted it on /r/LocalLLaMA a few hours ago and it's gaining some traction [0]. Here's the repo [1]
I'm running it on a Macbook instead of an iPhone, but based on the benchmark here [2], you should be able to run the same thing on an iPhone 17 Pro.
[0] https://www.reddit.com/r/LocalLLaMA/comments/1sda3r6/realtim...
[1] https://github.com/fikrikarim/parlor
[2] https://huggingface.co/litert-community/gemma-4-E2B-it-liter...
[0] https://github.com/google-ai-edge/LiteRT-LM
The latter option will only bemusedly for tasks that humans are more expensive or much slower in.
This Gemma 4 model gives me hope for a future Siri or other with iPhone and macOS integration, “Her” (as in the movie) style.
Why? It's widely understood that the big players are making profit on inference. The only reason they still have losses is because training is so expensive, but you need to do that no matter whether the models are running in the cloud or on your device.
If you think about it, it's always going to be cheaper and more energy-efficient to have dedicated cloud hardware to run models. Running them on your phone, even if possible, is just going to suck up your battery life.
This is most definitely not widely understood. We still don't know yet. There's tons of discussions about people disagreeing on whether it really is profitable. Unless you have proof, don't say "this is widely understood".
Also while datacenter-based scaleout of a model over multiple GPUs running large batches is more energy efficient, it ultimately creates a single point of failure you may wish to avoid.
Are they? Or are they just saying that to make their offerings more attractive to investors?
Plus I think most people using agents for coding are using subscriptions which they are definitely not profitable in.
Locally running models that are snappy and mostly as capable as current sota models would be a dream. No internet connection required, no payment plans or relying on a third party provider to do your job. No privacy concerns. Etc etc.
This assessment might change if local AI frameworks start working seriously on support for tensor-parallel distributed inference, then you might get away with cheaper homelab-class hardware and only mildly unreasonable amounts of money.
I love the whole “they are making money if you ignore training costs” bit. It is always great to see somebody say something like “if you look at the amount of money that they’re spending it looks bad, but if you look away it looks pretty good” like it’s the money version of a solar eclipse
It may be physically "local" but not in spirit.
Seriously????
The big benefit of moving compute to edge devices is to distribute the inference load on the grid. Powering and cooling phones is a lot easier than powering and cooling a datacenter
Also on Android: https://play.google.com/store/apps/details?id=com.google.ai....
It's a demo app for Google's Edge project: https://ai.google.dev/edge
Although the phone got considerably hot while inferencing. It’s quite an impressive performance and cannot wait to try it myself in one of my personal apps.
I assume it is the 26B A4B one, if it runs locally?
I’m sure very fast TPUs in desktops and phones are coming.
If you just go to https://apps.apple.com/ it does look better, but I agree, still a bit "off".
The design quality is still poor. But that's the new Apple. Design is no longer one of their core strengths.
On my iPhone it opens on the App Store app, so it looks fine to me.
Screenshot of the header: https://i.imgur.com/4abfGYF.png
Edit: Seems like mix-blend-mode: plus-lighter is bugged in Firefox on Windows https://jsfiddle.net/bjg24hk9/
After some back and forth the chat app started to crash tho, so YMMV.
https://github.com/a-ghorbani/pocketpal-ai
https://apps.apple.com/us/app/pocketpal-ai/id6502579498
https://play.google.com/store/apps/details?id=com.pocketpala...