Run an incredible 400B parameters on a handheld device.
0.6 t/s, wait 30 seconds to see what these billions of calculations get us:
"That is a profound observation, and you are absolutely right ..."
intrasight 1 hours ago [-]
Better than waiting 7.5 million years to have a tell you the answer is 42.
bartread 12 minutes ago [-]
Looked at a certain way it's incredible that a 40-odd year old comedy sci-fi series is so accurate about the expected quality of (at least some) AI output.
Which makes it even funnier.
It makes me a little sad that Douglas Adams didn't live to see it.
whyenot 36 minutes ago [-]
Should have used a better platform. So long and thanks for all the fish.
1 hours ago [-]
thinkingtoilet 1 hours ago [-]
Maybe you should have asked a better question. :P
patapong 53 minutes ago [-]
What do you get if you multiply six by nine?
ctxc 24 minutes ago [-]
Tea
RuslanL 28 minutes ago [-]
67?
xeyownt 45 minutes ago [-]
54?
ep103 19 minutes ago [-]
Some one should let Douglas Adams know the calculation could have been so much faster if the machine just lied.
lesam 12 minutes ago [-]
I think Adams was prescient, since in his story the all powerful computer reaches the answer '42' via incorrect arithmetic.
WarmWash 1 hours ago [-]
I don't think we are ever going to win this. The general population loves being glazed way too much.
baal80spam 1 hours ago [-]
> The general population loves being glazed way too much.
This is 100% correct!
WarmWash 1 hours ago [-]
Thanks for short warm blast of dopamine, no one else ever seems to grasp how smart I truly am!
timcobb 1 hours ago [-]
That is an excellent observation.
otikik 34 minutes ago [-]
The other day, I got:
"You are absolutely right to be confused"
That was the closest AI has been to calling me "dumb meatbag".
winwang 11 minutes ago [-]
It would be much worse if it had said "You are absolutely wrong to be confused", haha.
Terretta 23 minutes ago [-]
"Carrot: The Musical" in the Carrot weather app, all about the AI and her developer meatbag, is on point.
tombert 1 hours ago [-]
That's an astute point, and you're right to point it out.
actusual 1 hours ago [-]
You are thinking about this exactly the right way.
9dev 51 minutes ago [-]
You’re absolutely right!
Aurornis 1 hours ago [-]
I thought you were being sarcastic until I watched the video and saw those words slowly appear.
Emphasis on slowly.
r_lee 40 minutes ago [-]
I too thought you were joking
laughed when it slowly began to type that out
amelius 57 minutes ago [-]
I mean size says nothing, you could do it on a Pi Zero with sufficient storage attached.
So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.
zozbot234 43 minutes ago [-]
You need fast storage to make it worthwhile. PCIe x4 5.0 is a reasonable minimum. Or multiple PCIe x4 4.0 accessed in parallel, but this is challenging since the individual expert-layers are usually small. Intel Optane drives are worth experimenting with for the latter (they are stuck on PCIe 4.0) purely for their good random-read properties (quite aside from their wearout resistance, which opens up use for KV-cache and even activations).
vntok 21 minutes ago [-]
2 years ago, LLMs failed at answering coherently. Last year, they failed at answering fast on optimized servers. Now, they're failing at answering fast on underpowered handheld devices... I can't wait to see what they'll be failing to do next year.
dv_dt 32 minutes ago [-]
CPU, memory, storage, time tradeoffs rediscovered by AI model developers. There is something new here, add GPU to the trade space.
alephnerd 18 minutes ago [-]
It's been known to people working in the space for a long time. Heck, I was working on similar stuff for the Maxwell and later Pascal over a decade ago.
You do have a lot of "MLEs" and "Data Scientists" who only know basic PyTorch and SKLearn, but that kind of fat is being trimmed industry wide now.
Domain experience remains gold, especially in a market like today's.
firstbabylonian 2 hours ago [-]
> SSD streaming to GPU
Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?
A similar approach was recently featured here: https://news.ycombinator.com/item?id=47476422 Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model. (Unless you want to use Intel Optane wearout-resistant storage, but that was power hungry and thus unsuitable to a mobile device.)
Aurornis 1 hours ago [-]
> Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.
This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.
zozbot234 27 minutes ago [-]
Yes but most people are still running MoE models with all experts loaded in RAM! This experiment shows quite clearly that some experts are only rarely needed, so you do benefit from not caching every single expert-layer in RAM at all times.
simonw 2 hours ago [-]
Yeah, this new post is a continuation of that work.
A year ago this would have been considered impossible. The hardware is moving faster than anyone's software assumptions.
cogman10 2 hours ago [-]
This isn't a hardware feat, this is a software triumph.
They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).
pdpi 2 hours ago [-]
It's both.
We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.
bigyabai 53 minutes ago [-]
> We haven't had phones running laptop-grade CPUs/GPUs for that long
Agree to disagree, we've had laptop-grade smartphone hardware for longer than we've had LLMs.
smallerize 2 hours ago [-]
The iPhone 17 Pro launched 8 months ago with 50% more RAM and about double the inference performance of the previous iPhone Pro (also 10x prompt processing speed).
2 hours ago [-]
SV_BubbleTime 22 minutes ago [-]
>triumph
It’s been a lot of years, but all I can hear after reading that is … I’m making a note here, huge success
GorbachevyChase 7 minutes ago [-]
There’s no use crying over every mistake. You just keep on trying until you run out of cake.
breggles 9 minutes ago [-]
It's hard to overstate my satisfaction!
Aurornis 1 hours ago [-]
It wasn't considered impossible. There are examples of large MoE LLMs running on small hardware all over the internet, like giant models on Raspberry Pi 5.
It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.
zozbot234 53 minutes ago [-]
If the bottleneck is storage bandwidth that's not "slow". It's only slow if you insist on interactive speeds, but the point of this is that you can run cheap inference in bulk on very low-end hardware.
The software has real software engineers working on it instead of researchers.
Remember when people were arguing about whether to use mmap? What a ridiculous argument.
At some point someone will figure out how to tile the weights and the memory requirements will drop again.
snovv_crash 1 hours ago [-]
The real improvement will be when the software engineers get into the training loop. Then we can have MoE that use cache-friendly expert utilisation and maybe even learned prefetching for what the next experts will be.
zozbot234 51 minutes ago [-]
> maybe even learned prefetching for what the next experts will be
Experts are predicted by layer and the individual layer reads are quite small, so this is not really feasible. There's just not enough information to guide a prefetch.
snovv_crash 46 minutes ago [-]
Manually no. It would have to be learned, and making the expert selection predictable would need to be a training metric to minimize.
zozbot234 40 minutes ago [-]
Making the expert selection more predictable also means making it less effective. There's no real free lunch.
ottah 37 minutes ago [-]
I mean, by any reasonable standard it still is. Almost any computer can run an llm, it's just a matter of how fast, and 0.4k/s (peak before first token) is not really considered running. It's a demo, but practically speaking entirely useless.
alephnerd 6 minutes ago [-]
Devils advocate - this actually shows how promising TinyML and EdgeML capabilities are. SoCs comparable to the A19 Pro are highly likely to be commodified in the next 3-5 years in the same manner that SoCs comparable to the A13 already are.
_air 2 hours ago [-]
This is awesome! How far away are we from a model of this capability level running at 100 t/s? It's unclear to me if we'll see it from miniaturization first or from hardware gains
Tade0 1 hours ago [-]
Only way to have hardware reach this sort of efficiency is to embed the model in hardware.
This exists[0], but the chip in question is physically large and won't fit on a phone.
I think for many reasons this will become the dominant paradigm for end user devices.
Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.
Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
bigyabai 49 minutes ago [-]
One big bottleneck is SRAM cost. Even an 8b model would probably end up being hundreds of dollars to run locally on that kind of hardware. Especially unpalatable if the model quality keeps advancing year-by-year.
> Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
It's amazing to me that people consider this to be more realistic than FAANG collaborating on a CUDA-killer. I guess Nvidia really does deserve their valuation.
intrasight 32 minutes ago [-]
> bottleneck is SRAM cost
Not for this approach
26 minutes ago [-]
ottah 42 minutes ago [-]
That's actually pretty cool, but I'd hate to freeze a models weights into silicon without having an incredibly specific and broad usecase.
19 minutes ago [-]
tclancy 44 minutes ago [-]
I think you're ignoring the inevitable march of progress. Phones will get big enough to hold it soon.
originalvichy 1 hours ago [-]
On smartphones? It’s not worth it to run a model this size on a device like this. A smaller fine-tuned model for specific use cases is not only faster, but possibly more accurate when tuned to specific use cases. All those gigs of unnecessary knowledge are useless to perform tasks usually done on smartphones.
svachalek 15 minutes ago [-]
A long time. But check out Apollo from Liquid AI, the LFM2 models run pretty fast on a phone and are surprisingly capable. Not as a knowledge database but to help process search results, solve math problems, stuff like that.
ottah 44 minutes ago [-]
Probably 15 to 20 years, if ever. This phone is only running this model in the technical sense of running, but not in a practical sense. Ignore the 0.4tk/s, that's nothing. What's really makes this example bullshit is the fact that there is no way the phone has a enough ram to hold any reasonable amount of context for that model. Context requirements are not insignificant, and as the context grows, the speed of the output will be even slower.
Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.
The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.
russellbeattie 50 minutes ago [-]
I have some macro opinions about Apple - not sure if I'm correct, but tell me what you think.
Apple has always seen RAM as an economic advantage for their platform: Make the development effort to ensure that the OS and apps work well with minimal memory and save billions every year in hardware costs. In 2026, iPhones still come with 8Gb of RAM, Pro/Max come with 12Gb.
The problem is that AI (ML/LLM training and inference) are areas where you can't get around the need for copious amounts of fast working memory. (Thus the critical shortage of RAM at the moment as AI data centers consume as many memory chips as possible.)
Unless there's something I don't know (which is more than possible) Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM.
So, it's going to be interesting whether they accept this reality and we start seeing the iPhones in the future with 16Gb, 32Gb or more as standard in order to make AI performant. And if they give up on adding AI to the billions of iPhones with minimal RAM already out there.
As a side note, 8Gb of RAM hasn't been enough for a decade. It prevents basic tasks like keeping web tabs live in the background. My pet peeve is having just a few websites open, and having the page refresh when swapping between them because of aggressive memory management.
To me, Apple's obvious strength is pushing AI to the edge as much as possible. While other companies are investing in massive data centers which will have millions of chips that will be outdated within the next couple years, Apple will be able to incrementally improve their ML/AI features by running on the latest and greatest chips every year. Apple has a huge advantage in that they can design their chips with a mega high speed bus, which is just as important as the quantity of RAM.
But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.
Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.
ecshafer 9 seconds ago [-]
In a recent episode of Dwarkesh the guest who is a semiconductor industry analyst predicted that an iPhone will increase in price by about $250 for the same stuff due to increased ram/chip costs from AI. Apple will not be able to afford to put a bunch more RAM into the phones and still sell them.
zozbot234 30 minutes ago [-]
RAM is just too expensive. We need to bring back non-DRAM persistent memory that doesn't have the wearout issues of NAND.
ottah 34 minutes ago [-]
Possibly this just isn't the generation of hardware to solve this problem in? We're like, what three or four years in at most, and only barely two in towards AI assisted development being practical. I wouldn't want to be the first mover here, and I don't know if it's a good point in history to try and solve the problem. Everything we're doing right now with AI, we will likely not be doing in five years. If I were running a company like Apple, I'd just sit on the problem until the technology stabilizes and matures.
bigyabai 30 minutes ago [-]
If I was running a company like Apple, I'd be working with Khronos to kill CUDA since yesterday. There are multiple trillions of dollars that could be Apple's if they sign CUDA drivers on macOS, or create a CUDA-compatible layer. Instead, Apple is spinning their wheels and promoting nothingburger technology like the NPU and MPS.
It's not like Apple's GPU designs are world-class anyways, they're basically neck-and-neck with AMD for raster efficiency. Except unlike AMD, Apple has all the resources in the world to compete with Nvidia and simply chooses to sit on their ass.
zozbot234 23 minutes ago [-]
CUDA is not the real issue, AMD's HIP offers source-level compatibility with CUDA code, and ZLUDA even provides raw binary compatibility. nVidia GPUs really are quite good, and the projected advantages of going multi-vendor just aren't worth the hassle given the amount of architecture-specificity GPUs are going to have.
bigyabai 22 minutes ago [-]
Okay, then don't kill CUDA, just sign CUDA drivers on macOS instead and quit pretending like MPS is a world-class solution. There are trillions on the table, this is not an unsolvable issue.
This has nothing to do with Apple, and everything to do with MoE and that everyone forgot you can re-read the necessary bits of the model from disk for each token.
This is extremely inefficient though. For efficiency you need to batch many requests (like 32+, probably more like 128+), and when you do that with MoE you lose the advantage of only having to read a subset of the model during a single forward pass, so the trick does not work.
But this did remind me that with dense models you might be able to use disk to achieve high throughput at high latency on GPUs that don't have a lot of VRAM.
rwaksmunski 2 hours ago [-]
Apple might just win the AI race without even running in it. It's all about the distribution.
dzikimarian 1 hours ago [-]
Because someone managed to run LLM on an iPhone at unusable speed Apple won AI race? Yeah, sure.
naikrovek 1 hours ago [-]
whoa, save some disbelief for later, don't show it all at once.
raw_anon_1111 2 hours ago [-]
Apple is already one of the winners of the AI race. It’s making much more profit (ie it ain’t losing money) on AI off of ChatGPT, Claude, Grok (you would be surprised at how many incels pay to make AI generated porn videos) subscriptions through the App Store.
It’s only paying Google $1 billion a year for access to Gemini for Siri
detourdog 2 hours ago [-]
Apple’s entire yearly capex is a fraction of the AI spend of the persumed AI winners.
foobiekr 1 hours ago [-]
Fantasy buildouts of hundreds of billions of dollars for gear that has a 3 year lifetime may be premature.
Put another way, there is no demonstrated first mover advantage in LLM-based AI so far and all of the companies involved are money furnaces.
devmor 1 hours ago [-]
Which is mostly insane amounts of debt leveraged entirely on the moonshot that they will find a way to turn a profit on it within the next couple years.
Apple’s bet is intelligent, the “presumed winners” are hedging our economic stability on a miracle, like a shaking gambling addict at a horse race who just withdrew his rent money.
qingcharles 1 hours ago [-]
Plus all those pricey 512GB Mac Studios they are selling to YouTubers.
giobox 26 minutes ago [-]
Most of the influencer content I saw demonstrating LLMs on multiple 512gb Mac Studios over Thunderbolt networking used Macs borrowed from Apple PR that were returned afterwards - network chuck, Jeff Geerling et al didn't actually buy the 4 or 5 512gb Mac Studios used in their corresponding local LLM videos.
The financial math on actually buying over $40k worth of Mac for 1 to 2 youtube videos probably doesn't work that well, even for the really big players.
icedchai 1 hours ago [-]
They don't offer the 512 gig RAM variant anymore. Outside of social media influencers and the occasional AI researcher, the market for $10K desktops is vanishingly small.
spacedcowboy 19 minutes ago [-]
Huh, interesting. I wonder if there's a premium price right now for the one on my desk...
Pretty sure the M5 Ultra will be out after WWDC, so my M3 Ultra is (while still completely capable of fulfilling my needs) looking a bit long in the tooth. If I can get a good price for it now, I might be able to offset most of the M5 post WWDC...
criddell 34 minutes ago [-]
The best desktop you could get has been around $10k going back all the way back to the PDP-8e (it could fit on most desks!).
Multiplayer 52 minutes ago [-]
My understanding is that the 512gb offering will likely return with the new M5 Ultra coming around WWDC in June. Fingers crossed anyway!
simopa 2 hours ago [-]
It's crazy to see a 400B model running on an iPhone. But moving forward, as the information density and architectural efficiency of smaller models continue to increase, getting high-quality, real-time inference on mobile is going to become trivial.
volemo 57 minutes ago [-]
> moving forward, as the information density and architectural efficiency of smaller models continue to increase
If they continue to increase.
vessenes 20 minutes ago [-]
They will. Either new architectures will come out that give us greater efficiency, or we will hit a point where the main thing we can do is shove more training time onto these weights to get more per byte. Similar thing is already happening organically when it comes to efficient token use; see for instance https://github.com/qlabs-eng/slowrun.
simopa 10 seconds ago [-]
Thanks for the link.
Rendered at 17:20:17 GMT+0000 (Coordinated Universal Time) with Vercel.
0.6 t/s, wait 30 seconds to see what these billions of calculations get us:
"That is a profound observation, and you are absolutely right ..."
Which makes it even funnier.
It makes me a little sad that Douglas Adams didn't live to see it.
This is 100% correct!
"You are absolutely right to be confused"
That was the closest AI has been to calling me "dumb meatbag".
Emphasis on slowly.
laughed when it slowly began to type that out
So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.
You do have a lot of "MLEs" and "Data Scientists" who only know basic PyTorch and SKLearn, but that kind of fat is being trimmed industry wide now.
Domain experience remains gold, especially in a market like today's.
Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?
1: https://arxiv.org/abs/2312.11514
This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.
They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).
We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.
Agree to disagree, we've had laptop-grade smartphone hardware for longer than we've had LLMs.
It’s been a lot of years, but all I can hear after reading that is … I’m making a note here, huge success
It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.
iPhone 17 Pro outperforms AMD’s Ryzen 9 9950X per https://www.igorslab.de/en/iphone-17-pro-a19-pro-chip-uebert...
Remember when people were arguing about whether to use mmap? What a ridiculous argument.
At some point someone will figure out how to tile the weights and the memory requirements will drop again.
Experts are predicted by layer and the individual layer reads are quite small, so this is not really feasible. There's just not enough information to guide a prefetch.
This exists[0], but the chip in question is physically large and won't fit on a phone.
[0] https://www.anuragk.com/blog/posts/Taalas.html
Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.
Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
> Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
It's amazing to me that people consider this to be more realistic than FAANG collaborating on a CUDA-killer. I guess Nvidia really does deserve their valuation.
Not for this approach
Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.
The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.
Apple has always seen RAM as an economic advantage for their platform: Make the development effort to ensure that the OS and apps work well with minimal memory and save billions every year in hardware costs. In 2026, iPhones still come with 8Gb of RAM, Pro/Max come with 12Gb.
The problem is that AI (ML/LLM training and inference) are areas where you can't get around the need for copious amounts of fast working memory. (Thus the critical shortage of RAM at the moment as AI data centers consume as many memory chips as possible.)
Unless there's something I don't know (which is more than possible) Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM.
So, it's going to be interesting whether they accept this reality and we start seeing the iPhones in the future with 16Gb, 32Gb or more as standard in order to make AI performant. And if they give up on adding AI to the billions of iPhones with minimal RAM already out there.
As a side note, 8Gb of RAM hasn't been enough for a decade. It prevents basic tasks like keeping web tabs live in the background. My pet peeve is having just a few websites open, and having the page refresh when swapping between them because of aggressive memory management.
To me, Apple's obvious strength is pushing AI to the edge as much as possible. While other companies are investing in massive data centers which will have millions of chips that will be outdated within the next couple years, Apple will be able to incrementally improve their ML/AI features by running on the latest and greatest chips every year. Apple has a huge advantage in that they can design their chips with a mega high speed bus, which is just as important as the quantity of RAM.
But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.
Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.
It's not like Apple's GPU designs are world-class anyways, they're basically neck-and-neck with AMD for raster efficiency. Except unlike AMD, Apple has all the resources in the world to compete with Nvidia and simply chooses to sit on their ass.
This is extremely inefficient though. For efficiency you need to batch many requests (like 32+, probably more like 128+), and when you do that with MoE you lose the advantage of only having to read a subset of the model during a single forward pass, so the trick does not work.
But this did remind me that with dense models you might be able to use disk to achieve high throughput at high latency on GPUs that don't have a lot of VRAM.
It’s only paying Google $1 billion a year for access to Gemini for Siri
Put another way, there is no demonstrated first mover advantage in LLM-based AI so far and all of the companies involved are money furnaces.
Apple’s bet is intelligent, the “presumed winners” are hedging our economic stability on a miracle, like a shaking gambling addict at a horse race who just withdrew his rent money.
The financial math on actually buying over $40k worth of Mac for 1 to 2 youtube videos probably doesn't work that well, even for the really big players.
Pretty sure the M5 Ultra will be out after WWDC, so my M3 Ultra is (while still completely capable of fulfilling my needs) looking a bit long in the tooth. If I can get a good price for it now, I might be able to offset most of the M5 post WWDC...
If they continue to increase.