I'm not well versed in LLMs, can someone with more experience share how this compares to Ollama (https://ollama.com/)? When would I use this instead?
Star_Ship_1010 235 days ago [-]
Best answer to this is from Reddit
"how does a smart car compare to a ford f150? its different in its intent and intended audience.
Ollama is someone who goes to walmart and buys a $100 huffy mountain bike because they heard bikes are cool. Torchchat is someone who built a mountain bike out of high quality components chosen for a specific task/outcome with the understanding of how each component in the platform functions and interacts with the others to achieve an end goal."
https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...
Longer Answer with some more details is
If you don't care about which quant you're using, only use ollama and want easy integration with desktop/laptop based projects use Ollama.
If you want to run on mobile, integrate into your own apps or projects natively, don't want to use GGUF, want to do quantization, or want to extend your PyTorch based solution use torchchat
Right now Ollama (based on llama.cpp) is a faster way to get performance on a laptop desktop and a number of projects are pre-integrated with Ollama thanks to the OpenAI spec. It's also more mature with more fit and polish.
That said the commands that make everything easy use 4bit quant models and you have to do extra work to go find a GGUF model with a higher (or lower) bit quant and load it into Ollama.
Also worth noting is that Ollama "containerizes" the models on disk so you can't share them with other projects without going through Ollama which is a hard pass for any users and usecases since duplicating model files on disk isn't great.
https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...
dagaci 235 days ago [-]
If you running windows anywhere then you better off using ollama, lmstudio, and or LLamaSharp for coding these are all cross-platform too.
lostmsu 234 days ago [-]
I found LlamaSharp to be quite unstable with random crashes in the built-in llama.cpp build.
sunshinesfbay 235 days ago [-]
Pretty cool! What are the steps to use these on mobile? Stoked about using ollama on my iPhone!
dagaci 235 days ago [-]
>> "If running windows" << All of these have web interfaces actually, and all of these implement the same openai api.
So you get to browse locally and remotely if you are able to expose the service remotely adjusting your router.
So you can also run on any LLM privately with ollama, lmstudio, and or LLamaSharp with windows, mac and iphone, all are opensource and customizable too and user friendly and frequently maintained.
JackYoustra 235 days ago [-]
Probably if you have any esoteric flags that pytorch supports. Flash attention 2, for example, was supported way earlier on pt than llama.cpp, so if flash attention 3 follows the same path it'll probably make more sense to use this when targeting nvidia gpus.
sunshinesfbay 235 days ago [-]
It would appear that Flash-3 is already something that exists for PyTorch based on this joint blog between Nvidia, Together.ai and Princeton about enabling Flash-3 for PyTorch: https://pytorch.org/blog/flashattention-3/
JackYoustra 235 days ago [-]
Right - my point about "follows the same path" mostly revolves around llama.cpp's latency in adopting it.
jerrygenser 236 days ago [-]
Olamma currently has only one "supported backend" which is llama.cpp. It enables downloading and running models on CPU. And might have more mature server.
This allows running models on GPU as well.
Zambyte 236 days ago [-]
I have been running Ollama on AMD GPUs (which support for came after NVIDIA GPUs) since February. Llama.cpp has supported it even longer.
tarruda 236 days ago [-]
How well does it run in AMD GPUs these days compared to Nvidia or Apple silicon?
I've been considering buying one of those powerful Ryzen mini PCs to use as an LLM server in my LAN, but I've read before that the AMD backend (ROCm IIRC) is kinda buggy
SushiHippie 235 days ago [-]
I have an RTX 7900 XTX and never had AMD specific issues, except that I needed to set some environment variable.
But it seems like integrated GPUs are not supported
Not sure about Ollama, but llama.cpp supports vulkan for GPU computing.
darkteflon 236 days ago [-]
Ollama runs on GPUs just fine - on Macs, at least.
Kelteseth 236 days ago [-]
Forks fine on Windows with an AMD 7600XT
amunozo 236 days ago [-]
I use it in Ubuntu and works fine too.
ekianjo 236 days ago [-]
it runs on GPUs everywhere. On Linux, on Windows...
236 days ago [-]
gleenn 236 days ago [-]
This looks awesome, the instructions are basically a one-liner to get a Python program to start up a chat program, and it's optimized for a lot of hardware you can run locally like if you have an Nvidia GPU or Apple M processor. Super cool work bringing this functionality to local apps and to just play with a lot of popular models. Great work
boringg 235 days ago [-]
Can someone explain the use case? Is it so that I can run LLMs more readily in terminal instead of having to use a chat interface?
I'm not saying it isn't impressive being able to swap but I have trouble understanding how this integrates into my workflow and I don't really want to put much effort into exploring given that there are so many things to explore these days.
sunshinesfbay 235 days ago [-]
It's an end to end solution that supports the same model from server (including OpenAI API!) to mobile. To the extent that you just want to run on one specific platform, other solutions might work just as well?
ipunchghosts 236 days ago [-]
I have been using ollama and generally not that impressed with these models for doing real work. I can't be the only person who thinks this.
diggan 236 days ago [-]
Same conclusion here so far. Tested out various open source models, maybe once or twice per month, comparing them against GPT-4, nothing has come close so far. Even closed source models seems to not far very well, so far maybe Claude got the closest to GPT-4, but yet to find something that could surpass GPT-4 for coding help.
Of course, could be that I've just got used to GPT-4 and my prompting been optimized for GPT-4, and I try to apply the same techniques to other models where those prompts wouldn't work as great.
wongarsu 236 days ago [-]
They won't beat Claude or GPT-4. If you want a model that writes code or answers complex questions use one of those. But for many simpler tasks like summarization, sentiment analysis, data transformation, text completion, etc, self-hosted models are perfectly suited.
And if you work on something where the commercial models are trained to refuse answers and lecture the user instead, some of the freely available models are much more pleasant to work with. With 70B models you even get decent amounts of reasoning capabilities
ekianjo 236 days ago [-]
> various open source models
what models did you try? There's a ton of new ones every month these days.
diggan 236 days ago [-]
Most recently: Llama-3.1, Codestral, Gemma 2, Mistral NeMo.
codetrotter 236 days ago [-]
Which parameter counts, and which quantization levels?
236 days ago [-]
bboygravity 236 days ago [-]
I wrote an automated form-filling Firefox extension and tested it with Ollama 3.1. Not perfect, quite slow, but better than any other form fillers I tested.
I also tried to hook it up to Claude and so far its flawless (didn't do a lot of testing though).
Dowwie 235 days ago [-]
Can you share what kind of real work you're trying?
derefr 235 days ago [-]
What's your example of "real work"?
Most "well-known-name" open-source ML models, are very much "base models" — they are meant to be flexible and generic, so that they can be fine-tuned with additional training for task-specific purposes.
Mind you, you don't have to do that work yourself. There are open-source fine-tunes as well, for all sorts of specific purposes, that can be easily found on HuggingFace / found linked on applicable subreddits / etc — but these don't "make the news" like the releases of new open-source base models do, so they won't be top-of-mind when doing a search for a model to solve a task. You have to actively look for them.
Heck, even focusing on the proprietary-model Inference-as-a-Service space, it's only really OpenAI that purports to have a "general" model that can be set to every task with only prompting. All the other proprietary-model Inf-aaS providers also sell Fine-Tuning-as-a-Service of their models, because they know people will need it.
---
Also, if you're comparing e.g. ChatGPT-4o (~200b) with a local model you can run on your PC (probably 7b, or maybe 13b if you have a 4090) then obviously the latter is going to be "dumber" — it's (either literally, or effectively) had 95+% of its connections stripped out!
For production deployment of an open-source model with "smart thinking" requirements (e.g. a customer-support chatbot), the best-practice open-source-model approach would be to pay for dedicated and/or serverless hosting where the instances have direct-attached dedicated server-class GPUs, that can then therefore host the largest-parameter-size variants of the open-source models. Larger-parameter-size open-source models fare much better against the proprietary hosted models.
IMHO the models in the "hostable on a PC" parameter-size range, mainly exist for two use-cases:
• doing local development and testing of LLM-based backend systems (Due to the way pruning+quantizing parameters works, a smaller spin of a larger model will be probabilistically similar in behavior to its larger cousin — giving you the "smart" answer some percentage of the time, and a "dumb" answer the rest of the time. For iterative development, this is no problem — regenerate responses until it works, and if it never does, then you've got the wrong model/prompt.)
• "shrinking" an AI system that doesn't require so much "smart thinking", to decrease its compute requirements and thus OpEx. You start with the largest spin of the model; then you keep taking it down in size until it stops producing acceptable results; and then you take one step back.
The models of this size-range don't exist to "prove out" the applicability of a model family to a given ML task. You can do it with them — especially if there's an existing fine-tuned model perfectly suited to the use-case — but it'll be frustrating, because "the absence of evidence is not evidence of absence." You won't know whether you've chosen a bad model, or your prompt is badly structured, or your prompt is impossible for any model, etc.
When proving out a task, test with the largest spin of each model you can get your hands on, using e.g. a serverless Inf-aaS like Runpod. Once you know the model family can do that task to your satisfaction, then pull a local model spin from that family for development.
simonw 235 days ago [-]
"There are open-source fine-tunes as well, for all sorts of specific purposes"
Have you had good results from any of these? I've not tried a model that's been fine-tuned for a specific purpose yet, I've just worked with the general purpose ones.
daghamm 236 days ago [-]
Does pytorch have better acceleration on x64 CPUs nowadays?
Last time I played with LLMs on CPU with pytorch you had to replace some stuff with libraries from Intel otherwise your performance would be really bad.
gleenn 236 days ago [-]
I can't find it again in this doc but pretty sure it supports MKL which at least is Intel's faster math library. Better than a stick in the eye. Also certainly faster than plain CPUs but almost certainly way slower than something with more massively parallel matrix processing.
sva_ 236 days ago [-]
x86_64*
ein0p 235 days ago [-]
Selling it as a “chat” is a mistake imo. Chatbots require very large models with a lot of stored knowledge about the world. Small models are useful for narrow tasks, but they are not, and will never be, useful for general domain chat
suyash 235 days ago [-]
This is cool, how can I go about using this for my own dataset - .pdf, .html files etc?
236 days ago [-]
jiratemplates 236 days ago [-]
looks great
aklgh 236 days ago [-]
A new PyTorch feature. Who knew!
How about making libtorch a first class citizen without crashes and memory leaks? What happened to the "one tool, one job" philosophy?
As an interesting thought experiment: Should PyTorch be integrated into systemd or should systemd be integrated into PyTorch? Both seem to absorb everything else like a black hole.
smhx 236 days ago [-]
it's not a new PyTorch feature.
It's just a showcase of existing PyTorch features (including libtorch) as an end-to-end example.
On the server-side it uses libtorch, and on mobile, it uses PyTorch's executorch runtime (that's optimized for edge)
BaculumMeumEst 236 days ago [-]
Did not know executorch existed! That's so cool! I have it on my bucket list to tinker with running LLMs on wearables after I'm a little further along in learning, great to see official tooling for that!
I think this is not about new Pytorch features, although it requires the latest Pytorch and Executorch making me think that some features in pytorch and executorch got extended optimized for this use case?
What makes this cool is that you can use the same model and the same library and apply to server, desktop, laptop and mobile on iOS and Android, with a variety of quantization schemes and other features.
Definitely still some rough edges as I'd expect from any first software release!
Rendered at 01:24:25 GMT+0000 (Coordinated Universal Time) with Vercel.
"how does a smart car compare to a ford f150? its different in its intent and intended audience.
Ollama is someone who goes to walmart and buys a $100 huffy mountain bike because they heard bikes are cool. Torchchat is someone who built a mountain bike out of high quality components chosen for a specific task/outcome with the understanding of how each component in the platform functions and interacts with the others to achieve an end goal." https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...
Longer Answer with some more details is
If you don't care about which quant you're using, only use ollama and want easy integration with desktop/laptop based projects use Ollama. If you want to run on mobile, integrate into your own apps or projects natively, don't want to use GGUF, want to do quantization, or want to extend your PyTorch based solution use torchchat
Right now Ollama (based on llama.cpp) is a faster way to get performance on a laptop desktop and a number of projects are pre-integrated with Ollama thanks to the OpenAI spec. It's also more mature with more fit and polish. That said the commands that make everything easy use 4bit quant models and you have to do extra work to go find a GGUF model with a higher (or lower) bit quant and load it into Ollama. Also worth noting is that Ollama "containerizes" the models on disk so you can't share them with other projects without going through Ollama which is a hard pass for any users and usecases since duplicating model files on disk isn't great. https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...
So you get to browse locally and remotely if you are able to expose the service remotely adjusting your router.
Coudflare will also expose services remotely if you wishhttps://developers.cloudflare.com/cloudflare-one/connections...
So you can also run on any LLM privately with ollama, lmstudio, and or LLamaSharp with windows, mac and iphone, all are opensource and customizable too and user friendly and frequently maintained.
This allows running models on GPU as well.
I've been considering buying one of those powerful Ryzen mini PCs to use as an LLM server in my LAN, but I've read before that the AMD backend (ROCm IIRC) is kinda buggy
But it seems like integrated GPUs are not supported
https://github.com/ollama/ollama/issues/2637
I'm not saying it isn't impressive being able to swap but I have trouble understanding how this integrates into my workflow and I don't really want to put much effort into exploring given that there are so many things to explore these days.
Of course, could be that I've just got used to GPT-4 and my prompting been optimized for GPT-4, and I try to apply the same techniques to other models where those prompts wouldn't work as great.
And if you work on something where the commercial models are trained to refuse answers and lecture the user instead, some of the freely available models are much more pleasant to work with. With 70B models you even get decent amounts of reasoning capabilities
what models did you try? There's a ton of new ones every month these days.
I also tried to hook it up to Claude and so far its flawless (didn't do a lot of testing though).
Most "well-known-name" open-source ML models, are very much "base models" — they are meant to be flexible and generic, so that they can be fine-tuned with additional training for task-specific purposes.
Mind you, you don't have to do that work yourself. There are open-source fine-tunes as well, for all sorts of specific purposes, that can be easily found on HuggingFace / found linked on applicable subreddits / etc — but these don't "make the news" like the releases of new open-source base models do, so they won't be top-of-mind when doing a search for a model to solve a task. You have to actively look for them.
Heck, even focusing on the proprietary-model Inference-as-a-Service space, it's only really OpenAI that purports to have a "general" model that can be set to every task with only prompting. All the other proprietary-model Inf-aaS providers also sell Fine-Tuning-as-a-Service of their models, because they know people will need it.
---
Also, if you're comparing e.g. ChatGPT-4o (~200b) with a local model you can run on your PC (probably 7b, or maybe 13b if you have a 4090) then obviously the latter is going to be "dumber" — it's (either literally, or effectively) had 95+% of its connections stripped out!
For production deployment of an open-source model with "smart thinking" requirements (e.g. a customer-support chatbot), the best-practice open-source-model approach would be to pay for dedicated and/or serverless hosting where the instances have direct-attached dedicated server-class GPUs, that can then therefore host the largest-parameter-size variants of the open-source models. Larger-parameter-size open-source models fare much better against the proprietary hosted models.
IMHO the models in the "hostable on a PC" parameter-size range, mainly exist for two use-cases:
• doing local development and testing of LLM-based backend systems (Due to the way pruning+quantizing parameters works, a smaller spin of a larger model will be probabilistically similar in behavior to its larger cousin — giving you the "smart" answer some percentage of the time, and a "dumb" answer the rest of the time. For iterative development, this is no problem — regenerate responses until it works, and if it never does, then you've got the wrong model/prompt.)
• "shrinking" an AI system that doesn't require so much "smart thinking", to decrease its compute requirements and thus OpEx. You start with the largest spin of the model; then you keep taking it down in size until it stops producing acceptable results; and then you take one step back.
The models of this size-range don't exist to "prove out" the applicability of a model family to a given ML task. You can do it with them — especially if there's an existing fine-tuned model perfectly suited to the use-case — but it'll be frustrating, because "the absence of evidence is not evidence of absence." You won't know whether you've chosen a bad model, or your prompt is badly structured, or your prompt is impossible for any model, etc.
When proving out a task, test with the largest spin of each model you can get your hands on, using e.g. a serverless Inf-aaS like Runpod. Once you know the model family can do that task to your satisfaction, then pull a local model spin from that family for development.
Have you had good results from any of these? I've not tried a model that's been fine-tuned for a specific purpose yet, I've just worked with the general purpose ones.
Last time I played with LLMs on CPU with pytorch you had to replace some stuff with libraries from Intel otherwise your performance would be really bad.
How about making libtorch a first class citizen without crashes and memory leaks? What happened to the "one tool, one job" philosophy?
As an interesting thought experiment: Should PyTorch be integrated into systemd or should systemd be integrated into PyTorch? Both seem to absorb everything else like a black hole.
It's just a showcase of existing PyTorch features (including libtorch) as an end-to-end example.
On the server-side it uses libtorch, and on mobile, it uses PyTorch's executorch runtime (that's optimized for edge)
https://github.com/pytorch/executorch
What makes this cool is that you can use the same model and the same library and apply to server, desktop, laptop and mobile on iOS and Android, with a variety of quantization schemes and other features.
Definitely still some rough edges as I'd expect from any first software release!