I am a medical student with thousands of pdfs, various anki databases, video conferences, audio recordings, markdown notes etc. It can query into all of them and return extremely high quality output with sources to each original document.
It's still in alpha though and there's only 0.5 user beside me that I know of so there are bugs that have yet to be found!
brumar 57 days ago [-]
Med student working on sophisticated RAG system... What kind of beast are you? Thanks for sharing anyway, I'll keep tab on it.
Ey7NFZ3P0nzAe 56 days ago [-]
Thanks for the kind words. My burning desire to advance computational psychiatry is what's keeping me up late at night :). But don't be impressed, I might be a terrible psychiatrist that's only half competent at coding after all! I'll try my best not to though and keep on improving!
Llamamoe 56 days ago [-]
Could you include information about the hardware necessary to run it?
Ey7NFZ3P0nzAe 56 days ago [-]
Hi, thank you very much for the interest. I think as long as your hardware can run Python 3.11, it should be fine. You don't really have to use any local LLMs or anything. You can just rely on external APIs and be fine.
pamelafox 58 days ago [-]
I’ve got this RAG repo working entirely locally (Ollama/Postgres) but it doesnt RAG on documents like you want.
You can use BerryDB for doing this use case at scale. BerryDB is a JSON native database that can ingest PDFs, images, etc and it has a built in semantic layer (for labeling) so that way you can build your knowledge database with entities and relationships. This will ground your knowledge with entities and accuracy scales very well with large number of documents
It provides APIs to extract paragraphs or tables from your PDFs in bulk, You can also separately do bulk labeling (say classification, NER and other labeling types). Once you have a knowledge database, it creates 4 indexes on top of your JSON data layer - db index for metadata search, full text search index, annotation index and vector index, so that way you can perform any search operation including hybrid search
The fact that your data layer is in JSON, it gives you infinite flexibility to add new snippets of knowledge or new labels and improve accuracy over time.
It doesn't show on desktop either. It's a hash link to no anchor.
imrantech 57 days ago [-]
The key to accuracy is use case specific knowledge graphs. Here is a YouTube video of how to do it. https://youtu.be/iWtF1Qe7QkM
The key benefits are
- Improved data quality of the data available for genAI
- Reduction in risk associated with genAI’s known problems
- Increasing business value due to being able to hit use case driven accuracy/reliability, security, and transparency metrics
shadowmnifold 56 days ago [-]
Thanks for this. I am going to watch the whole thing. Use case specific knowledge graphs sounds right to me.
Yes! We can definitely help with this. Khoj lets you chat with your documents, indexing your private knowledge base for local RAG with any open source (or foundation) model.
You can make it as 'fancy' as you want, and use speech-to-text, image generation, web scraping, custom agents.
Let me know if you run into any issues? I'd love to get this setup for senior citizens! You can reach me at saba at khoj.dev.
kingkongjaffa 60 days ago [-]
> expected the more documents we feed the lower the accuracy
Not surprising!
The LLM itself is the least important bit as long as it’s serviceable.
Depending on your goal you need to have a specific RAG strategy.
How are you breaking up the documents? Are the documents consistently formatted to make breaking them up uniform?
Do you need to do some preprocessing to make them uniform?
When you retrieve documents how many do you stuff into your prompt as context?
Do you stuff the same top N chunks from a single prompt or do you have a tailored prompt chain retrieving different resourced based on the prompt and desired output?
generalizations 58 days ago [-]
> How are you breaking up the documents? Are the documents consistently formatted to make breaking them up uniform? Do you need to do some preprocessing to make them uniform?
> When you retrieve documents how many do you stuff into your prompt as context?
> Do you stuff the same top N chunks from a single prompt or do you have a tailored prompt chain retrieving different resourced based on the prompt and desired output?
Wouldn't these questions be answered by the RAG solution the OP is asking for?
kingkongjaffa 55 days ago [-]
Not really you need to try different strategies for different use cases in my experience.
specproc 58 days ago [-]
Fascinating there doesn't seem to be a consensus "just use this" answer here.
stavros 58 days ago [-]
My sentiments exactly, and given how widespread a need RAG is, I'm extremely surprised that we don't have something solid and clearly a leader in the space yet. We don't even seem to have two or three! It's "pick one of these million side-projects".
LunaSea 57 days ago [-]
Because RAGs are simply a list of vectors and a similarity search with some variations trying to use knowledge graphs.
So everybody is roughly using the same method with some tweaks here and there and thus getting a similar quality in results.
disgruntledphd2 57 days ago [-]
Yeah, I'm coming to believe that this is a much, much, much harder problem than it looks. Getting it running is pretty easy, but actually tuning the results to make them better is tricky, especially if you're not a domain expert in the area you're working on.
Evals seem like a solution, but they're very tied to specific examples, so it looks like that might be most of the issue in getting this to work, as with a good set of evals, one can actually measure performance and test out different approaches.
Embedding also seems to be a bit of a dark art in that every tutorial uses something small, but I haven't seen a lot of work on comparing the performance of particular embeddings for domain specific tasks.
bronco21016 55 days ago [-]
This was our experience trying to deploy a knowledge base for our Org.
You have to get the domain experts to help you build evals and you need a good pipeline for testing the LLM against those as you make changes. We were never able to get there before the project was killed. Our use-case was potentially giving career altering legal advice and we only made it to roughly 80% accuracy from our very informal eval. The domain experts wanted nothing to do with actually helping build the tool. Their idea of "testing" was asking 3 softball questions and saying "yea, it's good to go".
I think on a personal level you could probably get a usable tool that works well enough most of the time. But for anything going to production where people actually depend on it, this isn't an easy problem to solve. Although, I do think its doable.
0xbadc0de5 57 days ago [-]
Is it possible that you're using RAG in a nonstandard way? Thousands of documents seems like a lot to feed into a single query. Have you tried using collections and tags to narrow down the field prior to performing the semantic search? You may also want to consider using a larger model or one with a larger context window.
jerpint 57 days ago [-]
Just remember that retrieval is the real bottleneck in RAG, so your problem could very much be related to how you create and embed your chunks and not the model you’re using
ivan_ah 57 days ago [-]
I recently watched a talk[1] on this exact use case: a RAG system that runs on localhost with a simple web UI, and based on very powerful text processing, and a simple backend (PHP and sqlite3 with FTS and vector search extensions).
I recently had some luck turning an excel tracker that lists multiple locations and their services into markdown for RAG. It worked great as a natural language lookup, way better than digging through a big Excel sheet.
But things got a bit messy when I handed it off to someone else. They started using synonyms for locations, like abbreviated addresses to refer to certain columns, which didn't return the right documents.
Followed a friend's suggestion to try NotebookLM, so I uploaded the same docs there, and it was awesome. Some cloud-hosted vector DB tools only handle PDFs, but NotebookLM accepted my Markdown and chunked the docs better than the Supabase library I was using. It just "worked".
I would swap over to NotebookLM because their document chunking and RAG performance is working for my use case, but they just don’t offer an API yet.
Am I overhyping NotebookLM? I’d love to know to get on-par document chunking, because that seems to deliver fantastic RAG right out of the box. I’m planning to try some other suggestions I’ve seen here, but any insights into how NotebookLM does its magic would be super helpful.
depingus 58 days ago [-]
txtai was brought up in a discussion yesterday. I saved it to look at later. But you might find it useful.
https://github.com/neuml/txtai
I've been working on this, which allows you to build a full RAG pipeline in Postgres: https://github.com/neondatabase-labs/pgrag. You can compile it yourself, or it's available on Neon (including the free plan) since yesterday.
kylecazar 58 days ago [-]
I would look at articles on building an open source RAG pipeline. Generation (model) is the last in a series of important steps -- you have options to choose from (retrieval, storage, etc) in each component step. Those decisions will affect the accuracy you mention.
Langchain, llamaindex have good resources on building such a pipeline the last I checked
obelos 57 days ago [-]
You might find Rag to Riches' (R2R) built-in use of Unstructured for doc parsing, hybrid search, knowledge graphs, and HyDE queries improves the quality of your retrievals. https://github.com/SciPhi-AI/R2R
teleforce 57 days ago [-]
Check this Manning Early Access Program (MEAP) soon to be published book on AI-Powered Search:
Does anyone have experience using any of these for scientific paper PDFs, in particular containing equations (I'm guessing graphs are still well beyond their reach)? The workflow for these seems to involve converting PDF->text...
I am a medical student with thousands of pdfs, various anki databases, video conferences, audio recordings, markdown notes etc. It can query into all of them and return extremely high quality output with sources to each original document.
It's still in alpha though and there's only 0.5 user beside me that I know of so there are bugs that have yet to be found!
https://github.com/Azure-Samples/rag-postgres-openai-python
I’d like to make that version when I have the time, probably just using Llamaindex for the ingestion.
My tips for getting SLMs working well for RAG: http://blog.pamelafox.org/2024/08/making-ollama-compatible-r...
I have a few tabs open that I haven't had a chance to try:
https://github.com/Mintplex-Labs/anything-llm
https://github.com/Bin-Huang/chatbox
https://github.com/saeedezzati/superpower-chatgpt
It provides APIs to extract paragraphs or tables from your PDFs in bulk, You can also separately do bulk labeling (say classification, NER and other labeling types). Once you have a knowledge database, it creates 4 indexes on top of your JSON data layer - db index for metadata search, full text search index, annotation index and vector index, so that way you can perform any search operation including hybrid search
The fact that your data layer is in JSON, it gives you infinite flexibility to add new snippets of knowledge or new labels and improve accuracy over time.
https://berrydb.io
You can make it as 'fancy' as you want, and use speech-to-text, image generation, web scraping, custom agents.
Let me know if you run into any issues? I'd love to get this setup for senior citizens! You can reach me at saba at khoj.dev.
Not surprising!
The LLM itself is the least important bit as long as it’s serviceable.
Depending on your goal you need to have a specific RAG strategy.
How are you breaking up the documents? Are the documents consistently formatted to make breaking them up uniform? Do you need to do some preprocessing to make them uniform?
When you retrieve documents how many do you stuff into your prompt as context?
Do you stuff the same top N chunks from a single prompt or do you have a tailored prompt chain retrieving different resourced based on the prompt and desired output?
> When you retrieve documents how many do you stuff into your prompt as context?
> Do you stuff the same top N chunks from a single prompt or do you have a tailored prompt chain retrieving different resourced based on the prompt and desired output?
Wouldn't these questions be answered by the RAG solution the OP is asking for?
So everybody is roughly using the same method with some tweaks here and there and thus getting a similar quality in results.
Evals seem like a solution, but they're very tied to specific examples, so it looks like that might be most of the issue in getting this to work, as with a good set of evals, one can actually measure performance and test out different approaches.
Embedding also seems to be a bit of a dark art in that every tutorial uses something small, but I haven't seen a lot of work on comparing the performance of particular embeddings for domain specific tasks.
You have to get the domain experts to help you build evals and you need a good pipeline for testing the LLM against those as you make changes. We were never able to get there before the project was killed. Our use-case was potentially giving career altering legal advice and we only made it to roughly 80% accuracy from our very informal eval. The domain experts wanted nothing to do with actually helping build the tool. Their idea of "testing" was asking 3 softball questions and saying "yea, it's good to go".
I think on a personal level you could probably get a usable tool that works well enough most of the time. But for anything going to production where people actually depend on it, this isn't an easy problem to solve. Although, I do think its doable.
You can see the project page here: https://textualization.com/ragged/
src and scripts here: https://github.com/Textualization/the-ragged-edge-box
[1] video presentation about the project https://www.youtube.com/watch?v=_fJFuL2pLvw
I uploaded them through Supabase Embeddings Generator if you're curious. https://github.com/supabase/embeddings-generator
But things got a bit messy when I handed it off to someone else. They started using synonyms for locations, like abbreviated addresses to refer to certain columns, which didn't return the right documents.
Followed a friend's suggestion to try NotebookLM, so I uploaded the same docs there, and it was awesome. Some cloud-hosted vector DB tools only handle PDFs, but NotebookLM accepted my Markdown and chunked the docs better than the Supabase library I was using. It just "worked".
I would swap over to NotebookLM because their document chunking and RAG performance is working for my use case, but they just don’t offer an API yet.
I also gave Gemini a shot using this guide, but didn’t get the results I was hoping for. https://codelabs.developers.google.com/multimodal-rag-gemini...
Am I overhyping NotebookLM? I’d love to know to get on-par document chunking, because that seems to deliver fantastic RAG right out of the box. I’m planning to try some other suggestions I’ve seen here, but any insights into how NotebookLM does its magic would be super helpful.
Here is that that thread. https://news.ycombinator.com/item?id=41981907
Langchain, llamaindex have good resources on building such a pipeline the last I checked
https://www.manning.com/books/ai-powered-search
Helpful for building a scalable, local RAG solution tailored to your group’s needs—plus, it’s open source-friendly if i'm correct.