This glosses over a fundamental scaling problem that undermines the entire argument. The author's main example is Claude Code searching through local codebases with grep and ripgrep, then extrapolates this to claim RAG is dead for all document retrieval. That's a massive logical leap.
Grep works great when you have thousands of files on a local filesystem that you can scan in milliseconds. But most enterprise RAG use cases involve millions of documents across distributed systems. Even with 2M token context windows, you can't fit an entire enterprise knowledge base into context. The author acknowledges this briefly ("might still use hybrid search") but then continues arguing RAG is obsolete.
The bigger issue is semantic understanding. Grep does exact keyword matching. If a user searches for "revenue growth drivers" and the document discusses "factors contributing to increased sales," grep returns nothing. This is the vocabulary mismatch problem that embeddings actually solve. The author spent half the article complaining about RAG's limitations with this exact scenario (his $5.1B litigation example), then proposes grep as the solution, which would perform even worse.
Also, the claim that "agentic search" replaces RAG is misleading. Recent research shows agentic RAG systems embed agents INTO the RAG pipeline to improve retrieval, they don't replace chunking and embeddings. LlamaIndex's "agentic retrieval" still uses vector databases and hybrid search, just with smarter routing.
Context windows are impressive, but they're not magic. The article reads like someone who solved a specific problem (code search) and declared victory over a much broader domain.
_the_inflator 2 days ago [-]
I agree.
A great many pundits don't get, that RAG means: "a technique that enables large language models (LLMs) to retrieve and incorporate new information"
So, RAG is a pattern that is as a principle applied to almost every process. Context windows? Ok, I won't get into all the nitty gritty details here (embedded, small storage device, security, RAM defects, cost and storage of contexts for different contexts etc.), just a hint, that the act of filling a context is what? Applied RAG.
RAG is not a architecture, it is a principle. A structured approach. There is a reason, why nowadays many refer to RAG as search engine.
All we know about knowledge, there is only one entity with a infinite context window. We still call it God not cloud.
larodi 2 days ago [-]
Indeed, the name is Retrieval Augmented Generation... so this is generation (synthesis of text) augmented by retrieval (of data from external systems). the goal is to augment the generation, not to improve retrieval.
the improvements needed for the retrieval part are then another topic.
CuriouslyC 2 days ago [-]
Agentic retrieval is really more a form of deep research (from a product standpoint there is very little difference). The key is that LLMs > rerankers, at least when you're not at webscale where the cost differential is prohibitive.
nbstme 2 days ago [-]
LLMs > rerankers. Yes! I don't like rerankers. They are slow, the context window is small (4096 tokens), it's expensive... It's better when the LLM reads the whole file versus some top_chunks.
janalsncm 2 days ago [-]
Rerankers are orders of magnitude faster and cheaper than LLMs. Typical latency out of the box on a decent sized cross encoder (~4B) will be under 50ms on cheap gpus like an A10G. You won’t be able to run a fancy LLM on that hardware and without tuning you’re looking at hundreds of ms minimum.
More importantly, it’s a lot easier to fine tune a reranker on behavior data than an LLM that makes dozens of irrelevant queries.
CuriouslyC 2 days ago [-]
This is worth emphasizing. At scale, and when you have the resources to really screw around with them to tune your pipeline, rerankers aren't bad, they're just much worse/harder to use out of the box. LLMs buy you easy robustness, baseline quality and capabilities in exchange for cost and latency, which is a good tradeoff until you have strong PMF and you're trying to increase margins.
deepsquirrelnet 1 days ago [-]
More than that, adding longer context isn’t free either in time or money. So filling an LLM context with k=100 documents of mixed relevance may be slower than reranking and filling with k=10 of high relevance.
Of course, the devil is in the details and there’s five dozen reasons why you might choose one approach over the other. But it is not clear that using a reranker is always slower.
kaycey2022 7 hours ago [-]
>The author spent half the article complaining about RAG's limitations with this exact scenario (his $5.1B litigation example), then proposes grep as the solution, which would perform even worse.
Yeah I found this very confusing. Sad to see such a poor quality article being promoted to this extent.
grafmax 2 days ago [-]
RAG doesn’t just mean word vectors but can include keyword search. Claude using grep is a form of RAG.
edanm 2 days ago [-]
In practice this is not how the term is used.
It bugs me, because the acronym should encompass any form of retrieval - but in practice, people use RAG to specifically refer to embedding-vector-lookups, hence it making sense to say that it's "dying" now that other forms of retrieval are better.
mk_stjames 1 days ago [-]
This was essentially my response as well, but the other replies to you also have a point, and I think the key here is the 'Retrieval' in RAG is very vague, and depending on who you were and what you were getting into RAG for, the term means different things.
I am definitely more aligned with needing what I would rather call 'Deep Semantic Search and Generation' - the ability to query text chunk embeddings of... a 100k PDF's, using the semantics to search for the closeness of the 'ideas', those fed into the context of the LLM, and then the LLM generate a response to the prompt citing the source PDF(s) the closest matched vectors came from...
That is the killer app of a 'deep research' assistant IMO and you don't get that via just grepping words and feeding related files into the context window.
The downside is, how to generate embeddings of massive amounts of mixed-media files and store in a database quickly and cheaply compared to just grepping a few terms from said files? A CPU grep of text in files in RAM is like five orders of magnitude faster than an embedding model on the GPU generating semantic embeddings of the chunked file and then storing those for later.
bjornsing 2 days ago [-]
But couldn’t an LLM search for documents in that enterprise knowledge base just like humans do, using the same kind of queries and the same underlying search infrastructure?
z3dd 2 days ago [-]
I wouldn't say humans are efficient at that so no reason to copy, other than as a starting point.
carlmr 2 days ago [-]
Maybe not efficient, but if the LLMs can't even reach this benchmark then I'm not sure.
zwaps 2 days ago [-]
Yes but that would be worse than many RAG approaches, which were implemented precisely because there is no good way to cleanly search through a knowledge base for a million different reasons.
At that point, you are just doing Agentic RAG, or even just Query Review + RAG.
I mean, yeah, agentic RAG is the future. It's still RAG though.
rightbyte 2 days ago [-]
I don't get it. Isn't grep RAG?
lossolo 2 days ago [-]
In RAG, you operate on embeddings and perform vector search, so if you search for fat lady, it might also retrieve text like huge queen, because they're semantically similar. Grep on the other hand, only matches exact strings, so it would not find it.
saberience 10 hours ago [-]
This isn't the case.
RAG means any kind of data lookup which improves LLM generation results. I work in this area and speak to tons of companies doing RAG and almost all these days have realised that hybrid approaches are way better than pure vector searches.
Standard understanding of RAG now is simply adding any data to the context to improve the result.
gk1 2 days ago [-]
R in RAG is for retrieval… of any kind. It doesn’t have to be vector search.
lossolo 2 days ago [-]
Sure, but vector search is the dominant form of RAG, the rest are niche. Saying "RAG doesn’t have to use vectors" is like saying "LLMs don't have to use transformers". Technically true, but irrelevant when 99% of what's in use today does.
hannasanarion 1 days ago [-]
How are they niche? The default mode of search for most dedicated RAG apps nowadays is hybrid search that blends classical BM-25 search with some HNSW embedding search. That's already breaking the definition.
A search is a search. The architecture doesn't care if it's doing an vector search or a text search or a keyword search or a regex search, it's all the same. Deploying a RAG app means trying different search methods, or using multiple methods simultaneously or sequentially, to get the best performance for your corpus and use case.
lossolo 1 days ago [-]
Most hybrid stacks (BM25 + dense via HNSW/IVF) still rely on embeddings as a first class signal. So in practice the vector side carries recall on paraphrase/synonymy/OOO vocab, while BM25 stabilizes precision on exact term and short doc cases. So my point still stands.
> The architecture doesn't care
The architecture does care because latency, recall shape, and failure modes differ.
I don't know of any serious RAG deployments that don't use vectors. I'm referring to large scale systems, not hobby projects or small sites.
lemonlearnings 1 days ago [-]
Isn't grep + LLM a form of RAG anyway?
kaycey2022 7 hours ago [-]
I guess, but with a very basic form of exact match retreival. The embedding based RAG tries to augment the prompt with extra data that is semantically similar instead of just exactly same.
tedivm 1 days ago [-]
It really depends on what you mean by RAG. If you take the acronym at face value yeah.
However, RAG has been used as a stand in for a specific design pattern where you retrieve data at the start of a conversation or request and then inject that into the request. This simple pattern has benefits compared to just using sending a prompt by itself.
The point the author is trying to make is that this pattern kind of sucks compared to Agentic Search, where instead of shoving a bunch of extra context in at the start you give the model the ability to pull context in as needed. By switching from a "push" to a "pull" pattern, we allow the model to augment and clarify the queries it's making as it goes through a task which in turn gives the model better data to work with (and thus better results).
peab 1 days ago [-]
Yeah 100%
Almost all tool calls would result in rag.
Rag is dead just means rolling out your own search and manually injecting results into context is dead (just use tools). It means the chunking techniques are dead.
hannasanarion 1 days ago [-]
Chunking is still relevant, because you want your tool calls to return results specific to the needs of the query.
If you want to know "how are tartans officially registered" you don't want to feed the entire 554kb wikipedia article on Tartan to your model, using 138,500 tokens, over 35% of gpt-5's context window, with significant monetary and latency cost. You want to feed it just the "Regulation>Registration" subsection and get an answer 1000x cheaper and faster.
peab 1 days ago [-]
but you could. For that example, you could just use a much cheaper model since it's not that complicated a question, and just pass the entire article. Just use gemini flash for example. Models will only get cheaper and context windows only get bigger
mvieira38 1 days ago [-]
I've seen it called "agentic search" while RAG seems to have become synonymous with semantic search via embeddings
hannasanarion 1 days ago [-]
That's a silly distinction to make, because there's nothing stopping you from giving an agent access to a semantic search.
If I make a semantic search over my organization's Policy As Code procedures or whatever and give it to Claude Code as an MCP, does Claude Code suddenly stop being agentic?
iamleppert 1 days ago [-]
Yes, this guy's post came up on my LinkedIn. I think it's helpful to consider the source in these types of articles, written by a CEO at a fintech startup (looks like AI generated too). It's obvious from reading the article that he doesn't understand what he's talking about and has likely never created any kind of RAG or other system. He has a very limited experience, basically a single project, of building a system around rudimentary ingestion of SEC filings, that's his entire breath of technical experience on the subject. So take what you read with a grain of salt, and do your own research and testing.
alansaber 1 days ago [-]
Well yeah RAG just specifies retrieval augmented, not that vector retrieval or decoder retrieval was used
nbstme 2 days ago [-]
Appreciate the feedback. I’m not saying grep replaces RAG. The shift is that bigger context windows let LLMs just read whole files, so you don’t need the whole chunk/embed pipeline anymore. Grep is just a quick way to filter down candidates.
From there the model can handle 100–200 full docs and jot notes into a markdown file to stay within context. That’s a very different workflow than classic RAG.
visarga 2 days ago [-]
I think the most important insight from your article, which I also felt, is that agentic search is really different. The ability to retarget a search iteratively fixes both the issues of RAG and grep approaches - they don't need to be perfect from the start, they only need to get there after 2-10 iterations. This really changes the problem. LLMs have become so smart they can compensate for chunking and not knowing the right word.
But on top of this I would also use AI to create semantic maps, like hierarchical structure of content, and put that table of contents in the context, let the AI explore it. This helps with information spread across documents/chapters. It provides a directory to access anything without RAG, by simply following links in a tree. Deep Research agents build this kind of schema while they operate across sources.
To explore this I built an graph MCP memory system where the agent can search both by RAG and text matching, and when it finds top-k nodes it can expand out by links. Writing a node implies having the relevant nodes first loaded up, and when generating the text, place contextual links embedded [1] like this. So simply writing a node also connects it to the graph in all the right points. This structure fits better with the kind of iterative work LLMs do.
davidmckayv 2 days ago [-]
That's fair, but how do you grep down to the right 100-200 documents from millions without semantic understanding? If someone asks "What's our supply chain exposure?" grep won't find documents discussing "vendor dependencies" or "sourcing risks."
You could expand grep queries with synonyms, but now you're reimplementing query expansion, which is already part of modern RAG. And doing that intelligently means you're back to using embeddings anyway.
The workflow works great for codebases with consistent terminology. For enterprise knowledge bases with varied language and conceptual queries, grep alone can't get you to the right candidates.
pjm331 2 days ago [-]
the agent greps for the obvious term or terms, reads the resulting documents, discovers new terms to grep for, and the process repeats until its satisfied it has enough info to answer the question
> You could expand grep queries with synonyms, but now you're reimplementing query expansion, which is already part of modern RAG.
in this scenario "you" are not implementing anything - the agent will do this on its own
this is based on my experience using claude code in a codebase that definitely does not have consistent terminology
it doesn't always work but it seemed like you were thinking in terms of trying to get things right in a single grep when it's actually a series of greps that are informed by the results of previous ones
cwyers 2 days ago [-]
Classical search
Spivak 2 days ago [-]
Which is RAG. How you decide to take a set of documents to large for an LLM context window and narrow it down to a set that does fit is an implementation issue.
The chunk, embed, similarity search method was just a way to get a decent classical search pipeline up and running with not too much effort.
glenngillen 2 days ago [-]
I was previously working at https://autonomy.computer, and building out a platform for autonomous products (i.e., agents) there. I started to observe a similar opportunity. We had an actor-based approach to concurrency that meant it was super cheap performance-wise to spin up a new agent. _That_ in turn meant a lot of problems could suddenly become embarrassingly parallel, and that rather than pre-computing/caching a bunch of stuff into a RAG system you could process whatever you needed in a just-in-time approach. List all the documents you've got, spawn a few thousand agents and give each a single document to process, aggregate/filter the relevant answers when they come back.
Obviously that's not the optimal approach for every use case, but there's a lot where IMO it was better. In particular I was hoping to spend more time exploring it in an enterprise context where you've got complicated sharing and permission models to take into consideration. If you have agents simply passing through the permission of the user executing the search whatever you get back is automatically constrained to only the things they had access to in that moment. As opposed to other approaches where you're storing a representation of data in one place, and then trying to work out the intersection of permissions from one of more other systems, and sanitise the results on the way out. Always seemed messy and fraught with problems and the risk of leaking something you shouldn't.
closeparen 1 days ago [-]
Cursor’s use of grep is bad. It finds definitions way slower and less accurately than I do using IDE indexing, which is frustratingly “right there.” Crazy that there’s not even LSP support in there.
Claude Code is better, but still frustrating.
madeofpalk 2 days ago [-]
What exactly is RAG? Is it a specific technology, or a technique?
I'm not a super smart AI person, but grepping through a codebase sounds exactly like what RAG is. Isn't tool use just (more sophisticated) RAG?
hannasanarion 1 days ago [-]
Yes, you are right. The OP has a weirdly narrow definition of what RAG is.
Only the most basic "hello world" type RAG systems rely exclusively on vector search. Everybody has been doing hybrid search or multiple simultaneous searches exposed through tools for quite some time now.
lossolo 2 days ago [-]
RAG is a technique, so instead of string matching (like grep), it uses embeddings + vector search to retrieve semantically similar text (car ≈ automobile), then feeds that into the LLM. Tool use is broader, RAG is one pattern within that, but not the same as grep.
taneq 2 days ago [-]
Is letting an agent use grep not a form of RAG? I know usually RAG is done with vector databases but grep is definitely a form of retrieval, and it’s augmenting the generation.
torginus 1 days ago [-]
Yeah, 'RAG' is quite literal tool use, where the tool is a vector search engine more or less.
What was described as 'RAG' a year ago now is a 'knowledge search in vector db MCP', with the actual tool and mechanism of knowledge retrieval being the exact same.
jgalt212 2 days ago [-]
> Grep works great when you have thousands of files on a local filesystem that you can scan in milliseconds. But most enterprise RAG use cases involve millions of documents across distributed systems
Great point, but this grep in a loop probably falls apart (i.e. becomes non-performant) at 1000s of docs, not millions and 10s of simultaneous users
nbstme 2 days ago [-]
Why does grep in a loop fall apart? It’s expensive, sure, but LLM costs are trending toward zero. With Sonnet 4.5, we’ve seen models get better at parallelization and memory management (compacting conversations and highlighting findings).
adrianbooth17 2 days ago [-]
"LLM costs are trending toward zero". They will never be zero for the cutting edge. One could argue that costs are zero now via local models but enterprises will always want the cutting edge which is likely to come with a cost
jgalt212 1 days ago [-]
If LLM costs are trending towards zero, please explain the $600B openai when Oracle and the $100B deal with Nvidia.
And if you think those deals are bogus, like I do, you still need to explain surging electricity prices.
flyinglizard 1 days ago [-]
They're not trending toward zero; they're just aggressively subsidized with oil money.
voidhorse 2 days ago [-]
Not to mention, unless you want to ship entire containers, you are beholden to the unknown quirks of tools on whatever system your agent happens to execute on. It's like taking something already nondeterministic and extremely risky and ceding even more control—let's all embrace chaos.
Generative AI is here to stay, but I have a feeling we will look back on this period of time in software engineering as a sort of dark age of the discipline. We've seemingly decided to abandon almost every hard won insight and practice about building robust and secure computational systems overnight. It's pathetic that this industry so easily sold itself to the illogical sway of marketers and capital.
rightbyte 2 days ago [-]
> It's pathetic that this industry so easily sold itself to the illogical sway of marketers and capital.
What are you implying. Capital always owned the industry except some really small coops and FOSS communities.
queenkjuul 2 days ago [-]
Mostly, i agree, except that the industry (from where I'm standing) has never done much else but sell itself to marketers and capital.
cmenge 2 days ago [-]
We're processing tenders for the construction industry - this comes with a 'free' bucket sort from the start, namely that people practically always operate only on a single tender.
Still, that single tender can be on the order of a billion tokens. Even if the LLM supported that insane context window, it's roughly 4GB that need to be moved and with current LLM prices, inference would be thousands of dollars. I detailed this a bit more at https://www.tenderstrike.com/en/blog/billion-token-tender-ra...
And that's just one (though granted, a very large) tender.
For the corpus of a larger company, you'd probably be looking at trillions of tokens.
While I agree that delivering tiny, chopped up parts of context to the LLM might not be a good strategy anymore, sending thousands of ultimately irrelevant pages isn't either, and embeddings definitely give you a much superior search experience compared to (only) classic BM25 text search.
elliotto 2 days ago [-]
I work at an AI startup, and we've explored a solution where we preprocess documents to make a short summary of each document, then provide these summaries with a tool call instruction to the bot so it can decide which document is relevant. This seems to scale to a few hundred documents of 100k-1m tokens, but then we run into issues with context window size and rot. I've thought about extending this as a tree based structure, kind of like an LLM file system, but have other priorities at the moment.
Embeddings had some context size limitations in our case - we were looking at large technical manuals. Gemini was the first to have a 1m context window, but for some reason its embedding window is tiny. I suspect the embeddings might start to break down when there's too much information.
codyb 1 days ago [-]
For anyone unfamiliar, construction tenders are part of the project bidding process and appear to be a structured and formal manner in which contractors submit bids for large projects.
themanmaran 2 days ago [-]
I'm always amazed at claude codes ability to build context by just putting grep in a for loop.
It's pretty much the same process I would use in an unfamiliar code base. Just ctrl+f the file system till I find the right starting point.
eru 2 days ago [-]
That's what I used to use as a human, but then I finally overcame my laziness in setting up integration between my editor and compiler (and similar) and got 'jump to definition' working.
(Well, I didn't overcome my laziness directly. I just switched from being lazy and not setting up vim and Emacs with the integrations, to trying out vscode where this was trivial or already built in.)
lukaslalinsky 2 days ago [-]
Do you trust 'jump to definition'. Obviously it depends on the language server, but it's best effort. I'm often frustrated when it doesn't work, because I broke the code in some way. Or it jumps to a specific definition, but there are multiple. If I was as quick at opening and reading files as claude code, I'd prefer grep with context around the searched term.
1718627440 6 hours ago [-]
You can instruct the compiler to tell you, for example by passing -save-temps, and then it's a trivial text search in a single file.
eru 2 days ago [-]
> Do you trust 'jump to definition'.
It depends, for some languages 'jump to definition' tools ask the same compiler/interpreter that you use to build your code, so it's as accurate as it gets, and it's not 'best effort'.
It also depends a bit on your project, some project are more prone to re-using names or symbols.
> If I was as quick at opening and reading files as claude code, I'd prefer grep with context around the searched term.
Well, Claude probably also doesn't want to have to 'learn' how to use all kinds of different tools for different languages and eco-systems.
codyb 1 days ago [-]
In ViM with the CoC code completion plugin JTD gives me multiple options when there are many and I can choose the sensible one in a popover before the actual jump occurs.
I believe that was my experience with IDEs too?
guipsp 1 days ago [-]
In java, for example, jump to definition is pretty flawless.
robmccoll 1 days ago [-]
Unless I'm in an interface and inheritance heavy codebase. Then the first place it jumps to is rarely the one I wanted.
codyb 1 days ago [-]
Jump to definition works when you have the starting point already.
I use both grep and JTD fairly frequently for different use cases.
eru 23 hours ago [-]
Yes.
I meant 'Jump to Definition' as one clear example, not as a definitive enumeration of everything that compiler integration can help you with.
Eg compiler integration is also really useful to show you the inferred types. Even dinosaurs like old-school Java and C have (limited) type inference: inside of expressions. But of course in a language like Haskell or Rust (or even Python) this becomes much more important.
nbstme 2 days ago [-]
It's mind blowing. It's so simple, elegant and... effective! Grep+glob and a lot of iterations is all we need.
Analemma_ 2 days ago [-]
We always suspected find+grep+xargs was Turing-complete, and now Claude is proving it.
nbstme 2 days ago [-]
Exactly. AGI implies minimal tooling and very primitive tools.
SV_BubbleTime 2 days ago [-]
AGI implies that a system is financially viable to let run 24 hours a day with little to no direction.
No amount of find+grep+LLM is even remotely there yet.
delusional 2 days ago [-]
That's one of the most nonsensical comments on all of hackernews. A Markov change could have wrote it.
What do you mean Turing complete? Obviously all 3 programs are running on a Turing complete machine. Xargs is a runner for other commands, obviously those commands can be Turing complete.
I haven't heard of anybody working on a _proof_ for the Turing completeness of xargs, and I think the only conference willing to publish it would be Sigbovik.
hyperbovine 2 days ago [-]
Can’t tell which of these two comments is the joke …
EdwardDiego 2 days ago [-]
It was a joke.
kingjimmy 2 days ago [-]
Has it not dawned on the author how ironic calling embeddings and retrieval pipelines "a nightmare of edge cases" when talking about LLM
nbstme 2 days ago [-]
Haha! LLMs themselves are pure edge cases because they are non-deterministic. But if you add a 7-step pipeline on top of that, it's edge cases on top of edge cases.
leopoldj 1 days ago [-]
The author is conflating RAG with vector search. I think.
One can use any and all available search mechanisms, SQL, graph db, regex, keyword and so on, for the retrieval part.
rohansood15 1 days ago [-]
I don't get why folks are so dismissive here.
If you ever saw Claude Code/Codex use grep, you will find that it constructs complex queries that encompass a whole range of keywords which may not even be present in the original user query. So the 'semantic meaning' isn't actually lost.
And nobody is putting an entire enterprise's knowledge base inside the context window. How many enterprise tasks are there that need referencing more that a dozen docs? And even those that do, can be broken down into sub-tasks of manageable size.
Lastly, nobody here mentions how much of a pain it is to build, maintain and secure an enterprise vector database. People spend months cleaning the data, chunking and vectorizing it, only for newer versions of the same data making it redundant overnight. And good look recreating your entire permissioning and access control stack on top of the vector database you just created.
The RAG obituary is a bit provocative, and maybe that's intentional. But it's surprising how negative/dismissive the reactions in this thread are.
innagadadavida 1 days ago [-]
The article is not making a proper distinction of scale and is probably due to the small scale problem that they solved. What is small scale and <10K documents / files can be easily processed with grep, find etc. For something at larger scale >1M documents etc. you will need to use search engine technology. You can definitely do the same agent approach for the large scale problem - we essentially need search, look at the results and issue follow up queries to get documents of interest.
All that said, for the types of problem the OP is solving, it might just be better to create a project in Claude/ChatGPT and throw in the files there and get done with it. That approach has been working for over 2 years now and is nothing new.
CuriouslyC 2 days ago [-]
RAG isn't dead, RAG is just fiddly, you need to tune retrieval to the task. Also, grep is a form of RAG, it just doesn't use embeddings.
nbstme 2 days ago [-]
Yes my point is that the entire RAG pipeline like ingest, chunk, embed, search with Elastic, rerank is in decline. Grep is far simpler. It’s trivial.
tw1984 1 days ago [-]
No, grep is not RAG. RAG is all about embeddings + vector search + LLM working under a fixed workflow.
Saying grep is also RAG is like saying ext4 + grep is a database.
23 hours ago [-]
CuriouslyC 1 days ago [-]
So you're saying grep isn't a form of information retrieval?
tw1984 1 days ago [-]
information retrieval is a much larger superset of RAG.
grep + agentic LLM is not RAG.
leobg 6 hours ago [-]
Retrieval Augmented Generation. How you do the retrieving is irrelevant. You can do it manually and it’ll still be RAG. Also, most RAG pipelines combine multiple approaches - BM25, embeddings, etc..
keeganpoppen 1 days ago [-]
RAG isnt dead, it just isnt being used correctly. agents get the correct answer, yes, just not fast. RAG is a performance optimization (and cost), same as it always has been. as agents improve, the nature of what and how to RAG will change, but it has no reason to not exist, not the opposite.
malshe 1 days ago [-]
> Table Integrity: Financial tables are never split—income statements, balance sheets, and cash flow statements remain atomic units with headers and data together
In 10-k and 10-q often there are no table headers. This is particularly true for the consolidated notes to financial statements section. Standalone tables could be pretty much meaningless because you won't even know what they are reporting. For example, a table that simply mentions terms like beginning balance and ending balance can be reporting inventory, warranty, or short term debt. But the table does not mention these metrics at all and there are no headers. So I am curious to know how Fintool uses standalone tables. Do you retain text surrounding the tables in the same chunk as the table?
23 hours ago [-]
stoneyhrm1 2 days ago [-]
I'm free to be corrected because I'm no expert in the field but isn't RAG just enriching context, it doesn't have to be semantic search, it could be an API call or grabbing info from a database.
zwaps 2 days ago [-]
I am so tired of these undifferentiated takes.
These types of articles regularly come from people who don't actually build SCALE systems with LLMs. Or, people who want to sell you on a new tech.
And the frustrating thing is: They ain't even wrong.
Top-K RAG via vector search is not a sufficient solution. It never really was for most interesting use-cases.
Of course, take easiest and most structured - in a sense perfectly indexed - data (code repos) and claim that "RAG is dead". Again.
Now try this with billions of unstructured tokens where the LLM really needs to do something with the entire context (like, confirm that something is NOT in the documents), where even the best LLM loses context coherence after like 64k tokens for complex tasks.
Good luck!
The truth is: Whether its Agentic RAG, Graph RAG, or a combination of these with ye olde top-k RAG - it's still RAG.
You are going to Retrieve, and then you are going to use a system of LLM agents to generate stuff with it. You may now be able to do the first step smarter. It's still Rag tho.
The latest Antrophic whoopsy showed that they also haven't solved the context rot issue. Yes you can get a 1M context scaled version of Claude, but then the small/detail scale performance is so garbage that misrouted customers loose their effin mind.
"My LLM is just gonna ripgrep through millions of technical doc pdfs identified only via undecipherable number-based filenames and inconsistent folder structures"
lol, and also, lmao
gengstrand 1 days ago [-]
I agree. Permit me to rephrase. From this learning adventure https://www.infoq.com/articles/architecting-rag-pipeline/ I came to understand what many now call context rot. If you want quality answers, you still need relevance reranking and filtering no matter how big your context window becomes. Whether that happens in a search that is upfront in a one shot prompt or iteratively in a long session through an agentic system is merely an implementation detail.
selcuka 2 days ago [-]
I don't find this surprising. We are constantly finding workarounds for technical limitations, then ditch them when the limitation no longer exists. We will probably be saying the same thing for LLMs in a few years (when a new machine learning related TLA becomes the hype).
nbstme 2 days ago [-]
100%. The speed of change is wild. With each new model, we end up deleting thousands of lines of code (old scaffolding we built to patch the models’ failures.)
maerch 1 days ago [-]
> The agent follows references like a human analyst would. No chunks. No embeddings. No reranking. Just intelligent navigation.
I think this sums it up well. Working with LLMs is already confusing and unpredictable. Adding a convoluted RAG pipeline (unless it is truly necessary because of context size limitations) only makes things worse compared to simply emulating what we would normally do.
msukhareva 1 days ago [-]
RAG was always somewhat of a Frankenstein combining two things that should not be combined: information retrieval based on string matching enhanced with embeddings and LLM that needs not string matching but informative texts. If string matching is good but information is poor or provides wrong context, it would only enforce hallucinations. Search, tool calling and connections should be a part of the system and trained together with LLM
Imanari 2 days ago [-]
I get the reasoning behind “letting the agent use grep in a loop“, after all, it is very similar to how humans would explore a document base with ctrl-f. But wouldn’t humans also use vector search all the time if it were as available as ctrl-f? So maybe not ditch vector search but provide it as a tool to the agent. Increased complexity aside, letting the agent explore a huge document base with “vector search in a loop“ should be more powerful that with grep in a loop. Overall I liked the article.
regularfry 2 days ago [-]
My mental model (in the "all models are wrong, some are useful" sense) is that vector search is the thing that gives you the terms to grep for.
Imanari 1 days ago [-]
I like it. With this approach it feels like you also don’t need to fiddle as much with the details of your vector search and DB as that portion just gets you going and the actual retrieval happens with grep in a loop.
te_chris 2 days ago [-]
Or the thing that ranks the term based result. That’s the fun these days: it’s all whatever fits your problem.
alansaber 1 days ago [-]
Conceptually yes but practically speaking vector search results are generally not sufficiently good
sakoht 2 days ago [-]
People say “agents not RAG”, but one framing is that this describes RAG where the database is a file system and bash is the query language (with other cli tools installed it can use, including curl, jq, grep). With writing its own notes on the filesystem structure and maintaining them as a way to “index the database”
It is still using code to selectively grab the chunks of data it needs rather than than putting everything in context. It’s just better RAG?
masterkram 2 days ago [-]
After building a few RAG based apps I was curious to try the ClaudeCode based approach that is mentioned by the author. So I built a python service that exposes ripgrep to a rest api: https://github.com/masterkram/jaguar
This makes it possible to quickly deploy this on coolify and quickly build an agent that can use ripgrep on any of your uploaded files.
BrokenLButton 1 days ago [-]
I worked at a startup that heavily relied RAG, and this article definitely articulates most of the same issues we ran into. I do think RAG still has its place but it is definitely becoming a harder to justify case.
bze12 1 days ago [-]
This post was definitely written by an llm.
sublimefire 2 days ago [-]
Saying that RAG alone is complex and should be superseded by agentic search is a bit weak. Agentic search makes more sense when your pipeline becomes more complicated: RAG+MCP+Client calls, it is then when you can see that LLM starts behaving erratically and cannot answer the question well. You then want better control over streams of content and intents which could be solved by smaller agents looping over the data.
intalentive 2 days ago [-]
Agentic search with a handful of basic tools (drawn from BM25, semantic search, tags, SQL, knowledge graph, and a handful of custom retrieval functions) blows the lid off RAG in my experience.
The downside is it takes longer. A single “investigation” can easily use 20-30 different function calls. RAG is like a static one-shot version of this and while the results are inferior the process is also a lot faster.
nsomaru 2 days ago [-]
Hey, I’m interested in what you call “agentic search”. Did you roll your own or are you using a set of integrated tools?
I’ve used LightRAG and looking to integrate it with OpenWebUI and possibly air weave which was a show HN earlier.
My data is highly structured and has references between documents, so I wanted to leverage that structure for better retrieval and reasoning.
intalentive 1 days ago [-]
Rolled my own in Python.
For graph/tree document representations, it’s common in RAG to use summaries and aggregation. For example, the search yields a match on a chunk, but you want to include context from adjacent chunks — either laterally, in the same document section, or vertically, going up a level to include the title and summary of the parent node.
How you integrate and aggregate the surrounding context is up to you. Different RAG systems handle it differently, each with its own trade offs. The point is that the system is static and hardcoded.
The agentic approach is: instead of trying to synthesize and rank/re-rank your search results into a single deliverable, why not leave that to the LLM, which can dynamically traverse your data. For a document tree, I would try exposing the tree structure to the LLM. Return the result with pointers to relevant neighbor nodes, each with a short description. Then the LLM can decide, based on what it finds, to run a new search or explore local nodes.
mscbuck 1 days ago [-]
I've found his hybrid approach pretty good for the majority of use cases. BM25 (maybe Splade if you want a blend of BOW/Keyword), + Vectors + RRF + re-rank works pretty damn well.
The trick that has elevated RAG, at least for my use cases, has been having different representations of your documents, as well as sending multiple permutations of the input query. Do as much as you can in the VectorDB for speed. I'll sometimes have 10-11 different "batched" calls to our vectorDB that are lightning quick. Then also being smart about what payloads I'm actually pulling so that if I do use the LLM to re-rank in the end, I'm not blowing up the context.
TLDR: Yes, you actually do have to put in significant work to build an efficient RAG pipeline, but that's fine and probably should be expected. And I don't think we are in a world yet where we can just "assume" that large context windows will be viable for really precise work, or that costs will drop to 0 anytime soon for those context windows.
redwood 2 days ago [-]
Weird to see the use case referenced specifically code search when that's a very targeted one rather than what general purpose agents (or RAG) use cases might target.
nbstme 2 days ago [-]
The main use case I referenced is SEC filings search, which is quite different from code. Filings are much longer, less structured, and more complex, with tables and footnotes.
hluska 2 days ago [-]
I’m sure that was your intent but why did you get bogged down talking about code?
nbstme 2 days ago [-]
hum because Claude Code pioneered the 'grep/glob/read' paradigm, so I felt the need to explain that what works well for coding files can also be applied to more complex documents.
hluska 2 days ago [-]
Did you consider using words to explain that? I don’t think you pay yourself by the word.
kohlerm 1 days ago [-]
Cursor's search is still better(faster and cheaper) then Claude Code. I just did some tests. It looks like they do agentic searches with query rewriting.
athrowaway3z 1 days ago [-]
I had seen RAG mentioned a lot before I had gotten into LLM agents. I assumed it was tricky and required real deep model training knowledge.
My first real AI use (beyond copy-paste ChatGPT) was Claude Code. I figured out in a few days to just write scripts and CLAUDE.md how to use them. For instance, one that prints comments and function names in a file is a few lines of python. MCP seemed like context bloat when a `tools/my-script -h` would put it in context on request.
Eventually stumbled on some more RAG a few weeks later, so decided to read up on it and... what? That's it? A 'prelude function' to dump 'probably related' things into the context?
It seems so obviously the wrong way to go from my perspective, so am I missing something here?
alastairr 2 days ago [-]
Isn't 'agentic search' just another form of RAG? information still gets retrieved and added to the prompt, even if the 'prompt' is levels down in the product and not visible to the user.
jimbohn 2 days ago [-]
Feels like saying Elasticsearch (and similar) tools are dead because we can just grep our way through things. I'd love to see more data on this.
beastman82 1 days ago [-]
> No need for similarity when you can use exact matches
This is a weakness, not a strength of agentic search
kixiQu 2 days ago [-]
This is a great example of a piece with enough meaningful and useful content in it that it's very clear the author had something of value to deliver, and I'm grateful for that... but enough repetitive LLM-output that I'm very annoyed by the end.
Actually, let me be specific: everything from "The Rise of Retrieval-Augmented Generation" up to "The Fundamental Limitations of RAG for Complex Documents" is good and fine as given, then from "The Emergence of Agentic Search - A New Paradigm" to "The Claude Code Insight: Why Context Changes Everything" (okay, so the tone of these generated headings is cringey but not entirely beyond the pale) is also workable. Everything else should have been cut. The last four paragraphs are embarrassing and I really want to caution non-native English speakers: you may not intuitively pick up on the associations that your reader has built with this loudly LLM prose style, but they're closer to quotidian versions of the [NYT] delusion reporting than you likely mean to associate with your ideas.
This is just wrong. As many here have said, grep is RAG; just the most primitive kind. It means you miss typos, synonyms, semantic matches (e.g., "the payment service"), and AST matches. I have to deal with this when I use grep-based agents by handholding them and overpaying. grep is just something that enabled CLI-based tools to get to market faster. grep's dominance will fade as the landscape matures. The current pattern seems to be to outsource RAG to an MCP.
djoldman 2 days ago [-]
... for this specific use case (financial documents).
These corpora have a high degree of semantic ambiguity among other tricky and difficult to alleviate issues.
Other types of text are far more amenable to RAG and some are large enough that RAG will probably be the best approach for a good while.
For example: maintenance manuals and regulation compendiums.
nbstme 2 days ago [-]
Why? What if LLMs could parallelize much of their reading and then summarize the findings into a markdown file, eliminating the need for complicated search?
jgalt212 2 days ago [-]
I'm not feeling it. Constantly pinging these yuge LLMs is not economic and not good for sensitive docs.
nbstme 2 days ago [-]
But don’t you think LLM pricing is heading toward zero? It seems to halve every six months. And on privacy, you can hope model providers won’t train on your data, (but there’s no guarantee)
queenkjuul 2 days ago [-]
I don't see how it can trend to zero when none of the vendors are profitable. Uber and doordash et. al. increased in price over time. The era of "free" LLM usage can't be permanent
dangoodmanUT 2 days ago [-]
Google’s inference is profitable
jgalt212 1 days ago [-]
Not on the SERP page. The zero click Internet is bad for content producers and for those who sell ads (Google).
imiric 2 days ago [-]
Oh, it's going to be "free" alright, in the same way that most web services are today. I.e., you will pay for it with your data and attention.
The only difference is that the advertising will be much more insidious and manipulative, the data collection far easier since people are already willingly giving it up, and the business much more profitable.
I can hardly wait.
aussieguy1234 2 days ago [-]
grep was invented at a time when computers had very small amounts of memory, so small that you might not even be able to load a full text file. So you had tools that would edit one line at a time, or search through a text file one line at a time.
LLMs have a similar issue with their context windows. Go back to GPT-2 and you wouldn't have been able to load a text file into its memory. Slowly the memory is increasing, same as it did for the early computers.
nbstme 2 days ago [-]
Agree. It's a context/memory issue. Soon LLMs will have a 10M context window and they won't need to search. Most codebases are less than 10M tokens.
pcthrowaway 2 days ago [-]
When dependencies are factored in, I don't know if this is true.
anshumankmr 1 days ago [-]
Honestly, I am not sure when RAG had its heyday to be deserving an obituary. I still think that it is in its early days.
findjashua 1 days ago [-]
RAG != EBR
devmor 2 days ago [-]
This reads like someone AI-generated prose to defend something they want to invest in and decry something it competes with. It does not come off as honest, written by a human, or useful to anyone outside of the specific, narrow contexts the "author" sees for the technologies mentioned.
Frankly, reading through this at makes me feel as though I am a business analyst or engineering manager being presented with a project proposal from someone very worried that a competing proposal will take away their chance to shine.
As it reaches the end, I feel like I'm reading the same thing, but presented to a Buzzfeed reader.
cantor_S_drug 2 days ago [-]
How come this isn't the top comment? This post screams AI.
cyberax 2 days ago [-]
I wonder if something like LSP or IntelliJ's reverse index would work better for AI than RAG.
sergiotapia 2 days ago [-]
>The winners will not be the ones who maintain the biggest vector databases, but the ones who design the smartest agents to traverse abundant context and connect meaning across documents.
So if one were building say a memory system for an AI chat bot, how would you save all the data related to a user? Mother's name, favorite meals, allergies? If not a Vector database like pinecone, then what? Just a big .txt file per user?
Exactly. Just a markdown file per user. Anthropic recommends that.
queenkjuul 2 days ago [-]
Any kind of database is far too efficient for an LLM, just take all your markdown and turn it into less markdown.
diamondfist25 1 days ago [-]
whats everyone's RAG pipeline?
I was using qdrant, but im considering moving to OpenSearch since i want something more complete w/ a dashboard that i can muck around with
OutOfHere 2 days ago [-]
That's quite the over-generalization. RAG fundamentally is: topic -> search -> context -> output. Agents can enhance it by iterating in a loop, but what's inside the loop is not going away.
dkga 2 days ago [-]
RAG is the new US dollar, now every year someone will predict its looming death…
nbstme 2 days ago [-]
HAHAHA. Ok let's call it "transformation." As i wrote "The next decade of AI search will belong to systems that read and reason end-to-end. Retrieval isn’t dead—it’s just been demoted."
2 days ago [-]
catlover76 2 days ago [-]
The Agents are just RAG
thenewwazoo 2 days ago [-]
[flagged]
tomhow 2 days ago [-]
We've been asking people not to comment like this on HN. We can never know exactly how much an individual's writing is LLM-generated, and the negative consequences of a false accusation outweigh the positive consequences of a valid one.
We don't want LLM-generated content on HN, but we also don't want a substantial portion of any thread being devoted to meta-discussion about whether a post is LLM-generated, and the merits of discussing whether a post is LLM-generated, etc. This all belongs in the generic tangent category that we're explicitly trying to avoid here.
If you suspect it, please use the the established approaches for reacting to inappropriate content: if it's bad content for HN, flag it; if it's a bad comment downvote it; and if there's evidence that it's LLM-generated, email us to point it out. We'll investigate it the same way we do when there are accusations of shilling etc, and we'll take the appropriate action. This way we can cut down on repetitive, generic tangents, and unfair accusations.
dymk 2 days ago [-]
I don’t mind articles that have a hint of “an AI helped write this” as long as the content is actually informationally dense and well explained. But this article is an obvious ad, has almost no interesting information or summaries or insights, and has the… weirdly chipper? tone that AI loves to glaze readers with.
tptacek 2 days ago [-]
How is this an ad? It's a couple thousand words about how they built something complicated that was then obsoleted.
serf 2 days ago [-]
in the same vein that a 'Behind The Scenes Look At The Making of Jurassic Park' is , in fact, an ad.
having a company name pitched at you within the first two sentences is a pretty good give away.
tptacek 2 days ago [-]
3/4 of what hits the front page is an "ad" by that standard. I don't see how you can get less promotional than a long-form piece about why your tech is obsolete. Seems just mean-spirited.
dymk 2 days ago [-]
It’s because the article’s main goal is to sell me the company’s product, not inform me about RAG. It’s a zero calorie article.
SV_BubbleTime 2 days ago [-]
> 3/4 of what hits the front page is an "ad" by that standard.
Is anyone disagreeing with that?
nbstme 2 days ago [-]
haha so true!
nbstme 2 days ago [-]
Why call it an ad? It’s not even on the company site. I only mentioned my company upfront so people get context (why we had to build a complex RAG pipeline, what kinds of documents we’re working with, and why the examples come from real production use cases).
dymk 2 days ago [-]
It stands out because the flow and tone was clearly AI generated. It’s fluff, and I don’t trust it was written by a human who wasn’t hallucinating the non-company related talking points.
tptacek 2 days ago [-]
There are typos in it, too. I don't think this kind of style critique is really on topic for HN.
Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
Those guidelines that you reference talk almost exclusively about annoyances on the webpage itself, not the content of the article.
I think it's fair to point out that many articles today are essentially a little bit of a human wrapper around a core of ChatGPT content.
Whether or not this was AI-generated, the tells of AI-written text are all throughout it. There are some people who have learned to write like the AI talks to them, which is really not much of an improvement over just using the AI as your word processor.
bigwheels 2 days ago [-]
Do you agree that bickering over AI-generated vs. not AI-generated makes for dull discussion? Sliding sewing needles deep into my fingernail bed sounds more appealing than nagging over such minutiae.
IgorPartola 2 days ago [-]
It’s also dull to brush my teeth, but I still do it because it is necessary.
The problem is that HN is one of the few places left where original thoughts are the main reason people are here. Letting LLMs write articles for us here is just not all that useful or fun.
Maybe quarantining AI related articles to their own thing a la Show HN would be a good move. I know it is the predominant topic here for the moment but like there is other interesting stuff too. And articles about AI written by AI so that Google’s AI can rank it higher and show it to more AI models to train on is just gross.
tom_ 2 days ago [-]
I'm not the person you're replying to, but for my part I do actually like to hear when people think it sounds like it's AI-generated.
serf 2 days ago [-]
minutiae to me is the effort of loading a page and reading half a paragraph in order to determine the AI tone for myself. The new AI literature frontier has actually added value to reading the comments first on HN in a surprising twist -- saves me the trouble.
davkan 2 days ago [-]
Almost as dull as being spoon-fed AI slop articles, yeah.
bigwheels 2 days ago [-]
There's an idea - create a website which can accurately assess "Slop-o-Meter" for any link, kind of like what FakeSpot of old did for Amazon products with fake reviews.
sebmellen 2 days ago [-]
I've tried doing this, but LLMs are shockingly bad at differentiating between their own slopware and true wetware thoughts.
EnPissant 2 days ago [-]
It's more akin to complaining about how Google search results have gotten worse.
threecheese 2 days ago [-]
It certainly makes a dull discussion, but frankly we need to have it. Post-AI HN is now a checkbox on a marketing plan - like a GitHub repository is - and I’m sick of being manipulated and sold to in one of the few forums that wasn’t gamed. It’s not minutiae, it’s an overarching theme that’s enshittifying the third places. Heck even having to discuss this is ruining it (yet here I am lol).
akerl_ 2 days ago [-]
I hate to ruin the magic for you, but HN has been part of marketing plans long before AI.
titanomachy 2 days ago [-]
"This wasn't written by a person" isn't a tangential style critique.
momojo 2 days ago [-]
I'm guessing first draft was AI. I had to re-read that part a couple times because the flow was off. That second paragraph was completely unnecessary too since the previous paragraph already got the point across that "context window small in 2022".
On the whole though, I still learned a lot.
nbstme 2 days ago [-]
Thanks! Sorry if the flow was off
sebmellen 2 days ago [-]
It truly is unfortunate. Thankfully most people seem to have an innate immune response to this kind of RLHF slop.
Retr0id 2 days ago [-]
Unfortunately this can't be true, otherwise it wouldn't be a product of RLHF.
sebmellen 2 days ago [-]
Go on an average college campus, and almost anyone can tell you when an essay was written with AI vs when it wasn't. Is this a skill issue? Are better prompters able to evade that innate immune response? Probably yes. But the revulsion is innate.
phainopepla2 2 days ago [-]
Crowds can have terrible taste, even if they're made up of people with good (or at least middling) taste
2 days ago [-]
xgulfie 2 days ago [-]
[flagged]
Rendered at 23:56:57 GMT+0000 (Coordinated Universal Time) with Vercel.
Grep works great when you have thousands of files on a local filesystem that you can scan in milliseconds. But most enterprise RAG use cases involve millions of documents across distributed systems. Even with 2M token context windows, you can't fit an entire enterprise knowledge base into context. The author acknowledges this briefly ("might still use hybrid search") but then continues arguing RAG is obsolete.
The bigger issue is semantic understanding. Grep does exact keyword matching. If a user searches for "revenue growth drivers" and the document discusses "factors contributing to increased sales," grep returns nothing. This is the vocabulary mismatch problem that embeddings actually solve. The author spent half the article complaining about RAG's limitations with this exact scenario (his $5.1B litigation example), then proposes grep as the solution, which would perform even worse.
Also, the claim that "agentic search" replaces RAG is misleading. Recent research shows agentic RAG systems embed agents INTO the RAG pipeline to improve retrieval, they don't replace chunking and embeddings. LlamaIndex's "agentic retrieval" still uses vector databases and hybrid search, just with smarter routing.
Context windows are impressive, but they're not magic. The article reads like someone who solved a specific problem (code search) and declared victory over a much broader domain.
A great many pundits don't get, that RAG means: "a technique that enables large language models (LLMs) to retrieve and incorporate new information"
So, RAG is a pattern that is as a principle applied to almost every process. Context windows? Ok, I won't get into all the nitty gritty details here (embedded, small storage device, security, RAM defects, cost and storage of contexts for different contexts etc.), just a hint, that the act of filling a context is what? Applied RAG.
RAG is not a architecture, it is a principle. A structured approach. There is a reason, why nowadays many refer to RAG as search engine.
All we know about knowledge, there is only one entity with a infinite context window. We still call it God not cloud.
the improvements needed for the retrieval part are then another topic.
More importantly, it’s a lot easier to fine tune a reranker on behavior data than an LLM that makes dozens of irrelevant queries.
Of course, the devil is in the details and there’s five dozen reasons why you might choose one approach over the other. But it is not clear that using a reranker is always slower.
Yeah I found this very confusing. Sad to see such a poor quality article being promoted to this extent.
It bugs me, because the acronym should encompass any form of retrieval - but in practice, people use RAG to specifically refer to embedding-vector-lookups, hence it making sense to say that it's "dying" now that other forms of retrieval are better.
I am definitely more aligned with needing what I would rather call 'Deep Semantic Search and Generation' - the ability to query text chunk embeddings of... a 100k PDF's, using the semantics to search for the closeness of the 'ideas', those fed into the context of the LLM, and then the LLM generate a response to the prompt citing the source PDF(s) the closest matched vectors came from...
That is the killer app of a 'deep research' assistant IMO and you don't get that via just grepping words and feeding related files into the context window.
The downside is, how to generate embeddings of massive amounts of mixed-media files and store in a database quickly and cheaply compared to just grepping a few terms from said files? A CPU grep of text in files in RAM is like five orders of magnitude faster than an embedding model on the GPU generating semantic embeddings of the chunked file and then storing those for later.
At that point, you are just doing Agentic RAG, or even just Query Review + RAG.
I mean, yeah, agentic RAG is the future. It's still RAG though.
RAG means any kind of data lookup which improves LLM generation results. I work in this area and speak to tons of companies doing RAG and almost all these days have realised that hybrid approaches are way better than pure vector searches.
Standard understanding of RAG now is simply adding any data to the context to improve the result.
A search is a search. The architecture doesn't care if it's doing an vector search or a text search or a keyword search or a regex search, it's all the same. Deploying a RAG app means trying different search methods, or using multiple methods simultaneously or sequentially, to get the best performance for your corpus and use case.
> The architecture doesn't care
The architecture does care because latency, recall shape, and failure modes differ.
I don't know of any serious RAG deployments that don't use vectors. I'm referring to large scale systems, not hobby projects or small sites.
However, RAG has been used as a stand in for a specific design pattern where you retrieve data at the start of a conversation or request and then inject that into the request. This simple pattern has benefits compared to just using sending a prompt by itself.
The point the author is trying to make is that this pattern kind of sucks compared to Agentic Search, where instead of shoving a bunch of extra context in at the start you give the model the ability to pull context in as needed. By switching from a "push" to a "pull" pattern, we allow the model to augment and clarify the queries it's making as it goes through a task which in turn gives the model better data to work with (and thus better results).
Almost all tool calls would result in rag.
Rag is dead just means rolling out your own search and manually injecting results into context is dead (just use tools). It means the chunking techniques are dead.
If you want to know "how are tartans officially registered" you don't want to feed the entire 554kb wikipedia article on Tartan to your model, using 138,500 tokens, over 35% of gpt-5's context window, with significant monetary and latency cost. You want to feed it just the "Regulation>Registration" subsection and get an answer 1000x cheaper and faster.
If I make a semantic search over my organization's Policy As Code procedures or whatever and give it to Claude Code as an MCP, does Claude Code suddenly stop being agentic?
From there the model can handle 100–200 full docs and jot notes into a markdown file to stay within context. That’s a very different workflow than classic RAG.
But on top of this I would also use AI to create semantic maps, like hierarchical structure of content, and put that table of contents in the context, let the AI explore it. This helps with information spread across documents/chapters. It provides a directory to access anything without RAG, by simply following links in a tree. Deep Research agents build this kind of schema while they operate across sources.
To explore this I built an graph MCP memory system where the agent can search both by RAG and text matching, and when it finds top-k nodes it can expand out by links. Writing a node implies having the relevant nodes first loaded up, and when generating the text, place contextual links embedded [1] like this. So simply writing a node also connects it to the graph in all the right points. This structure fits better with the kind of iterative work LLMs do.
You could expand grep queries with synonyms, but now you're reimplementing query expansion, which is already part of modern RAG. And doing that intelligently means you're back to using embeddings anyway.
The workflow works great for codebases with consistent terminology. For enterprise knowledge bases with varied language and conceptual queries, grep alone can't get you to the right candidates.
> You could expand grep queries with synonyms, but now you're reimplementing query expansion, which is already part of modern RAG.
in this scenario "you" are not implementing anything - the agent will do this on its own
this is based on my experience using claude code in a codebase that definitely does not have consistent terminology
it doesn't always work but it seemed like you were thinking in terms of trying to get things right in a single grep when it's actually a series of greps that are informed by the results of previous ones
The chunk, embed, similarity search method was just a way to get a decent classical search pipeline up and running with not too much effort.
Obviously that's not the optimal approach for every use case, but there's a lot where IMO it was better. In particular I was hoping to spend more time exploring it in an enterprise context where you've got complicated sharing and permission models to take into consideration. If you have agents simply passing through the permission of the user executing the search whatever you get back is automatically constrained to only the things they had access to in that moment. As opposed to other approaches where you're storing a representation of data in one place, and then trying to work out the intersection of permissions from one of more other systems, and sanitise the results on the way out. Always seemed messy and fraught with problems and the risk of leaking something you shouldn't.
Claude Code is better, but still frustrating.
I'm not a super smart AI person, but grepping through a codebase sounds exactly like what RAG is. Isn't tool use just (more sophisticated) RAG?
Only the most basic "hello world" type RAG systems rely exclusively on vector search. Everybody has been doing hybrid search or multiple simultaneous searches exposed through tools for quite some time now.
What was described as 'RAG' a year ago now is a 'knowledge search in vector db MCP', with the actual tool and mechanism of knowledge retrieval being the exact same.
Great point, but this grep in a loop probably falls apart (i.e. becomes non-performant) at 1000s of docs, not millions and 10s of simultaneous users
And if you think those deals are bogus, like I do, you still need to explain surging electricity prices.
Generative AI is here to stay, but I have a feeling we will look back on this period of time in software engineering as a sort of dark age of the discipline. We've seemingly decided to abandon almost every hard won insight and practice about building robust and secure computational systems overnight. It's pathetic that this industry so easily sold itself to the illogical sway of marketers and capital.
What are you implying. Capital always owned the industry except some really small coops and FOSS communities.
Still, that single tender can be on the order of a billion tokens. Even if the LLM supported that insane context window, it's roughly 4GB that need to be moved and with current LLM prices, inference would be thousands of dollars. I detailed this a bit more at https://www.tenderstrike.com/en/blog/billion-token-tender-ra...
And that's just one (though granted, a very large) tender.
For the corpus of a larger company, you'd probably be looking at trillions of tokens.
While I agree that delivering tiny, chopped up parts of context to the LLM might not be a good strategy anymore, sending thousands of ultimately irrelevant pages isn't either, and embeddings definitely give you a much superior search experience compared to (only) classic BM25 text search.
Embeddings had some context size limitations in our case - we were looking at large technical manuals. Gemini was the first to have a 1m context window, but for some reason its embedding window is tiny. I suspect the embeddings might start to break down when there's too much information.
It's pretty much the same process I would use in an unfamiliar code base. Just ctrl+f the file system till I find the right starting point.
(Well, I didn't overcome my laziness directly. I just switched from being lazy and not setting up vim and Emacs with the integrations, to trying out vscode where this was trivial or already built in.)
It depends, for some languages 'jump to definition' tools ask the same compiler/interpreter that you use to build your code, so it's as accurate as it gets, and it's not 'best effort'.
It also depends a bit on your project, some project are more prone to re-using names or symbols.
> If I was as quick at opening and reading files as claude code, I'd prefer grep with context around the searched term.
Well, Claude probably also doesn't want to have to 'learn' how to use all kinds of different tools for different languages and eco-systems.
I believe that was my experience with IDEs too?
I use both grep and JTD fairly frequently for different use cases.
I meant 'Jump to Definition' as one clear example, not as a definitive enumeration of everything that compiler integration can help you with.
Eg compiler integration is also really useful to show you the inferred types. Even dinosaurs like old-school Java and C have (limited) type inference: inside of expressions. But of course in a language like Haskell or Rust (or even Python) this becomes much more important.
No amount of find+grep+LLM is even remotely there yet.
What do you mean Turing complete? Obviously all 3 programs are running on a Turing complete machine. Xargs is a runner for other commands, obviously those commands can be Turing complete.
I haven't heard of anybody working on a _proof_ for the Turing completeness of xargs, and I think the only conference willing to publish it would be Sigbovik.
One can use any and all available search mechanisms, SQL, graph db, regex, keyword and so on, for the retrieval part.
If you ever saw Claude Code/Codex use grep, you will find that it constructs complex queries that encompass a whole range of keywords which may not even be present in the original user query. So the 'semantic meaning' isn't actually lost.
And nobody is putting an entire enterprise's knowledge base inside the context window. How many enterprise tasks are there that need referencing more that a dozen docs? And even those that do, can be broken down into sub-tasks of manageable size.
Lastly, nobody here mentions how much of a pain it is to build, maintain and secure an enterprise vector database. People spend months cleaning the data, chunking and vectorizing it, only for newer versions of the same data making it redundant overnight. And good look recreating your entire permissioning and access control stack on top of the vector database you just created.
The RAG obituary is a bit provocative, and maybe that's intentional. But it's surprising how negative/dismissive the reactions in this thread are.
Saying grep is also RAG is like saying ext4 + grep is a database.
grep + agentic LLM is not RAG.
In 10-k and 10-q often there are no table headers. This is particularly true for the consolidated notes to financial statements section. Standalone tables could be pretty much meaningless because you won't even know what they are reporting. For example, a table that simply mentions terms like beginning balance and ending balance can be reporting inventory, warranty, or short term debt. But the table does not mention these metrics at all and there are no headers. So I am curious to know how Fintool uses standalone tables. Do you retain text surrounding the tables in the same chunk as the table?
These types of articles regularly come from people who don't actually build SCALE systems with LLMs. Or, people who want to sell you on a new tech. And the frustrating thing is: They ain't even wrong.
Top-K RAG via vector search is not a sufficient solution. It never really was for most interesting use-cases.
Of course, take easiest and most structured - in a sense perfectly indexed - data (code repos) and claim that "RAG is dead". Again. Now try this with billions of unstructured tokens where the LLM really needs to do something with the entire context (like, confirm that something is NOT in the documents), where even the best LLM loses context coherence after like 64k tokens for complex tasks. Good luck!
The truth is: Whether its Agentic RAG, Graph RAG, or a combination of these with ye olde top-k RAG - it's still RAG. You are going to Retrieve, and then you are going to use a system of LLM agents to generate stuff with it. You may now be able to do the first step smarter. It's still Rag tho.
The latest Antrophic whoopsy showed that they also haven't solved the context rot issue. Yes you can get a 1M context scaled version of Claude, but then the small/detail scale performance is so garbage that misrouted customers loose their effin mind.
"My LLM is just gonna ripgrep through millions of technical doc pdfs identified only via undecipherable number-based filenames and inconsistent folder structures"
lol, and also, lmao
I think this sums it up well. Working with LLMs is already confusing and unpredictable. Adding a convoluted RAG pipeline (unless it is truly necessary because of context size limitations) only makes things worse compared to simply emulating what we would normally do.
This makes it possible to quickly deploy this on coolify and quickly build an agent that can use ripgrep on any of your uploaded files.
I’ve used LightRAG and looking to integrate it with OpenWebUI and possibly air weave which was a show HN earlier.
My data is highly structured and has references between documents, so I wanted to leverage that structure for better retrieval and reasoning.
For graph/tree document representations, it’s common in RAG to use summaries and aggregation. For example, the search yields a match on a chunk, but you want to include context from adjacent chunks — either laterally, in the same document section, or vertically, going up a level to include the title and summary of the parent node. How you integrate and aggregate the surrounding context is up to you. Different RAG systems handle it differently, each with its own trade offs. The point is that the system is static and hardcoded.
The agentic approach is: instead of trying to synthesize and rank/re-rank your search results into a single deliverable, why not leave that to the LLM, which can dynamically traverse your data. For a document tree, I would try exposing the tree structure to the LLM. Return the result with pointers to relevant neighbor nodes, each with a short description. Then the LLM can decide, based on what it finds, to run a new search or explore local nodes.
The trick that has elevated RAG, at least for my use cases, has been having different representations of your documents, as well as sending multiple permutations of the input query. Do as much as you can in the VectorDB for speed. I'll sometimes have 10-11 different "batched" calls to our vectorDB that are lightning quick. Then also being smart about what payloads I'm actually pulling so that if I do use the LLM to re-rank in the end, I'm not blowing up the context.
TLDR: Yes, you actually do have to put in significant work to build an efficient RAG pipeline, but that's fine and probably should be expected. And I don't think we are in a world yet where we can just "assume" that large context windows will be viable for really precise work, or that costs will drop to 0 anytime soon for those context windows.
My first real AI use (beyond copy-paste ChatGPT) was Claude Code. I figured out in a few days to just write scripts and CLAUDE.md how to use them. For instance, one that prints comments and function names in a file is a few lines of python. MCP seemed like context bloat when a `tools/my-script -h` would put it in context on request.
Eventually stumbled on some more RAG a few weeks later, so decided to read up on it and... what? That's it? A 'prelude function' to dump 'probably related' things into the context?
It seems so obviously the wrong way to go from my perspective, so am I missing something here?
This is a weakness, not a strength of agentic search
Actually, let me be specific: everything from "The Rise of Retrieval-Augmented Generation" up to "The Fundamental Limitations of RAG for Complex Documents" is good and fine as given, then from "The Emergence of Agentic Search - A New Paradigm" to "The Claude Code Insight: Why Context Changes Everything" (okay, so the tone of these generated headings is cringey but not entirely beyond the pale) is also workable. Everything else should have been cut. The last four paragraphs are embarrassing and I really want to caution non-native English speakers: you may not intuitively pick up on the associations that your reader has built with this loudly LLM prose style, but they're closer to quotidian versions of the [NYT] delusion reporting than you likely mean to associate with your ideas.
[NYT]: https://www.nytimes.com/2025/08/08/technology/ai-chatbots-de...
These corpora have a high degree of semantic ambiguity among other tricky and difficult to alleviate issues.
Other types of text are far more amenable to RAG and some are large enough that RAG will probably be the best approach for a good while.
For example: maintenance manuals and regulation compendiums.
The only difference is that the advertising will be much more insidious and manipulative, the data collection far easier since people are already willingly giving it up, and the business much more profitable.
I can hardly wait.
LLMs have a similar issue with their context windows. Go back to GPT-2 and you wouldn't have been able to load a text file into its memory. Slowly the memory is increasing, same as it did for the early computers.
Frankly, reading through this at makes me feel as though I am a business analyst or engineering manager being presented with a project proposal from someone very worried that a competing proposal will take away their chance to shine.
As it reaches the end, I feel like I'm reading the same thing, but presented to a Buzzfeed reader.
So if one were building say a memory system for an AI chat bot, how would you save all the data related to a user? Mother's name, favorite meals, allergies? If not a Vector database like pinecone, then what? Just a big .txt file per user?
I was using qdrant, but im considering moving to OpenSearch since i want something more complete w/ a dashboard that i can muck around with
We don't want LLM-generated content on HN, but we also don't want a substantial portion of any thread being devoted to meta-discussion about whether a post is LLM-generated, and the merits of discussing whether a post is LLM-generated, etc. This all belongs in the generic tangent category that we're explicitly trying to avoid here.
If you suspect it, please use the the established approaches for reacting to inappropriate content: if it's bad content for HN, flag it; if it's a bad comment downvote it; and if there's evidence that it's LLM-generated, email us to point it out. We'll investigate it the same way we do when there are accusations of shilling etc, and we'll take the appropriate action. This way we can cut down on repetitive, generic tangents, and unfair accusations.
having a company name pitched at you within the first two sentences is a pretty good give away.
Is anyone disagreeing with that?
Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
https://news.ycombinator.com/newsguidelines.html
I think it's fair to point out that many articles today are essentially a little bit of a human wrapper around a core of ChatGPT content.
Whether or not this was AI-generated, the tells of AI-written text are all throughout it. There are some people who have learned to write like the AI talks to them, which is really not much of an improvement over just using the AI as your word processor.
The problem is that HN is one of the few places left where original thoughts are the main reason people are here. Letting LLMs write articles for us here is just not all that useful or fun.
Maybe quarantining AI related articles to their own thing a la Show HN would be a good move. I know it is the predominant topic here for the moment but like there is other interesting stuff too. And articles about AI written by AI so that Google’s AI can rank it higher and show it to more AI models to train on is just gross.
On the whole though, I still learned a lot.