The BM25-first routing bet is interesting. You mention 85% recall@20 on 500 artifacts, but the heuristic classifier routing "short lookups to BM25 and narrative queries to cited-answer" raises a practical question: what does the classifier key on to decide a query is narrative vs short? Token count? Syntactic structure? The reason I ask is that in agent-generated queries, the boundary is often blurry - an agent doing a dependency lookup might issue a surprisingly long, well-formed sentence. If the classifier routes those to the more expensive cited-answer loop it could negate the latency advantage of BM25 being first.
https://x.com/karpathy/status/2039805659525644595
https://xcancel.com/karpathy/status/2039805659525644595
But also would like to understand how markdown helps in durability - if I understand correctly markdown has a edge over other formats for LLMs.
Also I too am building something similar on markdown which versions with git but for a completely different use case : https://voiden.md/