NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Show HN: We analyzed 1,573 Claude Code sessions to see how AI agents work (github.com)
dmix 23 minutes ago [-]
I've seen Claude ignore important parts of skills/agent files multiple times. I was running a clean up SKILL.md on a hundred markdown files, manually in small groups of 5, and about half the time it listened and ran the skill as written. The other half it would start trying to understand the codebase looking for markdown stuff for 2min, for no good reason, before reverting back to what the skill said.

LLMs are far from consistent.

cbg0 18 minutes ago [-]
Try this: Keep your CLAUDE.md as simple as possible, disable skills, and request Opus to start a subagent for each of the files and process at most 10 at a time (so you don't get rate limited) and give it the instructions in the skill for whatever processing you're doing to the markdowns as a prompt, see if that helps.
keks0r 21 minutes ago [-]
yes we had to tune the claude.md and the skill trigger quite a bit, to get it much better. But to be honest also 4.6 did improve it quite a bit. Did you run into your issues under 4.5 or 4.6?
dmix 2 minutes ago [-]
I was using Sonnet 4.6 since it was a menial task
sriramgonella 46 minutes ago [-]
This kind of dataset is really valuable because most conversations about AI coding tools are based on anecdotes rather than actual usage patterns. I’d be curious about a few things from the sessions:

1.how often developers accept vs modify generated code 2.which tasks AI consistently accelerates (tests, refactoring, boilerplate?) 3.whether debugging sessions become longer or shorter with AI assistance

My experience so far is that AI is great for generating code but the real productivity boost comes when it helps navigate large codebases and reason about existing architecture.

keks0r 39 minutes ago [-]
1. can only partly be answered, because we can only capture the "edits" that are prompted, vs manual ones. 2. for us actually all of them, since we do everything with ai, and invest heavily and continously, to just reduce the amount of iterations we need on it 3. thats a good one, we dont have anything specific for debugging yet, but it might be an interesting class for a type of session.
Aurornis 24 minutes ago [-]
> 26% of sessions are abandoned, most within the first 60 seconds

Starting new sessions frequently and using separate new sessions for small tasks is a good practice.

Keeping context clean and focused is a highly effective way to keep the agent on task. Having an up to date AGENTS.md should allow for new sessions to get into simple tasks quickly so you can use single-purpose sessions for small tasks without carrying the baggage of a long past context into them.

emehex 1 hours ago [-]
For those unaware, Claude Code comes with a built in /insights command...
keks0r 1 hours ago [-]
Ohh this is exciting, I kinda overlooked it. I assume there are still a lot of differences, especially for accross teams. But I immediately ran it, when I saw your comment. Actually still running.
loopmonster 1 hours ago [-]
insights is straight ego fluffing - it just tells you how brilliant you are and the only actionable insights are the ones hardcoded into the skill that appear for everyone. things like be very specific with the success criteria ahead of time (more than any human could ever possibly be), tell the llm exactly what steps to follow to the letter (instead of doing those steps yourself), use more skills (here's an example you can copy paste that has 2 lines and just tells it to be careful), and a couple of actually neat ideas (like having it use playwright to test changes visually after a UI change)
KaiserPister 57 minutes ago [-]
This is awesome! I’m working on the Open Prompt Initiative as a way for open source to share prompting knowledge.
keks0r 55 minutes ago [-]
Cool, whats the link? We have some learnings, especially in the "Skill guiding" part of our example.
alyxya 52 minutes ago [-]
Why does it need login and cloud upload? A local cli tool analyzing logs should be sufficient.
keks0r 46 minutes ago [-]
We used it across the team, and when you want to bring metrics together across multiple people, its easier on a server, than local.
blef 1 hours ago [-]
mentalgear 36 minutes ago [-]
> A local-first desktop and web app for browsing, searching, and analyzing your past AI coding sessions. See what your agents actually did across every project.

Thx for the link - sounds great !

keks0r 1 hours ago [-]
Our focus is a little bit more cross team, and in our internal version, we have also some continuous improvement monitoring, which we will probably release as well.
152334H 2 hours ago [-]
is there a reason, other than general faith in humanity, to assume those '1573 sessions' are real?

I do not see any link or source for the data. I assume it is to remain closed, if it exists.

keks0r 1 hours ago [-]
Its our own sessions, from our team, over the last 3 months. We used them to develop the product and learn about our usage. You are right, they will remain closed. But I am happy to share aggregated information, if you have specific questions about the dataset.
marconardus 2 hours ago [-]
It might be worthwhile to include some of an example run in your readme.

I scrolled through and didn’t see enough to justify installing and running a thing

keks0r 1 hours ago [-]
Ah sorry, the readme is more about how to run the repo. The "product" information is rather on the website: https://rudel.ai
vova_hn2 1 hours ago [-]
This is so sad that on top of black box LLMs we also build all these tools that are pretty much black box as well.

It became very hard to understand what exactly is sent to LLM as input/context and how exactly is the output processed.

keks0r 1 hours ago [-]
The tool does have a quite detailed view for individual sessions. Which allows you to understand input and output much better, but obviously its still mysterious how the output is generated from that input.
27 minutes ago [-]
ekropotin 1 hours ago [-]
> That's it. Your Claude Code sessions will now be uploaded automatically.

No, thanks

keks0r 1 hours ago [-]
It will be only enabled for the repo where you called the `enable` command. Or use the cli `upload` command for specific sessions.

Or you can run your own instance, but we will need to add docs, on how to control the endpoint properly in the CLI.

tgtweak 59 minutes ago [-]
Big ask to expect people to upload their claude code sessions verbatim to a third party with nothing on site about how it's stored, who has access to it, who they are... etc.
keks0r 37 minutes ago [-]
We dont expect anything, we put it out there, and we might be able to build trust as well, but maybe you dont trust us, thats fair. You can still run it yourself. We are happy about everyone trying it out, either hosted or not. We are hosting it, just to make it easier for people that want to try it, but you dont have to. But you have a good point, we should probably put more about this on the website. Thanks.
jamiemallers 43 minutes ago [-]
[dead]
anthonySs 46 minutes ago [-]
is this observability for your claude code calls or specifically for high level insights like skill usage?

would love to know your actual day to day use case for what you built

keks0r 41 minutes ago [-]
the skill usage was one of these "I am wondering about...." things, and we just prompted it into the dashboard to undertand it. We have some of these "hunches" where its easier to analyze having sessions from everyone together to understand similarities as well as differences. And we answered a few of those kinda one off questions this way. Ongoing, we are also using a lot our "learning" tracking, which is not really usable right now, because it integrates with a few of our other things, but we are planning to release it also soon. Also the single session view sometimes helps to debug a sessions, and then better guide a "learning". So its a mix of different things, since we have multiple projects, we can even derive how much we are working on each project, and it kinda maps better than our Linear points :)
mentalgear 37 minutes ago [-]
How diverse is your dataset?
keks0r 35 minutes ago [-]
Team of 4 engineers, 1 data & business person, 1 design engineer.

I would say roughly equal amount of sessions between them (very roughly)

Also maybe 40% of coding sessions in large brownfield project. 50% greenfield, and remaining 10% non coding tasks.

cluckindan 2 hours ago [-]
Nice. Now, to vibe myself a locally hosted alternative.
vidarh 2 hours ago [-]
I was about to say they have a self-hosting guide, but I see they use third party services that seem absolutely pointless for such a tiny dataset. For comparison, I have a project that happily analyzes 150 million tokens worth of Claude session data w/some basic caching in plain text files on a $300 mini pc in seconds... If/when I reach billions, I might throw Sqlite into the stack. Maybe once I reach tens of billions, something bigger will be worthwhile.
26 minutes ago [-]
keks0r 2 hours ago [-]
There is also a docker setup in there to run everything locally.
vidarh 1 hours ago [-]
That's great. It's still over-engineered given processing this data in-process is more than fast enough at a scale far greater than theirs.
keks0r 2 hours ago [-]
The docker-compose contain everything you should need: https://github.com/obsessiondb/rudel/blob/main/docker-compos...
lau_chan 2 hours ago [-]
Does it work for Codex?
keks0r 2 hours ago [-]
Yes we added codex support, but its not yet extensively tested. Session upload works, but we kinda have to still QA all the analytics extraction.
Sebastian_Dev 7 minutes ago [-]
[dead]
mrothroc 1 hours ago [-]
[dead]
keks0r 1 hours ago [-]
This is great. How are you "identifying" these stages in the session? Or is it just different slash commands / skills per stage? If its something generic enough, maybe we can build the analysis into it, so it works for your use case. Otherwise feel free to fork the repo, and add your additional analysis. Let me know if you need help.
huflungdung 7 minutes ago [-]
[dead]
bhekanik 2 hours ago [-]
[dead]
multidude 1 hours ago [-]
[flagged]
indiosmo 1 hours ago [-]
I usually instruct the agent to use the skills explicitly, e.g. "/writing-tests write the tests for @some-class.cpp"

So the skills are mostly a sort of on-demand AGENTS.md specific to the task.

Another example is I have a `plan-review` skill, so when planning something I add at the end of the prompt something like: "plan the task, .... then launch claude and codex /plan-review agents in parallel and take their findings into account before producing the final plan".

keks0r 1 hours ago [-]
The 4% usage was about our internal team, and we have skills setup. So it is not necessary that they are not built, but rather that they were not used, when we expected them to be used. So we adapted our CLAUDE.md to make claude more eager to use them. Also the 4% usage was on the 4.5 models, 4.6 got much better with invoking skills.
mihir_kanzariya 32 minutes ago [-]
[flagged]
rob 9 minutes ago [-]
[delayed]
bspammer 27 minutes ago [-]
Heavy use of /rewind helps with this - it's much better to remove the bad information from the context entirely instead of trying to tell the model "actually, ignore the previous approach and try this instead"
ozgurozkan 50 minutes ago [-]
[flagged]
x187463 44 minutes ago [-]
> The 26% abandonment rate, the error cascade patterns in the first 2 minutes — these are behavioural signals, not just performance metrics.

> When Claude Code gets stuck in a loop, tries an unexpected tool chain, or produces inconsistent outputs under adversarial prompts — those aren't just UX failures, they're security surface area.

Twice in one paragraph, not even trying to blend in.

howdareme 47 minutes ago [-]
LLM comment spotted
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 15:45:35 GMT+0000 (Coordinated Universal Time) with Vercel.