Company creates a benchmark. Same company is best in that benchmark.
Story as old as time.
andai 15 days ago [-]
>optimize for thing
>thing gets optimized
falloutx 15 days ago [-]
Nothing suggest they built the benchmark before the product. It was built after the product was already shipped, more of an afterthought.
shimman 15 days ago [-]
A dishonest AI company? Well, I never!
mattvv 15 days ago [-]
Some feedback for the team, looked at pricing page and saw it more expensive ($30/dev/mo) and highly limiting (20prs per month per user). We have devs putting up that many prs in a single day. With this kind of plan pretty much no way we would even try this product
atomicnature 7 days ago [-]
Try git-lrc, totally free since it uses gemini key. Triggers reviews automatically on git commit.
esafak 15 days ago [-]
It's true, those are some pre-AI quotas.
DerArzt 15 days ago [-]
Not even, there are some months where I make 30+ prs at work (no llms, just a lot of code responsibilities).
esafak 15 days ago [-]
I'm not as cynical as the others here; if there are no popular code review benchmarks why should they not design one?
> We believe that code review is not a narrow task; it encompasses many distinct responsibilities that happen at once. [...]
> Qodo 2.0 addresses this with a multi-agent expert review architecture. Instead of treating code review as a single, broad task, Qodo breaks it into focused responsibilities handled by specialized agents. Each agent is optimized for a specific type of analysis and operates with its own dedicated context, rather than competing for attention in a single pass. This allows Qodo to go deeper in each area without slowing reviews down.
> To keep feedback focused, Qodo includes a judge agent that evaluates findings across agents. The judge agent resolves conflicts, removes duplicates, and filters out low-signal results. Only issues that meet a high confidence and relevance threshold make it into the final review.
> Qodo’s agentic PR review extends context beyond the codebase by incorporating pull request history as a first-class signal.
thierrydamiba 15 days ago [-]
I'm building a benchmark for coding agent memory following your philosophy. There are so many memory tools out there but I have not been able to find a reliable benchmark for coding agent memory. So I'm just building it myself.
A lot of this stuff is really new, and we will need to find ways to standardize, but it will take time and consensus.
It took 4 years after the release of the automobile to coin the term milage to refer to miles driven per unit of gasoline. We will in due time create the same metrics for AI.
seg_lol 15 days ago [-]
Curious what papers you are reading on this. Benchmarks are way more important than people realize, on every level.
mbesto 15 days ago [-]
Cmd+F - "Overfitting"...nothing.
Nope, no mention of how they do anything to alleviate overfitting. These benchmarks are getting tiresome.
polynomial 15 days ago [-]
Call for pricing.
DavidYoussef 11 days ago [-]
The benchmark measures whether a tool finds known bugs. That's useful but it's the wrong question for most teams in 2026.
The question auditors actually ask isn't "did your tool catch this bug?" It's "can you prove this change was reviewed, by whom, and that the code didn't change between review and merge?"
None of the tools benchmarked here produce verifiable evidence. They produce comments. A green checkmark on a PR tells you someone clicked a button. It doesn't tell you what they saw, whether the diff changed after review, or what risk level the change carried.
We took a different approach: instead of building another AI reviewer, we built a governance layer that wraps whatever review process you already use. Every PR gets a cryptographically sealed evidence bundle -- the exact diff, risk tier (L0-L4), findings, and a SHA-256 hash chain. Verifiable offline with one command. Open source, Apache 2.0.
Not a replacement for code review tools. An audit trail for them.
steve_avery 15 days ago [-]
I'd be interested, but they don't even list any anthropic model on their code review benchmark, so I feel like they haven't really tested their benchmark on SOTA models.
nomel 15 days ago [-]
Whenever I see this, I make the (almost always correct) assumption that the SOTA models had an advantage, with the alternative explanation being a complete lack of awareness of the state of AI (which is very very rare for a tool like this).
With SOTA missing, it also is a strong indicator that someone like you is not the target audience.
CuriouslyC 15 days ago [-]
I don't think LLMs are the right tool for pattern enforcement in general, better to get them to create custom lint rules.
Agents are pretty good at suggesting ways to improve a piece of code though, if you get a bunch of agents to wear different hats and debate improvements to a piece of software it can produce some very useful insights.
zhubert 15 days ago [-]
I'm trying to bring a slightly different take to the pricing of ShipItAI (https://shipitai.dev, brazen plug). I've got a $5/mo/active dev + Bring Your Own Key option for those that want better price controls.
Still early in development and has a much simpler goal, but I like simple things that work well.
freakynit 15 days ago [-]
No way you can afford unlimited pr's and unlimited projects for 20$/month using anthropic api.
mdeeks 15 days ago [-]
I feel like pricing needs to be included here. I kind of don't care about 10 percentage points if the cost is dramatically higher. Cursor Bugbot is about the same cost but gives 10x the monthly quota of Qodo.
I know this is focused solely on performance, but cost is a major factor here.
logicx24 15 days ago [-]
Where's the code for this? I'd love to run our tool, https://tachyon.so/, against it.
kachapopopow 15 days ago [-]
coderabbit being the worst while (presumeably) advertising the most seems to be check out at least, wouldn't believe the recall % seems bogus.
mohsen1 15 days ago [-]
> Qodo takes a different approach by starting with real, merged PRs
Merged PRs being considered good code?
esafak 15 days ago [-]
What do you suggest they use for ground truth?
mohsen1 15 days ago [-]
I thought about this quite a bit. There are some nuggets in the open source code:
- vX.X.1 releases. when software was considered perfect but author had to write a fast follow up fix. very real bugs with real fixes
- Reverts. I'm sure anyone doing AI code review pays attention to this already. This is a sign of bad changes, but as important.
- PRs that delete a lot of code. A good change is often deleting code and making things simpler
esafak 14 days ago [-]
For the first, your signal would be weak, for those events are rare. I don't think deleting and reverting is a signal of quality. Rather, it demonstrates bad changes, as you said. This does not tell the model what good code is, just what it is not.
aetherspawn 15 days ago [-]
Your pricing page has a bug on it, the annual price is higher than the monthly price.
zamadatix 15 days ago [-]
I'm seeing $30/m at annual and $38/m at monthly? (maybe already fixed, hard to tell)
Rendered at 04:05:37 GMT+0000 (Coordinated Universal Time) with Vercel.
Story as old as time.
>thing gets optimized
Apparently this is in support of their 2.0 release: https://www.qodo.ai/blog/introducing-qodo-2-0-agentic-code-r...
> We believe that code review is not a narrow task; it encompasses many distinct responsibilities that happen at once. [...]
> Qodo 2.0 addresses this with a multi-agent expert review architecture. Instead of treating code review as a single, broad task, Qodo breaks it into focused responsibilities handled by specialized agents. Each agent is optimized for a specific type of analysis and operates with its own dedicated context, rather than competing for attention in a single pass. This allows Qodo to go deeper in each area without slowing reviews down.
> To keep feedback focused, Qodo includes a judge agent that evaluates findings across agents. The judge agent resolves conflicts, removes duplicates, and filters out low-signal results. Only issues that meet a high confidence and relevance threshold make it into the final review.
> Qodo’s agentic PR review extends context beyond the codebase by incorporating pull request history as a first-class signal.
A lot of this stuff is really new, and we will need to find ways to standardize, but it will take time and consensus.
It took 4 years after the release of the automobile to coin the term milage to refer to miles driven per unit of gasoline. We will in due time create the same metrics for AI.
Nope, no mention of how they do anything to alleviate overfitting. These benchmarks are getting tiresome.
The question auditors actually ask isn't "did your tool catch this bug?" It's "can you prove this change was reviewed, by whom, and that the code didn't change between review and merge?"
None of the tools benchmarked here produce verifiable evidence. They produce comments. A green checkmark on a PR tells you someone clicked a button. It doesn't tell you what they saw, whether the diff changed after review, or what risk level the change carried.
We took a different approach: instead of building another AI reviewer, we built a governance layer that wraps whatever review process you already use. Every PR gets a cryptographically sealed evidence bundle -- the exact diff, risk tier (L0-L4), findings, and a SHA-256 hash chain. Verifiable offline with one command. Open source, Apache 2.0.
https://github.com/DNYoussef/codeguard-action
Not a replacement for code review tools. An audit trail for them.
With SOTA missing, it also is a strong indicator that someone like you is not the target audience.
Agents are pretty good at suggesting ways to improve a piece of code though, if you get a bunch of agents to wear different hats and debate improvements to a piece of software it can produce some very useful insights.
Still early in development and has a much simpler goal, but I like simple things that work well.
I know this is focused solely on performance, but cost is a major factor here.
Merged PRs being considered good code?
- vX.X.1 releases. when software was considered perfect but author had to write a fast follow up fix. very real bugs with real fixes
- Reverts. I'm sure anyone doing AI code review pays attention to this already. This is a sign of bad changes, but as important.
- PRs that delete a lot of code. A good change is often deleting code and making things simpler