Fair, many tools trade off cost for performance/adherence. So far our experience is good with Semble (at least with Anthropic/OpenAI models), but if you have any feedback feel free to reach out. We're working on evaluations for this as well, though it will take some time as this is much harder than benchmarking retrieval quality.
There's a "semble savings" command you can run to see token savings. It varies heavily based on the repo: in general, the larger the project the larger the savings. Output code quality is something we're still trying to measure, but it's much harder than measuring retrieval quality.
Gotcha. Might be interesting to show some plots of average/median/max input token savings per query as a function of repo size if you're going to do some more comparison testing. The efficiency gains are compelling, but I'd want to see the magnitude of gains as well to get a full picture. Regardless, cool project and I'll check it out for myself
Hey, this skepticism is fair and we share it, which is why we don't claim end-to-end agent improvements since we haven't measured those (yet). The benchmark we published measures retrieval quality and token count during search, not overall agent performance. We are working on agent-level evals, but those are unfortunately much harder to get right. However, we do believe that Semble makes agents better based on our own experience of using it for the past months while in development (or at the very least, cheaper).
> We are working on agent-level evals, but those are unfortunately much harder to get right.
It's unfortunately a nearly impossible task, as the models change regularly (without letting you know), so you have a moving (invisible) target that's 1) hard to test exhaustively, and 2) very expensive to test with any low margin of error.
This is why no one does it and just makes broad sweeping unverified claims instead.
If you figure out how to do it... You should probably just get a job at Anthropic or OpenAI and make $2M+ per year...
Hey, what's the issue with a disk search-lib in python specifically? The library is extremely fast. Yes, we could probably squeeze some more performance in Rust, but that's not our native programming language, so we opted for doing it correctly rather than use a language that we don't understand well enough.
In theory maybe, but in practice it hurts more than it helps I think. Irrelevant context makes the model more likely to reason from the wrong code (and it's slower and more expensive).
Hey, thanks for the detailed feedback. For the bug, would you mind opening an issue with your setup details? This is definitely something we want to investigate and fix. The multiple queries thing is really good feedback, thanks for that, we'll update the prompt/instructions to prevent this from happening and we'll try to add some tests for this. The external connection errors during install are uv fetching deps from PyPI I think, those should not be the reason it's hanging.
The 98% is vs the grep+read loop, not grep output alone. When an agent hits an unfamiliar codebase it typically does "cat file" or reads the whole thing first, at least in my experience. If you're reliably getting agents to do "grep -C N" and stop there I'd genuinely be curious what your setup looks like, because I think the quality of the results is just too low to serve as useful context.
> When an agent hits an unfamiliar codebase it typically does "cat file" or reads the whole thing first, at least in my experience.
Depends on the size of the project and specific files. I have definitely seen agents make smart use of pi's "read" tool, which can take an offset and line limit (or defaults to a max 2000 lines/50KiB if the model doesn't specify). The bash tool also has the same max output, so if a model decides to cat instead of using the read tool it still wont blow out its context window with a single large file read.
But this sort of thing is going to vary with harness, model, project, and whatever the RNG delivers for the day.