I am not the person you are responding to but I have tried both: using OpenRouter and also giving a Chinese company $5 on my credit card to buy tokens. If I know what model I want to experiment with, I much prefer to just pay $5 and have plenty of tokens to experiment. On a yearly basis, this is a very tiny expense for the benefits of getting plenty of tokens to experiment with.
What has worked better for me is splitting authority, not just prompts. One agent can touch app code, one can only write failing tests plus a short bug hypothesis, and one only reviews the diff and test output. Also make test files read only for the coding agent. That cuts out a surprising amount of self-grading behavior.
It’s not an agentic pattern, it’s an approach to test driven development.
You write a failing test for the new functionality that you’re going to add (which doesn’t exist yet, so the test is red). You then write the code until the test passes (that is, goes green).
reply