> It also can't decide when to run the program and how to interpret/make changes...

> It also can't decide when to run the program and how to interpret/make changes based on the results.

Not quite, but a bit of plumbing can get you closer. Not human using a computer close, but interestingly closer nonetheless.

I've been trying to accomplish something akin to this by having a program monitor and alter another program within a virtual machine, using GPT-generated solutions to error traces to correct bugs in the sand boxed program.

It watches the program to see when an error occurs, feeds the error to GPT with pertinent code, then tries to splice in the solution.

It kind of works. I don't think we're going to see human-levels of success from this in the immediate future, but I was able to write a simple event-based system which alters a program to resolve simple bugs. It even does it on a different git branch, and there is some stubbed out code and prompts for generating tests. In my manual testing, this actually worked too. If the tests passed I was going to have it push the change set and create a PR explaining the changes, tests, etc.

I doubt I'll continue now that Copilot is doing this already. My point though is that with the right configuration, the right data and prompts, and a system orchestrating the start/stop/test patterns based on the state of the sandboxed program, you can begin to achieve something akin to an inexperienced person solving bugs.

Sometimes it does a terrible job and other times it kind of falls over itself. But we're already leaps and bounds ahead of previous systems, and I just cobbled this together with what's possible via OpenAI's API.

The crazy part is that there are so many possible layers. Like say we get our initial solution and we verify that it works. Well, now we can have a system which optimizes the implementation. Like a PR buddy that observes the implementation and determines: should this test be appended to an existing suite of tests? Can the test case simply be added to an existing table-driven test? How can we streamline this patch to avoid an endless stream of additional files and tests to maintain? I think that's actually tractable already. While the success rate won't be 100% today, it'll clearly only improve.