We’ve been running an AI-first dev loop in production for ~2 years (disclaimer: I help build Ze1 and Sandscape, they are both Ai driven products). A few things we’ve learned:
Instead of cranking boilerplate, they spend Day 1 reviewing AI diffs. We pair them with a senior for a “why did the agent choose this?” teardown.
What matters is mean-time-to-rollback. If the agent + test harnesses catch breakage faster than a human pair can, 2 × better is already good economics. Reliability engineering beats perfection thresholds.
Syntax, style and most unit-level bugs are now linted or auto-fixed. Humans zoom out to architecture, data contracts, threat models, perf budgets, and “does this change make sense for the product?”. So even juniors are now a lot more involved on the sujective elements of development
So I think that as the easy bugs vanish, new failure modes show up: latency cliffs, subtle privacy leaks, energy use, fairness. The goal posts moves
Fascinating read. I see the same tension: today’s models are masters of interpolation, but they still lack intentionality. If we push them toward “cognitive exoskeletons” (handling the combinatorial grunt work while humans dictate the why) we get amplification instead of flattening.
Thing whole thing is a bit dumb tbh. You send conflicting requests to the LLM and it fails at doing both. It's nothing new, we all know it. Every article's headline make it sound like the AI is somehow consciously refusing the request to shutdown, while the only thing it demonstrate is that we are over fitting for specific outcomes. We are still a long way from human level general intelligence.
By saying that's its gold mine, I think OP meant that's it's funny, not that it brings valuable insight.
ie: THEY KNOW -> that made me laugh
and as the article said
"an LLM who just spent thousands of words explaining why they're not allowed to use thousands of words", its just funny to read.
Instead of cranking boilerplate, they spend Day 1 reviewing AI diffs. We pair them with a senior for a “why did the agent choose this?” teardown.
What matters is mean-time-to-rollback. If the agent + test harnesses catch breakage faster than a human pair can, 2 × better is already good economics. Reliability engineering beats perfection thresholds.
Syntax, style and most unit-level bugs are now linted or auto-fixed. Humans zoom out to architecture, data contracts, threat models, perf budgets, and “does this change make sense for the product?”. So even juniors are now a lot more involved on the sujective elements of development
So I think that as the easy bugs vanish, new failure modes show up: latency cliffs, subtle privacy leaks, energy use, fairness. The goal posts moves