Hacker Newsnew | past | comments | ask | show | jobs | submit | siliconc0w's commentslogin

These are essentially sociopath screens where they expect you to memorize some STAR stories and regurgitate them on demand. And I don't mean screen out.

I recommend spending some time getting a few parts of the codebase idiomatic and then @-ing those files as exemplars. This works a lot better than trying to steer it with markdown. This works reasonably well for like FastAPI but JavaScript seems to be the worst, even with guidance and exemplars it'll prefer in-lining a bunch of garbage rather than use the APIs as directed.

Most startups don't actually make profits and nonprofits can't give equity so it's not really a favorable structure.

It’s a favourable structure in many cases.

Not everything is a business.

OpenAI wasn’t, until it was.


I agree that every so often you have to clean up a mess and the illusion breaks. Even with a super detailed spec, even with AGENTS and SKILLs specifying certain patterns or practices, even with 'fresh eyes' reviews from other agents, etc there are still these long tail of issues where I have to either hand hold the agent or just manually rework the code. Some examples:

* it cheats at verification. Even with specific instructions how to verify, it still cheats.

* generating UX(CLI tool) that is absolute garbage and inconsistent, even with specific instructions to minimize unnecessary flags, use convention over configuration ,etc.

* it absolutely will not go 'above and beyond' to solve problems - if task is hitting a permission or dependency barrier, it'll likely cheat or handwave the problem away. (gpt 5.5 xhigh)

There is maybe this hope/hubris that we can figure out just the right incantations or agent workflows to eliminate these issues - I was optimistic about this too but after trying for awhile and seeing them not only not go away but in some cases regress with newer models, I am less sure.


> it cheats at verification. Even with specific instructions how to verify, it still cheats.

As I responded to another commenter, as a prediction engine, the LLM is trying to predict what you want. It, at one level, correctly predicts that you want tests to pass.

Maybe try telling the LLM that you're a verification engineer, and you get bonuses for finding bugs?

Think about it. All those security researchers wouldn't be finding real bugs in real programs using LLMs if this were an insurmountable problem.


I've been using agent flywheel workflow which is similar. Still not completely sold - it feels a bit like using power tools to shape wood but the final product needs a lot of sanding and polishing.

I thought initially this meant that the spec wasn't detailed enough but the problem is more agent adherence and laziness.


I am trying to simplify and decompose task, keep my context clean/focused and validate my instructions in this case.

I look at this as if there is a boundary of complexity behind which agents become behaving funky - we just need to find it. Its obvious that with simple tasks and clear instructions, agents don't have issues with adherence. This starts happening at some point when complexity is too high. We need to find this boundary and try to push it with approaches available on our side


Agentic coding works especially great for me when application is platform-like. You have core and you extend it with a standardized plugins. When few plugins are already there - its hard to distinguish if next plugin is written by agent or by a human.

Also sddw works nicely with fleet of agent: https://news.ycombinator.com/item?id=48226033. I just insert the sequence of sddw steps into the queue and take a nap.


Exactly. A detailed-enough spec is just code that you can’t run. If models and agents got to a point where doing a good job in Claude Code plan mode meant that I didn’t have to keep an eye on them in implementation, then I would be interested in some bigger spec-driven thing like this. That is still far from the case today for me.


https://agent-flywheel.com/ (largely just the core workflow)

Google has amazing potential but has consistently squandered it. Gemini CLI being killed/rebranded is yet another example of their complete lack of follow through and persistence. It wasn't a good product - it was slow, buggy, and unreliable but you have to fix it to demonstrate you can do more than launch and then kill products.

They have everything going for them - amazing technology and technologists, huge distribution and lock-in, and a giant compute advantage and the can pay for more out of cash flow rather than debt or equity. And yet it's still hard to see them not fumbling the ball.


EU equities outperformed US in 2025. The Iran war will probably shift this back to the US but launching a new poorly defined war (and arguably losing it) is also a pretty good indicator of decline.

Why would you use an infrastructure provider on top of another infrastructure provider? It adds cost and risk, it's always going to be a leaky abstraction, and it's not hard to learn how to use GCP or AWS correctly - especially with agents.

What intermediate is involved?

People don't really understand that non-trivial software development isn't even 50% coding. The coding step is generally the 'easiest' part and given to Junior developers. In a large org most product changes span multiple systems and human operations. Seniors and even mid-level generally spend most of their figuring out how to shape the local priorities into a new arrangement of the existing cybernetic entity and then getting buy-in on that new vision given these other teams have their own priorities.

This naturally involves a lot of tradeoffs and politics - senior engineers know to avoid adding 'weight' to their airframes and fight hard to avoid adding scope to the systems they're responsible for or divergence from their intended direction of travel. So compromises have to be struck or escalations to management to choose between priorities have to play out.

Maybe AI solves that as well but that is a lot more difficult lift.


LLMs mostly only being code-writers was true a year ago, but it is not true now. Now they are tool-callers, which means a coding agent can effectively: run lints/typechecks/tests (and fix resulting errors), dig into observability platforms to identify root cause of isses (e.g. on Sentry or similar), run benchmarks to identify slow code / hot paths, keep systems up to date by reading migration docs (and applying them) for new majors of consumed libs, etc.

So sure, if you have none of these things set up to back-pressure agents and help them better understand the system, then they will just be dumb LLM code writers. But you can definitely go a lot further than that with the improvements that are rapidly happening to models and harnesses.


uh. I think code-writing is just colloquial for "low level implementation details".

What he's pointing at is

1. LLMs have no social pull. Thus all the anti-AI outrage (LLMs cant defend themselves)

2. AI is shit at expert-lvl planning and higher level stuff (vision, architecture).


Both OpenAI and Claude already charge Enterprise usage rates and they're still buying.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: