Hacker Newsnew | past | comments | ask | show | jobs | submit | edude03's commentslogin

Essentially the more turns you have the more the agent is likely to fail since the error compounds per turn. Agentic model are tuned for “long horizon tasks” ie being able to go many many turns on the same problem without failing.

Much appreciated, but I mean more around "what do the error bars in the figure represent" than what the turn scaling itself is.

For the tasks in SWE-Bench Pro they obtained a distribution of agent turns, summarized as the box plot. The box likely describes the inter-quartile range while the whiskers describe the some other range. You'd have to read their report to be sure. https://en.wikipedia.org/wiki/Box_plot

That's a box plot, so those are not error bars but a visualization of the distribution of a metric (min, max, median, 25th percentile, 75th percentile).

The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.


On iOS 26 (and I think before that as well) if I enable the sleep focus and disable it later it usually stays in sleep focus despite showing no focus is selected. Therefore I don’t get notifications and the wall paper on the main screen is dimmed. Requires a reboot to fix it but it defeats the purpose of having a phone if no one can contact me until I notice


I have the same experience despite using claude every day. As an funny anecdote:

Someone I know wrote the code and the unit tests for a new feature with an agent. The code was subtly wrong, fine, it happens, but worse the 30 or so tests they added added 10 minutes to the test run time and they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests


There was an article on HN last week (?) which described this exact behaviour in the newer models.

Older, less "capable", models would fail to accomplish a task. Newer models would cheat, and provide a worthless but apparently functional solution.

Hopefully someone with a larger context window than myself can recall the article in question.


I think that article was basically wrong. They asked the agent not to provide any commentary, then gave an unsolvable task, and wanted the agent to state that the task was impossible. So they were basically testing which instructions the agent would refuse to follow.

Purely anecdotally, I've found agents have gotten much better at asking clarifying questions, stating that two requirements are incompatible and asking which one to change, and so on.

https://spectrum.ieee.org/ai-coding-degrades


From my experience: TDD helps here - write (or have AI write) tests first, review them as the spec, then let it implement.

But when I use Claude code, I also supervise it somewhat closely. I don't let it go wild, and if it starts to make changes to existing tests it better have a damn good reason or it gets the hose again.

The failure mode here is letting the AI manage both the implementation and the testing. May as well ask high schoolers to grade their own exams. Everyone got an A+, how surprising!


> TDD helps here - write (or have AI write) tests first, review them as the spec

I agree, although I think the problem usually comes in writing the spec in the first place. If you can write detailed enough specs the agent will usually give you exactly what you asked for. If you're spec is vague, it's hard to eyeball if the tests or even the implementation of the tests matches what you're looking for.


This happens with me every time I try to get claude to write tests. I've given up on it. Instead I will write the tests if I really care enough to have tests.


> they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests

A very human solution


I wonder if Volkswagen would've blamed AI if they got caught with Dieselgate nowadays...

In PR-lese: "To improve quality and reduce costs, we used AI to program some test code. Unfortunately the test code the AI generated fell below our standards, and it was missed during QA.".

Then again they got their supplier Bosch to program the "defeat device" and lied to them that "Oh don't worry, it's just for testing, we won't deploy it to production". (The "device" (probably just an algorithm) detects whether the steering wheel was being moved or not as the throttle is pushed, and if not, it assumes the car was undergoing emissions testing, and it runs the engine in the environmentally friendlier mode).


Sounds like you're using LLMs to replace human connection.

For example instead of: Duolingo - I practice with my friends Calorie tracking - I have planned meals from my dietitian Workout tracking - I have WhatsApp with my PT, who adjusts my next workouts from our conversations Reminders - A combo of Siri + Fantastical + My Wife

I'm sure my way is more expensive but I don't know, there is also a non tangible cost of not having friends/personal connections as well.


I may be missing your intent, but this feels like a misread of what I was describing.

I wasn’t swapping human connection for LLMs. These workflows already existed; I’ve simply used newer tools to make them better aligned to my needs and more cost-effective for me.


> There is no economic rule that says that riveting should pay more than taking care of the elderly or food delivery.

There kind of is - it's the same reason B2B SaaS tend to make more money than B2C - it's easy (easier) to sell someone something if they can make money from it.

If I can pay you Y to rivet some sheet metal together and sell the finished product for Y * 10, that's a much better outcome for me (economically) than paying someone to take care of my elderly parents. In fact, maybe I'm not mean, maybe _I_ don't make enough money to afford to pay someone to take care of my elderly parents.


Economic rules are all subject to externalities like the effects of taxes, regulations, etc. I mean there is no rule in the sense that some of the jobs that are poverty wage in the US are not poverty wage in other countries due to the impact of regulation.

It's a policy choice to allow Walmart pay full-time employees so little that taxpayers have to subsidize their food. We are free to make different choices.


I've been thinking about building this with friends, in the short term though you could do this today with Garage http://garagehq.deuxfleurs.fr


Kind of - AFAIK "micro" was never actually throughly defined. In my mind I think of it as mapping to one table (IE, users = user service, balances = balances service) but that might still be a "full service" worth of code if you need anything more than basic CRUD


The original sense was one business domain or business function (which often would include more than one table in a normalized relational db); the broader context was that, given the observation that software architecture tends to reflect software development organization team structure, software development organizations should parallel businesses organizations and that software serving different business functions should be loosely coupled, so that business needs in any area could be addressed with software change with only the unavoidable level of friction from software serving different business functions, which would be directly tied to the business impacts of the change on those connected functions, rather than having unrelated constraints from coupling between unrelated (in business function) software components inhibiting change driven by business needs in a particular area.


It blows my mind that with all the technology we have we can't find a plane that we have a pretty decent rough idea where is.

It's a testament to how big and deep the ocean is


> I maintain an S3 client that has a test matrix for the commonly used S3 implementations.

Is it open to the public? I'd like to check it out


Why not? What better things does the CTO have to do?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: