Most of the problem is that for voice chat, you usually get no reasoning at all and no tool use at all to research or ground assumptions.
For example for voice ChatGPT still uses a quantized gpt40 non-reasoning model that hallucinates pretty frequently. It also doesn’t do much automatic search for updated information and fact checking.
I usually don’t find I need high, usually DeepSeek v4 with medium reasoning is sufficient.
However if it’s important chat like brainstorming on complex topics I sometimes bump it up.
OpenAI has a new voice api that supports adjustable reasoning, but ChatGpt is not using it currently.
With a sufficiently sophisticated harness you can actually do quite a lot by just talking to your AI. I have regularly dictated to build things on my phone while walking to lunch for example.
Following the original comment concepts, if every model requires a different prompting technique to maximize its output, how can a benchmark based on sending the same prompt to all models be accurate?
We should create different prompts for each model, but then how reliable and unbiased can the benchmark be?
How often do you really need to look at that info while doing normal work?
Because to me and to the very vast majority of git users it is totally irrelevant.
It is nice that the info is available, but the more sane default would be to hide under a verbose flag not the other way around.
Imagine typing cd folder/ and have the whole filesystem subtree be displayed in the terminal. You are free to ignore it, but it is useless and inconvenient nonetheless
Are you not a programmer? Do you not ever find yourself having to debug an issue? When you have to, are you not glad when there's sufficient information in the log files to do so, even though 99.9% of the time you never look at the logs?
> Imagine typing cd folder
It's not comparable. `cd` is a local command (technically a shell built-in) that completes instantly (unless you cd to a hung NFS mount...). So it honors the Unix philosophy of emitting nothing on success.
But cloning is a network operation. And it's normal for networking tools to output progress by default. See `wget` and `curl`.
The problem with hiding progress under `--verbose` is that by the time you need the information (why is this taking so long?), it's too late to add `--verbose`. You'd have to cancel the command and run it again losing progress.
If you don't want it, then use `--quiet` and move on with your day.
(Sure you could make it smarter by deferring outputting anything until it realizes the operation is taking some time. Patches welcomed.)
I think in this case the human effort was put into the actual discovery, honestly I don't mind if AI helped him write the blog post if the result is enjoyable and not sloppy
> most of the bugs were in the JS code, so for most things, hot reloading is enough for E2E validation and to run just the right tests. No need to run a full build and test suite (which takes 10+ minutes); the CI can do this.
Have you tried adding this instruction to your agents.MD? Avoiding situations were the agent start running a loop is the main use case of the file for me
I'm guessing the greatest reason behind each provider creating and agent harness is that (a) there is not a clear winner still and (b) it is harder to switch models with a competitor, as you also have to switch harness
Or by using a proxy, yeah. Personally I would still prefer a multi provider harness over CC when using it with another provider, if alone for the visible reasoning, model switcher, cost estimation and so on. So far I've only preferred CC when I needed to work with Jupyter Notebooks because it has built-in tools for that.
Aside from LLM architecture, that already is a complex issue, an issue is that training data is unstructured text.
An LLM able to structurally separate context and instructions, should logically need separated data to train, and we don't have it.
Moreover, while an equally powerful LLM architecture solving this may exists, there are no guarantees at all that we are able to come up with it in a reasonable timeframe.
Without some signals moving in that direction, the most pragmatic and realistic way of looking at the problem is that it will not be solved in the near future
I agree this doesn't mean we shouldn't try to address limitations with the current architecture. I just mean that I expect the root cause to be solved eventually if we ever really want to take steps towards AGI.
User identity attached is not a solution, it doesn't solve anything if you have to pull in external data that you can't control.
Like in the banking world, you can make everything super authenticated, but if you have an API that receives the latest wire transfer YOU received with the message attached, you don't control the message content and it can be an attack vector.
Being authenticated/authorized is not the solution, it is data that the user can access.
Aren't small local models worse efficiency-wise? It means that every person must have a powerful enough machine to power a small model, and we are very, very far away from that.
The best solution, from an efficiency point of view, is to use smaller models on datacenters, requiring much less of them.
There's an efficiency sweet spot where hardware that people have anyway gets a higher percentage of load.
MacBooks have a lot of memory and a lot of FLOPs. They mostly sit unused all day. Yes, the excess energy use will be higher than a GPU in a datacenter doing the same work, but you have to generate an absurd amount of tokens before the dollar-efficiency catches up with the MacBook.
reply