More

gbalduzzi · 2026-06-18T15:14:28 1781795668

What are the use cases of an LLM while walking or driving, that also require high reasoning?

WhitneyLand · 2026-06-18T17:20:49 1781803249

Most of the problem is that for voice chat, you usually get no reasoning at all and no tool use at all to research or ground assumptions.

For example for voice ChatGPT still uses a quantized gpt40 non-reasoning model that hallucinates pretty frequently. It also doesn’t do much automatic search for updated information and fact checking.

I usually don’t find I need high, usually DeepSeek v4 with medium reasoning is sufficient.

However if it’s important chat like brainstorming on complex topics I sometimes bump it up.

OpenAI has a new voice api that supports adjustable reasoning, but ChatGpt is not using it currently.

shostack · 2026-06-18T18:52:11 1781808731

With a sufficiently sophisticated harness you can actually do quite a lot by just talking to your AI. I have regularly dictated to build things on my phone while walking to lunch for example.

gbalduzzi · 2026-06-18T06:25:09 1781763909

Following the original comment concepts, if every model requires a different prompting technique to maximize its output, how can a benchmark based on sending the same prompt to all models be accurate? We should create different prompts for each model, but then how reliable and unbiased can the benchmark be?

It is a fundamentally hard problem to solve

gbalduzzi · 2026-06-17T21:29:13 1781731753

How often do you really need to look at that info while doing normal work?

Because to me and to the very vast majority of git users it is totally irrelevant.

It is nice that the info is available, but the more sane default would be to hide under a verbose flag not the other way around.

Imagine typing cd folder/ and have the whole filesystem subtree be displayed in the terminal. You are free to ignore it, but it is useless and inconvenient nonetheless

js2 · 2026-06-17T22:37:43 1781735863

Are you not a programmer? Do you not ever find yourself having to debug an issue? When you have to, are you not glad when there's sufficient information in the log files to do so, even though 99.9% of the time you never look at the logs?

> Imagine typing cd folder

It's not comparable. `cd` is a local command (technically a shell built-in) that completes instantly (unless you cd to a hung NFS mount...). So it honors the Unix philosophy of emitting nothing on success.

But cloning is a network operation. And it's normal for networking tools to output progress by default. See `wget` and `curl`.

The problem with hiding progress under `--verbose` is that by the time you need the information (why is this taking so long?), it's too late to add `--verbose`. You'd have to cancel the command and run it again losing progress.

If you don't want it, then use `--quiet` and move on with your day.

(Sure you could make it smarter by deferring outputting anything until it realizes the operation is taking some time. Patches welcomed.)

gbalduzzi · 2026-06-16T07:10:17 1781593817

you clearly have never read a 1000 word text written by me (/s, but only partially)

dgellow · 2026-06-16T07:23:30 1781594610

Honestly I would prefer to read a long text from a human that is badly written than a LLM version. It’s fine to not write well

gbalduzzi · 2026-06-16T07:08:57 1781593737

I think in this case the human effort was put into the actual discovery, honestly I don't mind if AI helped him write the blog post if the result is enjoyable and not sloppy

gbalduzzi · 2026-06-12T05:28:57 1781242137

> most of the bugs were in the JS code, so for most things, hot reloading is enough for E2E validation and to run just the right tests. No need to run a full build and test suite (which takes 10+ minutes); the CI can do this.

Have you tried adding this instruction to your agents.MD? Avoiding situations were the agent start running a loop is the main use case of the file for me

gbalduzzi · 2026-06-11T17:20:14 1781198414

I'm guessing the greatest reason behind each provider creating and agent harness is that (a) there is not a clear winner still and (b) it is harder to switch models with a competitor, as you also have to switch harness

jeremyjh · 2026-06-11T18:54:03 1781204043

But you really don't have to switch. MiMo Code has the same provider support as OpenCode.

Even Claude Code you can use with any provider that exposes an anthropic API endpoint, which they all do.

ghrl · 2026-06-11T19:35:43 1781206543

Or by using a proxy, yeah. Personally I would still prefer a multi provider harness over CC when using it with another provider, if alone for the visible reasoning, model switcher, cost estimation and so on. So far I've only preferred CC when I needed to work with Jupyter Notebooks because it has built-in tools for that.

gbalduzzi · 2026-06-10T20:58:29 1781125109

Aside from LLM architecture, that already is a complex issue, an issue is that training data is unstructured text.

An LLM able to structurally separate context and instructions, should logically need separated data to train, and we don't have it.

Moreover, while an equally powerful LLM architecture solving this may exists, there are no guarantees at all that we are able to come up with it in a reasonable timeframe.

Without some signals moving in that direction, the most pragmatic and realistic way of looking at the problem is that it will not be solved in the near future

airstrike · 2026-06-10T21:54:14 1781128454

Thanks, I appreciate the thoughtful reply.

I agree this doesn't mean we shouldn't try to address limitations with the current architecture. I just mean that I expect the root cause to be solved eventually if we ever really want to take steps towards AGI.

Regarding signals moving in that direction, here's a paper you might enjoy https://arxiv.org/abs/2503.21937

gbalduzzi · 2026-06-10T20:48:59 1781124539

User identity attached is not a solution, it doesn't solve anything if you have to pull in external data that you can't control.

Like in the banking world, you can make everything super authenticated, but if you have an API that receives the latest wire transfer YOU received with the message attached, you don't control the message content and it can be an attack vector.

Being authenticated/authorized is not the solution, it is data that the user can access.

gbalduzzi · 2026-06-09T07:31:35 1780990295

Aren't small local models worse efficiency-wise? It means that every person must have a powerful enough machine to power a small model, and we are very, very far away from that.

The best solution, from an efficiency point of view, is to use smaller models on datacenters, requiring much less of them.

pbmonster · 2026-06-09T08:12:28 1780992748

There's an efficiency sweet spot where hardware that people have anyway gets a higher percentage of load.

MacBooks have a lot of memory and a lot of FLOPs. They mostly sit unused all day. Yes, the excess energy use will be higher than a GPU in a datacenter doing the same work, but you have to generate an absurd amount of tokens before the dollar-efficiency catches up with the MacBook.

gbalduzzi · 2026-06-10T21:22:26 1781126546

You need to have a 3k dollar machine available though, I think you are overestimating how many people have access to it