I wonder if it's because it's hard to do most things with LLMs in a reliable way...

Ekaros · 2025-01-20T21:27:04 1737408424

Just thinking of it. 1% error rate, say 1 in 100 customers gets some wrong information. And they go to place trusting it. Just to hear that AI lied to them. Say you have 1000 or 10000 customers using system, now you potentially have 10 or 100 one star negative reviews... And this might be just answering to simple queries like a restaurant menu or opening time.

airstrike · 2025-01-21T01:20:59 1737422459

No decently coded chatbot is going to respond with an incorrect restaurant menu or opening time. You'd call a function to return the menu from a database or the opening time from a database. At worst, the function fails, but it's not going to hallucinate dishes.

Yizahi · 2025-01-21T15:41:11 1737474071

Exactly this. Even for internal use. Our corp approved a small project where NN will do the analysis of nightly test runs (our test suite runs very long). For now it does classification of results into several exiting broad categories. Technically product type failures are the most important usually and this should allow to focus efforts on them. But since even 1% false rate (it is actually in double digits in real life) would mean that we, QAs, need to verify all results anyway. So no time saved, and this NN software is eh... useless.

There are other ideas how to make it more useful, but my point is that non-zero failure rate with unpredictable answers is not applicable to many domains.

rsynnott · 2025-01-21T10:16:28 1737454588

Yeah, a 1% error rate (IME in practice the error rate is _much_ higher than this if you care about detail, but whatever) just won't fly in most use cases. You're really talking about stuff which doesn't matter at all (people are rarely willing to pay very much for this) or where its output is guaranteed to be reviewed by a human expert (at which point, in many cases, well, why bother).

mercer · 2025-01-22T05:04:47 1737522287

indeed, reviewing notoriously takes just as much effort as producing the artifact under review!