Hacker Newsnew | past | comments | ask | show | jobs | submit | ramity's commentslogin

Let me first start off by saying I and many others have stepped in this pitfall. This is not an attack, but a good faith attempt to share painfully acquired knowledge. I'm actively using AI tooling, and this comment isn't a slight on the tooling but rather how we're all seemingly putting the circle in the square hole and it fits.

Querying an LLM to output its confidence in its output is a misguided pattern despite being commonly applied by many. LLMs are not good at classification tasks as the author states. They can "do" it, yes. Perhaps better than random sampling can, but random sampling can "do" it as well. Don't get too tied to that example. The idea here is that if you are okay with something getting the answer wrong every so often, LLMs might be your solve, but this is a post about conforming non-deterministic AI into classical systems. Are you okay if your agentic agent picks the red tool instead of the blue tool 1%, 10%, etc of the time? If so, you're never not going to be wrangling, and that's the reality often left unspoken when integrating these tools.

While tangential to this article, I believe its worth stating that when interacting with an LLM in any capacity, remember your own cognitive biases. You often want the response to work, and while generated responses may look good and fit your mental model, it requires increasingly obscene levels of critical evaluation to see through the fluff.

For some, there will be inevitable dissonance reading this, but consider that these experiments are local examples. Its lack of robustness will become apparent with large scale testing. The data spaces these models have been trained on are unfathomably large in both quantity and depth, but under/over sampling bias will be ever present (just to name one).

Consider the the following thought experiment: You are an applicant for a job submitting your resume with knowledge it will be fed into an LLM. Let's confine your goal into something very simple. Make it say something. Let's oversimplify for the sake of the example and say complete words are tokens. Consider "collocations". [Bated] breath, [batten] down, [diametrically] opposed, [inclement] weather, [hermetically] sealed. Extend this to contexts. [Oligarchy] government, [Chromosome] biology, [Paradigm] technology, [Decimate] to kill. With this in mind, consider how each word of your resume "steers" the model's subsequent response, and consider how the data each model is trained on can subtly influence its response.

Now let's bring it home and tie the thought experiment into confidence scoring in responses. Let's say its reasonable to assume that the results of low accuracy/low confidence models are less commonly found on the internet than higher performing ones. If that can be entertained, extend the argument to confidence responses. Maybe the term "JSON" or any other term used in the model input is associated with high confidences.

Alright, wrapping it up. The end point here is that the model output provided confidence value is not the likelihood of the answer provided in the response but rather the most likely value following the stream of tokens in the combined input and output. The real sampled confidence values exist closer to code, but they are limited to each token. Not series of tokens.


"when interacting with an LLM in any capacity, remember your own cognitive biases. You often want the response to work, and while generated responses may look good and fit your mental model, it requires increasingly obscene levels of critical evaluation to see through the fluff."

100% this.

Idk about the far-out takes where "AI is an alien lifeform arrived into our present", but the first thing we know about how humans relate to extraterrestrials is: "I want to believe".


Contrasting take: RTT and a service providing black box knowledge is not equivalent to knowledge of the backbone. To assume traffic is always efficiently routed seems dubious when considering a global scale. The supporting infrastructure of telecom is likely shaped by volume/size of traffic and not shortest paths. I'll confess my evaluation here might be overlooking some details. I'm curious on others' thoughts on this.


They don't have to assume that traffic is efficiently routed, on the contrary if they can have a <1ms RTT from London to a server, the speed of light guarantees that that server is not in Mauritius EVEN if the traffic was efficiently routed.

It just can't be outside England, just one 0.4ms RTT as seen here is enough to be certain that the server is less then 120 km away from London (or wherever their probe was, they don't actually say, just the UK).

RTT from a known vantage point gives an absolute maximum distance, and if that maximum distance is too short then that absolutely is enough to ascertain that a server is not in the country it claims to be.


We've got detailed global ping data here: https://wondernetwork.com/pings

One of our competitors was claiming a server in a middle eastern country we could not find any hosting in. So I figured out what that server's hostname was to do a little digging. It was >1ms away from my server in Germany.


I see I was mistaken, but I'm tempted to continue poking holes. Trying a different angle, though it may be a stretch, but could a caching layer within the VPN provider cause these sort of "too fast" RTTs?

Let's say you're a global VPN provider and you want to reduce as much traffic as possible. A user accesses the entry point of your service to access a website that's blocked in their country. For the benefit of this thought experiment, let's say the content is static/easily cacheable or because the user is testing multiple times, that dynamic content becomes cached. Could this play into the results presented in this article? Again, I know I'm moving goalposts here, but I'm just trying to be critical of how the author arrived at their conclusion.


This is about ping though, so presumably ICMP packets. There is no content to cache as the request is sent with random data that must be sent back in the reply.

It is very unlikely that VPN providers use convoluted caching systems just to make their ping replies appear to come from a different region than the one they claim to be in. It would be much more likely for them to add a little latency to their responses to make them more plausible, instead.


Assuming a secure connection this isn't possible without terminating TLS and re-negotiating.


The speed of light provides a limit on distance for a given RTT, and taking the examples in the article which are less than 0.5ms and considering the speed of light (300km/ms) the measured exit countries must be accurate.

The speed of light in fiber which probably covers most of the distance is also even slower due to refraction (about 2/3).


Thanks for your informative reply. I see now I was approaching this incorrectly. I was considering drawing conclusions from a high RTT rather than a RTT so small it would be impossible to have gone the distance.


We (I work for IPinfo) talk about latency because it is a thread that you can start from when exploring our full depth of data.

We are the internet data company and our ProbeNet only represents a fraction of our investment. Through our ProbeNet, we run ping, traceoute, and other active measurements. Even with traceroute we understand global network topology. There are dozens and dozens of hints of data.

We are tapping into every aspect on the internet data possible. We are modeling every piece of data that is out there, and through research, we are coming up with new sources of data. IP geolocation is only product for us. Our business is mapping internet network topology.

We are hoping to work with national telecoms, ISPs, IXPs, and RIRs to partner with them, guiding and advising them about data-driven internet infrastructure mapping.


> I'll confess my evaluation here might be overlooking some details.

Yeah like... physics. If you're getting sub-millisecond ping times from London you aren't talking to Mauritius.


35m ago edit: Apple uses many predictive systems for typing. My sentiment in pointing out just slide to type might be misguided as it does not exist in a vacuum. I'd love to see these tests redone with slide to type disabled. I'm leaving the original comment below for reference.

Slide to type. This "issue" is at most 6 years old for iOS users.

Turn off slide to type if you do not use it. Slide to type does key resizing logic. This is the direct cause of this issue. Please upvote this comment for visibility.

Please reply if you think I'm wrong. I see this get posted frequently enough I'm actually losing it.

Please refer to https://youtu.be/hksVvXONrIo?si=XD7AKa8gTl85_rJ6&t=72 (timestamp 1:12) to see that slide to type is enabled.


I have that feature off and I am making noticeably more typing errors since the glass update.


I'm on an iPhone 12 Mini and always thought this issue was because it's kind of old. But I've seen this issue for at least 3 major iOS generations now, and I'm currently on 26.X


13 mini here and it’s definitely just since the glass update for me.


I don't use the slide feature and typing quality has gone downhill ever since iOS 17 or thereabouts IMO.


Doesn’t.helpmme At.all


I'll give this a try. My typing is better when I use slide to type but I'm still super uncomfortable with it (I feel anxious trying to think of the letters "fast enough" even though I know it doesn't matter).

FWIW I've felt my phone typing accuracy has gotten worse every single year for, whatever, almost 20 years now. That's not the case on the computer.


I almost exclusively use slide to type and what I do is not think about the letters, but about the motions I would have done if I was typing with my hands on a regular keyboard, sort of letting muscle memory take over and create the correct “shape” of the word without thinking too hard about it.


Peak swipe-to-text was on my HTC Desire circa 2010 using the third-party keyboard Swype. Everything since then has been a downgrade.


I remember when Swiftkey first launched on Android, the swipe-to-text was extremely good and the built-in "learning by itself" dictionary worked well too. Of course, it seems like Microsoft at one point bought it, so I don't even have to try it again to understand the current state of it.


I still refer to doing it on iPhone as swyping. The portmanteau has permanently genericized in my brain. Those were the days!


I have this disabled and the problem clearly exists anyway.


Key resizing has been in the iPhone since day 1. It has nothing to do with slide to type, even if slide to type may affect key sizing.

But the video clearly shows this isn’t key sizing given that they show U is selected in the keyboard UI, but j is input into the text.


General -> Keyboard -> Slide to Type

I don't have an issue with typing on iPhone, but I just disabled it to see what happens.


> Slide to type does key resizing logic.

It might be different with slide-to-type enabled, but the iPhone always invisibly resizes keys hitboxes using predictions about what key you want to use next. This can't be disabled, and has been part of the iPhone since the very first. It's a really abysmal experience for something that's so crucial to a smartphone, Apple seems to be completely disconnected with how people use these.

Apple even used to advertise this on their own site. That video definitely exists somewhere on YouTube.


> the iPhone always invisibly resizes keys hitboxes using predictions about what key you want to use next. This can't be disabled, and has been part of the iPhone since the very first.

Yes. True.

> It's a really abysmal experience for something that's so crucial to a smartphone

Full disagreement here. I expect and enjoy the predictive hitboxes, and this issue I am experiencing is not about those. It is when I type for example the letter "T" and I am certain I touched correctly and I am certain I _actually saw_ the letter "T" appear as pressed from the UI, yet when I look at the word I just typed something else which was obviously not the "T" appeared.


I thought I had a neurological disorder. (My iphone has auto-everything off. I'm not enabling slide type for fun, but I do not exclude the probability ios auto-enabled it when I changed brightness or something, as they are used to do.)

About two years ago, my phone typing suddenly gets extremely bad. Like, from occasional error to about one typo every second sentence. No matter how carefully I type. Hardware didn't change, so it must be me, right?

Let me play with that setting, I hope you are right.


I feeeeeel like this helped me but didn’t solve the problem fully. Changed it like 2-3 weeks ago.


Thanks, I’ll try this :)


>Please upvote this comment for visibility.

Lol. Don't forget to hit that subscribe button!


If YouTube ever renames or even just moves that button, millions of videos will suddenly be “broken”.


This already happened when they got rid of the 5-star rating in favor of the like button. "Rate 5 stars and subscribe" became "Like and subscribe". People will adapt.


Still funny in old videos or when they point to the right-hand side when the video info was there.


No worries, they will also introduce an AI "rephrase" (no way to opt-out) which will "translate" these in real-time!


I didn't see any reference to a sender or actively blasting RF from the same access point. I think the approach relies on other signal sources creating reflections to a passively monitoring access point and attempting to make sense of that.


5GHz WiFi has a wavelength of ~6cm and 2.4GHz ~12.5cm. Anything achieving smaller is a result of interferometry or a non WiFi signal. Mentioning this might not add much substance to the conversation, but it felt worth adding.


This resolution is probably enough, as they use human skeleton pose estimators and human movement pattern detectors too.


I'm interested but am also incredibly dubious. Not because it seems impossible but the opposite. On one hand, an open source repo like this making an approach for hackable extension should be praised, but the "Why Built WiFi-3D-Fusion" section[0] gives me very, very bad vibes. Here's some excerpts I especially take issue with:

> "Why? Because there are places where cameras fail, dark rooms, burning buildings, collapsed tunnels, deep underground. And in those places, a system like this could mean the difference between life and death."

> "I refuse to accept 'impossible.'"

WiFi sensing is an established research domain that has long struggled with line of sight requirements, signal reflection, interference, etc. This repo has the guise of research, but it seems to omit the work of the field it resides in. It's one thing to detect motion or approximately track a connected device through space, but "burning buildings, collapsed tunnels, deep underground" are exactly the kind of non-standardized environments where WiFi sensing performs especially poorly.

I hate to judge so quickly based on a readme, but I'm not personally interested in digging deeper or spinning up an environment. Consider this before aligning with my sentiment.

[0] https://github.com/MaliosDark/wifi-3d-fusion/blob/main/READM...


I really want to love rust, and I understand the niches it fills. My temporary allegiance with it comes down to performance, but I'm drawn by the crate ecosystem and support provided by cargo.

What's so damning to me is how debilitatingly unopinionated it is during situations like error handling. I've used it enough to at least approximate its advantages, but strongly hinting towards including a crate (though not required) to help with error processing seems to mirror the inconvenience of having to include an exception type in another language. I don't think it would be the end of the world if it came with some creature comforts here and there.


I'll provide a contrasting, pessimistic take.

> How do you write programs when a bug can kill their user?

You accept that you will have a hand in killing users, and you fight like hell to prove yourself wrong. Every code change, PR approval, process update, unit test, hell, even meetings all weigh heavier. You move slower, leaving no stone unturned. To touch on the pacemakers example, even buggy code that kills X% of users will keep Y% alive/improve QoL. Does the good outweigh the bad? Even small amounts of complexity can bubble up and lead to unintended behavior. In a corrected vibrator example, what if frequency becomes so large it overflows and leads to burning the user? Youch.

The best insight I have to offer is that time is often overlooked and taken for granted. I'm talking Y2K data type, time drift, time skew, special relativity, precision, and more. Some of the most interesting and disturbing bugs I've come across all occurred because of time. "This program works perfectly fine, but after 24 hours it starts infinitely logging." If time is an input, do not underestimate time.

> How do we get to a point to `trust` it?

You traverse the entire input space to validate the output space. This is not always possible. In these cases, audit compliance can take the form of traversing a subset of the input space deemed "typical/expected" and moving forward with the knowledge that edge cases can exist. Even with a fully audited software, oddities like a cosmic bit flip can occur. What then? At some point, in this beautifully imperfect world, one must settle for good enough over perfection.

The astute reading above might be furiously pounding their keyboards mentioning the halting problem. We can't even verifiably prove a particular input will provide an output - moreover an entire space.

> I am convinced that open code, specs and (processes) must be requirement going forward.

I completely agree, but I don't believe this will outright prevent user deaths. Having open code, specs, etc aids towards accountability, transparency, and external verification. I must express I feel there are pressures against this, as there is monumental power in being the only party able to ascertain the facts.


I know someone who developed medical devices, not as critical as pacemakers, and he kind of boasts that he probably killed (as in caused premature death of) some people, but also extended the life of many many more.


Tbh, it's the same kind of survival economics as invasive treatments like surgery anyway. No doctor can or should guarantee 100% survival.


That’s actually the standard in automotive and industrial applications — likelihood of failure vs consequences of failure, set the “acceptable” risk low and show proof that you’re not any higher than that level. Medical devices actually have a much higher “contributes in any way to any patient harm” risk analysis.



elzbardico is pointing out how the author is having the confidence value generated in the output of the response rather than it being the confidence of the output.


Is there research solid knowledge on this?


this trick is being used by many apps (including Github copilot reviews). The way I see it, is that if the agent has an eager-to-please problem, then you give it a way out


Thanks. I was talking about the confidence measure.


I too once fell into the trap of having an LLM generate a confidence value in a response. This is a very genuine concern to raise.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: