ZihangZ's comments

ZihangZ · 2026-04-26T12:47:45 1777207665

Small caution from using agents here: the useful chart is the one generated from code, tests, or traces, not the one the model draws from its own explanation.

I've had models produce very reasonable Mermaid diagrams that matched the intended design but not the actual program. It felt helpful until I realized I was reviewing the plan twice and the implementation zero times.

For PRs I'd rather render the diagram from the executable state machine itself — at least then drift in the chart means drift in behavior, and you can't review one without the other.

ZihangZ · 2026-04-26T12:16:13 1777205773

+1 to the CI/isolation point. That is the part that makes these setups work for me too: make the failure cheap to reproduce, make stderr visible, make the agent rerun the same command after the patch. A lot of bad agent behavior is really just "it never got a clean signal".

The part that still bites me is across sessions. A tight loop fixes this run, but next week the agent can walk into the same rake again: same wrong import path, same misuse of an internal API, same CI-only dependency issue. After patching the same class of failure a few times, I started writing those down outside the chat context so the next run sees the failure pattern before it guesses.

ZihangZ · 2026-04-25T05:58:43 1777096723

Built a ROS 2 cycling helmet. IMU + GPS, but we didn't fuse them, on purpose.

Speed is just haversine between GPS fixes. IMU only does turn detection and crash/fall. No EKF. Under bridges or urban canyons I'd rather have speed go stale and drop to zero after a few seconds than have a filter keep extrapolating from IMU bias and tell the rider they're still doing 20 km/h.

Other thing: safety stays below ROS. Crash/fall runs on the MCU next to the IMU; ROS just subscribes to /safety/event. Pi reboots, helmet still alarms.

How does FusionCore handle long GPS outages: gate the output, or keep predicting?

kharwarm · 2026-04-26T01:32:29 1777167149

It keeps predicting. During a GPS outage, FusionCore just dead-reckons off the IMU and wheel encoders, so the output stream stays continuous.

Covariance inflates over time as uncertainty builds... there’s no output gating. The Mahalanobis gate is only used on incoming measurements, so it’ll reject bad GPS fixes (like multipath spikes), but it doesn’t suppress the state estimate itself.

If the robot is stationary during an outage, ZUPT kicks in and drift stays close to zero. If it’s moving without GPS, then drift is entirely a function of IMU and encoder quality.... which, for something like a helmet, is probably going to degrade pretty quickly after ~30 seconds.

Your architecture is interesting to me. Letting speed go stale as an intentional safety signal (with the MCU handling crash logic below ROS) makes sense when “wrong but confident” is worse than no signal at all. FusionCore takes the opposite stance: never stop publishing, and let covariance communicate uncertainty to downstream consumers. For a cycling helmet... where false confidence could be dangerous... your approach is probably the safer call. For a robot that needs to keep navigating through something like a tunnel, FusionCore’s approach makes more sense.

Out of curiosity... what does your system do if GPS is lost for more than ~10 seconds while the device is moving? Does the MCU fall back to accelerometer-only crash detection, or does it just wait for GPS to come back?

ZihangZ · 2026-04-25T04:49:00 1777092540

Yeah, this is pretty common once a device has any real DSP in it. There's usually some stripped-down Linux on an ARM SoC underneath, and the vendor BSP just happens to ship with sshd on.

Not necessarily malice, more like nobody on the audio side really owns the rootfs.

The big question is whether it's only listening on the USB-side network, or on the actual LAN. First one is annoying. Second one would actually bother me.

hhh · 2026-04-25T05:39:48 1777095588

It is listening on the LAN. It connects over wifi only when you use certain features, so i didn’t test if that interface is listening as well.

ZihangZ · 2026-04-25T06:11:09 1777097469

Yeah, LAN is the line for me. USB-side sshd is a weird dev leftover; LAN means it’s now in the home threat model.

surajrmal · 2026-04-25T15:34:02 1777131242

Linux defaults are unfortunately not great for production of devices of this nature. By comparison, android ships with 3 default image types, eng, userdebug, and user. By creating this system of preconfigured defaults, it makes it easy to avoid this sort of mistake.

ZihangZ · 2026-04-23T06:40:28 1776926428

I don't think SSH vs OpenTofu is the core issue here.

For agents, declarative plans are still valuable because they are reviewable. The interesting question is whether exe.dev changes the primitive: resource pools for many isolated VM-like processes, or just nicer VPS provisioning.

poly2it · 2026-04-23T06:41:45 1776926505

It doesn't do either at competitive rates by the looks of it.

ZihangZ · 2026-04-18T10:50:11 1776509411

This matches my experience. When I write the boring glue code myself, I get a map of the project in my head.

When I let an agent write too much of the structure, the code may work, but a week later every small change starts with "where did it put that?"

ZihangZ · 2026-04-17T12:09:01 1776427741

This matches what I've seen too — the hallucination gets much worse when the loop has no external verifier. "Does this board work?" has no ground truth inside the model, so it defaults to optimistic narration.

What OP is doing here is actually the mitigation: SPICE + scope readout is a verifier the model can't talk its way past. The netlist either simulates or it doesn't, the waveform either matches or it doesn't. That closes the feedback loop the same way tests close it for code.

The failure mode that remains, in my experience, is a layer down: when the verifier itself errors out (SPICE convergence failure, missing model card, wrong .include path), the agent burns turns "reasoning" about environment errors it has seen a hundred times.That's where most of the token budget actually goes, not the design work.

jddj · 2026-04-17T12:12:54 1776427974

What throws me about this comment is the missing space between the period and the T in the last sentence.

Did the model itself do that? Was it a paste error?

svnt · 2026-04-17T15:33:27 1776440007

I’ve also noticed Gemini and Claude occasionally mixing terms recently (eg revel vs reveal) and can’t decide whether it is due to cost optimization effects or some attempt to seem more human.

I can’t recall either using a wrong word prior this month for some time.

lambda · 2026-04-17T15:47:17 1776440837

Or just because mistakes are part of the distribution that it's trained on? Usually the averaging effect of LLMs and top-k selection provides some pressure against this, but occasionally some mistake like this might rise up in probability just enough to make the cutoff and get hit by chance.

I wouldn't really ascribe it to any "attempt to seem more human" when "nondeterministic machine trained on lots of dirty data" is right there.

svnt · 2026-04-17T15:56:46 1776441406

Sure, but if that were the case why has it gotten worse recently? I would expect it to be as a result of cost optimization or tradeoffs in the model. I suppose it could be an indicator of the exhaustion of high quality training data or model architecture limitation. But this specific example, revel vs reveal, is almost like going back to GPT-2 reddit errors.

I also don’t want to pretend there is no incentive for AI to seem more human by including the occasional easily recognized error.

lambda · 2026-04-17T16:31:09 1776443469

Or just the models are getting bigger and better at representing the long tail of the distribution. Previously errors like this would get averaged away more often; now they are capable of modelling more variation, and so are picking up on more of these kinds of errors.

svnt · 2026-04-17T19:33:35 1776454415

That makes sense, but what is the solution?

jddj · 2026-04-17T16:35:38 1776443738

Looking at the account's other comment there are subtle grammatical errors in that one too.

Would be good to see the prompt out of morbid curiosity