Hacker Newsnew | past | comments | ask | show | jobs | submit | mrandish's commentslogin

> <!-- EXACT setup from working simple-test.html -->

All LLM-kind would be vastly improved if the words "exact" and "brilliant" were nerfed to hell in their pre-training weights or even just removed from their training distributions entirely. Virtually nothing outside of mathematics is "exact", and virtually nothing outside of colors should be described as "brilliant".


Yeah, that was my guess too. Still a little disappointing to see fellow HNers reflexively fanboying a company that's the overwhelming dominate player in LLM coding.

I try to reserve my reflexive fanboy company/project votes for underdogs who need and deserve the help.


> But you know what my coworker asks? “Test Y theory.”

It still surprises me when I see people not prompting more specifically and clearly. It not only avoids problems, it's faster, costs less -and just works better.

I recently shared with a friend a multi-hour LLM chat session I'd done because it veered into a domain he's interested in. In the session I'd brainstormed and probed the feasibility of a novel concept for a new research direction. It traversed a half dozen domains diving into minute detail then zooming back out to survey an adjacent space, interspersed with intense skeptical probing of key assumptions, all while spewing tons of detailed citations, specific paragraph pulls, summarized data tables etc.

My friend is very experienced using LLMs for research so I was surprised when he called me shocked by the sheer velocity, precise targeting and signal/noise. I'd assumed everyone did it the same as I do. He attributed the different result solely to the way I crafted my prompts.


I used to write detailed prompts. Now I find the benefits of strategic ambiguity — rather than speaking imperatively, I emphasize my vision and then Claude can often figure out a method.

This doesn’t always work better. But often enough.


That's actually what I do too. What I was trying to say is that my prompts are precise in the sense that whether they're vaguely ambiguous or hyper-detailed and highly directive it's always very intentional to improve the response in the direction I want. The difference can have significant impact as shown in research on how LLMs naturally mirror user's prompts.

I noticed this last year and started experimenting which led to several realizations about how my prompt's tone, style, length, format, word choices and even punctuation can have very counter-intuitive impact on model responses. It's not that one strategy always gets "better" results, they're just different in specific ways, which can make one input style better for one context but worse for another. I first noticed this effect when modding my user prompt so major topic headings would always be numbered. It's surprisingly difficult to get it to reliably use the same simple scheme due to various potential ambiguities. So, I spent a little time word-smithing, lawyering and tuning the prompt but I found the closer I got to full compliance on heading numbering, the more unrelated things would drift. Like it would just stop using bullets, even though I never mentioned anything about bullets.

Then I changed the prompt to "Change nothing about your default formatting, except headings." But just mentioning anything related to formatting, could suddenly cause unintended effects on seemingly unrelated things. Then I tried being explicitly directive about all formatting to just lock it down. And this completely failed because once the formatting was perfect, I started noticing the model's output would get less intelligent much earlier in sessions. So I cleared my user prompt entirely as it wasn't worth the cognitive cost on the model or my time. A few days later in a long session I noticed it was numbering everything perfectly with no prompt at all. When I scrolled back through I saw it didn't start out numbering its responses. It started doing it because I was consistently numbering every major concept in my inputs, even though I never mentioned numbering or formatting.

So... yeah, subtle differences in prompts which absolutely shouldn't matter, do impact model output in unexpected ways. And, as of now, these effects can only be fully suppressed with strong directive prompts for short periods, but doing so always impacts other unrelated things - and has some cognitive impact on model performance. So, by paying a little attention, I've discovered ways to optimize a model's output in the direction I need by shifting not only my prompt's explicit directives but also the subliminal meta-elements like tone, style, length, structure, formatting, etc.


Yeah, I find the back and forth with Claude is often better than trying to front load everything in a massive and detailed prompt.

The counter-intuitive nature of LLMs is so simultaneously interesting and frustrating. Overloading a single prompt definitely can create challenge remarkably similar to human short-term memory and attentional drift.

LLMs gain so much knowledge and capability from absorbing the symbolic relationships embedded in human language but in doing so, inevitably absorb many of the human foibles, sensitivities and weaknesses reflected in our languages.


> we can't read even a tiny fraction of what gets posted here

I'll bet it's exhausting but your note did make ponder: If a soul was condemned to the eternal torment of reading nothing but all the user posts of one social media site for all eternity, HN would be a pretty excellent choice. I shudder to think of the alternatives.


Re: AI OS integration: I recently retired so most of my LLM use is just implementing and fixing fairly mundane OS and networking things along with light scripting for OS automation (AHK) and Home Assistant. So far, I just use web chat and cut-paste to the OS which is fine for little things but it starts to suck after the 15th round back and forth. For example, debugging intermittent Windows crash logs on my wife's laptop by doing multi-line PowerShell incantations from browser chat window, paste into PowerShell window. Cut multi-line error messages back to browser. Rinse / Repeat.

I'm leery about just giving an LLM free run of my laptop, but with reasonable restrictions on which app(s) it can access and how many steps it can do before checking in, and maybe even a throttle on how fast it works, I'd be fine (I'm not in a hurry and I can learn by watching it work at double-speed). It doesn't have to be mil-spec locked down, it's not like I have production code accessible or millions in crypto keys, the biggest downside would be a few hours hosing out and restoring the laptop, which would be annoying but not the end of the world.

I get those that say, "just spin up a VM and run it there", but I 'spin up a VM' rarely enough that the versions have changed and UXs drifted enough that it's exactly the kind of thing I'd actually want the LLMs help to do without me being a cut-paste bot. I'm mostly Windows at the moment and I don't understand why MSFT insists on spamming LLM features everywhere except the one place I'd not only use it, but pay for it. The usage model could be as simple and intuitive as a Zoom remote desktop share with a collaborator. That's already constrained and users have a mental model for the interaction pattern.

I asked Gemini earlier today to search recent user reviews of the latest 'drive my Windows desktop for me' and it reported that the capability is still slow, expensive, and prone to getting lost navigating the interface or interpreting window boundaries etc.

Anyone have any suggestions for my lightweight, casual use case?


Yeah unironically just let an agent harness rip with full admin access without monitoring anything it does or using a VM. It’ll be fine, probably. “How I Learned To Stop Worrying And Love the AgentDOS and Only Exfiltrate Secrets Occasionally”

I heard they actually changed to this wording from the original, which for a long time was "Don't be evil."

https://en.wikipedia.org/wiki/Don%27t_be_evil


You heard? It was a fairly big controversy when they did get around to removing it.

Yep, I remember seeing the headline. I clicked in, read the sub-head and first few sentences, hit the back button, and moved on, having duly noted the passing of one more milestone in Google's long descent. The reasons why it sucks, why they eventually did it and the vague, implausible PR justifications for doing it were all self-derivable from the headline and sub-head.

I doubt I missed anything of significant substance, but didn't want to assert factual knowledge, so I just linked the Wikipedia article (which I also didn't read into). Don't interpret my skipping the rumination step with excusing or dismissing Google's decline. I don't need to rubber-neck every step of a slow-mo, multi-year train wreck to lament that it happened and update my priors regarding Google.


Yes, it's always been published as a joke. You've explained why it was (and still is) funny meta-commentary on AI benchmarks.

As often happens with random oddball things which become traditions in web communities, the replies asking what it is or complaining about it, begin to gain their own humor value.

It's evolved from a funny, unserious benchmark to a tradition. When a major new model is released, I now always check the HN thread for Simon's Pelican post. I'll be sad when I don't find it.

When it started, comparing the progress between models was mildly interesting but everyone (including Simon) acknowledges it certainly leaked into the training data long ago.


Hence it has become a meta-benchmark of relative progress in SVG image generation of a known target which has leaked into the training data and for which "every frontier AI team has/had a person at least partially dedicated to" at least checking if not optimizing.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: