Hacker Newsnew | past | comments | ask | show | jobs | submit | ProofHouse's commentslogin

Is anyone a researcher here that has studied the proven ability to sneak malicious behavior into an LLM's weights (somewhat poisoning weights but I think the malicious behavior can go beyond that).

As I recall reading in 2025, it has been proven that an actor can inject a small number of carefully crafted, malicious examples into a training dataset. The model learns to associate a specific 'trigger' (e.g. a rare phrase, specific string of characters, or even a subtle semantic instruction) with a malicious response. When the trigger is encountered during inference, the model behaves as the attacker intended.You can also directly modify a small number of model parameters to efficiently implement backdoors while preserving overall performance and still make the backdoor more difficult to detect through standard analysis. Further, can do tokenizer manipulation and modify the tokenizer files to cause unexpected behavior, such as inflating API costs, degrading service, or weakening safety filters, without altering the model weights themselves. Not saying any of that is being done here, but seems like a good place to have that discussion.


> The model learns to associate a specific 'trigger' (e.g. a rare phrase, specific string of characters, or even a subtle semantic instruction) with a malicious response. When the trigger is encountered during inference, the model behaves as the attacker intended.

Reminiscent of the plot of 'The Manchurian Candidate' ("A political thriller about soldiers brainwashed through hypnosis to become assassins triggered by a specific key phrase"). Apropos given the context.


In that area, https://arxiv.org/html/2507.06850v3 was pretty interesting imo.

Scamthropic at it again

Yup. Avoid Google services especially map at all costs

That map pricing change stole literally months of my life having to rip out google from an app I spend months building it into. F Google

It tells you how bad their product management and engineering team is that they haven’t just decided to kill Siri and start from scratch. Siri is utterly awful and that’s an understatement, for at least half a decade.


Huge mistake. That’s what they specialize in though


AI but does it matter, no


Thanks for sharing your opinion


And it was a no brainer


Thank you! Shame all these big corps that do this forever. Meta #1, Apple # 2, psuedo fake journalists # 3


God bless you. Wishing you joy.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: