More

p1esk · 2026-01-25T18:56:00 1769367360

China can

Yes, things are different in totalitarian states.

red-iron-pine · 2026-01-26T14:09:19 1769436559

"China can, due to mandatory 996 work hours"

p1esk · 2026-01-25T18:47:20 1769366840

Privacy itself can become illegal just as easily as religion, etc. if we follow your argument.

nfinished · 2026-01-25T19:03:29 1769367809

What point do you think you're making?

vladms · 2026-01-25T20:13:38 1769372018

My interpretation: advocating for privacy without making effort to avoid a large part of the society goes "crazy" will not protect you much on the long term.

I do like "engineering solutions" (ex: not storing too much data), but I start to think it is important to make more effort on more broad social, legal and political aspects.

RicciFlow · 2026-01-25T20:12:49 1769371969

EU is literally debating about "Chat Control". Its purpose is to scan for child sexual abuse material in internet traffic. But its at the cost of breaking end to end encryption.

zugi · 2026-01-25T21:07:03 1769375223

> Its purpose is to scan

That's its ostensible, purported, show purpose.

The real purpose is to break end to end encryption to increase government surveillance and power. "But think of the children" or "be afraid of the terrorists" are just the excuses those in power rotate through to to achieve their true desired ends.

ericfr11 · 2026-01-25T23:42:30 1769384550

I wouldn't be surprised that Trump goes one step further. He is so unleashed, and irrational. This guy is a liability for humanity

anigbrowl · 2026-01-25T21:54:30 1769378070

Yes, that is indeed the point.

steve1977 · 2026-01-25T20:01:43 1769371303

Absolutely - there are quite a few attempts in this direction.

jayd16 · 2026-01-25T20:19:08 1769372348

It's a hell of lot harder to enforce...

p1esk · 2026-01-26T00:03:43 1769385823

Harder than ethnicity or sexual orientation or religion?

jayd16 · 2026-01-26T02:06:52 1769393212

Without privacy of those things? Yes.

p1esk · 2026-01-19T22:42:47 1768862567

You also no longer need to work, earn money, have a life, read, study, know anything about the world. This is pure fantasy

This will be reality in 10-20 years

suddenlybananas · 2026-01-20T08:15:31 1768896931

A traditional Marxist revolution is more likely than that.

hollowturtle · 2026-01-19T22:55:00 1768863300

It's already reality if you want to, today and in 10-20 years the outcome will be the same: being an homeless! And no please no UBI bs thanks

p1esk · 2026-01-19T23:36:07 1768865767

99.9% of today’s jobs will be fully automated in 20 years. What do you think will happen to all the unemployed population?

bandrami · 2026-01-20T09:41:29 1768902089

I remember when they were saying that 20 years ago

p1esk · 2026-01-20T17:56:24 1768931784

20 years ago, Kurzweil predicted AGI will be achieved by 2029, and ASI by 2045. We are right on track.

hollowturtle · 2026-01-19T23:39:12 1768865952

hahahahaha. Please can you advice on lottery numbers? I'd like to win a bunch of money before losing the job

p1esk · 2026-01-19T02:19:00 1768789140

I wonder how long it would have taken antirez without opus

p1esk · 2026-01-13T04:23:35 1768278215

Actually audience support during tight matches can be decisive.

p1esk · 2026-01-12T16:42:16 1768236136

How can you train a Claude/Gemini scale model if you’re limited to <10% of the training data?

p1esk · 2026-01-08T03:58:51 1767844731

You should compare the number of top AI scientists each company has. I think those numbers are comparable (I’m guessing each has a couple of dozen). Also how attractive each company is to the best young researchers.

p1esk · 2026-01-08T01:19:29 1767835169

The way I think about QKV projections: Q defines sensitivity of token i features when computing similarity of this token to all other tokens. K defines visibility of token j features when it’s selected by all other tokens. V defines what features are important when doing weighted sum of all tokens.

D-Machine · 2026-01-08T02:34:10 1767839650

Don't get caught up in interpreting QKV, it is a waste of time, since completely different attention formulations (e.g. merged attention [1]) still give you the similarities / multiplicative interactions, but may even work better [2]. EDIT: Oh and attention is much more broad than scaled dot-product attention [3].

[1] https://www.emergentmind.com/topics/merged-attention

[2] https://blog.google/innovation-and-ai/technology/developers-...

[3] https://arxiv.org/abs/2111.07624

p1esk · 2026-01-08T03:48:12 1767844092

I glanced at these links and it seems that all these attention variants still use QKV projections.

Do you see any issues with my interpretation of them?

D-Machine · 2026-01-08T04:03:51 1767845031

Read the third link / review paper, it is not at all the case that all attention is based on QKV projections.

Your terms "sensitivity", "visibility", and "important" are too vague and lack any clear mathematical meaning, so IMO add nothing to any understanding. "Important" also seems factually wrong, given these layers are stacked, so later weights and operations can in fact inflate / reverse things. Deriving e.g. feature importances from self-attention layers remains a highly disputed area (e.g. [1] vs [2], for just the tip of the iceberg).

You are also assuming that the importance of attention is the highly-specific QKV structure and projection, but there is very little reason to believe that based on the third review link I shared. Or, if you'd like another example of why not to focus so much on scaled dot-product attention, see that it is just a subset of a broader category of multiplicative interactions (https://openreview.net/pdf?id=rylnK6VtDH).

[1] Attention is not Explanation - https://arxiv.org/abs/1902.10186

[2] Attention is not not Explanation - https://arxiv.org/abs/1908.04626

p1esk · 2026-01-08T05:56:28 1767851788

1. The two papers you linked are about importance of attention weights, not QKV projections. This is orthogonal to our discussion.

2. I don't see how the transformations done in one attention block can be reversed in the next block (or in the FFN network immediately after the first block): can you please explain?

3. All state of the art open source LLMs (DeepSeek, Qwen, Kimi, etc) still use all three QKV projections, and largely the same original attention algorithm with some efficiency tweaks (grouped query, MLA, etc) which are done strictly to make the models faster/lighter, not smarter.

4. When GPT2 came out, I myself tried to remove various ops from attention blocks, and evaluated the impact. Among other things I tried removing individual projections (using unmodified input vectors instead), and in all three cases I observed quality degradation (when training from scratch).

5. The terms "sensitivity", "visibility", and "important" all attempt to describe feature importance when performing pattern matching. I use these terms in the same sense as importance of features matched by convolutional layer kernels, which scan the input image and match patterns.

lostmsu · 2026-01-08T16:01:20 1767888080

> and in all three cases I observed quality degradation (when training from scratch).

At the same model size and training FLOPS?

p1esk · 2026-01-08T16:46:32 1767890792

No. Each projection is ~5% of total FLOPs/params. Not enough model capacity change to care. From what I remember, removing one of them was worse than other two, I think it was Q. But in all three cases, degradation (in both loss and perplexity) was significant.

D-Machine · 2026-01-08T06:54:17 1767855257

1. I do not think it is orthogonal, but, regardless, there is plenty of research trying to get explainability out of all aspects of scaled dot-product attention layers (weights, QKV projections, activations, other aspects), and trying to explain deep models generally via sort of bottom-up mechanistic approaches. I think it can be clearly argued this does not give us much and is probably a waste of time (see e.g. https://ai-frontiers.org/articles/the-misguided-quest-for-me...). I think this is especially clear when you have evidence (in research, at least) that other mechanisms and layers can produce highly similar results.

2. I didn't say the transformations can be reversed, I said if you interpret anything as an importance (e.g. a magnitude), that can be inflated / reversed by whatever weights are learned by later layers. Negative values and/or weights make this even more annoying / complicated.

3. Not sure how this is relevant, but, yes, any reasons for caring about QKV and scaled dot-product attention specifics are mostly related to performance and/or current popular leading models. But there is nothing fundamentally important about scaled dot-product attention, it most likely just happens to be something that was settled on prematurely because it works quite well and is easy to parallelize. Or, if you like the kernel smoothing explanation also mentioned in this thread, scaled dot-product self-attention implements something very similar to a particularly simple and nice form of kernel smoothing.

4. Yup, removing ops from scaled dot-product attention blocks is going to dramatically reduce expressivity, because there really aren't much ops there to remove. But there is enough work on low-rank attention, linear attentions, and sparse attentions, that show you can remove a lot of expressivity and still do quite well. And, of course, the huge amount of helpful other types of attention I linked before give gains in some cases too. You should be skeptical about any really simple or clear story about what is going on here. In particular, there is no clear reason why a small hypernetwork couldn't be used to approximate something more general than scaled dot-product attention, except that, obviously this is going to be more expensive, and in practice you can probably just get the same approximate flexibility by stacking simpler attention layers.

5. I still find that doesn't give me any clear mathematical meaning.

I suspect our learning goals are at odds. If you want to focus solely on the very specific kind of attention used in the popular transformer models today, perhaps because you are interested in optimizations or distillation or something, then by all means try to come up with special intuitions about Q, K, and V, if you think that will help here. But those intuitions will likely not translate well to future and existing modifications and improvements to attention layers, in transformers or otherwise. You will be better served learning about attention broadly and developing intuitions based on that.

Others have mentioned the kernel smoothing interpretation, and I think multiplicative interactions are the clearer deeper generalization of what is really important and valuable here. Also, the useful intuitions in DL have been less about e.g. "feature importances" and "sensitivity" and such, but tend to come more from linear algebra and calculus, and tend to involve things like matrix conditioning and regularization / smoothing and Lipschitz constants and the like. In particular, the softmax in self-attention is probably not doing what people typically say it does (https://arxiv.org/html/2410.18613v1), and the real point is that all these attention layers are trained in an end-to-end fashion where all layers are interdependent on each other to varying complicated degrees. Focusing on very specific interpretations ("Q is this, K is that"), especially where these interpretations are sort of vaguely metaphorical, like yours, is not likely to result in much deep understanding, in my opinion.

psb217 · 2026-01-08T10:48:34 1767869314

Per your point 4, some current hyped work is pushing hard in this direction [1, 2, 3]. The basic idea is to think of attention as a way of implementing an associative memory. Variants like SDPA or gated linear attention can then be derived as methods for optimizing this memory online such that a particular query will return a particular value. Different attention variants correspond to different ways of defining how the memory produces a value in response to a query, and how we measure how well the produced value matches the desired value.

Some of the attention-like ops proposed in this new work are most simply described as implementing the associative memory with a hypernetwork that maps keys to values with weights that are optimized at test time to minimize value retrieval error. Like you suggest, designing these hypernetworks to permit efficient implementations is tricky.

It's a more constrained interpretation of attention than you're advocating for, since it follows the "attention as associative memory" perspective, but the general idea of test-time optimization could be applied to other mechanisms for letting information interact non-linearly across arbitrary nodes in the compute graph.

[1] https://arxiv.org/abs/2501.00663

[2] https://arxiv.org/abs/2504.13173

[3] https://arxiv.org/abs/2505.23735

p1esk · 2026-01-08T17:06:26 1767891986

perhaps because you are interested in optimizations or distillation or something

Yes, my job is model compression: quantization, pruning, factorization, ops fusion/approximation/caching, in the context of hw/sw codesign.

In general, I agree with you that simple intuitions often break down in DL - I observed it many times. I also agree that we don't have good understanding how these systems work. Hopefully this situation is more like pre-Newtonian physics, and Newtons are coming.

p1esk · 2026-01-07T04:34:59 1767760499

Financial freedom is about not having to worry about losing your job, or tolerating shitty work conditions. Why would you retire if you do what you love? I think the real problem might be if there's nothing you actually love doing (long term), that's when money won't help.

canpan · 2026-01-07T04:39:43 1767760783

The people with the drive to be able to retire early are also the most likely to be bored when it happens.

Working on something fun and novel, like in his case Gemini, mentioned in the article, is the ideal.

dyauspitr · 2026-01-07T04:39:32 1767760772

What do you mean? Money is the best thing to have if you’re lazy and don’t like doing anything.

In this case I agree though, he’s the boss, not beholden to anyone. Can wander around and do what interests him.

p1esk · 2026-01-07T04:45:55 1767761155

But that's the problem - if you don't like doing anything, what will you do? What will you fill your life with? You will quickly get bored of anything you try. Your life will have no meaning, and you will probably turn to alcohol or drugs.

mockingbirdy · 2026-01-07T05:06:24 1767762384

Had this exact situation. Turned to drugs. Lived like a GTA character. That’s unsustainable, similar to luxury vacations which turn dull. Now I‘m finding investors for curing MS with a team of researchers who‘ve built sth that is more accurate than CRISPR to cure my sister. I actually want to get in contact with Sergey Brin about that because we might have something interesting for him - but my American contacts are only connected to Musk and people like a Polygon founder and music/hollywood people. This is not a psychotic or exaggerated message, I‘m sure HN can vet us (@dang) and get us contacts, currently I‘m talking to family offices in Saudi Arabia. About meaning: if you get bored, aim for bigger positive impact.

paper: https://pubmed.ncbi.nlm.nih.gov/29857928/

short: We have a rigorously validated antigen-specific immune tolerance platform with bystander suppression, NIH/MS Society backing and a clear translational gap.

dyauspitr · 2026-01-07T07:17:24 1767770244

Everyone, even the laziest person likes doing something even if that’s just parking yourself in front of the TV and stuffing your face.

Most people though genuinely like activities that most times would be impossible to monetize enough to make a living, which isn’t a problem if you’re rich. Alternatively, there are plenty of things people want to do that they have no intention of being the best at, they want to dabble.

vbezhenar · 2026-01-07T06:18:59 1767766739

I can play World of Warcraft indefinitely.

oreally · 2026-01-07T06:38:14 1767767894

Indeed, video games are probably the things most of humanity will retire to if they didn't attach so much ego and meaning to their jobs and by extension, the people around them.

Just be sure to swap games once in a while so you don't get bored.

p1esk · 2026-01-04T18:56:07 1767552967

Almost nobody else in engineering did this.

What you described is the job of a product manager. Are there no PMs at Google?

Xorlev · 2026-01-04T19:03:53 1767553433

There are, and often times they're stuck in a loop of presenting decks and status, writing proposals rather than doing this kind of research.

That said, interpreting user feedback is a multi-role job. PMs, UX, and Eng should be doing so. Everyone has their strengths.

One of the most interesting things I've had a chance to be a part of is watching UX studies. They take a mock (or an alpha version) and put it in front of an external volunteer and let them work through it. Usually PM, UX, and Eng are watching the stream and taking notes.

javawizard · 2026-01-04T20:37:05 1767559025

Xoogler here.

When you get to a company that's that big, the roles are much more finely specialized.

I forget the title now, but we had someone who interfaced with our team and did the whole "talk to customers" thing. Her feedback was then incorporated into our day-to-day roadmap through a complex series of people that ended with our team's product manager.

So people at Google do indeed do this, they just aren't engineers, usually aren't product managers, frequently are several layers removed from engineers, and as a consequence usually have all the problems GP described.

stefan_ · 2026-01-04T19:27:32 1767554852

PM is a fake job where the majority have long learned that they can simply (1) appease leadership and (2) push down on engineering to advance their career. You will notice this does not actually involve understanding or learning about products.

It's why the GP got that confused reaction about reading user reports. Talk to someone outside big company who has no power? Why?

kmoser · 2026-01-04T19:50:39 1767556239

I've had the pleasant experience of having worked for PMs at several companies (not at Google) who were great at their jobs, and advocated for the devs. They also had no problem with devs talking directly with clients, and in fact they encouraged it since it was usually the fastest way to understand and solve a problem.

lovich · 2026-01-04T22:27:45 1767565665

Almost every job in the US is primarily about pleasing leadership at the end of the day.

If companies didn’t want that sort of incentive structure to play out then they would insulate employees from the whims of their bosses with things like contracts or golden parachutes that come out of their leaderships budget.

They pretty much don’t though, so you need to please your leadership first to get through the threat of at will employment, before considering anything else.

If you’re lucky what pleases your leadership is productive and if your super lucky what pleases them even pleases you.

Gotta suck it up and eat shit or quit if it doesn’t though

potatoproduct · 2026-01-04T20:53:51 1767560031

Sounds like you just got stuck with a shit PM to be honest.