bbcc90's comments

bbcc90 · 2025-08-15T05:24:37 1755235477

it does work; just download from HF and load in the app

bbcc90 · 2025-02-21T03:23:06 1740108186

true. See here for a map: https://sfplanninggis.s3.amazonaws.com/hub/BIGmap.pdf

I live in the that space (between tower and the city) and the local neighbourhood group (HANC) is ridiculously NIMBY. rezoning is happening but it's slow going...

bbcc90 · 2025-01-22T07:12:58 1737529978

(trying to move the critique beyond the title...)

When trying to deploy llms in with larger context windows constrained environments 2 things start to hurt: a) increased memory footprint for longer KV cache b) increased decode speed due to longer context window. this paper addresses a) only, which is useful, but we are still left with b) (right?)

verdverm · 2025-01-22T07:29:12 1737530952

The more meaningful contribution may be (section 3.4)

> These variants illustrate TPA’s versatility in balancing memory cost, computational overhead, and representation power. By choosing which dimensions (heads or tokens) remain contextual and adjusting ranks (RQ, RK, RV ), TPA unifies multiple existing attention mechanisms— such as MHA, MQA, and GQA—under one framework, while potentially reducing the KV cache size by an order of magnitude during autoregressive inference.

re: the title, it might be the true one if their proofs hold up

---

I'm now curious if the Element-wise Attention is All You Need preprint can be fit into this framework. Sadly my math is not currently up to the task. It appears to offer even better computational savings during both training and inference while maintaining accuracy, though only tested with a smaller model

https://arxiv.org/abs/2501.05730

hansvm · 2025-01-22T16:23:30 1737563010

EA doesn't quite fit in the same umbrella. EA has a constant cache size (it's just another classical recurrent architecture inspired by approximating transformers), where this paper gives speedups to a variety of true attention mechanisms which still require caches to be proportional to the sequence length.

verdverm · 2025-01-22T17:09:38 1737565778

very succinct and insightful, thank you!

ashupadhi01 · 2025-01-22T14:02:29 1737554549

Curious to know what mathematics you are comfortable with. If you are able to understand the papers you mentioned, you must belong to the 99 percentile.

verdverm · 2025-01-22T14:24:01 1737555841

I was never good at proof writing. I found group theory and algebra interesting, topology and analysis eluded me. It's just been a while since I did any serious math thinking

llm_trw · 2025-01-22T10:16:47 1737541007

It addresses b too since decompositions are always smaller than the original tensor. It's usually the case that memory access is also slower than matrix multiplications so this will be faster. Burning flops to save memory movement.

menaerus · 2025-01-22T14:12:49 1737555169

> It's usually the case that memory access is also slower than matrix multiplications so this will be faster. Burning flops to save memory movement.

I haven't read this paper (yet) but isn't this the case that mostly applies to training and not so much to inference? A good example would be flash-attention, it trades the higher flops for better memory utilization but it's mostly irrelevant in inference workloads.

verdverm · 2025-01-22T14:29:00 1737556140

They claim an inference time savings to the kv cache

menaerus · 2025-01-22T14:54:17 1737557657

I skimmed through the paper real quickly. There's no performance data on inference speedups in the paper. Only the benchmarks relevant for training.

They also, interestingly, don't compare against the flash-attention. Flash-attention outperforms all of the other attention mechanisms mentioned in the paper: MHA, MQA, GQA, and MLA.

apophis-ren · 2025-02-03T03:45:02 1738554302

Flash attention is an implementation trick; you can implement MHA/GQA, for example, with flash attention.

verdverm · 2025-01-22T15:15:47 1737558947

They aren't claiming speedups, they are claiming up to an order of magnitude less space needed for the kv cache at runtime. This translates to a smaller GPU or longer sequences in the same GPU

menaerus · 2025-01-22T16:29:55 1737563395

Under what circumstances can you cut down your LOADS and STORE from and to main memory by an order of magnitude without observing major improvements in algorithm runtime that is memory-bound?

verdverm · 2025-01-22T17:08:15 1737565695

AI models are compute bound, it's why we use GPUs

menaerus · 2025-01-22T17:24:57 1737566697

Incorrect. Self-attention is a highly parallel algorithm that makes it a great candidate for being a memory-bound workload once you have enough compute.

Both datacenter grade CPUs and GPUs have enough compute to carry out the self-attention computation but it is only the latter that has enough hi-bandwidth memory to make the algorithm really perform. If this hadn't been the case, the theory behind flash-attention wouldn't materialize, and it does, and reason being that (main) memory is slow.

Deep FFWD networks OTOH are compute-bound.

zaptrem · 2025-01-23T03:30:07 1737603007

Transformers are deep feedforward networks that happen to also have attention. Causal LMs are super memory bound during inference due to kv caching as all of those linear layers need to be loaded onto the core to transform only a single token per step.

menaerus · 2025-01-23T08:30:42 1737621042

And I said something else?

lostmsu · 2025-01-24T06:16:37 1737699397

Memory bound only applies to low batch size scenarios AFAIK

menaerus · 2025-01-24T07:17:27 1737703047

This obviously depends on the hardware and the shape of the LLM model itself but, generally speaking, it's quite the opposite. The idea of batching is to grow the compute bandwidth per single request, bigger batch sizes with much more compute will put more stress to the underlying (cache, RAM) memory subsystem, no?

For N self-attention layers, there will be N compute (tensor) units doing the computation in parallel. To retire the computation, each compute unit will need to LOAD/STORE from and to the chip memory. At batch size B, this only becomes a bigger scale, e.g. B * (N, LOAD/STORE).

lostmsu · 2025-01-24T15:19:52 1737731992

If you have a batch of size 1, for every token you need to load the entire model from memory into cache as you go through it. If it is 32 you can produce 32 tokens while doing the same amount of loading from VRAM.

menaerus · 2025-01-24T17:43:11 1737740591

That's not how it works because if what you're saying had been true then the self-attention memory complexity would be O(1), e.g. regardless of the batch size. This obviously isn't the case since each batch computation necessitates it's own load/store memory bandwidth. I suggest reading one of the transformers papers to really understand how it works.

lostmsu · 2025-01-24T19:04:08 1737745448

This was a simplification. Of course you need some extra VRAM I/O based on your KV cache size.

But assuming your KV cache size is << model size, that simplification is pretty accurate.

See, e.g. https://www.databricks.com/blog/llm-inference-performance-en...

You can just scroll to the first chart they have that explains the idea.

wseqyrku · 2025-01-22T10:54:28 1737543268

> (trying to move the critique beyond the title...)

This is kind of a theme in HN now. The top comments are completely besides the point of the article/story/etc.

msoad · 2025-01-22T13:42:00 1737553320

I know. It is sad. Naming can also be seen as a way of showing respect to a hugely impactful paper if you want to be positive about it.

bbcc90 · on Dec 18, 2024

This is a bug not a feature:

https://finance.yahoo.com/news/microsoft-stock-receives-rare...

dd_xplore · on Dec 18, 2024

Yahoo company still exists!!

bbcc90 · on Feb 17, 2024

any idea why this took them so long?

pixelpoet · on Feb 17, 2024

Because they have a near monopoly with Cuda and get away with dragging their heels due to lack of market pressure (everyone loves to lock themselves into Cuda and then complain about GPU prices), despite having been on the OpenCL committee (just like Apple with Metal conflict of interest).

Anyway, I'm very glad to see it and will be using it immediately.

dotnet00 · on Feb 17, 2024

Also because none of the competition were/are serious about OpenCL either.

AMD still doesn't have OpenCL 3.0 support and their implementation of previous versions was far far less stable than CUDA.

I can't find a definite source on this, but afaik none of the official OpenCL implementations have ever fully supported mixed CPU-GPU code the way CUDA does.

pixelpoet · on Feb 17, 2024

AMD's OpenCL support on GPU is overall excellent in my experience (2x commercial apps and lots of hobby code), but I tend to mostly use low level OpenCL 1.1 stuff, which I find sufficient.

Also Intel GPUs are actually incredibly competent with OpenCL if you give them wide enough NDrange, and somehow try to look past lack of any fp64 support at all :/

pjmlp · on Feb 17, 2024

On top of that even the "do no evil" Google, never supported OpenCL on Android, pushing instead their own dialect, Renderscript.

Yes, there are some custom Android deployments that have a libopencl.so kind of thing, it is used by the OEMs themselves, and never exposed as official Android API.

nomel · on Feb 17, 2024

This is why I am investing in AMD. They can only improve!

my123 · on Feb 17, 2024

CUDA itself only got bumped from LLVM 5 to LLVM 7 in CUDA 11.2 (https://developer.nvidia.com/blog/boosting-productivity-and-...).

LLVM 7 opt-in for OpenCL happened some time later (available since r510). What changes now is that LLVM 7 is the new default.

bbcc90 · on Feb 9, 2024

If you are going to go vertical then do it properly.

OpenAI could just build their own framework for internal use that works well on their silicon (see Jax+tpu)

Their starting point? Triton plus some triton libs. Jax chipped away at TF like this, and no reason why Triton can’t do the same to PyTorch.

bbcc90 · on Aug 6, 2023

Love this. I’m 10 years in SF myself. One thing I would add is the people

throwa623432 · on Aug 6, 2023

Agreed, the dating in the bay is amazing.

est31 · on Aug 6, 2023

Most statements I've heard about the dating scene are pretty negative for men at least. Mostly stuff like this: https://old.reddit.com/r/bayarea/comments/y0heqs/what_is_the...

8f2ab37a-ed6c · on Aug 7, 2023

My female straight friends flying in from LA and NYC feel like they Bay is a goldmine for women looking for ambitious partners, your "purchasing power" is a lot higher here.

At the same time, I don't think I've ever heard a single straight male praise the area for its love-life opportunities.

matt3210 · on Aug 7, 2023

Go to poor areas bars. I find that ladies there are very happy to find any stable man that can afford to live without roommates.

selectodude · on Aug 7, 2023

Man Jose is pretty tragic.

asielen · on Aug 7, 2023

The peninsula and San Jose is terrible but the city and East Bay are good. Silicon valley is full of guts who have inflated egos because of their pay checks, but they have not much else going for them.

metalforever · on Aug 7, 2023

You have to make a conscious effort to avoid douchebags.

bbcc90 · on June 11, 2023

I would also say that being an engineer at L8 is a meaningfully different set of skills than L7. The whole ‘what got you here won’t get you there idea’ starts at L7 for either track…

bbcc90 · on March 31, 2023

Chiplets and standardized hardware interfaces are a great start.

Now we just need to solve the remaining 90%: firmware, software, security, virtualization, …, …