Hacker Newsnew | past | comments | ask | show | jobs | submit | bbcc90's commentslogin

it does work; just download from HF and load in the app


true. See here for a map: https://sfplanninggis.s3.amazonaws.com/hub/BIGmap.pdf

I live in the that space (between tower and the city) and the local neighbourhood group (HANC) is ridiculously NIMBY. rezoning is happening but it's slow going...


(trying to move the critique beyond the title...)

When trying to deploy llms in with larger context windows constrained environments 2 things start to hurt: a) increased memory footprint for longer KV cache b) increased decode speed due to longer context window. this paper addresses a) only, which is useful, but we are still left with b) (right?)


The more meaningful contribution may be (section 3.4)

> These variants illustrate TPA’s versatility in balancing memory cost, computational overhead, and representation power. By choosing which dimensions (heads or tokens) remain contextual and adjusting ranks (RQ, RK, RV ), TPA unifies multiple existing attention mechanisms— such as MHA, MQA, and GQA—under one framework, while potentially reducing the KV cache size by an order of magnitude during autoregressive inference.

re: the title, it might be the true one if their proofs hold up

---

I'm now curious if the Element-wise Attention is All You Need preprint can be fit into this framework. Sadly my math is not currently up to the task. It appears to offer even better computational savings during both training and inference while maintaining accuracy, though only tested with a smaller model

https://arxiv.org/abs/2501.05730


EA doesn't quite fit in the same umbrella. EA has a constant cache size (it's just another classical recurrent architecture inspired by approximating transformers), where this paper gives speedups to a variety of true attention mechanisms which still require caches to be proportional to the sequence length.


very succinct and insightful, thank you!


Curious to know what mathematics you are comfortable with. If you are able to understand the papers you mentioned, you must belong to the 99 percentile.


I was never good at proof writing. I found group theory and algebra interesting, topology and analysis eluded me. It's just been a while since I did any serious math thinking


It addresses b too since decompositions are always smaller than the original tensor. It's usually the case that memory access is also slower than matrix multiplications so this will be faster. Burning flops to save memory movement.


> It's usually the case that memory access is also slower than matrix multiplications so this will be faster. Burning flops to save memory movement.

I haven't read this paper (yet) but isn't this the case that mostly applies to training and not so much to inference? A good example would be flash-attention, it trades the higher flops for better memory utilization but it's mostly irrelevant in inference workloads.


They claim an inference time savings to the kv cache


I skimmed through the paper real quickly. There's no performance data on inference speedups in the paper. Only the benchmarks relevant for training.

They also, interestingly, don't compare against the flash-attention. Flash-attention outperforms all of the other attention mechanisms mentioned in the paper: MHA, MQA, GQA, and MLA.


Flash attention is an implementation trick; you can implement MHA/GQA, for example, with flash attention.


They aren't claiming speedups, they are claiming up to an order of magnitude less space needed for the kv cache at runtime. This translates to a smaller GPU or longer sequences in the same GPU


Under what circumstances can you cut down your LOADS and STORE from and to main memory by an order of magnitude without observing major improvements in algorithm runtime that is memory-bound?


AI models are compute bound, it's why we use GPUs


Incorrect. Self-attention is a highly parallel algorithm that makes it a great candidate for being a memory-bound workload once you have enough compute.

Both datacenter grade CPUs and GPUs have enough compute to carry out the self-attention computation but it is only the latter that has enough hi-bandwidth memory to make the algorithm really perform. If this hadn't been the case, the theory behind flash-attention wouldn't materialize, and it does, and reason being that (main) memory is slow.

Deep FFWD networks OTOH are compute-bound.


Transformers are deep feedforward networks that happen to also have attention. Causal LMs are super memory bound during inference due to kv caching as all of those linear layers need to be loaded onto the core to transform only a single token per step.


And I said something else?


Memory bound only applies to low batch size scenarios AFAIK


This obviously depends on the hardware and the shape of the LLM model itself but, generally speaking, it's quite the opposite. The idea of batching is to grow the compute bandwidth per single request, bigger batch sizes with much more compute will put more stress to the underlying (cache, RAM) memory subsystem, no?

For N self-attention layers, there will be N compute (tensor) units doing the computation in parallel. To retire the computation, each compute unit will need to LOAD/STORE from and to the chip memory. At batch size B, this only becomes a bigger scale, e.g. B * (N, LOAD/STORE).


If you have a batch of size 1, for every token you need to load the entire model from memory into cache as you go through it. If it is 32 you can produce 32 tokens while doing the same amount of loading from VRAM.


That's not how it works because if what you're saying had been true then the self-attention memory complexity would be O(1), e.g. regardless of the batch size. This obviously isn't the case since each batch computation necessitates it's own load/store memory bandwidth. I suggest reading one of the transformers papers to really understand how it works.


This was a simplification. Of course you need some extra VRAM I/O based on your KV cache size.

But assuming your KV cache size is << model size, that simplification is pretty accurate.

See, e.g. https://www.databricks.com/blog/llm-inference-performance-en...

You can just scroll to the first chart they have that explains the idea.


> (trying to move the critique beyond the title...)

This is kind of a theme in HN now. The top comments are completely besides the point of the article/story/etc.


I know. It is sad. Naming can also be seen as a way of showing respect to a hugely impactful paper if you want to be positive about it.



Yahoo company still exists!!


any idea why this took them so long?


Because they have a near monopoly with Cuda and get away with dragging their heels due to lack of market pressure (everyone loves to lock themselves into Cuda and then complain about GPU prices), despite having been on the OpenCL committee (just like Apple with Metal conflict of interest).

Anyway, I'm very glad to see it and will be using it immediately.


Also because none of the competition were/are serious about OpenCL either.

AMD still doesn't have OpenCL 3.0 support and their implementation of previous versions was far far less stable than CUDA.

I can't find a definite source on this, but afaik none of the official OpenCL implementations have ever fully supported mixed CPU-GPU code the way CUDA does.


AMD's OpenCL support on GPU is overall excellent in my experience (2x commercial apps and lots of hobby code), but I tend to mostly use low level OpenCL 1.1 stuff, which I find sufficient.

Also Intel GPUs are actually incredibly competent with OpenCL if you give them wide enough NDrange, and somehow try to look past lack of any fp64 support at all :/


On top of that even the "do no evil" Google, never supported OpenCL on Android, pushing instead their own dialect, Renderscript.

Yes, there are some custom Android deployments that have a libopencl.so kind of thing, it is used by the OEMs themselves, and never exposed as official Android API.


This is why I am investing in AMD. They can only improve!


CUDA itself only got bumped from LLVM 5 to LLVM 7 in CUDA 11.2 (https://developer.nvidia.com/blog/boosting-productivity-and-...).

LLVM 7 opt-in for OpenCL happened some time later (available since r510). What changes now is that LLVM 7 is the new default.


If you are going to go vertical then do it properly.

OpenAI could just build their own framework for internal use that works well on their silicon (see Jax+tpu)

Their starting point? Triton plus some triton libs. Jax chipped away at TF like this, and no reason why Triton can’t do the same to PyTorch.


Love this. I’m 10 years in SF myself. One thing I would add is the people


Agreed, the dating in the bay is amazing.


Most statements I've heard about the dating scene are pretty negative for men at least. Mostly stuff like this: https://old.reddit.com/r/bayarea/comments/y0heqs/what_is_the...


My female straight friends flying in from LA and NYC feel like they Bay is a goldmine for women looking for ambitious partners, your "purchasing power" is a lot higher here.

At the same time, I don't think I've ever heard a single straight male praise the area for its love-life opportunities.


Go to poor areas bars. I find that ladies there are very happy to find any stable man that can afford to live without roommates.


Man Jose is pretty tragic.


The peninsula and San Jose is terrible but the city and East Bay are good. Silicon valley is full of guts who have inflated egos because of their pay checks, but they have not much else going for them.


You have to make a conscious effort to avoid douchebags.


I would also say that being an engineer at L8 is a meaningfully different set of skills than L7. The whole ‘what got you here won’t get you there idea’ starts at L7 for either track…


Chiplets and standardized hardware interfaces are a great start.

Now we just need to solve the remaining 90%: firmware, software, security, virtualization, …, …


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: