Hacker Newsnew | past | comments | ask | show | jobs | submit | vladf's commentslogin

This is available for Flash


why


That's pretty disappointing, it has been out for a while, and we still get top comments like (https://news.ycombinator.com/item?id=43475043) where people clearly think native image generation capability is new. Where do you usually get your updates from for this kind of thing?


Yes perhaps one day cities like Tokyo will catch up


Most of Tokyo (and Japan) doesn't have an excessive amount of pedestrians' litter. Public garbage cans are rare, though many convenience stores and some train stations have them.

Is there litter? Yes. Is there much less than, say, Vancouver, Bangkok or London? Absolutely.

It's due to a couple things that tourists may not be used to: -Be prepared to bring your garbage with you until you are home.

-Garbage is sorted differently in each municipality in Japan, and often the garbage bags cost money. Who is going to buy those bags and sort someone else's garbage?

-It's changing, but walking around consuming snacks, food, drinks in Japan is not that common. People do that at specific locations, thus they don't find themselves with empty food wrappers and drink cups while walking around. Thus, they don't see a need for public garbage cans.

-Crows make a quick mess of garbage here. Observing the above points means that (most) of Japan doesn't need stinky, sticky, flies-and-wasps-buzzing-around, crow-magnet garbage cans, which look almost as bad as litter everywhere.


Have you seen the litter that piles up late at night in many Japanese cities? People simply leave their trash everywhere but the city cleans it up by morning. Tokyo is one of the most littered cities I've seen at midnight compared to any western city.


Are you referring to the shibuya crossing area? Apart from there I don’t see that at all


Lots of areas where it's prevalent to see litter, not just in Shibuya. Walk around any popular neighborhood at night and you'll see it.


Vancouver is pretty clean, have you ever been to Paris or Rome? (I’ve lived in all three)


It’s not litter that causes the rat issue in NYC, it is the huge masses of household and business garbage.


they do just pile up garbage bags on pickup day that i've seen


Never been to Japan, but AFAIK they used to have trash cans but got rid of them due to a sarin gas terrorist attack in 1995.


Tokyo is significantly cleaner than NYC, or any major US city. I think this is GPs point.


Rat populations mostly aren't sustained by the sorts of trash cans that passersby toss garbage into (or fail to). They're sustained by containers carrying large amounts of residential refuse. (Consider the proportion of garbage you, personally, throw into a public trash can compared to your own trash can, especially food waste.) And I'm not well-versed on the specifics of Japanese waste processing, but I'm fairly confident they have something analogous to dumpsters and residential trash collection, even if they don't have public bins.


Many businesses and residences typically bag their trash and leave it on the street curb.

The main exception is when a building has a managed trash facility, which is a room that people leave their bags in instead of on the curb.


I don't know, there's a decent amount of rats where I live, but no outdoor dumpsters and very few public trashcans. Trash is kept indoors and brought out at the daily collection time.

I always assumed sewer access and the occasional rat-stronghold in poorly maintaned buildings was the issue.


Ah! Finally a real world use for egg drop (with eggs and floors both equal to num commits since init, but maybe fewer eggs for those less patient).


That was the first thing that came to my mind!

Though I cannot find the original, well-written article I read some years ago, this link covers the problem for those interested: https://www.geeksforgeeks.org/egg-dropping-puzzle-dp-11/


I see how the problem maps onto this. It seems like the main issue in the original problem is that you have a fixed number of eggs. But for git bisect-find there are unlimited eggs, but we want to use as few as possible.

It is also quite different because there is a bias that most targets are probably sooner. If you have a 10 year old repo it is more likely that an obvious bug is in the last month than the first month. (Otherwise the optimal would always just be to test the first commit and then assuming that it is good just bisect from there).


* I don't see how.


Really strong binary results. So strong it was fishy. I hope someone can explain my confusion below.

> We compared the performance of the Llama2-7B model in three configurations: FP16 (full precision), HQQ (without fine-tuning), and HQQ+ (with adapter layers) using a group-size of 8.

Interesting, what is "group-size of 8"?

From their HQQ post (https://mobiusml.github.io/hqq_blog/), it's the block size at which they add scales (presumably 16-bit) and shifts (in that post, it's 8-bit).

So for every 8 binary weights we have a 16-bit scale and 8-bit shift?

> Fine-tuning with Low-Rank Adapters

They say they inline the shift into the LoRA but how can you do this, block-wise, without increasing your LoRA rank by num-blocks (they claim to only use 1 additional rank)?

Then, the reported 7B sizes, in GB:

> 13.5 (fp16) 1.76 (HQQ 1-bit) 1.85 (HQQ+ 1-bit) 2.72 (quip# 2-bit)

those numbers would make sense if it was _actually_ 1 bit. But if you include the overhead of 16-bit scales (and why is the shift inlineable into lora? still unexplained) it'd be more like 3-bit.

From their HF page:

> This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.

Interesting, so we have to go back to CPU to rescale? Is this how they counted GB? This should have been clearly caveated in the table. I also am amazed they got latency lower than quip if they pingpong to CPU.


Hello, I am the main author, would love to clarify a couple of things:

All the linear-quantization methods have meta-data, including the 1.58bit paper. You can control the quality vs. memory usage by reducing the group-size. However, the meta-data is the not the same thing as the quantized weights for many reasons:

> The meta-data size doesn't change the fact that you can do binary/ternary matmul, which the most important thing in this story.

> The meta-data size doesn't increase the actual compute: these are point-wise operations and even if you have 1 scalar you still need to multiply the same amount of weights.

> Meta-data is offloaded to the CPU with pinned-memory, which allows non-blocking transfers. Technically, you can trigger the copy in the layer before and synchronize and will make it almost seamless. I did some experiments with cuda streams that worked very well on an older machine, but then I tried a better machine and the transfer was much faster. Obviously if you are trying it on Google colab it's very slow for this reason.

> Smaller models like Llama2-7B are very hard to directly quantize at very low bits, so they need a lower group-size to function well. Larger models (like what we showed for Mixtral), can be quantized to 2-bit on the fly, without any data, and still work very well. So basically larger models are less sensitive to extreme quantization and you could use a much larger group-size. I still think that the meta-data size is really not a big deal for the reasons I have explained above.

> There are many other ways to increase the group-size or even get rid of it all together, many ideas available but needs lots of experimentation.

> Binary/ternary CUDA matmul kernels don't exist yet. The current code is implementing the dequantization step in CUDA but then uses torch.matmul as fp16. I tried doing matmul at low-bits with CUDA but it is very difficult to even beat cuBLAS with fp16, especially for a novice CUDA coder like me :)

Please note: this is early experimental work. Since it showed promising results, we wanted to share it with the community first as we progress. There's still a lot of things to be done and we are actively working on it, despite the very limited resources we have.

Happy to answer any questions here!


Thanks for the reply. I’m quite familiar with subchannel quant, but still feel like my questions did not get addressed.

1 Could you post the full memory use of the methods? E.g. you include quip metadata in its GB but not hqq metadata in its GB.

2 If you have to go to cpu to shift and scale, how did you get latency lower than pure on device? Was this bsz1? No speculative decoding?

3 how can lora absorb shifts with only increasing rank by 1 if you have a shift per group?


Sure, I can give you detailed answers:

1- The answer is still ~1.7GB. You only need meta-data of a single nn.Linear at a time. There are 32x(4+3) = 224 layers quantized, so you need an additional (3GB - 1.7GB)/224 = 1.3GB/224 ~ 5.8MB, which is negligible.

2- As the table says, that's the forward pass. (Batch-size=1, context-size=1024). Forward pass means there's no caching and no decoding logic. The actual model generation speed should be much faster with caching + decoding logics like speculative decoding, and using VLLM instead of HF. And even with all of that, a much larger model like Mixtral with the same group-size of 8 offloaded to the CPU works quite well on a 4090.

You mean why it's faster than Quip# despite being all on-device? Because dequantization with HQQ is a simple linear operation. It's not even using a fused kernel, only the dequantization part is done on CUDA.

3- LoRA absorbs -zero x shift, not the shift, the shift is still there, including the BitNet/1.58 work. As the paragraph explains, the math is ignoring reshaping to make the math simple and easy to read.

Let's say you have a matrix of 4096x4096, with grouping done channel-wise (but no reshaping), the -zero x shift part is a rank-1 matrix (4096x1 .dot 1x4096), the lora data will be (4096xr .dot rx4096), you can merge them exactly into (4096x(r+1) .dot (r+1)x4096).

The point of that paragraph is to show two things: - Compared to the BitNet formulation, the additional zero-point (which is necessary to get good quantization results on pre-trained models with minimum calibration), has a negligible overhead. -More importantly, it explains how we even got the idea of adding low-rank adapters: it's not because LoRA is popular, it's because the zero-point alone results in a rank-1 matrix error which is not enough to express the quantization error. As the rank tends to min(num_rows, num_cols), the error goes down. So if we increase the rank by r via low-rank adapters, we would expect better results.

Now, if we include the reshape with a lower-group size than the num_rows, the -zero x shift part is a rank-n matrix (4096xn dot nx4096), but it's not possible to properly estimate the rank n because that would highly depend on the nature of the weights matrix, but in the end, the LoRA part will be (4096x(n+r) .dot (n+r)x4096). We only use a lora rank of 8 for MLPs which are the larger matrices, so even if you double or even 4x to let's say n+r=32, it's still just 1/128=0.78% of the original matrix.

Merging -zero x scale with the low-rank adapters or not doesn't matter much, that would highly depending on which fused kernel implementation performs the best.


I see, so we’re still fetching the metadata to gpu, and rescaling on gpu, just on-demand and discarding metadata when we’re done with that layer?

Why not do the same optimization for layer weights themselves?


Yes, correct, and that fetching operation is a non-blocking operation, once we dequantize the weights we discard it before moving to the next layer.

Technically, you can do it for the weights as well. But that wouldn't work in many situations. For example, when training with FSDP: the quantized weights stay on the device but you can still offload the meta-data (https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html)

I would like to re-iterate that larger models, which would be more interesting to run at low-bits, are much less sensitive to quantization compared to a 7B. So you could potentially use a larger group-size and just keep it on device, like what is done with 4-bit and 3-bit now using a group size of 64. We just started running some experiments with a 13B llama2 and it looks very good so far (outperforming some full-precision llama2-13B-based models), let's see how far we can push it, ideally get-rid of the reshaping all together will be great.


If you’re willing to pay for the latency cost of per layer cpu fetching/offloading, I don’t see what extreme quant buys you.

You could just do a layer-by-layer fetching scheme with 4 bit weights.

For training too, just fetch each layer twice per step as needed for fwd/bwd.

And all for hbm cost equal to one layer’s worth


The extreme quant buys you potentially 70x more efficient matmul via binary/ternary operations.

You still have a group-size of 64 in 4-bit fyi.And even if you keep the meta-data on-device, provided that the quality is high (which is the case for 2-bit, outperforming fp16 on certain tasks), that is a much better option compared to 4-bit even if the VRAM usage is the same.

Again, and I keep repeating this but it seems to be ignored every time: this is experimental work and it's still in progress. This story of small group-sizes on large models should not be an issue.


> You still have a group-size of 64 in 4-bit fyi.

Results may vary :)

> Again, and I keep repeating this but it seems to be ignored every time: this is experimental work and it's still in progress. This story of small group-sizes on large models should not be an issue.

Apologies if something I said (or I guess did not say...) offended you! It's a hypothetical, and one IME is not so easy to achieve, but maybe you have different results. So I didn't want to comment on this, maybe it's possible (but LLMs don't scale up as easily in terms of quantization than other networks like image classifiers, in my experience).

> The extreme quant buys you potentially 70x more efficient matmul via binary/ternary operations.

To be clear, such hardware does not yet exist, and it's unclear if you really can have more efficient binary/ternary matmul if you need high-precision accumulators and more frequent broadcasting shiftss. It's again a complicated hardware question to answer if the sum total latency of doing many high-precision accumulations and many scales/shifts will be smaller (or, chip-area-wise, even feasible to implement), compared to a 4-bit baseline.


Thank you for your efforts on behalf of the GPU poor!

It's getting tougher to use older, cheaper GPUs (Pascal/Maxwell) with modern quantization schemes so anything you can do to keep kernels compatible with SM52 and SM61 would be greatly appreciated.


Thank you, very glad to hear that!


When one does quantization, it's done in blocks. Bitsandbytes uses a blocksize of 64 I think. W * scale + zero_point is needed for each group size of 8. So you need 2 numbers in fp16 for each 64 group of numbers. For BnB, you get 4.5bit approx since 64*4bit + 16bit + 16bit = 288/64 = 4.5. So 4bit is actually 4.5bit.

For HQQ 1bit, a group size of 8 needs 2 fp16 numbers (you mentioned 8bit for shift). So you need 8 * 1bit + 16bit + 8bit for each group ie 32bits for each group size of 8. Or 4bits per param.

I'm assuming the scale and zero_point are both moved to 8bit maybe so 8*1bit + 8bit + 8bit = 24bit / 8 = 3bits per param?

"This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart!


> "This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart!

Doesn't it need all the weight metadata for a layer to use that layer?

* If yes, then can't any algorithm offload x% of its data as a balancing act between speed and RAM?

* If no, then what's it for and when does it get used?


Oh yes you need all the metadata, but because it's 2 numbers the scale and zero_point, I think the movement of singular digits are super fast to the GPU registers - cannot confirm though.

It's like in cuBLAS you do alphaAB + beta*C, and alpha and beta are both scalars which can be on the CPU, and moved to the GPU in nanaseconds.

I'm unsure though since I haven't tested it


It still has to go through the entire memory system. It's hard for me to imagine that transferring a number from the CPU to the GPU is faster than transferring a byte, and if you have 2 CPU-resident numbers per GPU-resident byte that's a lot of transferring.


I don't disagree - fair point there definitely is a latency transfer overhead. I would suspect one had prefetch it by calling `.to("cuda", non_blocking = True)` say 2 layers ahead, so you can in theory hide the movement.

I think somewhere the blog did mention HQQ for 1 bit is slower for now, maybe due to the transfer overhead, although I couldn't exactly remember where


My point is more that if it's that many bytes flowing around on demand, you're basically swapping layers in and out of the GPU as you use them (or x% of each layer).

Which is fine, and it's a valid feature, but you don't need to split those bytes into "data" and "metadata" to make that happen.

Is there actually something they gain from this particular method of splitting?


I guess it's mainly to reduce VRAM usage. Assuming we don't do this, then a 7b model with 1bit group size 8 will use 3GB or something of VRAM, whilst a 4bit group size of 64 will use 4GB approx.

Assume we have a 100b model - with 4bit, VRAM is around 57GB or so. With 1bit, 43GB VRAM, but by moving the scalars and zero point to RAM, VRAM use is like 15GB or something, at the cost of like 28GB RAM usage.

I guess maybe a valid approach is to dynamically select which ones to move to RAM or VRAM, given your VRAM budget. Say you have a 40GB GPU, clearly move more of the scalars to GPU. But if you have a weak GPU, then you need to use more RAM.


I still don't think I'm getting my point across.

What if you store 40% of the data and 40% of the metadata in CPU memory. Is that the same for performance?

Is there a reason we want particular parts of the layer to stay on the GPU? Or is it just number of bytes.


Oh, very good question - tbh im not sure. Another close technique is layer offloading - if your network can't fit and has layers 1, 2, ..., 32, we offload layers 16 to 32 to RAM, then load them in to GPU memory on the fly.

I'm gonna guess the performance hit is similar - although I have not tried it myself to verify for benchmarking


Err, you are just restating what I’m saying, without addressing the concerns.

1 - is it fair to use ram in two places and report only one of them without any asterisk? (If you think this is fair-oh boy wait till you hear about my 0GB hbm use inference algorithm)

2 - i know how subchannel quantization works. Are they hitting those reported latency numbers with per layer cpu pingpong to rescale?


Oh sorry - did not expect to restate what you said whoops - all train of thought!

You can see from https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-hf_1bi... that the model disk space is 3GB + 100MB of LoRA weights. I also uploaded a 4bit one to https://huggingface.co/unsloth/llama-2-7b-bnb-4bit/tree/main which uses 3.87GB.

So because of the offloading trick, the GPU VRAM is less, but in actuality, still 3GB is needed.

Unsure on latency sadly


Hey Daniel! The VRAM is still the same as a pure n-bit model would take. Because we only need meta-data for a single nn.Linear at a time, you only need an additional (3GB-1.7GB)/224 = 5.8MB. If we compress the meta-data as well that would become much lower.


Hey :)) Oh I love the idea of async movement from CPU to GPU ahead of time - ingenious! Prefetching a small amount of metadata seems reasonable and very smart!


I’m not sure if you’re being facetious, but this is literary available for early access, but not ga yet. https://simonwillison.net/2024/Feb/21/gemini-pro-video/


Isn't this literally the case in the US? You list dependents on your tax form.


What is your current level? Is it shown on your LinkedIn?


And yet, a Rust Option (or really any option) can just be viewed as a list of one or zero elements. https://rust-unofficial.github.io/patterns/idioms/option-ite...

In fact, in Haskell, operating on an option conditionally has the exact same functor as a list: `map`.

So what am I to do, with an iterator? It's conflicting advice! An if is a for for an option.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: