Memory compression is a generalization of swap, which is only for dynamic memory; files on disk don't need it because you can just read them off the disk.
The problem is that GPUs don't support virtual memory paging, so they can't read files nor decompress nor swap anything unless you write it yourself, which is a lot slower.
Also, ML models (probably) can't be compressed because they already are compressed; learning and compression are the same thing!
Wait. This comment just blew my mind. Does that imply that you might be able to measure the efficiency of a model by it's compressibility? Note, I'm trying to recognize efficient and accurate are not the same. One could imagine evaluating a model on a 2d performance and compression map somehow.
> Also, ML models (probably) can't be compressed because they already are compressed; learning and compression are the same thing!
I feel like they're kind of two sides of the same coin: learning is about putting more information in the same data, while compression is about putting the same information in less data.
I'm wondering if some lossy floating-point compressor (such as zfp) would work.
> I'm wondering if some lossy floating-point compressor (such as zfp) would work.
Well apparently this can work; StableDiffusion comes as 32-bit and 16-bit float versions. I'm kind of surprised they both work, but that's lossy compression.
Sure, but 16-bit float is pretty primitive compression, as it does not exploit any redundancy in the input. zfp groups numbers together in chunks, which means that correlated numbers can be represented more precisely. Its algorithm is described here: https://zfp.readthedocs.io/en/release1.0.0/algorithm.html#lo...
I would like to see if the zfp can be applied to something like Stable Diffusion (or other ML models) and give better results than regular floats at the same size.
Memory compression? I can't find any good resources to read about it, any hints? I'm having trouble imaging how could it possibly work without totally destroying performance.
It doesn't destroy performance for the simple reason that nowadays memory access is slower than pure compute. If you need to use compute to produce some data to be stored in memory, your overall throughput could very well be faster than without compression.
There have been a large amount of innovation on fast compression and fast decompression in recent years. Traditional compression tools like gzip or xz are geared towards higher compression ratio, but memory compression tends to favor speed. Check out those algorithms:
It is not compressed swap, the compressed data is still in RAM. The OS just compresses inactive memory, with a couple of criteria to define “inactive”.
iOS uses memory compression but not swap. iOS devices actually have special CPU instructions to speed up compression of up to page size increments specifically to aid in this model [1]