Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> as well explored as data compression

What's well explored is compression rate, where indeed it's difficult to improve, and true innovations, like arithmetic coding, are rare.

Compressing speed on the other hand it's not very interesting to academics, it's more of an engineering problem. And there is plenty of work to do here, starting with stuff as simple as multi-threading and SIMD.

ZLib and ZStandard are probably in the same complexity class, but with different constant factors, which academics don't care about but which have massive practical consequences.



> What's well explored is compression rate not performance

Exactly! And this seems like a shame to me with something burning so many cpu cycles.

> true innovations, like arithmetic coding, are rare.

Yeah, Yann tried to explain arithmetic coding to me, but I didn't get it.


I think arithmetic coding is much simpler than the way most resources describe it (the hard part is making it efficient).

Consider this example: you want to transmit two completely random variables, both of which can have 5 states. The obvious way is to concatenate two bit fields of size ceil(log2(5)), so 3+3 = 6 bits.

But alternatively, you can count the total number of states possible for both variables together, 5*5 = 25 and encode it as a single integer of size ceil(log2(25)) = 5, so both variables can be stored with just 5 bits.

So we arrive at the idea that there can be a fractional number of bits, which we often round up to the nearest integer for simplicity (or, in practice, the nearest multiple of 8 since most protocols are based on bytes).

The other part is just assigning shorter sequences of bits to more common symbols, except, of course unlike in Huffman coding, our symbols can have a fractional number of bits. This allows them to match the actual symbol probabilities more closely. If your data is highly repetitive, you can fit dozens of (common) symbols per bit.

The coolest part IMO is how easy it is to plug in custom models for symbol probabilities. Usually a simple counter for each symbol is enough, but you can go crazy and start predicting the next symbol based on previous ones.


> The coolest part IMO is how easy it is to plug in custom models for symbol probabilities. Usually a simple counter for each symbol is enough, but you can go crazy and start predicting the next symbol based on previous ones.

PPM-based [0] compression can work quite well for text but on its own it's not enough to unseat the Lempel-Ziv family as the top general purpose compressor.

[0] https://en.wikipedia.org/wiki/Prediction_by_partial_matching


Yeah arithmetic coding on its own is not enough, typically you would have it as the last step, with dictionary and other compressors before it.


That must be why you left out all the technical details ;) At least you piqued my interest, I'll just ask chatgpt to explain the general concepts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: