> as well explored as data compression What's well explored is compression rate,...

adamgordonbell · on May 12, 2023

> What's well explored is compression rate not performance

Exactly! And this seems like a shame to me with something burning so many cpu cycles.

> true innovations, like arithmetic coding, are rare.

Yeah, Yann tried to explain arithmetic coding to me, but I didn't get it.

PhilipRoman · on May 12, 2023

I think arithmetic coding is much simpler than the way most resources describe it (the hard part is making it efficient).

Consider this example: you want to transmit two completely random variables, both of which can have 5 states. The obvious way is to concatenate two bit fields of size ceil(log2(5)), so 3+3 = 6 bits.

But alternatively, you can count the total number of states possible for both variables together, 5*5 = 25 and encode it as a single integer of size ceil(log2(25)) = 5, so both variables can be stored with just 5 bits.

So we arrive at the idea that there can be a fractional number of bits, which we often round up to the nearest integer for simplicity (or, in practice, the nearest multiple of 8 since most protocols are based on bytes).

The other part is just assigning shorter sequences of bits to more common symbols, except, of course unlike in Huffman coding, our symbols can have a fractional number of bits. This allows them to match the actual symbol probabilities more closely. If your data is highly repetitive, you can fit dozens of (common) symbols per bit.

The coolest part IMO is how easy it is to plug in custom models for symbol probabilities. Usually a simple counter for each symbol is enough, but you can go crazy and start predicting the next symbol based on previous ones.

account42 · on May 16, 2023

> The coolest part IMO is how easy it is to plug in custom models for symbol probabilities. Usually a simple counter for each symbol is enough, but you can go crazy and start predicting the next symbol based on previous ones.

PPM-based [0] compression can work quite well for text but on its own it's not enough to unseat the Lempel-Ziv family as the top general purpose compressor.

[0] https://en.wikipedia.org/wiki/Prediction_by_partial_matching

PhilipRoman · on May 16, 2023

Yeah arithmetic coding on its own is not enough, typically you would have it as the last step, with dictionary and other compressors before it.

tda · on May 12, 2023

That must be why you left out all the technical details ;) At least you piqued my interest, I'll just ask chatgpt to explain the general concepts.