What's well explored is compression rate, where indeed it's difficult to improve, and true innovations, like arithmetic coding, are rare.
Compressing speed on the other hand it's not very interesting to academics, it's more of an engineering problem. And there is plenty of work to do here, starting with stuff as simple as multi-threading and SIMD.
ZLib and ZStandard are probably in the same complexity class, but with different constant factors, which academics don't care about but which have massive practical consequences.
I think arithmetic coding is much simpler than the way most resources describe it (the hard part is making it efficient).
Consider this example: you want to transmit two completely random variables, both of which can have 5 states. The obvious way is to concatenate two bit fields of size ceil(log2(5)), so 3+3 = 6 bits.
But alternatively, you can count the total number of states possible for both variables together, 5*5 = 25 and encode it as a single integer of size ceil(log2(25)) = 5, so both variables can be stored with just 5 bits.
So we arrive at the idea that there can be a fractional number of bits, which we often round up to the nearest integer for simplicity (or, in practice, the nearest multiple of 8 since most protocols are based on bytes).
The other part is just assigning shorter sequences of bits to more common symbols, except, of course unlike in Huffman coding, our symbols can have a fractional number of bits. This allows them to match the actual symbol probabilities more closely. If your data is highly repetitive, you can fit dozens of (common) symbols per bit.
The coolest part IMO is how easy it is to plug in custom models for symbol probabilities. Usually a simple counter for each symbol is enough, but you can go crazy and start predicting the next symbol based on previous ones.
> The coolest part IMO is how easy it is to plug in custom models for symbol probabilities. Usually a simple counter for each symbol is enough, but you can go crazy and start predicting the next symbol based on previous ones.
PPM-based [0] compression can work quite well for text but on its own it's not enough to unseat the Lempel-Ziv family as the top general purpose compressor.
What's well explored is compression rate, where indeed it's difficult to improve, and true innovations, like arithmetic coding, are rare.
Compressing speed on the other hand it's not very interesting to academics, it's more of an engineering problem. And there is plenty of work to do here, starting with stuff as simple as multi-threading and SIMD.
ZLib and ZStandard are probably in the same complexity class, but with different constant factors, which academics don't care about but which have massive practical consequences.