I heard of pigz in the discussions following my interview of Yann Collet, creator of LZ4 and zstd.
If you'll excuse the plug, here is the LZ4 story:
Yann was bored and working as a project manager. So he started working on a game for his old HP 48 graphing calculator.
Eventually, this hobby led him to revolutionize the field of data compression, releasing LZ4, ZStandard, and Finite State Entropy coders.
His code ended up everywhere: in games, databases, file systems, and the Linux Kernel because Yann built the world's fastest compression algorithms. And he got started just making a fun game for a graphing calculator he'd had since high school.
Side note: In the 1990s, everyone in my engineering school had an HP48 calculator. There were a healthy selection of pretty decent games available.
One fine day, I finished my physics exam an hour early and so opened up an enjoyable game on my calculator. 45 minutes went by and so I went up and handed in my paper. It was at this point that the professor noted, “were you planning on leaving the second page blank?”
I had an HP48, I had lots of fun with Bjorn Gahm's IR remote control which could mimic the remote control signals of most common television brands, including the ones at my high school. The HP48 also had the ability to set an alarm to execute an arbitrary piece of code, e.g. turning on the TV to some channel and turning the volume up to max at some predefined time the middle of class.
Teachers were completely stumped. Their initial suspicions were always that someone brought a universal remote control to class, but they would painstakingly search everyone's desks to find nothing. And then after asking everyone to put their hands up in the ai, the TV would still have a mind of its own.
Still have mine. For a while on my phone I used an emulator. Eventually I found PCalc and it was customizable enough to recreate the parts of the hp48g that I cared about on a day-to-day basis.
One wild thing is how much performance wins were available compared to ZLib. Pigz is parrellel, but what if you just had a better way to compress and decompress than DEFLATE?
When zstd came out – and Brotli before it to a certain extent – they were 3x faster than ZLib with a slightly higher compression ratio. You'd think that such performance jumps in something as well explored as data compression would be hard to come by. We weren't that close to the efficiency frontier.
A part of this was that zstd and Brotli were able to use compression history windows of MBs not KBs, while DEFLATE maxes out at 32KB. RAM was thousands of times more expensive in the early 90s, so a smaller history window made sense.
That's not to minimize the clever ideas and amazing implementation work in new stuff. It's more that people were making smart decisions both then and now, more so than you might guess just from comparisons on today's hardware.
People take performance for granted. Even within gzip (and similarly .png), you can set compression level to 4 (default is 6) and get ~15-20% faster performance at the cost of ~5% larger file sizes.
No one ever tweaks that one setting even though they should, file sizes are a significantly smaller bottleneck than they were with MB hard drives and dial-up modems.
If your justification for not serving up larger .png is that not everyone has fast internet, then you should be either detecting and handling that case separately, downscaling the images, and/or serving .jpeg instead.
One time I was using Topaz AI to upscale video, and I spliced that into their ffmpeg filter and took a whole day off of a week long encode. Low hanging fruit.
In video games long loading times for levels is a serious pain point, so video game developers put a lot of effort into tuning up compression algorithms to get the best wall clock time considering both the time to fetch content from storage and the time to decompress.
If the target is a console you may know exactly what hardware is there so you can justify the effort in tuning. (it’s more complex today because you have a choice of what kind of storage to use with your XBOX). With a PC or phone your results may vary a lot more.
Don't most games/game engines use TGA format for their textures? Those are all RLE-encoded if I'm not mistaken (which is very fast but very inefficient space-wise). Or perhaps that is just at game creation and those will get baked to some other image format for distribution?
The most important thing for modern texture compression is that the GPU supports it without ever having to decompress - saves VRAM and memory bandwidth. So it's usually specialized formats like ASTC.
Economics of scale come into effect as well. Gzip decompression speed is slightly better at higher levels as well. A one time higher cost of compression can pay off pretty quickly when you are decompressing it a lot of times, or serving it to enough people.
I'm not so sure about this. Generally speaking there will be more work done on the CPU to decompress at higher levels (e.g. 6 through 9). It is possible (although unlikely) that you will get higher decompression speed, but only if the bottleneck wasn't CPU to begin with (e.g. network or disc).
My gut feeling is that if you are pulling down data faster than 40 Megabits and have a CPU made within the past 7 years (possibly including mobile), you won't be bottlenecked by I/O generally speaking.
Most compression algorithms don't take more work to decompress at higher levels, and actually perform better due to having less data to work through. Gzip consistently benchmarks better at decompression for higher levels.
It's not just about bottlenecks, but aggregate energy expenditure from millions of decompressions. On the whole, it can make a real measurable difference. My point was only really that it's not so cut and dry that it's a good trade off to take a 5% file size loss for 20% improved compression performance. You'd have to benchmark and actually estimate the total number of decompressions to see the tipping point.
> No one ever tweaks that one setting even though they should
That entirely depends on the use-case. Most people running FFMpeg do it as a once off thing - and if those people like me, when I rip a movie I want the highest quality and lowest size I can get, and I'm happy that the default sacrifices speed for quality and size. The processing can be slow because I'm doing it only once. If you're in the business of encoding video and do it all day everyday, your calculus will be different and you won't be using the defaults regardless.
That depends under what conditions the energy is used. In a place where you need extra energy for cooling yes. In my home no. I live in a cold place, so I need heating the bigger part of the year. And I heat using electricity (which might be stupid, but that's the house was built 30 years ago). So whatever energy my computer wastes, I save it in my heating bill. Computer energy is free, except during some warm summer weeks.
What's well explored is compression rate, where indeed it's difficult to improve, and true innovations, like arithmetic coding, are rare.
Compressing speed on the other hand it's not very interesting to academics, it's more of an engineering problem. And there is plenty of work to do here, starting with stuff as simple as multi-threading and SIMD.
ZLib and ZStandard are probably in the same complexity class, but with different constant factors, which academics don't care about but which have massive practical consequences.
I think arithmetic coding is much simpler than the way most resources describe it (the hard part is making it efficient).
Consider this example: you want to transmit two completely random variables, both of which can have 5 states. The obvious way is to concatenate two bit fields of size ceil(log2(5)), so 3+3 = 6 bits.
But alternatively, you can count the total number of states possible for both variables together, 5*5 = 25 and encode it as a single integer of size ceil(log2(25)) = 5, so both variables can be stored with just 5 bits.
So we arrive at the idea that there can be a fractional number of bits, which we often round up to the nearest integer for simplicity (or, in practice, the nearest multiple of 8 since most protocols are based on bytes).
The other part is just assigning shorter sequences of bits to more common symbols, except, of course unlike in Huffman coding, our symbols can have a fractional number of bits. This allows them to match the actual symbol probabilities more closely. If your data is highly repetitive, you can fit dozens of (common) symbols per bit.
The coolest part IMO is how easy it is to plug in custom models for symbol probabilities. Usually a simple counter for each symbol is enough, but you can go crazy and start predicting the next symbol based on previous ones.
> The coolest part IMO is how easy it is to plug in custom models for symbol probabilities. Usually a simple counter for each symbol is enough, but you can go crazy and start predicting the next symbol based on previous ones.
PPM-based [0] compression can work quite well for text but on its own it's not enough to unseat the Lempel-Ziv family as the top general purpose compressor.
zlib/gzip did not choose DEFLATE because it was the best algorithm. Rather it was, at the time, the only decent algorithm that could be implemented in a manner not covered by patents. (See the tragedy of LZW for why that was important.)
We're now more than two decades later, so all the important data compression patents should have expired.
Probably a lot of it is due to hardware changes since zlib was first written. Mainly the importance of cache and branch prediction, which were either less of a big deal or non-existent back then. IOW, zlib probably leaves a lot more on the table now than when it was written.
zlib is above all portable, and runs in small memory footprints. Size vs space and platform specific functionality all have costs associated with them.
Just listened to that episode, what a great story. The dry way he tells how he unexpectedly and almost accidentally transitioned from a project manager to a software engineer is really a treat. Thanks for your podcast!
Fastest open source compression algorithms. RAD game tools have proprietary ones that are faster and have better compression ratios, but since you have to pay for a license, they will never be widespread.
Interesting. Are there any benchmarks you can share? On their website they only compare decompression speed and with zlib and LZMA. It would be interesting to compare to LZ4 HC mode, that unity uses.
Build or download TurboBench [1] executables for linux and windows from releases [2] ans make your own tests comparing oodle,zstd and other compressors.
I've been following RAD for a long time and I love Charles Bloom's blog. They are proprietary but he also makes a lot of code public. For instance, he showed how RAD switched from arithmetic coders to Assymetrical Number Systems and added code.
This episode was fascinating. I had heard of LZ4 but not Zstd. It spurred me to make changes to our system at work that are reducing file sizes by as much as 25%. It’s great to have a podcast in which I learn practical stuff!
Probably the most underrated feature of zstd (likely because it's so unusual) is the ability to create a separate compression dictionary. This allows you to develop customized and highly efficient dictionaries that are highly specific to a type of data AND allow you to compress elements of that data without including an entire separate dictionary in every compression output.
So for example take logfiles. You can train up a dictionary on some sample log data. Then you can compress individual log rows, and all it actually stores is a diff of the compression dictionary (if any new entries were added) and the compressed data. So you get very efficient compression of small amounts of data which are part of a collection that may be very self-similar, but with the option of decompressing any individual element at will. (Of course, you'd need to hold onto the original trained dictionary for both compression and decompression, for any row you want to be able to decompress in the future. And you might want to retrain the dictionary every so often for slowly-changing types of data, which might prevent "drift" of the efficiency towards less-efficient over time)
I believe Postgres already uses this under the hood for some columnar data. It wouldn't take much to index it before compressing it and just decompress it at will. Or maybe it just got added? https://devm.io/databases/postgresql-release
I do this to make extraordinarily small UDP packets for a low latency system. I record the raw payload then build a dictionary for the data, then share it on both sides. It reduces the packet overhead by removing the dictionary and it does a much better job than other approaches.
I saw that zstd and brotli both suppport creating custom dictionaries but I couldn't find any tutorials showing how to do this. Perhaps you could share code?
I saw that zstd and brotli both suppport creating custom dictionaries but I couldn't find any tutorials showing how to do this. Perhaps you could share code?
will output a dictionary file, and then the `-D <path/to/dictionary/file>` option when used for either compression or decompression will then use that dictionary first.
You can also investigate "man zstd" or google "zstd --train" for more details. The directory for the training must consist of many small files each of which is an example artifact; if you want to split, say, a single log file into files of each line, you can use, say, a bash script like this (note that I just created this with ChatGPT and eyeballed it, it looks correct but I haven't run it yet!): https://gist.github.com/pmarreck/91124e761e45d6860834eb046d6...
(Also, don't forget to set it as executable with `chmod +x split_file.bash` before you try to run it directly)
Thank you so much. I was trying to create a dictionary last night and your comment was sent by God. You're doing the Lord's work frfr! I followed you on GitHub!
There are also a parallel versions of bzip2 (pbzip2), lzip (plzip), xz (pixz).
Depending upon the data, the non-threaded versions of these utilities can have higher performance when run with some kind of dispatcher on multiple files.
The GNU xargs utility is able to do this, and the relevant features are also in busybox.
GORDON BELL, ADAM! That was a great episode. The most amazing thing to me was how this guy was just messing around, a compression hobbyist, if you will - and then he is being courted by FAANG companies. He just walked into it, almost by accident.
I work in VMWare Fusion on a Mac, in a Mint guest OS, and zipping these huge instances for backup will take forever with a single core. Pigz punishes all 12 cores on my Mac mini and saves me a ton of time.
Thanks! Part of the strength of the story is the humility of Yann. He could have presented himself as a genius who was destined for greatness but he's humble enough to look at himself as an everyman.
I loved this episode, it was very engaging to the very end, I wish there were more episodes! (I already listened to all of them so far)
Thank you for doing this podcast!
That was a great read! Very inspiring to hear about a near-middle-age person keeping the flame stoked on a little side hobby, and having it turn into something world-changing. So cool!
If you'll excuse the plug, here is the LZ4 story:
Yann was bored and working as a project manager. So he started working on a game for his old HP 48 graphing calculator.
Eventually, this hobby led him to revolutionize the field of data compression, releasing LZ4, ZStandard, and Finite State Entropy coders.
His code ended up everywhere: in games, databases, file systems, and the Linux Kernel because Yann built the world's fastest compression algorithms. And he got started just making a fun game for a graphing calculator he'd had since high school.
https://corecursive.com/data-compression-yann-collet/