The interesting part about the article isn't that structured data is easier to compress and store, its that there's a relatively new way to efficiently transform unstructured logs to structured data. For those shipping unstructured logs to an observability backend this could be a way to save significant money
1) Use Zstandard with a generated dictionary kept separate from the data, but moreover:
2) Organize the log data into tables with columns, and then compress by column (so, each column has its own dictionary). This lets the compression algorithm perform optimally, since now all the similar data is right next to itself. (This reminds me of the Burrows-Wheeler transform, except much more straightforward, thanks to how similar log lines are.)
3) Search is performed without decompression. Somehow they use the dictionary to index into the table- very clever. I would have just compressed the search term using the same dictionary and do a binary search for that, but I think that would only work for exact matches.
Heh, my hacked version of this was used for apache logging. I just ran Mysql on top of a ZFS volume with compression turned on and had a column for timestamp, IP, referral IP, URL, action, and return code. I was amazed at how fast it was, how easy/fast it was to query, disk storage efficiency, and was overall quite impressed at how it was nearly as useful as a standard web traffic analysis tool that took significant time to crunch the logs, but worked on live data.
I wonder what would happen if you stored the columns in separate tables (perhaps pairs of columns?) and queried them with a join off a shared ID (perhaps a view or materialized view?) in order to really take advantage of compression’s ability to compress highly self-similar data located together, highly.
Also, I assume you used a smallish blocksize in ZFS because of the frequent small writes?
Well the storage was crazy efficient, I kept checking to make sure it was recording what I thought it was. No way 50M hits could fit in a file that small...
Timestamps, especially north of 1000 hits/sec have many bits in common. URL, Referrer, and IP address where all just indexes. That worked really well because it was storage efficient, and made various queries like "who hit this URL", "who is our top referrer" and the like very efficient. Things that used to require ingesting a months worth of logs and spitting out a report would often be answered with a simple SQL query.
Does the user have to specify the "schema" (each unique log message type) manually? or is it learned automatically (a la gzip and friends)? I wasn't able to discover this from a cursory readthrough of the paper...
If you have exactly two logging operations in your entire program:
log(“Began {x} connected to {y}”)
log(“Ended {x} connected to {y}”)
We can label the first one logging operation 1 and the second one logging operation 2.
Then, if logging operation 1 occurs we can write out:
1, {x}, {y} instead of “Began {x} connected to {y}” because we can reconstruct the message as long as we know what operation occurred, 1, and the value of all the variables in the message. This general strategy can be extended to any number of logging operations by just giving them all a unique ID.
That is basically the source of their entire improvement. The only other thing that may cause a non-trivial improvement is that they delta encode their timestamps instead of writing out what looks to be a 23 character timestamp string.
The columnar storage of data and dictionary deduplication, what is called Phase 2 in the article, is still not fully implemented according to the article authors and is only expected to result in a 2x improvement. In contrast, the elements I mentioned previously, Phase 1, were responsible for a 169x(!) improvement in storage density.
So does it means you need to specify rules for all unique logging operations you have? I was under impression that the thing can do it automatically... In a runtime based on receiving repetitive logs or something like that. If it doesn't, it is still a hell of work to compile and maintain the list of rules for your applications.
It seems very cool to get logmine style log line templating built right in. I've found its very helpful to run logs through it, and having a log system that can do this at ingest time for quicker querying seems like it'd have amazing log digging workflow benefits.
Github project for CLP: https://github.com/y-scope/clp
The interesting part about the article isn't that structured data is easier to compress and store, its that there's a relatively new way to efficiently transform unstructured logs to structured data. For those shipping unstructured logs to an observability backend this could be a way to save significant money