Having superficial knowledge about data storage/DBMS, perhaps I am just spewing nonsense. But I always imagine that a compaction layer for the long-tail/old data should be a standard, where less accessed(sometimes once a month) is pushed to a s3(but still queryable) like storage.
So with traditional Parquet this is usually handled through “sane” partitioning.
Heavily simplified version — Each partition is a separate file containing a bunch of table rows. And partition splits are determined by the values in those rows.
If you’ve got data with like a date column (sign up date or order date or something), you would partition on a YYYY-MM field you create early on.
Each time you run a query filtering by YYYY-MM, your OLAP query tool no longer needs to read bunch of files from disk or S3. If you only want to look at 2023-12, then you only need to read one file to run the query.
Edit — OLAP kinda stuff is all about getting the data “slices” nicely organised for queries people will run later.