Having superficial knowledge about data storage/DBMS, perhaps I am just spewing ...

dijksterhuis · on Aug 30, 2024

So with traditional Parquet this is usually handled through “sane” partitioning.

Heavily simplified version — Each partition is a separate file containing a bunch of table rows. And partition splits are determined by the values in those rows.

If you’ve got data with like a date column (sign up date or order date or something), you would partition on a YYYY-MM field you create early on.

Each time you run a query filtering by YYYY-MM, your OLAP query tool no longer needs to read bunch of files from disk or S3. If you only want to look at 2023-12, then you only need to read one file to run the query.

Edit — OLAP kinda stuff is all about getting the data “slices” nicely organised for queries people will run later.