BRIN indexes don't affect the storage layer, so you need to make sure your table is appropriately sorted on disk through other means. CLUSTER can do this as a one-off job (pg_repack if you need to keep the table available for writes), but won't sort new data as it's written.
For all of its greatness and (mostly well deserved) praises, the lack of a reasonable cluster index capability (as in data order at storage layer) is Postgres' biggest limit IMHO.
Unfortunately, The CLUSTER command not only "blocks" the table for WRITE ops, but more importantly, it also blocks READ operations [0]. pg_repack helps, but is not always available when using a managed PG offering.
Not being able to control data ordering on disk is a potential deal breaker once the data reaches a certain size.
Starting with PostgreSQL 11, B-tree indexes can store additional "non-key" columns in the index. These are kept in the index alongside the sorted key columns. This can avoid expensive sorts in certain query plans.
It's not perfect. Stale table statistics sometimes forces PostgreSQL to check table pages to confirm that rows are visible to the current transaction. Index-only scans don't support features like expressions either.
It depends on a lot of factors, my unscientific rule of thumb is that if the data doesn't fit in the RAM of the server, then that's when you may start worrying about how the data will be fetched from the disk when you query it (if performance matters).
The problem case is when the data needed to serve an average query is "scattered across the disk". This means that you will need to potentially fetch way more information from the disk (because data is returned by blocks/pages, not by bytes) than necessary to fulfill the query.
A worst case example: say to compute a query, you need to get 5000 rows (with a size of 100 bytes per row, 500KB in net total), if they are "perfectly scattered" (i.e. each row on a different page), you will really need to bring in 5000 pages of 8KB per page (default page size in Postgres) for a total of 40MB. By fetching 80 times more data than needed, you've essentially reduced your throughput by 80x.
Note that the example above assumes that an index (e.g. btree) can be leveraged. The index would point directly to the numerous pages, which is likely still much better compared to doing a full scan. But that index doesn't solve all your problems, only part 1.
It may not be a big deal as bringing 40MB from a disk would go fast, but this will limit 1) the number of concurrent user you'll be able to serve, 2) if you need 100,000 rows instead of 5000, then your query will take longer to process and it may be negatively noticeable by your user.
If you can co-locate the data on the disk (i.e. put them on same or contiguous pages) deterministically, then you would feel much better about your throughput. Cluster Index (or Indexed Views as it's called in SQL Server) is the typical mechanism to sort the data on the disk in RDBMS. MySQL does that by default with the primary key, but not Postgres.
> Cluster Index (or Indexed Views as it's called in SQL Server) is the typical mechanism to sort the data on the disk in RDBMS. MySQL does that by default with the primary key, but not Postgres.
Does that mean that Mysql moves the second or the first half of the table if you insert a row in the middle? I can't imagine that.
I've recently considered clustering multiple tera bytes of time-ordered data stored >100 partitions in a Postgres 12 instance. After careful consideration I came to the conclusion it wasn't worth it. Clustering would have sorted all the rows by date in the table blocks. But that doesn't guarantee anything about the block layer below the filesystem. So it is of dubious value from the standpoint of performance. The other advantage I was hoping for was being able to use a BRIN index. But since, my database has very rare cases of updates of those rows. A BRIN index looses its value very fast. Either I lower the fillrate to leave space in every block for updates. Which allows the BRIN index to stay current but costs a lot of space. Or I would have to force BRIN index updates regularly because they can be lossy. And that is not acceptable in my application. The whole database is stored on nvme disks managed by zfs. That won't benefit from ordering the data on some arbitrary abstraction in the middle.
Very cool, thanks for sharing. I'm curious how far vanilla Postgres can be taken for analytics before more exotic columnar solutions like Redshift start to make sense.
https://www.postgresql.org/docs/current/sql-cluster.html
https://reorg.github.io/pg_repack/