Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When running a batched migration it is important to batch using a strictly monotonic field so that new rows wont get inserted in already processed range


It's not even necessarily it being strictly monotonic. That part does help though as you don't need to skip rows.

For me the bigger thing is the randomness. A uid being random for a given row means the opposite is true; any given index entry points to a completely random heap entry.

When backfilling this leads to massive write amplification. Consider a table with rows taking up 40 bytes, so roughly 200 entries per page. If I backfill 1k rows sorted by the id then under normal circumstances I'd expect to update 6-7 pages which is ~50kiB of heap writes.

Whereas if I do that sort of backfill with a uid then I'd expect to encounter each page on a separate row. That means 1k rows backfilled is going to be around 8MB of writes to the heap.


Isn't that solved because UUIDv7 can be ordered by time?


Yeah pretty much, although ids can still be a little better. The big problem for us is that we need the security of UUIDs not leaking information and so v7 isn't appropriate.

We do use a custom uuid generator that uses the timestamp as a prefix that rotates on a medium term scale. That ensures we get some degree of clustering for records based on insertion time, but you can't go backwards to figure out the actual time. It's still a problem when backfilling and is more about helping with live reads.


Are page misses still a thing in the age of SSDs?


Strictly monotonic fields are quite expensive and the bigserial PK alone won't give you that.


PG bigserial is already strictly monotonic


No they're not, even with a `cache` value of 1. Sequence values are issued at insert rather than commit. A transaction that commits later (which makes all updates visible) can have an earlier value than a previous transaction.

This is problematic if you try to depend on the ordering. Nothing is stopping some batch process that started an hour ago from committing a value 100k lower than where you thought the sequence was at. That's an extreme example but the consideration is the same when dealing with millisecond timeframes.


Okay, but in a live DB, typically you won't have only inserts while migrating, won't you?


Yes, but updates are covered by updated app code


would creation/lastmod timestamps cover this requirement?


Yes, although timestamps may have collisions depending on resolution and traffic, no? Bigserials (at least in PG), are strictly monotonic (with holes).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: