Hacker Newsnew | past | comments | ask | show | jobs | submit | apavlo's commentslogin


> none of the reviews of the last few years mention immutable and/or bi-temporal databases.

We hosted XTDB to give a tech talk five weeks ago:

https://db.cs.cmu.edu/events/futuredata-reconstructing-histo...

> Which looks more like a blind spot to me honestly.

What do you want me to say about them? Just that they exist?


Nice work Andy. I'd love to hear about semantic layer developments in this space (e.g. Malloy etc.). Something to consider for the future. Thanks.

> I'd love to hear about semantic layer developments in this space (e.g. Malloy etc.)

We also hosted Llyod to give a talk about Malloy in March 2025:

https://db.cs.cmu.edu/events/sql-death-malloy-a-modern-open-...


> Am I living in a bubble?

There are rumblings that the MySQL project is rudderless after Oracle fired the team working on the open-source project in September 2025. Oracle is putting all its energy in its closed-source MySQL Heatwave product. There is a new company that is looking to take over leadership of open-source MySQL but I can't talk about them yet.

The MariaDB Corporation financial problems have also spooked companies and so more of them are looking to switch to Postgres.


> There are rumblings that the MySQL project is rudderless after Oracle fired the team working on the open-source project in September 2025.

Not just the open-source project; 80%+ (depending a bit on when you start counting) of the MySQL team as a whole was let go, and the SVP in charge of MySQL was, eh, “moving to another part of the org to spend more time with his family”. There was never really a separate “MySQL Community Edition team” that you could fire, although of course there were teams that worked mostly or entirely on projects that were not open-sourced.


Wouldn't Oracle need those 80%+ devs if they wanted to shift their efforts into Heatwave? That percentage sounds too huge to me and if true I believe they won't be making any larger investments into Heatwave neither. There's several core teams in MySQL and if you let those people go ... I don't know, I am not sure what to make out of it but that Oracle is completely moving away from MySQL as a strategic component of their business.

> Wouldn't Oracle need those 80%+ devs if they wanted to shift their efforts into Heatwave?

They would, so Heatwave is also going to suffer over this.


So, AI ate the cake ... I always thought that the investment that Oracle needs to make for MySQL is peanuts compared to the Oracle's total revenue and the revenue MySQL is generating. Perhaps the latter is not so true anymore.

Percona I suppose?

> Nothing about time series-oriented databases?

https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...


> I can't believe that article has no mention of SQLite ??

https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...


Thanks for catching this. Updated: https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...

I need to figure out an automatic way to track these.


> If the org behind it ever decides to rugpull/elastic you

I love it that you use "elastic" as a verb here.


> so... you take 10%-30% performance hit _right away_, and you perpetually give up any opportunities to improve the decoder in the future.

The WASM is meant as a backup. If you have the native decoder installed (e.g., as a crate), then a system will prefer to use that. Otherwise, fallback to WASM. A 10-30% performance hit is worth it over not being able to read a file at all.


It even says so right in the abstract:

"Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable."

The idea that software I write today can decode a data file written in ten years using new encodings is quite appealing.

And the idea that new software written to make use of the new encodings doesn't have to carry the burden of implementing the whole history of encoders for backwards compatibility likewise.


Now you have code stored in your database which you don't know what will do when you execute it.

Sounds very much like the security pain from macros in Excel and Microsoft Word that could do anything.

This is why most PDF readers will ignore any javascript embedded inside PDF files.


It gets even better further down the paper!

"In case users prefer native decoding speed over Wasm, F3 plans to offer an option to associate a URL with each Wasm binary, pointing to source code or a precompiled library."


They are not suggesting that the code at the url would be automatically downloaded. It would be up to you to get the code and build it into your application like any other library.


Is this relevant in practice? Say I go to a website to download some data, but a malicious actor has injected an evil decoder (that does what exactly?). They could just have injected the wasm into the website I am visiting to get the data!

In fact, wasm was explicitly designed for me to run unverified wasm blobs from random sources safely on my computer.


Excel, Word and PDF readers weren’t properly sandboxed.


The backstory is complicated. The plan was to establish a consortium between CMU, Tsinghua, Meta, CWI, VoltronData, Nvidia, and SpiralDB to unify behind a single file format. But that fell through after CMU's lawyers freaked out over Meta's NDA stuff to get access to a preview of Velox Nimble. IANAL, but Meta's NDA seemed reasonable to me. So the plan fell through after about a year, and then everyone released their own format:

→ Meta's Nimble: https://github.com/facebookincubator/nimble

→ CWI's FastLanes: https://github.com/cwida/FastLanes

→ SpiralDB's Vortex: https://vortex.dev

→ CMU + Tsinghua F3: https://github.com/future-file-format/f3

On the research side, we (CMU + Tsinghua) weren't interested in developing new encoders and instead wanted to focus on the WASM embedding part. The original idea came as a suggestion from Hannes@DuckDB to Wes McKinney (a co-author with us). We just used Vortex's implementations since they were in Rust and with some tweaks we could get most of them to compile to WASM. Vortex is orthogonal to the F3 project and has the engineering energy necessary to support it. F3 is an academic prototype right now.

I note that the Germans also released their own fileformat this year that also uses WASM. But they WASM-ify the entire file and not individual column groups:

→ Germans: https://github.com/AnyBlox


Andrew, it’s always great to read the background from the author on how (and even why!) this all played out. This comment is incredibly helpful for understanding the context of why all these multiple formats were born.


If I could ask you to speculate for a second, how do you think we will go from here to a clear successor to Parquet?

Will one of the new formats absorb the others' features? Will there be a format war a la iceberg vs delta lake vs hudi? Will there be a new consortium now that everyone's formats are out in the wild?


... Are you saying that there's 5 competing "universal" file format projects? Each with different non-compatible approaches? Is this a laughing/crying thing, or a "lots of interesting paths to explore" thing?

Also, back on topic - is your file format encryptable via that WASM embedding?


Thank you for the explanation! But what a mess.

I would love to bring these benefits to the multidimensional array world, via integration with the Zarr/Icechunk formats somehow (which I work on). But this fragmentation of formats makes it very hard to know where to start.


> • Memory-Mapping (mmap): We treat the database file as if it’s already in memory, eliminating the distinction between disk and RAM.

Ugh, not another one...


Yep, another developer enthusiastically proposing mmap as an "easy win" for database design, when in reality it often causes hard-to-debug correctness and performance problems.


To be fair, I use it to share financial time series between multiple processes and as long as there is a single writer it works well. Been in production since several years.


Creating a shared memory buffer by mapping it as a file is not the same as mapping files on disk. The latter has weird and subtle problems, whereas the former just works.


To be clear, I am indeed doing mmap to the same file on disk. Not using shmap. But there is only one thread in one process writing to it and the readers are tolerant to millisecond delays.


> millisecond delays

I thought you said financial time series!

But yeah, this is a case where mmap works great - convenience, not super fast, single writer and not necessarily super durable.


> I thought you said financial time series!

Yeah it is just your average normal financial time series.


Why not though, from what I can see from the docs, these databases supposed to be static and read only. At least when you use it on a device.


Page cache reclamation is mostly single threaded. It's much simpler, than you can create in a user space, it has no weight for specific pages etc.

Traveling into kernel flushes branch predictor caches, tlb. So it's not free at all.


No issue if you know what you are doing. Not sure about the author but I know very high perf mmap systems for decades without corruption / issues (in hft/finance/payments).


Ctrl-Fd you here the moment i saw that in the article


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: