Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree that linear reads and writes are not relevant these days in most settings.

I've worked on my own DB engine that uses a structure similar to LSM (but it's not an LSM tree), where the highest possible performance (millions of TPS) for random-writes mixed with semi-sorted writes mixed with random-reads on current SSDs was the target. There's no need for any data to be allocated sequentially on those, other than just enough aggregation to ensure a sufficiently large block size to reduce IOPS during streaming and merging operations, when IOPS-bound. Indeed it's better to fragment to reuse already filesystem-allocated space where possible - that lowers overhead on the filesystem.

I also agree that a well tuned RocksDB can perform very well, and that the authors have done the work, and that it has methods to reduce avoidable write amplification.

However, the RocksDB applications I've seen haven't use the fancy APIs to get the most out of it. They just used it as a plain k-v store with little understanding of what makes a DB sing, and got not great performance as a result.



> There's no need for any data to be allocated sequentially

There's two different times that it's critical. The first is for write ahead logs. Those are pre-allocated, opened, and ready in order to stop latency spikes.

The second is to control write amplification on ssd writes. If you write 100K bytes. Then issue a hardware flush. Then write another 100K. That's going to be two different writes to the same ssd. Meaning you'll burn it out double fast. That's a huge no no. Flash burn rate is one of the things that LSM's are good at.

But again things have changed since the first times that LSM's became en vouge. Previously it was critically important to write these files out sequentially because it did mean that large parts of the file were laid out sequentially in HHD tracks. That did work. It was important, and it wasn't the filesystem. It was all about issuing huge linear writes that had to be spilled to disk. There were times that the added linear read speed of reading data sequentially was important (read ahead caching, OLAP style queries, etc)

Now the linear reads and writes are structured around SSD block sizes. So 512k or so. Linear reading speed is bottle necked much more on cpu.

>Indeed it's better to fragment to reuse already filesystem-allocated space where possible - that lowers overhead on the filesystem.

I don't understand what filesystem overhead you're referring to here. A read comes in for a key. The SST files are already pre-opened so there's no FS interaction.

Then you walk the index blocks. Almost always those are in cache from just opening the files. (So I'll hand wave a little here but index blocks are just like data blocks with near 100% hit rates)

From there we have walked the index and know that the key we're looking for could be in a specific block. We have the offset in the file to a block.

That block is located on one single logical hardware unit (hdd sector, ssd block, raid stripe size, etc). So the read that finally goes to the fs layer where we translate one read of a index, offset into a read on a single hardware sector, with page cache off, DirectIO is used to. Read ahead is tuned at the application layer not the FS layer. So that will result in a single read operation being sent to the nvme/ssd.

Essentially the SST file is structured such that the reads use extents as a translation layer only of different address spaces the hardware and the file. That translation layer didn't buy LSM's nearly as much as regular file users. And we had to force everything to align so that each read of a key ends up a single hardware sector read.

Contrast that with using the FS more and relying on hardware structure less. I'll assume that walking the index is always in memory as before. So then we need to read some segment of the file. Issue a read. That read goes to the page cache. The page cache has lots of fun with locking and very strange behavior. We're then left with some un-cached data to read. The larger key/value sizes get the more likely it is that you have extra to read. Eg you only need 100k, but that 100k crosses over the ssd block or raid stripe boundary. Now you are waiting on two operations. Those need to fill the page cache, then return the data to waiting buffers.


And then there's the SSD controller, which yet again tries to maintain the illusion of efficiently supporting random I/O. For software that tries to get the best possible performance of the physical media, avoiding the OS and controller overhead, it's definitely important to maintain data in sequential blocks. In fact, NVMe ZNS [0], doesn't even allow you to do random writes.

[0] - https://zonedstorage.io/docs/introduction/zns#overview


At this point, we are in logs three levels deep—we have a database with a log-structured merge tree, we have a journaling filesystem like ext4, and we have an SSD controller which uses log structures internally for wear leveling.

I recommend the paper, “Don’t stack your Log on my Log”:

https://www.usenix.org/system/files/conference/inflow14/infl...

Basically, the beautiful thing about log structures is that they work well on “dumb” storage. Paradoxically, when you put logs on top of logs, you can make things worse. This is unintuitive, but it reminds me of the fact that if you tunnel a TCP connection over another TCP connection, the result is something which is much less reliable than an ordinary TCP connection operating over an unreliable network.


Plus cache layers upon cache layers




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: