Django SQLite Production Config

kgeist · on June 15, 2024

I want to warn that one huge issue with SQLite in WAL mode is that if your site gets a high load, the WAL file will grow unboundedly. I ran a stress test of 14k RPS (that's the maximum my PC can pull off in my Go application, but it probably can happen with a more modest RPS) and the WAL file quickly exploded to tens of gigabytes, which can render your machine inoperable (it almost broke my PC). The default checkpointer can't properly function if the DB is continuously written to, due to the internal limitations of SQLite.

I managed to solve it by running a separate goroutine (thread) which monitors the WAL size on disk every second: if it goes above the target size of 8 MB, my framework initiates the "slow down" mode where all reads and writes are slowed down by artificially calling sleep(), starting with 16ms and gradually increasing the sleep time according to a few heuristics. This allows the application to have small time gaps where no reads or writes happen and the checkpointer can actually proceed (the goroutine activates it manually). The slow down mode is deactivated when the WAL size is within the target size again.

I think SQLite in WAL mode is not really fit for production without this kind of hack.

tiffanyh · on June 15, 2024

Not an issue for WAL2 mode, which mitigates this issue.

—-

> In wal2 mode, wal files do not grow indefinitely

https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md

baq · on June 15, 2024

I love this story because it shows you really should read some docs about your storage layer before you jump in head first and just assume it’ll work forever and/or gracefully handle any load including overload. Postgres is the same, it’ll work until it won’t, in some case it’ll break in a way you can’t recover from without stopping production for a long time. (You can guess how I know - you’re right, I didn’t read the relevant part of the manual.)

Read your database’s manual, people! Even just going through the table of contents will put you in the top 20%.

stavros · on June 15, 2024

Well, maybe the database should have more suitable defaults? Arguably, slowing down a bit once when the WAL grows beyond X MB to clean it up is better behavior than having it grow unboundedly, but maybe that's not possible.

kmac_ · on June 15, 2024

There is no one-size-fits-all solution in the SQL world, and using SQLite in such a scenario is not a good idea. SQLite is an embedded database for embedded use. Even PostgreSQL is not universal, as many people assume. For use cases like analytics, there are many better-suited database engines.

stavros · on June 15, 2024

Sure, but there are one-size-fits-95% solutions, and they're better than one-size-fits-90% solutions.

kmac_ · on June 15, 2024

Absolutely agree. As most projects are small or medium in size, PostgreSQL is a good default choice. Problems arise when the problem does not fit the solution or tool. For example, using SQLite for high-volume transactions and then complaining that it doesn't work.

belter · on June 15, 2024

> you can’t recover from without stopping production for a long time.

Should that not be addressed at the level of Architectural design of your System? Everything fails all the time...bla bla....and all that?

baq · on June 15, 2024

If the system is small enough for a single postgres (which let's be honest 99% of them are) and short maintenance downtimes are easily tolerated by customers it's easy to forget about some failure modes and especially if you haven't been on the receiving end of such an incident. An experienced DBA would also notice the problem if only because they've been resolving a few of them in their professional past (hence 'experienced').

anze3db · on June 15, 2024

I've heard this is a potential issue but I've never encountered it. Do you know about the experimental WAL2 branch[0] that splits the WAL file into two to circumvent this problem? You'd have to compile SQLite yourself from the WAL2 branch to try it out, but it might be worth it at your scale.

Python is much slower than Go, so I don't think we could get 14k RPS as easily with Django, but I do have to see if I can reproduce the problem in Django. Topic for a future blog post!

Thanks for sharing this, I really appreciate knowing where the limits of SQLite are!

[0] https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md

sgbeal · on June 15, 2024

> I ran a stress test of 14k RPS

My built-in Devil's Advocate feels compelled to ask: do you have any reason to believe that your production site would get even a small fraction (single-digit percentage) of that traffic?

As a point of comparison, i have it on good authority that sqlite3's own sites typically get far less than 0.1% of that traffic. Two of those sites (sqlite.org/src and sqlite.org/forum) are database-driven applications.

kgeist · on June 15, 2024

Sure I don't expect my application to have 14k RPS all the time. However, some spikes can happen, and it can leave you with a large WAL file which doesn't shrink by itself by default. 14k RPS is the maximum my PC could pull off, I think it can happen with a more modest RPS. I ran the stress test with 12 parallel threads, so technically it can happen with just 12 RPS too, if your read or write transactions take a lot of time each (say, it has to read a large table, or you forgot to add indexes, or write transactions update many rows). In fact, a single open read transaction already can trigger checkpoint starvation. Also, it's not only about HTTP requests, my application is event-driven so a few event handlers processing events in the background can easily trigger it, too. A system can accidentally produce many events without having a lot of user requests (say, you forgot to batch events and generate N separate events for N items).

anze3db · on June 15, 2024

While this blog post was at the #2 spot on HN it was getting 0.138 rps, so yeah 14k RPS is A LOT!

That said it's still good to know when your tools will break! As far as I understand it's not really about RPS but more about constant write operations preventing the WAL file from being flushed. You can reproduce this issue with a lot fewer RPS.

kissgyorgy · on June 15, 2024

Isn't journal_size_limit is exactly for this? From the doc:

"Each time a transaction is committed or a WAL file resets, SQLite compares the size of the rollback journal file or WAL file left in the file-system to the size limit set by this pragma and if the journal or WAL file is larger it is truncated to the limit."

kgeist · on June 15, 2024

If there's checkpoint starvation, the WAL file cannot be reset, and it will still grow indefinitely even with this setting.

anze3db · on June 15, 2024

From what I understand, you still need the experimental WAL2 mode[0] to not have this problem.

[0] https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md

stefanos82 · on June 15, 2024

I think there's a workaround if I understood the official documentation correctly: https://www.sqlite.org/wal.html#avoiding_excessively_large_w...

kgeist · on June 15, 2024

Yeah, when implementing this hack, I took the advice from this exact page:

>This scenario can be avoided by ensuring that there are "reader gaps": times when no processes are reading from the database and that checkpoints are attempted during those times

I create reader/writer gaps by calling sleep() (before opening transactions) and activate the checkpointer manually in the checkpointer goroutine.

kissgyorgy · on June 15, 2024

I would argue that if you have this problem, SQLite is not a good database for your use-case.

jfloresremar · on June 18, 2024

maybe this could help: https://turso.tech/blog/something-you-probably-want-to-know-...

brigadier132 · on June 15, 2024

From what I remember reading 1k rps is the recommended limit for sqlite right?

leetrout · on June 15, 2024

There's also a great explanation of SQLite capabilities on the server and the various settings and their effects which have some overlap with Anže's settings:

https://kerkour.com/sqlite-for-servers

Previous (brief) HN discussion on that post:

https://news.ycombinator.com/item?id=39383725

And, if this piques your interest, there was recently discussion on distributed SQLite from the same author:

https://news.ycombinator.com/item?id=39975596

anze3db · on June 15, 2024

Stephen has a whole series of blog posts on SQLite (in Rails) that I highly recommend. I've learned most of what I know about SQLite from him!

https://fractaledmind.github.io/2024/04/15/sqlite-on-rails-t...

Also, hello Lee! I miss being in a Slack with you!

leetrout · on June 17, 2024

Discord is the new slack ;)

anze3db · on June 19, 2024

I'm in a few Discord servers, but I have no idea how to find you there :)

anze3db · on June 15, 2024

Hey HN! I'm the author of the blog post. I didn't mention this in the post, but I'll try to merge some of these settings into Django by making them the new default or the default for new projects created with the start project command.

Any feedback on all of this is greatly appreciated!

stefanos82 · on June 15, 2024

When you say " I'll try to merge some of these settings into Django by making them the new default or the default for new projects created with the start project command", you mean you are a Django core developer and you plan to make them as part of the Django project?

If you are, YES PLEASE! :D I would love to have those settings as my default configuration, because the majority of my projects are tiny to small to between-small-and-medium size, therefore SQLite is more than enough for me.

anze3db · on June 15, 2024

I'm not a Django core dev, but I have managed to get my changes merged into Django already (the transaction_mode setting in 5.1 was my contribution).

Carlton does seem to be onboard with my idea[0], so I'm optimistic that we can make it happen. Comments like yours will help me make my case so thank you for that!

[0] https://fosstodon.org/@carlton/112605212812578926

nprateem · on June 15, 2024

Another +1 from me then. I've had to litter my code with retries to recover from DB locks.

anze3db · on June 15, 2024

That's rough. Is your code using transactions? If so, making sure you use `begin immediate` will help!

cqqxo4zV46cp · on June 15, 2024

IIRC Django has rightfully killed the whole “core dev” thing. And, in reality, new contributors are getting PRs merged constantly.

anze3db · on June 15, 2024

I can confirm. My experience contributing has been very positive!

belter · on June 15, 2024

Django is a critical project, and I am sure your contributions are of high quality. But making it too easy to get new contributions in, somehow raises other concerns around the project governance.

anze3db · on June 15, 2024

Don't get me wrong, the PR review is still rigorous and can take weeks. But everybody I interacted with during the process was very supportive and helpful, so the overall experience was great!

kakulukia · on June 22, 2024

Thx for this!

dajonker · on June 15, 2024

SQLite is a great database but it's not suited for applications with frequent writes or write-transactions that take longer than a couple of milliseconds.

I tried basically everything in this blog post with our Rails business application and none of it really works in practice. I expect the same to be true for Django.

The reason is that with Rails, you usually end up with a bunch of callbacks on models which cause "long" running transactions. For example, the application writes a record to the database, then uses the newly generated primary key to create a bunch of related records, etc. With a reasonable amount of business logic, this means that a transaction can easily run over 100 milliseconds. Not because the database is slow (it's not), but because the application may do all kinds of slow stuff in between the different statements that run in a transaction.

SQLite is single threaded when it comes to writes, so when an average write can take 100 ms, your throughput is already limited to about 10 transactions per second. IMMEDIATE transactions are indeed necessary, to wait dor the database to be ready before attwempting any transaction. Because at least it is easy to have a backoff/retry strategy before letting the framework run its transaction logic.

However, even with just a handful of active users, I needed to make sure that waiting transactions were retried a large number of times, to prevent users from getting a 500 error and requiring them to perform the same action again. However, users were complaining that the application was slow, and I had the metrics that told me the same.

Eventually I just switched to Postgres and all of the issues just disappeared. The users also immediately told me that the application became much more responsive.

I did notice in the metrics that a lot of the read operations actually became slower, as SQLite is really efficient at doing lots of small reads as compared to a client/server model database.

I do still think that SQLite is very suited for production applications, but only when it is read heavy or has very lightweight write transactions.

anze3db · on June 15, 2024

I agree 100% with everything you wrote. It was very surprising to me that every transaction and every write operation blocks the whole database and not only the table it's performed upon (it makes sense since it's all a single file, but still).

The only workaround for this is splitting your main database into multiple databases, but this bleeds into your application logic and gets messy quickly. If you are in this position, it's best to switch to Postgres as you did!

kissgyorgy · on June 15, 2024

I did the exact same thing yesterday, couple of notes:

If you subclass sqlite3.base.DatabaseWrapper, it will issue the required PRAGMAs defined by Django, where foreign_keys = ON and legacy_alter_table = OFF.

I don't think synchronous = NORMAL worth it, as there is a tiny-tiny chance you will lose data. The relevant section from SQLite doc: "A transaction committed in WAL mode with synchronous=NORMAL might roll back following a power loss or system crash."

IMMEDIATE mode might not be needed and your application might never get a database locked error, I would only use that when I see the first error.

You can read about mmap_size in depth here: https://oldmoe.blog/2024/02/03/turn-on-mmap-support-for-your...

hu3 · on June 15, 2024

Does anyone have experience with running SQLite with mounted Docker volumes in production?

I wonder if Docker provides all the i/o features that SQLite requires to function properly.

tzot · on June 15, 2024

Is there a point for the PRAGMA journal_size_limit when we set the database to WAL mode?

anze3db · on June 15, 2024

From what I know, the journal_size_limit PRAGMA still affects WAL mode, but it doesn't solve the issue of the WAL file potentially growing uncontrollably. Am I missing something?

tzot · on June 15, 2024

Indeed on rereading my question I see it was not phrased correctly.

Yes, the `journal_size_limit` affects the maximum journal/wal file that remains on disk if larger than that — and by the way ensures that these files are not deleted once created.

Your setting it to ~25MiB while the default `wal_autocheckpoint` PRAGMA is set to 1000 pages (with the typical page size of 4KiB that means after ~4MiB the WAL file contents will get moved to the main database file if no other transaction is active) is what confused me. 25MiB seems very specific for a file size to keep in the occasion that the WAL file keeps growing beyond 4MiB. Perhaps you also meant to tinker with the `wal_autocheckpoint` PRAGMA but didn't?

kgeist · on June 15, 2024

https://sqlite.org/forum/info/54e791a519a225de

>Journal size limit is measured in bytes and only applies to an empty journal file.

>The WAL file will grow without bounds until a checkpoint takes place that reaches the very end of the WAL file. Usually, a checkpoint is performed when a commit causes the WAL file to be longer than 1000 pages (PAGES, not bytes). There are conditions when running a checkpoint to completion is not possible, like disabled checkpointing, checkpoint starvation because of open read transactions and large write transactions.