Some of us just aren't smart enough for sql. I'm perpetually running into the situation where I want to join one table with another that has multiple rows. Like a blog post with tags. Exactly like this: https://stackoverflow.com/questions/8201462/join-with-anothe...
For which the answer is oh, just use GROUP_CONCAT, which isn't even SQL. And I've still got to fix it up by running split when I get it back. Nor does it work particularly well if you're joining against something that isn't a single string column.
So I just bang the rocks together like a caveman and write a second query to go back and fetch all the tags, then loop around assign them to the matching posts.
This seems fundamentally unfixable with current ORMs. You either have to pick between lazy loading per entity, or eager load and carthesian product everything, which breaks badly if you are dealing with multiple one-to-many relations.
Our solution was to write our own ORM-like system that “remembers” which entities you’ve loaded within the context of the transaction, and then will fetch the relation for all of them. So if you access a1.b then it will also fetch a2.b and cache it if you loaded a2 within that same transaction. The call to a2.b will then resolve instantly. So instead of n queries with n being the number of entities loaded in lazy loading, you’re doing n queries with n being the number of relations touched.
The one bad failure case is if you update entities one-by-one but also accessing their relations in a loop since you invalidate the cache for that entity type on every write.
Django (python) does lazy loading by default, cartesian product with "select_related()", and has a third option called "prefetch_related()" where it does only 2 queries then does the join programmatically for you instead of in the database. It's kind of like your custom system except you do have to specify ahead of time which fields to prefetch.
It's also had this for over a decade so I have to wonder how rare it actually is in ORMs in general...
Interesting. Does it do this separate query for all entities in scope, or per entiry like Hibernates SELECT FetchMode? I find a separate SELECT is usually possible, but doesn’t quite solve this in the general case.
Perhaps I missed Django, but as far as I could tell, Hibernate and jooq can’t do this, Ecto can’t do it, ActiveRecord can’t, and the entirety of ORMs for JS can’t (TypeORM, Prisma, Drizzle, Sequelize and a bunch more).
I'm not sure I understand the question. If I'm reading the Hibernate examples right, I already answered that in my first comment: Lazy loading (separate query for each row, 1+N problem) is the default, one additional query per each field with prefetch_related() (1+1 if you only need 1 relationship, 1+2 for 2 relationships, etc).
Django also specifies foreign keys on the model, which are used by name when doing queries. From their examples in prefetch_related() [0], this query would involve 2 levels of lazy loading, except prefetch_related() tells Django to do 3 total queries and join them together in code:
jOOQ doesn't get involved in any such prefetch/eager fetch shenanigans, which are often wrong. They work for some cases and do too little/too much for others. So, specifying the exact data to fetch in every query is a much better approach, ideally with nested collections:
https://blog.jooq.org/jooq-3-15s-new-multiset-operator-will-...
Not really true. You can have separate queries to avoid cartesian explosions. EF Core implements it and its standard practice to use - but not enabled by default, because if you don't order by something unique you'll have a bad time.
I think that's exactly correct. You either do split queries (with more latency) or you do a join (and risk Cartesian explosion). Most ORMs should do this for you.
SQL is easy to understand from the perspective of syntax, but quickly becomes challenging to understand depending on the data model how to actually get what you want out of a query. I have spent many years working in SQL across different server products (PostgreSQL, MySQL, Microsoft SQL Server, and others), and I still sometimes find myself struggling to build the most efficient query for task. Most of the time, it's not SQL that's the problem, it's a broken data model that preceded me (often created by an ORM) that I am forced to work with.
All that said, while it may be painful sometimes, learning it can at least let you poke your ORM more towards the right direction, or give you an opportunity to identify the hot spots where it's worth investing the effort to write a more efficient query directly, much like where sometimes it makes sense to write some subset of an application in assembly for performance, while most of the time you're better off just letting the compiler optimize for you.
For the problem you're talking about above, you're trying to avoid a Cartesian explosion, which is probably best handled through query splitting as you mentioned, but it depends on the database engine how best to approach that. For all its warts, Microsoft SQL Server is pretty good about things like the ability to use query splitting within the context of a single transaction to make it serializable and ensure data consistency (although that can come with its own challenges for performance).
The example you provided is nearly exactly the case where I would expect an ORM to produce a bad query, where-as a direct query build would be significantly more performant because you can incorporate a clear understanding of the data model and use limits and ordering to minimize the size of the result set you need to deal with.
I've heard this all my career. And I went through several phases with SQL including the banging rocks phase.
No, some of us just haven't had time to improve. I wrote my first sql in the early 2000s. I wasn't aware of window functions and lateral joins until 10 years later. Took me another couple of years to start using CTEs.
I don't even write SQL that much. I know enough SQL to make ORMs just do what I want.
Or else, if the post is talking about a "public-facing API resource", can someone tell me why the API wouldn't implement querying for multiple of the same record type at once? It just seems to me that choosing between getting 1 owner, and "get ALL owners" (as TFA puts it), is like a law of the excluded middle
It's tricky because in some cases you might be able to batch all the queries up front, but in others you will only know the IDs you need to fetch after you get one or more intermediate results back and apply business logic to arrive at a decision.
As of today there's no silver bullet beyond having a solid and principles-first understanding of your database and related infrastructure.
This might be true for some APIs but if you're letting the caller control what they fetch, you don't really have that luxury. An extreme example is Graphql-based APIs but even Rest-based APIs might allow some flexibility to decide what they fetch.
That was my first thought as well. Devs will do anything to avoid learning SQL.
You’re spending 20x the code and probably 1000x the time coming up with this nonsense, and for what – so you can say there isn’t any raw SQL in your code base? It’s as ridiculous as denouncing someone for occasionally dropping into Assembly for a specific purpose, or writing a bit of C to call from Python, etc.
(I don't know him personally, but have been following his blog for years.)
What these "just write SQL" rants are missing is encapsulation--let's say you've got a business logic, like "if the account is disabled, render the account name as 'Foo (disabled)'".
You want to write this logic in your preferred backend language, Go/TS/C/etc.
This works fine, in the /account/X endpoint (just load the account, and apply the logic).
But now what about the /accounts/client:Y endpoint (load all accounts, all the logic for all of their accounts)
As time goes by, you end up having 10-20 endpoints that all "return some part of account" as their payload, and you want the same "Foo (disabled)" business logic.
Your options are:
1. Build a single, giant SQL statement that strings together every snippet of business logic applicable to this endpoint (bulk friendly b/c the db is doing all the cross-entity work w/joins)
2. Push the business logic down into the db layer (simple for this, just string concate with an if, but what about anything that is a loop? doable in SQL but tedious)
3. Use your language's abstractions, like functions, to organize the business logic (what Brandur is attempting to do).
Nearly everyone wants to do 3, because 1 doesn't scale as your business logic becomes more & more sophisticated (copy/pasting it around every endpoint's single-giant SQL statements, or pushing it down into the db as views/stored procedures).
> because 1 doesn't scale as your business logic becomes more & more sophisticated (copy/pasting it around every endpoint's single-giant SQL statements, or pushing it down into the db as views/stored procedures).
I think the idea that doing business logic inside the DB is an anti-pattern is cargo culting. You can quite easily store schema definitions, stored procedures, etc. inside version control. You can document it just as you would any other code. You don’t have to see what precisely the call is doing at all times, provided you understand its signature.
Letting the DB do more than just be a dumb data store is a great idea, IMO, and one that is too often eschewed in favor of overly-complicated schemes.
I know putting business logic in the DB has been used very successfully, but it also has some large downsides.
It's harder to express some things as SQL and your team will be less familiar with SQL than your preferred language. SQL language tooling (code autocomplete, testing libs, etc) is also far less mature than tooling for most languages.
It's harder to debug when things go wrong. Where do you put a breakpoint?
It's likely your team ends up splitting your business logic between your app and your DB, and then you have to figure out which part is writing incorrect data when there's a problem.
For my apps, I'm trying to set up the database to reject incorrect information (with types and constraints and sometimes triggers), but have the app code do all the business logic and writing data.
Honestly, it's too soon to tell whether my apps will be successful with this approach. They haven't yet gone through years of real world usage and requirements changing. But so far it's gone well!
This is true, but then, it's also harder to write Rust than Python. There are tradeoffs which are made for every choice.
> and your team will be less familiar with SQL than your preferred language.
Also true, but given how SQL is practically everywhere and is not likely to go away any time soon, I strongly feel that nearly every dev could benefit from learning it.
> SQL language tooling (code autocomplete, testing libs, etc) is also far less mature than tooling for most languages.
Again, true. Personally I dislike autocomplete functions (they break my flow, needing to parse the suggestion and accepting it, rather than just typing from muscle memory), but I get that others like them. Re: testing, I have an unpopular opinion: this matters very little compared to most other languages. Declarative languages like SQL or Terraform generally do not produce surprises. If you get an unexpected output, it's probably because you have incorrect logic. There are some testing frameworks that exist for various flavors of SQL should you wish, but IMO as long as you have some rigorous E2E tests, and your dev/staging environments are accurate and representative of prod, you'll be fine.
> For my apps, I'm trying to set up the database to reject incorrect information (with types and constraints and sometimes triggers), but have the app code do all the business logic and writing data.
Hey, you're doing better than most! I've never understood people's unwillingness to use things like length constraints. Sure, your app should ideally catch basic issues so they never have to make it to the DB, but I'd rather have the guarantee. The last thing you want is for someone to find a way to inject the entirety of Moby Dick into a `name` field.
for 3, you could write a stored proc to handle the various situations, and call that stored proc appropriately, letting the DB engine optimize appending "(disabled)".
however, I do wish Go had a map function ala python map() as opposed to just verbosely writing more for loops or a complicated thread-safe goroutine. Seems almost strange that Google, who popularized the mapreduce model, doesn't want it in Go.
Stored procedures are also screeched at as being anti-patterns, somehow. I don’t get it; you can store them in version control just like anything else, and add comments where necessary in the code base indicating what the DB is doing.
PostgREST for instance avoids the alleged problem by generating a single SQL statement. So does Hasura, PostGraphile, and probably Prisma for that matter.