Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You seem to be taking some kind of extreme example of a particular use case of yours and trying to make it a general rule. But it's not.

And none of this is "says me", it's standard practice, it's relational databases 101. And none of this is about NoSQL, it's about relational databases. NoSQL performance can be abysmal for trying to do things relational databases are meant for.

And the overhead is not about network latency, it's about all of the overhead involved in serializing, transferring, deserializing, and then doing it all over again in the other direction.

Your comment seems to boil down to:

> If Foo has a Bar, and there are 10 million foo and 1000 Bar used throughout them, then it's faster, less network, and less data intense to load up bar separately from foo.

I assume you're not retrieving 10 million Foo for the user, god forbid -- you're retrieving 20 or 50 or something user-friendly. Then you should join to Bar. It is slower and more overhead to load up Bar separate from Foo. It is an anti-pattern.

If you are getting results to the contrary, your query may not be written correctly -- e.g. you are joining 10 million rows of Foo to Bar in a subquery without a WHERE clause, and then only applying the WHERE at a higher level (in which case one solution is to move the WHERE clause into the subquery). Or your tables may not be architected suitably for the queries you need to perform, and you need to revisit your normalization strategy.

Again, there are super-super-complex queries where yes it becomes necessary to split them up. But that is not the "rule", it is not the starting point -- it is what you do only when you've exhausted all possible options of keeping it in the query. It is never a recommendation of how to use databases in a general sense, which is what you are suggesting.



> use case of yours and trying to make it a general rule

This is a fair critique. Definitely our system is a bit unique in what it works with and the amount of random data it needs to pull together.

> it's about all of the overhead involved in serializing, transferring, deserializing, and all the way back.

Serializing and deserializing are typically not a huge cost in DB communications. Most DB protocols have binary data transfer protocols which minimize the amount of effort on server or client side needed to transform the data into native language datatypes. It's not going to be a Json protocol.

Transfer can be a problem, though, if the dataset is large.

> I assume you're not retrieving 10 million Foo for the user, god forbid

In our most extreme cases, yeah we are actually pulling 10 million foo. Though a lot of our ETL backend is where these big data requests are happening as the upstream data is being processed. That's primarily where I end up working rather than the frontend service.

And I'll agree with you. If you are talking about requests which result in the order of 10 to 100 items then yes, it's faster to do that all within the database. It depends (which is what I've been trying to communicate throughout this thread).

> you are joining 10 million rows of Foo to Bar in a subquery without a WHERE clause

No, properly formed SQL. The issue is the mass of data being transferred and, as I mentioned earlier, the temporary memory being stored in the DB while it waits to transfer everything to the application.

Splitting things into the smaller and multiple queries ends up being faster for us because the DB doesn't end up storing as much temp data, nor does it end up serializing a bunch of `null` values which ultimately take up a significant chunk of the transfer.

Also, you should recognize that now you are talking about query structure that it's not universal on what's the best/fastest way to structure a query. What's good for postgresql might be bad for mssql.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: