Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Which is better? Performing calculations in sql or in your application? (stackoverflow.com)
53 points by hartleybrody on Sept 7, 2013 | hide | past | favorite | 52 comments


This is something I've been through before.

The first version of an app I worked from was very SQL heavy. Almost every calculation was done in a stored proc and the app servers just formatted that.

As the product got popular this became the bottleneck. It's far easier to get more app servers than DB servers.

So we restructured it to do straight index reads and aggregations in the DB, but more complex calculations in the app itself.

It all depends on the circumstances, but I'd still advocate pushing as much in to the DB as you can without making convoluted SQL - your average RDBMS has amazing optimisations for aggregation, sorting and filtering.


But the problem with doing it on the app is that when you are joining huge rows, all of this unfiltered data gets sent over a relatively slow 100-1000 mbit port.


We push data over 10Gbit - 40Gbit ports, but the point still stands.


From a straight performance perspective:

The rule of thumb is that DBs are bound on IO, so any calculation that you can get with the same amount of IO as returning the records is likely to be "free", and any calculation that requires more IO is likely to be "expensive" (unless that allows you to avoid IO by restricting returned records, avoiding future queries, etc. - it gets complicated fast).

So things like column-column calculations, simple aggregates, etc. are likely to be good ideas on the DB; for anything else It Depends.


Another fun complication: database stored procedures implemented with non-SQL code. I worked on a project where the client's dedicated SQL ninjas managed to bring the time to process daily extracts from the data warehouse down to 2min from 45min by switching over to CLR (.NET) implementations of the calculations being performed on SQL Server 2005.

IIRC the issue was simply that MSSQL 2005 is just plain slow at doing calculations compared to C# code (the calculations were complex financial models involving large amounts of data and inter-dependencies between that data in calculations, so shipping the raw data to an application was not a viable option).


The rule of thumb is that DBs are bound on IO

Isn't that just because of the way systems that make use of DBs are architected? I would think you could trivially make them CPU bound if you start tacking a bushel of FLOPs on to every request.


It's a rule of thumb, not a rule of fist.

But actually, it's not uncommon that you might have enough caching (or an in-memory DB) that you appear to be maxed out on CPU - but because of memory latency, overhead between the DB and the network, etc. you can still get a fair number of operations for "free".

Measure everything, but know where your starting points are and what to try first.


The big advantages for doing things "on the DB" are centralizing business logic and eliminating serialization/transport/deserialization. Having a view or stored proc with business logic lets it be shared across different components of the same app or even across separate apps that share a database. An extreme (yet common) example of calls best done on the DB itself is a SUM of price X quantity or any other grouping operation. Rather than moving N rows to the client, the server can aggregate it all and just send back the result.

If CPU usage on your DB really is your bottleneck (and seriously it's probably not) then you should look into federating logic out to your app. Otherwise the centralization of app logic alone is worth it.


You should write isolated libraries of business logic, not bake it into your data layer.


Sometimes it makes sense to make a common library shared between apps that interacts with your database and sometimes it makes sense to put that code itself in the database (as a view or stored proc). The latter has the advantage of working in multiple languages, direct DB access (eg. your fav DB client), and external systems that also read from your database (eg. a reporting service that connects to your DB). A shared library would be useless in the second and third situation and you'd end up duplicating logic.


So when the time comes to change the business logic you have stored in the database, do you have to ensure that every possible client that may use that business logic is updated at the same time? You could version your business logic in the database, but then it seems like you've implemented a version controlled business logic library inside the database.

Incidentally, every time I see people attempting to put business logic in the database it's usually manually updated views or stored procedures. It reminds me of the days of yore of people shelling into production to edit php files. Other than Rails migrations, South, etc, are there usable tools out there for sanely writing software that runs inside the database?


> So when the time comes to change the business logic you have stored in the database, do you have to ensure that every possible client that may use that business logic is updated at the same time?

No. If you change any of the interfaces (e.g., the structure or semantics of a view) in a non-backward-compatible manner, rather than merely changing the implementation, you have to assure that the consumers of the specific affected interfaces are updated. But if you've built a DB structure that isolates applications well (each application uses its own set of views, which reference the views implementing shared logic, which reference the base tables) most changes to shared business logic should be completely transparent to most applications, in the normal case not impacting even the application-specific views, but even when they do only requiring changes to the app-specific view definitions that don't impact the actual application.


> So when the time comes to change the business logic you have stored in the database, do you have to ensure that every possible client that may use that business logic is updated at the same time? You could version your business logic in the database, but then it seems like you've implemented a version controlled business logic library inside the database.

I don't see this as an issue at all. Consider the DB like any other software component and treat its data model as its API. Adding columns to existing structures or new stored procs should not effect any existing clients. Anyone who intends to use the new fields would explicitly use them.

Modifying and existing structure is a no-no (removing fields or dropping an existing view or proc "breaks" the DB module). Changing internals is fine though. If I change how a formula is calculated in a view or function but the API is stable then there should be no issue for existing clients. Sure you must test things in more places but that has nothing to do with the code being centralized. That's just because you have more code! The alternative would be independently building and testing multiple implementations of the same logic and deploying them simultaneously.

> Other than Rails migrations, South, etc, are there usable tools out there for sanely writing software that runs inside the database?

I've never considered this a major problem. If you design software starting with the data model then you rarely have to change it. Sure it does happen but no where need as often as the rest of the app. For basic changes Rails migrations, Hiberate scheme updates, etc are fine. For anything more major doing it manually isn't that much of a pain as you don't do it very often and when you do, it can usually be done in advance of your deployment (per my previous paragraph about non breaking DB changes).

If you're doing destructive changes to your data model regularly then you really need to stop and re think what the heck your building!


I'm not saying this is the wrong approach, but you will be moving aspects of the business logic from the application tier to the data tier. If you have written your application correctly, then you should have reusable modular code that does the same thing.

Like I say though, it's not wrong what you are saying. In fact, in terms of latency it's probably for the best.


I've written some pretty awesome SQL queries that give back exactly what the application wants with the application sometimes not even needing to do anything else afterwards.

I've usually always regretted those "awesome" queries.

I've known some very smart developers who've gotten lost in SQL queries, while show them the equivalent Ruby/Python/Javascript/C code that parses through the results and they can understand it less than a minute.


One approach I've seen that works well for untangling gnarly SQL is to refactor those mega-queries into smaller, component queries that are pieced together by application code. This keeps the processing on the DB server but avoids the pain that comes with programming in SQL. I rather like SQL, but it doesn't handle complexity well at all (e.g. maintaining a "variable" to be used in different parts of the query).

For example, I once worked on a Foursquare clone originally written by a hardcore PostgreSQL nut. This system had a query that, if memory serves, returned a list of places of a certain type within a geographic area along with user activity on those places (votes, comments, etc). This was around a 75 line SQL query that actually wasn't that fast (response times from the DB were roughly 1 second even with every join indexed). We rewrote that query into 4 smaller queries (place ID's within that area, place ID's within that category, hydrating those places from the filtered ID's and then getting the user info), and that cut our DB response by about 70% in addition to making the system easier to work with. This required roughly 10 lines of Java code and a variable - a list of ID's that we got first and passed into each other query. It also freed us up to do other things - for instance, if performance were still a problem, queries 2-4 could have been done asynchronously behind a latch. By lifting the "glue" out of SQL and into a better language, it freed us to do new things, and it freed the database from having to juggle unnecessary complexity while planning and executing its queries.


I've written gigantic, 500+ line queries for MySQL, with probably 40 subqueries within. Obviously, what MySQL receives is a monstrosity that nobody in their right mind would try to understand.

But the query is assembled piece by piece, in separate functions, each subquery responsible for its own contribution to the final query string, with well-defined inputs and outputs. The entire file that generates the query reads quite logically.

And there's simply no alternative -- many pieces of processing involves 100,000+ rows, so round-trips between db and app would be prohibitively slow. The whole thing uses data from around 10 different tables, it's extremely relational.

But because it's structured well and written correctly, the whole thing executes in a small fraction of a second. (Trying to do it in a "NoSQL" style would probably take ten minutes of back-and-forth network communications.)

I've known a lot of programmers who would shy away from such a thing -- but that's because a lot of programmers don't bother to actually understand SQL the way they understand Ruby or JavaScript or PHP. It can do amazing feats of data processing, which is the whole point of a relational database. My advice is, dig deep into SQL. It can work wonders, but it's true that its "best practices" can be difficult to learn, and there's a lot of bad advice out there.


I've never seen that kind of huge queries for OLTP (online transaction processing) apps, but I've written ~1000 line SQL queries for off line batch jobs. Especially with the recent popularity of Hive that allows UDFs written in Java, one can do wonders with SQL.

One thing SQL really helps is it force you to think "data first". Instead of thinking algorithms, step by step what you want to do, it makes you think along the line: what data I got and what output I want to get out of it, not unlike functional programming, but with more focus on data sets.


Agreed. The Postgres "with" statement is also a great way to make large queries readable:

http://www.postgresql.org/docs/9.2/static/queries-with.html

I first saw this demonstrated in Peter van Hardenberg's excellent Waza 2013 presentation, "Postgres: The Bits You Haven't Found":

http://vimeo.com/61044807


500+ small beer I have seen stored procedures thousands of lines long in some of BT's smaller mainframe systems from memory it was COSMOS (the one that tracks every circuit in the country) and not CSS which is an even bigger system.


Generally that sort of thing IME is a sign that you're got a hardcore SQL but who isn't as good as they think and / or a poor SQL dialect. I've known far, far larger things than that written in pure server side SQL (often made more readable by breaking it into sub-queries, table variables and common table expressions) that were very, very fast.

SQL is much more powerful than many realise, but there's a great many developers who aren't as good at it as they think.


Another approach is to refactor mega-queries into smaller, component queries using SQL views that are then pieced together in another, simpler query. YRMV.


Not the worst approach, but they can be very hard to maintain. If the view has an error, then it can be hard to troubleshoot what is going on.


Odd though it might sound, the opposite is probably true for me (and I assume a bunch of other DB developers) - once you really 'grok' SQL, and as long as it's written in an understandable manner, it can be easier to read than a large block of procedural (and/or functional) code.

One of the big things is that there are no 'for' loops in good SQL code - i.e. it is written using set-based logic and not procedural logic. That can (sometimes) really cut down the amount of reading involved, but takes a while to get used to.

Also depends how well you understand the schema in question (but that compares equally to understanding the code framework/namespace hierarchy).


Is that because those developers aren't as experienced in SQL (thinking in sets/relations) as they are in an imperative language?


Then they need to get better at databases and coding PL/SQL TSql or what have you.


Premature optimization is a bigger enemy than performance issues.

That being said, the DB server is what's optimized to do calculations.

The tradeoff is you have a second stack to maintain and performance tune now beyond being a datastore. A positive is you can independently write and run tests.

The question is, can you resist building the perfect empire on day 1? Move stored procs and functions into the DB as they are needed. Whatever you're working on (including who is working on it) isn't that important.


And to cache the results of those calculations for the next query. How do a farm of app servers do that?


Depending on the app server you are using, it should be able to cache results of a query no problem. I've never considered this to be an issue :)


Move things to the DB if the calculation aggregates data (the input is much larger than the output), to avoid materializing data and shipping it around. (Map-reduce is a subtype of this.) Otherwise don't.


If you are using numeric data (not floats) in the database, I would say do your calculations in the database. Different languages and tools have different arithmetic handling which can cause some very long and psychotic debugging sessions. A single point of calculation is a good thing.

[edit] I should explain a bit. If you use are in a multi-language environment[1] and are doing financial or weight / volume calculations, be extremely careful if you decide to not do all the calculation on the database. Having results calculate differently in two different places will drive you mad. I have noticed some serious problems with number handling in different languages and some mistakes in calculation will get you sued.

1) SQL counts as one of the languages


Except if a database upgrade fixes a floating point issue. Or the engine may not be as capable of doing calculations on floating point logic!


I specifically said non-floating point. I have not used any database that would screw up a decimal type. It would lead to mass migration away in certain very well paying industries.


Sorry, missed that.


If you find yourself returning very large numbers of rows to the client, you might be doing something wrong - SQL databases are at their most efficient when you select only the columns and rows you need.

If you perform aggregation/calculations in the DB, you can potentially save on-the-wire data transfer time (and potentially CPU time on your clients.. though obviously that is shifting the CPU work to the database).

Similarly if you find yourself making multiple trips to the database, and then using loops to combine different data sets, you're probably too far on the 'client-side' and should look at using some joins and combination logic on the DB side to get what you need in a single (and likely more efficient) round-trip.


The top answer on the page says as much but the answer to pretty much any "Which is a faster way to do this, ABC or XYZ?" is "try it both ways, measure the results (adjust for load if necessary), and see". Doesn't matter if you're talking different pure code algorithms to achieve the same result, stored procs vs app code, network caching or not, etc. If doing it the absolute fastest way matters to you, try multiple approaches and measure the results. Even in situations where there are accepted "best practices" there may be variables specific to your own project/tech stack that cause your results to be surprising!


In practice, the answer is not always obvious.

Benchmarking multiple queries / approaches is generally worthwhile if performance is important.


... annnnd then the database technology changes, and your benchmark results of three years ago are now actively hosing you. Or you have to change databases but you can't because you embedded a lot of logic in your database layer that can't be easily ported (somewhere, a salesman for your current database product is crying tears of joy and shouting his cash-register shout, "Ka-ching! Customer lock-in! Ka-CHING baby!". Avoid these people).

Unless you have really compelling reasons to get snuggly with a particular vendor's technology, be conservative.

[This still applies if you're using a "free" database engine; you're just not paying MS or Oracle or whomever, and it's "just" your own time]


>Unless you have really compelling reasons to get snuggly with a particular vendor's technology

I'll never understand this mentality. You bought it, so why not use it? Even OSS databases have some very compelling features. Take advantage. Use the hell out of your tools.

This is tantamount to saying "I'm using Go, but I can't use goroutines, because some day I might want to use Python instead." Don't want to get too attached to the technology, right?

The only time I ever tried to go abstract is when I sold on-premises software that had to support multiple enterprise-size clients' vendor choices. You know what happened? It took forever, everyone was unhappy, and ultimately it turned into 2 vendor-specific versions and dropping the least popular 3rd. Life was better after that.


The top answer in that thread is excellent.

Basically, "it depends". having dealt with extremely DB-intensive applications, I have developed a personal motto of "be nice to the DB".

Let the DB be a secure storage of your data, not a calculating part of your application. But like the top answer says, sometimes it is not practical to do a calculation within the application. In my case, we had a few database servers set aside just for reporting, so we could slam them with difficult queries and not worry about affecting data.


I have seen sql queries that were 30 pages long if printed out. The developer showing me this was looking for some other examples that he said were up to 100 pages long. These queries had multiple levels of correlated subqueries and tons of decode statements ( oracle syntax). There looked to be tons of duplication just in what I saw.

How are you going to write unit tests for that stuff? Refactor? etc. etc.

The examples always start out simple like summing a bunch of rows that match a predicate but once you start doing that it is hard to rewrite that to use application code once it becomes too complex.

Also most databases are 20+ year old technologies and often have weird systems in place for storing the code in the db or something else just as odd. No more grep, no more static code analysis.

As far as I am concerned the db is a pile of facts or observations. I tell the db something and later it tells me what I told it. When I am thinking about what goes in the db I think about using the past perfect verb tense. On this day such and such happened. Thats it. Preferably that never changes, you might get new info in the future so just record that new info along with everything else.

Ideally we should be getting to a point to where resources are so cheap that CRUD can become CR - no more updates or delete just new facts.


The existence of terribly written queries proves nothing. Because there's also plenty of terribly written code that takes 100 pages to do something simple.

In both cases it comes down to people blindly grasping when they don't understand the fundamentals.

Relational logic can be extremely elegant, composable, and testable. Unfortunately SQL is a pretty awful interface to expose those ideas, and most attempts to wrap SQL in a better interface make the mistake of trying to pretend to be object-oriented, when they should really let their true relational nature shine through.


Crucially, "performance" isn't defined.

The one that matters to me the most is developer time.

Write a damn SQL query. If it's too slow or the DB becomes a bottleneck, then reconsider.


As usual "it depends" but yes, this is imho the the most important aspect (provided reasonable performance and sustainability is carefully taken into consideration).

The older I get, the more tired I get of developers writing in-house apps that are never going to have more than 10 concurrent users, but they architect as if they are going to have 1000+ concurrent users, regardless of the additional cost or complexity....which is how relatively simple projects end up cost $100k+++ and become maintenance nightmares, and why simple change requests are often rejected because they would be "too complex".


I default to relying on SQL. I've found that helps ensure calculations have a single source of truth, which helps maintenance and reliable sharing throughout your app, or across multiple apps.

SQL also offers powerful aggregate functions to assist. Much simpler to use something like AVG() or SUM() in a SQL query than having to worry about deriving the same calcs in application code.


It comes down to performance and sometimes what your hardware constraints are e.g. your DB is a faster machine than your web server and you have no say in the matter.

You may also benefit from precalculating stuff in the DB and storing it. I wrote this 6 years ago which illustrates the point:

http://markmaunder.com/2007/07/20/how-to-create-a-zip-code-d...


Performance wise? Maybe writing hand-tuned assembly code will be faster. Or using CUDA. Don't optimize performance where you don't have to. Are small calculations the long pole in your app that talks to a db? Almost certainly not. Pick a method that maximizes robustness, ease of understanding, servicing, openness to feature changes, and accountability. In my experience that typically means client side code, not queries or sprocs.


I would always lean on the database to perform simple sums and aggregations unless I was really concerned about turning the database into a performance bottleneck.

If you need to add simple logic above and beyond this, stored procedures aren't sexy but they can be a good compromise that avoids shipping data around and re-implementing SQL in your application server.


It is interesting that nobody has mentioned testability, as in unit tests. Some people (such as Thoughtworks) advocate removing all business logic from the db server into app code so that it can be unit tested.

I buy that argument to some degree but in practice I am SQL junkie and always implement calculations in SQL.


There are tools that allow for unit testing for SQL. One that pops to mind is RedGate's tools. They wrote a white paper on it here:

http://download.red-gate.com/HelpPDF/DatabaseUnitTestingWith...

P.S. I'm not affiliated with them in any way, I just love their products.


Isn't there a size past which it's always good to do it on the DB? Or to use buzzwords, "At some points, shouldn't most Big Data calculations be done on the database?"


The most important factor is whether the DBAs in your company are cooperative and how easy it is to do a SQL release to production.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: