> What kind of query would you have to write to bring down a production db?
Scans and Sorts, seen in a query plan, are relatively expensive to run in a production row store. OLAP queries (GROUP BY with aggregate functions like COUNT, SUM, and AVG) do large^/full table scans by definition. They take seconds to run while your goal in a Cloud OLTP system is thousands of requests per second. An automatic sort issued per query in an OLTP system is pathological and represents a vector for a DoS attack.
> What makes a solution like hive much better - I guess its optimized for this?
Column stores use compressed bitmap indexes that are optimized for scans over a small number of columns. Hive is SQL over Hadoop, and is inherently slow but it does offload the processing from your Production OLTP server. Hive supports the RCFile format which is partially column oriented. The ORC file format is fully column oriented, replaces RCFile format, but requires Presto (or equivalent). Hive is brownfield for existing Hadoop clusters but it has no place in a discussion about greenfield architecture other than discussing historical systems.
If you have a need for GROUP BY style analytics, a true column store like Presto, Impala, or RedShift is a necessity.
> OLAP queries (GROUP BY with aggregate functions like COUNT, SUM, and AVG) do full table scans by definition
Isn't it only a full table scan if your query isn't otherwise filtered? Those functions have to read every row of "something", but that something might not always be a whole table.
Very true, I've edited my comment. Some GROUP BY queries are inexpensive and are fine in an OLTP system if the WHERE clause restriction limits the result set size, most however, are meant to scan a large number of rows.
Scans and Sorts, seen in a query plan, are relatively expensive to run in a production row store. OLAP queries (GROUP BY with aggregate functions like COUNT, SUM, and AVG) do large^/full table scans by definition. They take seconds to run while your goal in a Cloud OLTP system is thousands of requests per second. An automatic sort issued per query in an OLTP system is pathological and represents a vector for a DoS attack.
> What makes a solution like hive much better - I guess its optimized for this?
Column stores use compressed bitmap indexes that are optimized for scans over a small number of columns. Hive is SQL over Hadoop, and is inherently slow but it does offload the processing from your Production OLTP server. Hive supports the RCFile format which is partially column oriented. The ORC file format is fully column oriented, replaces RCFile format, but requires Presto (or equivalent). Hive is brownfield for existing Hadoop clusters but it has no place in a discussion about greenfield architecture other than discussing historical systems.
If you have a need for GROUP BY style analytics, a true column store like Presto, Impala, or RedShift is a necessity.
^EDIT: based on zbentley's comment