Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I dunno, personally when I teach SQL to engineers, it’s not the syntax that is the main difficulty (yes it can be improved) but the relational data mental model that trips people up. They all want sequential data processing and loops. Translating that to vectorized code is the closest thing I got to help non-data people grok.


I used to hate SQL when I was a backend engineer. I had difficulties understanding exactly this mental model and tried to avoid writing SQL by using ORMs. At some point I “accidentally” switched to data engineering and was forced to get to grips with SQL. I think that in the meantime I warmed up to a functional programming style which helped me to some extent.


The thing that trips everyone is that you cannot explain the SQL outputs. In sequential code, you can step through and understand why the logic error produced erroneous output.

With SQL, it is a trial and error. When your query passes your sniff tests, you sign looks good to me, and you ship it to the world. Only to silently break shortly afterwards without any warning.

In enterprise, I am convinced at this point that all of the complex ETL jobs are vending wrong outputs. Just nobody has the tools to diagnose and fix the problems.


For me, it's not the syntax, and not even the relational model. They are fairly easy to explain.

No, it's the morass of the interactions between GROUP BY/ORDER BY/HAVING. Like, why isn't there FIRST statement to select the first element in the group?


Not sure what getting the first row has to do with any issues you have with GROUP BY/ORDER BY/HAVING? Each of those clauses serves a distinct purpose, and I'm not sure I'd describe any interactions a "morass" since any connections between them are usually by design, and well documented. With the power of SQL (or any powerful and complex tool) comes great responsibility to understand how it works.


WHERE and HAVING (and QUALIFY) are redundant. The PRQL language unifies them into a single "filter" construct:

SQL:

    SELECT country, MAX(salary) AS max_salary
    FROM employees
    WHERE start_date > DATE '2021-01-01'
    GROUP BY country
    HAVING MAX(salary) > 100000
PRQL:

    from employees
    filter start_date > @2021-01-01
    group country (
      aggregate {max_salary = max salary}
    )
    filter max_salary > 100_000


A very simple problem, you have a table of house sales. For each house (identified by an ID), you want to find the sale date that had the lowest price.

You'd think that you should be able to sort by the house ID, then by the sale price, then group by the house ID, and select the sale _date_ from the first row of each group.

Something like: `select s.house_id, first(s.date) from sale s order by s.house_id, s.price asc group by s.house_id`. But this is impossible, because there's no `first` function in SQL.

But nope, you need DB-specific extensions for that.


What do you want to see if there's more than one sale date with the lowest price?

In any case, you can do this with a subquery and join it with the main table. I don't think there's anything non-standard about it:

  SELECT hs.*
  FROM house_sales hs
  JOIN (
      SELECT house_id, MIN(price) AS min_price
      FROM house_sales
      GROUP BY house_id
  ) min_prices
  ON hs.house_id = min_prices.house_id AND hs.price = min_prices.min_price
I'm sure you can do it with window functions as well.


> What do you want to see if there's more than one sale date with the lowest price?

Either one is fine.

PostgreSQL has an extension (DISTINCT ON) that allows to do what I want without subqueries. But standard SQL is simply deficient in this regard, it should be straightforward but it's not.


To have the first element, you have to describe a relation of order. It's not ordered by default.


True, but even going all the way back to E.F. Codd, there's "cheating" involved with all sorts of implicit sorting.

Even now, the default behavior of SQLite is still "undefined" if a Select column doesn't appear in Group By. Well, to be precise it's a single "arbitrary chosen row" used to sort the results. https://www.sqlite.org/lang_select.html#resultset

I'm not bagging on SQLite here (I love it!), but I am bagging on SQL 'cuz EVERY implementation must make far too many of these sorts of Faustian bargains.


That sqlite one is really annoying, many other dialects give you an error.


Actually, you don't have to. Any first element would do, even if it's a random one.

It's perfectly possible to do `SELECT * FROM blah LIMIT 1`, after all.


You can use ANY_VALUE to get an arbitrary value from a GROUP BY. There's also FIRST_VALUE for a window function (though that requires you specify an order).


This is a non-standard extension, and it also doesn't allow the selection of the _first_ element in a group. Even if the group is actually ordered.



Window function is your friend.

Row_number() over (partition by x order by) as rank

Then: where rank = 1.

Grouping is unordered, which is a clean definition.


> It’s not the syntax ... but the relational data mental model that trips people up.

It is also the syntax because it does not match the actual model. Even the simple example `SELECT 1` shows a mismatch.

And this cascade. Because the syntax is wrong, people have big trouble with `JOINs` (that is disconnected from the idea of making `?-to-?` specifications), `group by` (that doesn't exist in SQL), aggregation logic (that is badly implemented as window functions), and bigly, the lack of composability (that has a weird band-aid called CTEs) and so on.

A mismatch between the domain and the code is always reflected in syntax.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: