Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On the subject of types, I’ve often wondered why relational databases don’t have sum types. Lots of real-world data models have these kinds of relationships, and certainly “modeling data” is well-within the purview of a database. I know that ORMs exist, but they tend to be very bad and there’s no compelling reason to have to break off part of the database responsibility, implement it terribly (it’s divorced from the query planner, so even the best ORMs are subpar) for each application language, and put it on the client side.

Yes, I understand that sum types affect the way the data is arranged, but the same approaches that are used by programming languages to model sum types in memory can be used by databases to model sum types on disk and in memory (it’s all just byte vectors at the end of the day).



I'm making a relational lang (https://tablam.org) and this was one of the question marks for me.

The #1 reason sum types was not presented is because "nobody" thought (seriously) about that for long time. See how much "null" have been with us and still you find defenders of it!

The second and more broad problem, is that the relational model was NOT implemented by the RDBMS makers, not in full. To make sum types work, you NEED to support embed relations inside relations:

    rel Invoice
    rel DetailInvoice

    Invoice[invoice_id:1 customer:123 lines: DetailInvoice[
     [invoice_id:1 line_id:1 price:10 qty: 1],
     [invoice_id:1 line_id:2 price:10 qty: 2] 
    ]
Then with this, do sum types comes from free:

    type Option
      Some ?
      None

    [customer_id:1  phone: [tag:Some 123-23456]]  
    [customer_id:2  phone: [tag:None]]  
This will make things like storing Logs, metrics and event sourcing so much nicer!

How store that is another matter, but I think is not that complicated, just none of the major rdbms have put effort on this.

This is one of my dreams: Making a rdbms with this and other stuff that push the DBs to the modern way of make apps...


Frankly, understanding of what the relational model actually is has been missing from the industry even among many people who have made data management their primary focus. IMHO a lot of the penetration of the NoSQL stuff in the last decade was based on misunderstandings of the capabilities and potentials of the relational model.

I think the early history of SQL was too influenced by ISAM-type stores and COBOL flatfiles and the like. Hence the vocabulary around "tables" and "columns" instead of the relations, relvars, tuples, etc. described by the actual relational algebraic model. To many of those with an OO fetish, "relational model" became synonymous with "tabular model", and the industry's early 2000's obsession with object polymorphism, OORMs middleware, etc continued the stall in relational database innovation.

I feel like there is now renewed interest in this stuff, though. Your project looks really neat, for example.


I think you analysis is correct. I also add that is related too to the languages, you point with Cobol, and I also add C, Pascal, Java, etc.

When they have nulls and limited to zero queries, relationships and data modeling capabilities is not hard to see that how they are is reflected in how the DBs are too. I honestly not know about sum types until a few years ago, neither have a clue how much useful they are, so surely even today many are in the same blind spots...


> To make sum types work, you NEED to support embed relations inside relations

If that is the case, sum types would be in violation of first normal form.


Even considering that normal forms are tools...

Sum types, how them are stored and how they are used is not the same ie: How are sum types represented in languages like Rust/oCalm? that is tangential as how we experience them.

So, support to embed relations is the step to provide the storage and partially, help in the query, but from the POV of the user them are "normal sum types".

But with that, you get the ability to store relations, that is usefull as-is :)


Oddly enough, 1nf is differently defined depending on which person or database you're asking/using.

Is it

  . just to be rectangular?
  . values are atomic? (and what does that mean)
  . no such thing as nulls? good luck with that
  . no duplicate rows (good luck with enforcing that in intermediate tables)


E.F.Codd defines first normal form as eliminating nested relations (in "A Relational Model of Data for Large Shared Data Banks" which introduces the relational model). "Atomic" in this context just means any value which is not a relation.

A relation by definition cannot have duplicate rows, since relations are sets. If "rectangular" means that all rows in a table have the same columns (attributes), this also follows from the definition of relations. So these constraints have to be fulfilled before we can even talk about normalization, since normalization operates on relations. (Nulls are its own can of worms, but dosn't really have anything to do with 1NF.)

Codd introduces 1NF because he realized supporting nested relations would complicate the model and query language for no benefit. It would need special operators for navigate in and out or nested relations, e.g. if you wanted to join tables nested at different levels in other tables. You would need something like XPath on top of SQL. But you can express exactly the same information using foreign keys between tables and keep the language simpler.

The sum-types suggestion shows the problem. If we have sum types as values, presumably they would be composed of other types, right? So sum types could contain other sum types, arbitrary deep. Since you surely would want to query these individual values, you would need operators in SQL to navigate down into sum types. You would need facilities to index and apply constraints, again arbitrary deep down the structure. So the database model and query language gets significantly more complex for dubious benefit.


> So the database model and query language gets significantly more complex for dubious benefit.

Sum types are definitively not a "dubious benefit". Neither nesting relations (that are not that different to join them and will provide better semantics that create a LOT of null columns to flat them).

Will complicate matters and requiere extensions? Sure. But is clear that, today in this century, exist a lot of use-cases where people are resorting to pseudo-nosql (json embedding, to name one of the biggest anti-patterns), and that create more inconsistent features, extensions, weird extra syntax, etc. that anyway are required for it (and could have been cleaner under this idea).

With this, we can get the same benefits and much more.

P.D. Not against have a very simple core and try to model stuff that way, but the major point of a model, like relational, is support how to MODEL.

And we need to model nested thing, and we need to model variants and get rids of nulls and all that. So, sometimes, you need to add complexity + 1 so you get complications - 10.


> To make sum types work, you NEED to support embed relations inside relations

You don't need to embed relations. You don't even need to embed tuples or other composite types, you can just model sum types using a pair of fields. The latter is what Opaleye and Rel8 (Postgres libraries for Haskell) do:

https://hackage.haskell.org/package/opaleye-0.7.1.0/docs/src...


To reply to both my respondents, the link goes exactly to the source for Opaleye's version of the Maybe data type definition. Neatened up, it is

    data MaybeFields fields =
      MaybeFields {
        mfPresent :: Field SqlBool
      , mfFields  :: fields
      }
That is, an Opaleye Maybe (MaybeFields) is a pair. The first component is an SQL bool indicating whether it's a Just or a Nothing. The second component is the payload of the Just. If it's a Nothing the fields of the payload will be set to NULL.


Ok, this is for a Maybe/Option, but can't see how can work for any algebraic type.

I have this for mine:

    struct Ag {
        tag :String, //like "Option.None"
        data:Vec<Value> //The data inside, if any
    }


Arbitrary sum types are just a generalisation of what I wrote. Opaleye doesn't have them (yet?) but rel8 does:

https://hackage.haskell.org/package/rel8-1.1.0.0/docs/src/Re...

You have a bool which specifies whether it's left or right, and then two payloads, one for left and one for right.


You linked to a file with pages of code. Care to summarize how they model sum types using a pair of fields?


Can be show how is done, with data instead of algorithms? (aka "...show me your tables and I won't usually need your flowcharts; they'll be obvious." ~ Fred Brooks")

Haskell is hard to decipher for me...


Weird that this pops up today as I was trying to find something like this yesterday. Good luck with it!


I am interested in opportunities to cooperate on this.


Great!, I have the project at https://tablam.org


You might want to check TypeDB, which is a database with a strongly-typed query language: https://github.com/vaticle/typedb.

It doesn't do union types yet, but it's got type inheritance, type inference, type-safe joins (with its "relation" feature). It's quite cool what you can do with it when you can express complex domain with the language.

Disclaimer: I work there.


> I’ve often wondered why relational databases don’t have sum types.

It seems unlikely that relational databases would make the effort to add sum types whilst mainstream imperative languages don't have sum types. In turn that is probably because of a mistaken belief that subclassing was a sufficient means to achieve the same end. All is not lost, however. It's easy to simulate sum types using a field for the tag and multiple fields for the payload.

The syntax for that in raw SQL doesn't look nice but in a higher level query DSL it can look lovely!

https://hackage.haskell.org/package/opaleye-0.7.1.0/docs/Opa...

https://hackage.haskell.org/package/rel8-1.0.0.0/docs/Rel8.h...

(Disclaimer: I'm the author of Opaleye)


Not a part of the SQL standard, but would PostgreSQL enumerated types be what you're thinking of?

https://www.postgresql.org/docs/current/datatype-enum.html

Although I guess the question is are you thinking sum types at the attribute level or sum types at the relation level? At the relation level I'm not familiar with an answer, but I can see where that would be a nice database-level solution for polymorphic associations.


Postgres enum types are like C enums--a bounded set of identifiers. Sum types requires a bounded set of arbitrarily complex types (not just identifiers). For example:

    type Event enum {
        KeyPress(rune)
        MouseClick{ x int64, y int64 }
    }


Probably because the same can already be expressed using relations and foreign keys.


I don't think it can. I don't believe SQL allows you to create a fkey column that points into many different tables. You can create your own pseudo fkey column that the database doesn't know is a pointer into other tables ("type erasure") but then you have to recreate referential integrity and so on yourself, and you still run into issues querying against the table that has that fkey column in a useful way.


You don't need a FK to point to multiple tables. You would have the reference in the other direction, from multiple different tables to one table.


How do you ensure that only one row of only one table references the row of that one table?


Most databases have weak support for cross-table constraints. I believe in SQL Server you can define an indexed view on the union of the tables and then apply a unique constraint on the foreign key.


Theoretically, this could also be achived by allowing FKs to Views. But I know of no system that allows this.


The relational model is a logical model; that it's backed up by a B-tree or something else should bear no consequence except in query/update performance.

And you kinda can simulate the sum types, you just can't express it in the first-order way: instead of

    CREATE TYPE Color (Red, Green, Blue)
    
    CREATE TABLE MyTable(x Color, ...)
you do

    CREATE TABLE MyTableRed(...)
    CREATE TABLE MyTableGreen(...)
    CREATE TABLE MyTableBlue(...)
There is just no syntax to group these three tables together in all relevant queries/updates, and such skolemization is really quite painful to maintain manually.


> The relational model is a logical model; that it's backed up by a B-tree or something else should bear no consequence except in query/update performance.

Agreed, but some people have incorrectly argued in the past that introducing sum types would harm performance due to the underlying data structures, so I was trying to head-off that line of argumentation.

> And you kinda can simulate the sum types,

I'm not sure I follow your example. Your `Color` type is a product type, not a sum type. I've explored every workaround I could think of for sum types (and asked many colleagues, etc), and they all failed in one important respect or another (typically because the type system doesn't understand your "sum type").


No, it's a sum type. You have to add the constraints

    CONSTRAINT MyTableRed INTERSECT MyTableGreen IS EMPTY
    CONSTRAINT MyTableGreen INTERSECT MyTableBlue IS EMPTY
    CONSTRAINT MyTableBlue INTERSECT MyTableRed IS EMPTY
but after that, I believe it works. It essentially replaces the original "... and the color is 'x'" predicate with three another predicates: "... and the color is red", "... and the color is green", "... and the color is blue" with the constraint that only one (at most) of those can be true for any particular value of "...". As you can see, you have to do that for every would-be COLOR property, and adding more variants blows up the number of tables and constraints even more drastically.

It kinda reminds me of 3SAT and other logical exercises in reducing everything down to the first-order logic: it's doable, but the terms grow exponentially large.


The problem IIRC is actually consuming that sum type in SQL. Unfortunately it's been several years since I played with that approach, and I've forgotten the specific problems I encountered.


The issue I see is that now you have mixed data and schema. And unfortunately altering schema is much more difficult in current systems. There is no homoiconic SQL DBMS. This lack of homoiconicity is aso reflected in the fact that only values can be parametrized. Identifiers can not.

Also you face issues if there is some other entity depending on MyTable. You can no longer have a FK because you no longer have a table to which to point the FK, you have 3 tables.

You could create a table containing just the key and have FKs pointing from the three tables (I think this pattern does have a name) but you run more issues because there is usually no way to force bidirectional mandatory optionality using FKs. (A FK can be 0..1 - 0..M, 1..1 - 0..M, 0..1 - 0..1 or 1..1 - 0..1. But it can never be 0..1 - 1..M, 1..1 - 1..M or 1..1 - 1..1. In this case 0..1 - 1..1 is not an issue because it can be reversed as 1..1 - 0..1. )

Now you have two options:

- Either create FKs in both directions, but then you run into the issue that most DBMS can not update two tables at the same time in the same statement. You have to use transactions and defer FK checking until the end of the transaction. And by this point you have created a DSL over SQL.

- Or you use extra flag columns in order to achive a three step process (insert key in key table, insert extra attributes in other table, update key table to mark as active) and enforce everything using constraints and triggers, but similarly, by this point you have created a DSL over SQL.


In what situation do you believe your solution would be better than to create a table Colors, insert the desired list of colors and have a FK from MyTable to Colors?

The way I see it a single atribute relation is a sum type. Implemented in current imperfect systems as a single column table (maybe two column table if you insist on using surogate keys for various reasons)


Probably in the situation where the sum types have actual payloads, instead of being simple enums. For example, if it was

    type SHAPE = NONE | SQUARE(w, h INTEGER) | CIRCLE(r INTEGER) | TRIANGLE(a, b, c INTEGER) | BLOB(e SVG_STRING)
In my solution it entails having

    TABLE WithShapeNone(...)
    TABLE WithShapeSquare(..., w, h INTEGER)
    TABLE WithShapeCircle(..., r INTEGER)
    TABLE WithShapeTriangle(..., a, b, c INTEGER)
    TABLE WithShapeBlob(..., e SVG_STRING)
It seems I can't easily generalize your solution, so I won't strawman it.

Then again, I think the proper solution would just be adding the sum types into the system. We could have "+" implemented as a 3-column table too, but why would we?

Your notice of "only values can be parametrized. Identifiers can not" in the sibling comment is exactly what I was trying to express in these threads: the relational model is first-order, not second-order, so while you can theoretically model everything with it (logician have done it), it's ugly and impractical without some extensions. Better types for "ground" values is one such extension.


For one point I do also believe that adding sum types would be great. Even if the only benefit is eliminating 3-value-logic, it is a huge benefit. I also agree that while it can be modeled it is very ugly, though I wouldn't necessarily blame the relational model. The relational model is a modelling tool. The physical implementation doesn't need to be 1 to 1 with the model. Same as you rarely find pure OO or pure FP languages.

Current DBMSs do also have plenty of solutions for the problem of non-homogenous data. Some allow XML and JSON columns on which you can also enforce schemas. SqlServer proposes sparse tables. PostgreSQL has arrays, enums and structs (and arrays of structs) as data types. But I do agree that sum types ammong other things would really be nice.

As for your example I still believe it is incomplete, but I would like to discuss that in a separate thread.


You make a fair point about actual payloads. My example was illustrating a sum type of unit types, an enum.

Do you care to conjure a more complete example of your solution? As @throwaway894345 said, I believe your solution is difficult to consume. I would like an example with something more concrete than MyTable.

A theoretical elegant but currently impossible solution to consume your solution would require FKs pointing to views. A view called Shape that is a Union All (we know there are no common elements) over the 5 WithShape... tables. Other tables would point FKs to the PK of this View. Unfortunately, AFAIK there is no DBMS that supports FKs to Views.

I do believe this is not really what you have in mind and I have a hunch about what you are trying to get at, but I believe the example is incomplete.


I suspect part of the problem is that true sum types are poorly represented in many of the languages that use these databases. In a sense because of the layer these sit in the stacks they have to appeal to a lowest common denominator. If they supported tables where the rows were true sum types and polymorphic as a result many languages would find it quite difficult to interact with that database.


I don't understand why that would be a problem--programming languages interact with databases at runtime via SQL. I regularly use Go which lacks sum types, but there's nothing technically difficult about generating SQL that leverages sum types--except that SQL doesn't support sum types :).


In Go returning a list of items that are one of a set of different types would be poorly handled. It's possible but very awkward. You would probably drop down to just a slice of interface{} and then have to use reflection to figure out what each item was.

Dynamic languages like Python or Ruby would probably be fine but there are more languages than those out there to worry about.


There are ways of modeling sum types in Go (a struct with a few fields and a tag, an interface that's only implemented by the relevant types, etc), but that's unrelated to the database/client interface.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: