I really wish Date had any understanding of modern type theory, since that's the...

blacktriangle · on Oct 4, 2021

Heck I wish he even had any understanding of the use of NULL in real-world databases. He spends books arguing, very compellingly I might add, against the use of NULL in SQL dbs, but when asked to come up with an altenrative to NULL, he's suggestions are sentinel values and splitting out every nullable attribute into its own table. Both of these solutions are trivially rejectable.

Joker_vD · on Oct 4, 2021

Splitting every nullable attribute is not "trivially rejectable", you absolutely can do this to represent sum types (and it doesn't mean that those should actually be stored separately: the relational model is logical model, it doesn't matter if a table is implemented with one or with two B-trees or somehow else entirely).

The main problem is that you have to manually type JOINs for all of such sum-typed properties to do any useful queries/updates — and there really is no other way than write it every time anew by hand because the relational theory is just the intuitionistic first-order predicate calculus. You'd really want some second-order tools, to group those tables together... but then you essentially end with tables with sum-typed columns, and since the main idea of the relational model was to never step out of the first-order predicate calculus, the reasonable solution is probably to just introduce sum-types as actual types.

blacktriangle · on Oct 4, 2021

Maybe trivial was the wrong word, but your second paragraph is exactly why I reject the approach to splitting out all nullable values into their own relation. Yes from a data modeling theoretical perspective it works, but from a pragmatic perspective having to join everything back together again is a worse situation than just thinking through the behavior of null in that situation.

But even then things are awful. What do I do when I want to include the value of the nullable table in my result, but I want all of my records, not just the ones with non-null. So even though you've purged null from your data model you're still bringing them back with an outer join.

j-pb · on Oct 4, 2021

Joining the thing back together again is not so bad if with recent multi-way join algorithms.

But then again, the resulting model is factually a graph database and far far away from resembling something like SQL or Relational Calculus.

(Which might not be a bad thing though.

"You either die NOT NULL or live long enough to see an instance where you really wish that column was nullable.")

Joker_vD · on Oct 4, 2021

It's the implicit semantics of the tables. We start with modelling the world itself: "user with id ID is AGE years old and lives at ADDRESS", and end with modelling out particular instance of knowledge of the world: "user with id ID exists", "user with id ID is known to be AGE years old", "user with id ID known to live at ADDRESS"; and with the "closed world principle" you can explore more of the limits of the knowledge: "which users we know to exist and know their addresses but don't know their ages?". But query "what is the id of the oldest user?" is simply unanswerable unless you know all of their ages; the best you generally do is answer "what is the id of the oldest of the users that we know ages of?".

It's when we start confusing the map and the territory, that's when lot of miscalculations starts to happen.

j-pb · on Oct 4, 2021

I think you meant to write "open world principle", as every non state fact is implicitly false under the closed world assumption.

But yeah I agree, that databases should always be thought of a a model of the world (crisp), and not as the world itself (fuzzy).

But it's also noteworthy that this is a leaky abstraction, and that any kind of database that has to operate in the real world (in contrast to say an entity component game engine) will face this leakiness.

The only way I see to resolve this is to turn the database into something that remains crisp in the face of fuzziness. E.g. by storing only observations or interpretations of the real world processes, since the observation is an "subjective" statement made from the database perspective, it holds true regardless of the actual "objective" condition.

It's just not easy to program and model in such a system.

blacktriangle · on Oct 4, 2021

Sounds like a restatement of the open world / closed world debate.

Joker_vD · on Oct 4, 2021

Well, what's the semantics of your result? Say you have

    TABLE UsersWithKnownAge(id ID, age INTEGER)
    TABLE UsersWithUnknownAge(id ID)

    CONSTRAINT UsersWithKnownAge JOIN UsersWithUnknownAge IS EMPTY

with obvious predicates: "User with id ID is known to be AGE years old" and "User with id ID is unknown". You join them and get the relation with the predicate "User with id ID..." How do you continue?

"We know AGE_KNOWLEDGE about the age of the user with id ID", where AGE_KNOWLEDGE is Option<Integer> that has semantics "number is unknown" or "number is known to be NUM". Okay. What information about the whole dataset can you extract from this join?

If your query needs to know the users' ages, you can't execute it unless you restrict yourself to querying about UsersWithKnownAge, in which case, just run it on UsersWithKnownAge.

If your query doesn't need to know users' ages, you can run it on "UsersWithKnownAge(id) UNION UsersWithUnknownAge(id)", not on an OUTER JOIN.

talaketu · on Oct 4, 2021

Nice toy example.

> What information about the whole dataset can you extract from this join?

You could count the number of users.

bob1029 · on Oct 4, 2021

> splitting out every nullable attribute into its own table

One property of 6th normal form is that null ceases to exist by way of optional joins. This is a good thing.

If you ignore the physical data model for a few seconds, you might see how this is can become a deterministic path for modeling success in any problem domain of arbitrary complexity. You aren't able to make the kinds of assumptions that would typically screw you over in the future if you follow 6NF properly. De-normalization for performance reasons should be undertaken only after you have perfectly modeled the domain in logical terms.

blacktriangle · on Oct 5, 2021

And at this point you may as well be running Datomic. I remember the first time looking at Datomic the whole thing felt like it was coming out of left field. A few years later of data modeling conversations like this and you start to see Datomic is really every fix to the relational model taken to the extreme.

It also explains how we got column databases. If the problem with 6th normal form is the underlying storage, then change the underlying storage to optimize for 6NF.

In a way, the very existence of the normal forms can be taken as a sign that something is seriously wrong with the relational algebra given how easy it is to create a data model that will produce rubbish for certain queries.

bob1029 · on Oct 5, 2021

> In a way, the very existence of the normal forms can be taken as a sign that something is seriously wrong with the relational algebra given how easy it is to create a data model that will produce rubbish for certain queries.

I am of a growing opinion that normal forms below 6th (i.e. 1-5) are simple mistakes taken as performance optimizations. The mathematical side of me says to keep going too - I suspect that there is a "normal form" beyond the 6th that says something along lines of how all keys must be synthetic and that there is no such thing as a business or domain key. I strongly believe the relational algebra/calculus as applied to domain modeling hasn't quite reached a fundamental conclusion yet.

> If the problem with 6th normal form is the underlying storage, then change the underlying storage to optimize for 6NF.

You're god damn right. There is zero reason we cannot build storage layers that can handle this kind of logical layout. My current favorite solution is to simply event source all the facts and keep an in-memory copy of the active business state. You can store pointers to strings and other BLOBs so you don't bloat RAM over something you could stream from NVMe on demand. Most of the pain in higher normal forms is with the join, so having a working set that can fit into RAM is one way to partially address this problem.

garethrowlands · on Oct 4, 2021

Can you give examples of things this misses?