Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's called NoSQL, which removes the need for schema migrations for things like adding or deleting columns.

This could be solved for relational databases if you implemented application-level abstractions that allowed you to store all your data using JSON storage, but create non-JSON views in order to query it in your application using traditional ORMs, etc.

So, store all data using these tables, which never have to be changed:

- data_type

- data (int type_id, int id, json data)

- foreign_key_type (...)

- foreign_keys (int type_id, int subject_id, int object_id)

(we'll ignore many-to-many for the moment)

And then at deploy time, gather the list of developer-facing tables and their columns from the developer-defined ORM subclasses, make a request to the application-level schema/view management abstraction to update the views to the latest version of the "schema", along the lines of https://github.com/mwhite/JSONAlchemy.

With the foreign key table, performance would suffer, but probably not enough to matter for most use cases.

For non-trivial migrations where you have to actually move data around, I can't see why these should ever be done at deploy time. You should write your application to be able to work with the both the old and new version of the schema, and have the application do the migration on demand as each piece of data is accessed. If you need to run the migration sooner, then run it all at once using a management application that's not connected to deploy -- with the migration for each row in a single transaction, eliminating downtime for migrating large tables.

I don't have that much experience with serious production database usage, so tell me if this there's something I'm missing, but I honestly think this could be really useful.



> With the foreign key table, performance would suffer, but probably not enough to matter for most use cases.

Citation needed :) That's going to really depend.

I'm not for or against NoSQL (or any platform). Use what's best for you and your app!

In our case, NoSQL makes for a bad database approach. We do many cross-sectional queries that cover many tables (or documents in that world). For example, a Post document doesn't make a ton of sense, we're looking at questions, answers, comments, users, and other bits across many questions all the time. The same is true of users, showing their activity for things would be very, very complicated. In our case, we're simply very relational, so an RDBMS fits the bill best.


Sorry for being unclear. I'm not proposing NoSQL. I'm saying that many NoSQL users really mainly want NoDDL, which can be implemented on top of Postgres JSON storage while retaining SQL.

- data (string type, int id, json fields)

- fk (string type, int subj_id, int obj_id)

  select
    data.id,
    data.fields,
    fk_1.obj_id as 'foo_id'
    fk_2.obj_id as 'bar_id'
  from data
  join fk as fk_1 on data.id = fk_1.subj_id
  join fk as fk_2 on data.id = fk_2.subj_id
  where 
    data.type = 'my_table'
    and fk_1.type = 'foo'
    and fk_2.type = 'bar'
What would the performance characteristics of that be versus if "foreign keys" are stored in the same table as the data, if fk has the optimal indexes?


In no specific order:

If your database doesn't enforce the schema you still have a schema, it's just ad-hoc and spread across all your different processes, and no one quite agrees what it is. In the real world as requirements change and your app/service increases in complexity this becomes a constant source of real bugs while simultaneously leading to garbage data. This is not theoretical, we have a lot of direct painful experience with this. Best case scenario your tests and tooling basically replicate a SQL database trying to enforce the schema you used NoSQL to avoid in the first place.

Indexes are fast but they aren't magic. A lot of what a traditional SQL database does is providing a query optimizer and indexes so you can find the data you need really fast. Cramming everything into a few tables means everything has to live in the same index namespace. Yes you can use views and sometimes even indexed views, but then you have a schema so why jump through hoops to use non-optimized storage when the database has actual optimized storage?

Separate database tables can be put on separate storage stacks. A single table can even be partitioned onto separate storage stacks by certain column values. Cramming everything into four tables makes that a lot more complicated. It can also introduce contention (depending on locking strategies) where there wouldn't normally be any.

IMHO most systems would be better served by sharding databases than by using NoSQL and pretending they don't have a schema. If application design prevents sharding then scaling single-master, multiple-read covers a huge number of cases as well. The multiple-master scenario NoSQL systems are supposed to enable is a rare situation and by the time you need that level of scale you'll have thrown out your entire codebase and rewritten it twice anyway.

The key to schema migrations is just to add columns and tables if needed, don't bother actually migrating. Almost all database engines can add columns for "free" because they don't go mutate existing rows. Some can drop columns for "free" too by marking the field as obsolete and only bothering to remove it if the rows are touched.


Postgres (and at least one other RDBMS) has partial indexes, which pretty much solves the index namespace problem you mention: http://www.postgresql.org/docs/8.0/static/indexes-partial.ht... Partial indexes are integrated into the proof-of-concept repo I linked.

Storing a data type field in the generic storage table enables the same partitioning ability as a standard schema.

99% of NoSQL database users just don't want to deal with migrations, even if they're "free" (another big issue is synchronizing application code state and DB migration state of production, testing, and developer machines), so what they really need is NoDDL, YesSQL.

> Almost all database engines can add columns for "free" because they don't go mutate existing rows. Some can drop columns for "free" too by marking the field as obsolete and only bothering to remove it if the rows are touched.

Didn't know that, thanks.

> It can also introduce contention (depending on locking strategies) where there wouldn't normally be any.

Didn't think of that. I'm aiming this at 99% of NoSQL users in which doing things you could do with SQL requires much more effort, so allowing them to do it with SQL can accept a modest performance degradation, but if you have any good links relevant to how this storage design would affect lock contention, please share.


A .NET abstraction for PostgreSQL: https://github.com/JasperFx/marten




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: