Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The number of times I’ve seen serious data corruption because “foreign keys are bad and we can just enforce it in code” is amazing. There is zero excuse to not use FK’s

Any database that doesn’t use FK’s is almost guaranteed to have crap in it that didn’t get cleaned up, resulting in data corruption (and yes, dangling stuff in tables count as data corruption).

Developers aren’t perfect. Shit will slip through even with the most rigorous process. The database should always provide tools to keep its self from getting corrupted. This is why good database systems have constraints like FK’s. If your database software makes using those tools painful (cough MySQL), get a better database system.

Not using FK’s is just plain old ignorance.



Seriously.

"As a C developer, I never check exit codes of child processes. We can just enforce it by ensuring child processes don't have bugs"


`malloc` won't fail, right?


the javascript will enforce the user input. no need to have the backend check it again!


This should be a joke, but considering how often it's actually the case it's unfortunately not that funny.


Not as much as developers who can't find their own coding errors might have you think!


It won't on Linux! But hey, random processes will get OOM-killed.


You can turn off overcommit, in which case you'll get a null return instead of OOM killing.

But also, you can still get a malloc failure without actually be running out of memory if the allocator can't find a big enough contiguous chunk of address space.

This is highly unlikely on a 64 bit system, but if you try to malloc gigabytes on a 32 bit machine you might see it.


And also, you never know when this theoretical case actually happens but it did happen to me: at one point someone may use your code on an Arduino and unexpectedly, it works! But memory allocation could fail more than you expect!


On Unix it won't. You can overcommit memory and only get into trouble when you actually try to page it. But not at malloc time.


Iirc that's configurable.


I think this is a bit hyperbolic of an expression, but I do want to reinforce that all applications need to confirm data integrity, you either confirm it when saving the data, when extracting the data, or have to very carefully balance various consistency concerns and either enforce consistency before saving data or enforce consistency after saving (but before extracting) data.

Things like RDBMS's provide some really nice utilities for data consistency enforcement in the form of FKs, these are not the most performant approaches in all cases but they're optimized to be good for nearly all use cases.

If your business need /Requires/ moving off of FKs onto some other data integrity guarantee, then just understand that you're going to be reinventing the wheel. It might turn out to be a better wheel for your car, but it's going to be expensive. And... unless you specifically hire specialized people who comprehend data guarantees and ACID properties, you'll probably do it wrong. This sort of an undertaking is for mature companies only.


> I do want to reinforce that all applications need to confirm data integrity, you either confirm it when saving the data, when extracting the data, or have to very carefully balance various consistency concerns and either enforce consistency before saving data or enforce consistency after saving (but before extracting) data.

There are two kinds of error handling. The “happy error” and the “fuck you asshole” error handing. Good systems have both.

Yes, the application should check it isn’t about to violate a FK constraint. Just like it usually checks that it isn’t about to violate a unique constraint. That way the application can fail in happy, controlled way. But if the application doesnt do the right thing than the DB server still should puke all over the transaction with a “fuck you, don’t feed me bullshit” error.

Just like data entry over the web. The javascript can return all kinds of nice happy messages to the user about missing fields and stuff. But the backend should always validate and enforce correctness even if it’s mode of failure is some ugly “piss off, don’t feed me crap” message.

Just like you should never trust user provided data coming from an HTTP post, a database should never trust the INSERT or UPDATE is valid and won’t corrupt the database. Both backend systems can return mean old ugly errors when shit is bad and let the front end do pre-validation that can do happy nice errors. The backend always has to enforce its own data integrity. Period.


Oh I absolutely agree and even when a DB is properly configured with references all cleared defined and constrained it's absolutely a good UX thing to pre-check as much as possible.

But, beyond that, it is quite possible to remove FK checks and still have strong guarantees about data integrity. It is stupidly expensive and unless you have a few billion in the bank there is absolutely no reason to even consider it, but if you're dealing with data volumes like GitHub then it's conceivable that all the other salves for enforcing data integrity fall short. In that case there are ways to approach removing FKs, but, when you do so, you're not (logically speaking) giving up the data-integrity from FKs, you are replacing FKs as a tool for data-integrity with another tool for data-integrity (one that will probably be very similar to FKs) - under this guise DB FKs can stop making sense (though also having any sort of RDBMS engine likely also stops making sense as you're essentially adopting the functional responsibility for being an RDBMS into the primary application).


Being pedantic in this case doesn’t help the cause. Too many developers don’t understand database theory at all and will read this

> But, beyond that, it is quite possible to remove FK checks and still have strong guarantees about data integrity

And not the rest of your post. Yes technically you are right but it is stupidly expensive and nobody should do it.

The problem with being technically correct is, again, people will stop at the sentence I quoted and go build yet another FK free system. Said system will undoubtedly fill up with corrupt bullshit data that eventually leads to exciting mystery bugs in production that has everybody scratching their heads. I’ve seen it time and time again....


I feel like reading HN should come with a warning on the tin that "If you're reading a long technical comment, taking away just part of it is dangerous" - were I speaking to someone in the business side of a corp that asked "Hey do we need these FK things, some developers have been saying they're slow" I'd say 'Yes, we absolutely do need FKs' then go on to talk to the developers, double check I wasn't at one of the about dozen of companies with data at a scale that FKs as implemented in RDBMSes (especially postgres, mysql tends to drop off in performance much easier without heavy tweaking) is insufficient, and then tell them that FKs do work and they probably really just need to read up a lot on indexes and stop throwing around table locks like it's christmas.


Hey there, you and I still need work. Let them speak!


Unique constraint is a good example, because it reminds us about race conditions.

The app can check that it won't violate a unique constraint before doing an insert/update, but in between that check and actually doing the insert/update, some other process may have changed data, such that unique constraint can be violated.

So when the rdbms catches this, it'snot just a "fuck you for giving me bad data" condition if the implication there is that it was a bug in app code, and it's a failsafe. It isn't necessarily a bug at all -- unless you intended it to be the app's responsibility to use db-level locks and/or transactions to guarantee this can't happen without the uniqueness constraint -- but then why not just use a uniqueness constraint, the tool the db is actually giving you for this?

Mature rdbms's sometimes don't get the recognition they deserve for being amazing at enforcing data consistency and integrity under concurrency, with pretty reasonable performance for a wide variety of usage patterns.

Foreign key constraint can be similar; you can do all the checking you want, but some other process can delete the object with the pk right before you set the fk to it.

If you have app usage patterns such that you really can't afford database data integrity enforcement (what level of use that is of course depends on your rdbms)... you are forced to give up a lot of things the rdbms is really good at, and reinvent them (in a way that is more performant than the db??) or figure out how to make sure all code written never actually assumes consistent data.


The disdain some developers tend to have for their data storage system is pretty unreal. It’s like they see it as a necessary evil instead of their friend.

They often react to FK’s as if they cramp their style, where their style is to just insert whatever into the database without regards for what And where it is.


I had a job where the database was intentionally lacking FKs so that users could input data out of order, for example create a shipping route Foo that goes to a warehouse Bar, before the warehouse existed. Let's just say it was a suboptimal design.


this just sounds like eventual consistency which can form part of a perfectly optimal design


Or "eventual" never comes and you end up with a bunch of inconsistent data.


The engineer https://en.wikipedia.org/wiki/Niki_Lauda from Rush movie fame, if he's working on a race-car-level-system where every little bit of performance you can tweak out counts, like wringing a towel dry, every last bit... and in that mindset, FKs are removed NOT from ignorance!


Not using FKs essentially means eventual consistency, and it's on you to ensure it. Obviously, eventual consistency is something a lot of systems have to settle for, often for reasons that have nothing to do with architecture, but just because of circumstances where you have to synchronize disparate systems and there's no way to eliminate all but one.

I'll note that there are very large systems that do provide ACID-like consistency (e.g., Spanner, Lustre), so it's not at all that "large -> eventual consistency".

I won't pass judgment on projects where eventual consistency was a given from the start, even if I might think they could have done better, but too, I wouldn't make either approach a hard and fast rule: circumstances vary. What bothers me is the extent to which I see "triggers bad", "FKs bad", "business logic in the DB bad" blanket statements out there.


Eh. The issue arises when you have a very poorly designed schema and missing application-level operations to perform cleanup and deletion.

You don’t want to be caught in that situation because then you are handcuffed and cannot clean the database properly. For example, from the perspective of business operations you might have something that creates 1 single thing where under the hood 5+ DB objects are all created in a transactional manner and a very specific order. So how do you unroll that? You gotta carefully construct the rollback commands or just get rid of the bowling lane bumpers and slash and burn.

So you gotta commit and go all-in. If you make the choice to do it all within the app later, at least you keep that flexibility.

It all depends on your processes and your team. If you don’t design things properly and institute the correct procedures for designing and operating the DB, having or not having constraints at the DB level is inconsequential to the other issues you will face.


Bullshit. Even with the most perfect developers and perfect system, letting application code enforce constraints will lead to data corruption period. Some application will crash or something and leave dangling garbage behind and boom you are fucked.

Would you ever trust that form data is valid because the JavaScript front end “validated” it? No! Why the hell are you going to trust that everything sent to the database is valid? No constraints like FK’s is exactly like not validating input because “the JavaScript layer got it”.

It is out of ignorance that people argue otherwise, sorry. Too many people don’t understand relational database...


Well, what's true is that if you leave it to the application, you now have two problems: you have an RDBMS you use half-way, and an app that needs to implement the RDBMS features that you're not using from the RDBMS. Since most app devs are not RDBMS devs and don't want to be, they're bound to get a few things wrong.

The most likely thing to go out the window in an ORM-type app is concurrency. Take a big lock around DB ops and you're good, right? But then, too, there goes performance.

Or, if it's not concurrency that goes out the window, you have to implement eventual consistency...

The worst thing I've seen too much of is application-level JOINs and such. Next are recursive queries: done in the app. All that absolutely destroys performance. Then you have custom query languages that are just never good enough.

So I am mostly in agreement with you, but "b.s." is too strong for me. You can get all of this right in the app, though at a terrible cost, and there are times when maybe it's actually for the best.

Let me give you an example of how I might build something that's somewhere in the middle. Think of Uber or Lyft. I might have a single widely-replicated RDBMS for users, and maybe another for drivers, and maybe per-city in-memory-only ride DB for tracking requests and current rides, and one more, possibly sharded DB for billing. The ride DBs would send billing updates via some pub/sub scheme (e.g., like PG's NOTIFY, or something else). Recovering from reboots of the ride DB is easy: rely on the apps to recover the state. Most operations only have to hit one DB at a time. The whole thing is eventually consistent for user/driver ratings and billing, which seems rather fine to me given that the operations to synchronize are essentially just aggregation updates. In such an app there's not a lot of room for application logic in the DB -- some yes, and I'd put as much as possible in the DB.

There are plenty of apps where some degree of eventual consistency makes sense. What doesn't make sense is to end up hand-coding JOINs, GROUP BYs, and so on. And my preference is to use FKs and triggers as much as possible, but within limits (see all the above).


Yikes. Someone got up on the wrong side of bed today!


Sorry. I have just seen way to many systems fail in way too many mysterious ways because "foreign keys are bad" and "the application code can do the validation". In fact, I wager that 100% of all databases that don't have FK's have some kind of data corruption that is causing at least some kind of user-visible issues.

It is frustrating to me because even the most green behind the error developer would cringe at somebody saying "we don't need the backend to validate user input because the front-end javascript does it for us". This is the same thing.

Never trust your inputs. Your application code calling the database server is exactly like some javascript client calling your backend system. Your database has tools to keep it from being fed bullshit input. For some extremely frustrating reason, people think it is perfectly okay to have the database trust its user input.

The result is in almost every single instance where those tools aren't used, the inevitable corrupt data will cause some kind of hard to reproduce user-impacting issue. I've seen it so many times it makes my head hurt.


I should have clarified that I don't think FK's are bad ... just that if you are going to use them you need to plan for that.

Like you, I have been bitten by this in many different environments on both sides of the fence. In certain environments the absence of FK's has caused major headache and likewise in other environments the presense of them and the lack of a robust db design has meant the FK's cause more harm than good.


Thanks. As a mere occasional user of databases I was confused and trying to figure out how you could not use FKs without recreating essentially the same thing in code.


At the scale of GitHub, FKs may very well being more problems than benefits. For everyone else, FKs are a benefit.


> and yes, dangling stuff in tables count as data corruption

No. There are exceptions like financial systems, but most of the time it doesn't matter if your DB has dead links to some removed entities.

> Not using FK’s is just plain old ignorance.

Nice level of argumentation, but I prefer the way how author from github can prove his opinion.


Every time I have seen a database use foreign keys there has been data corruption, because everyone thought foreign keys were declarative and not procedural. Just because you have a foreign key doesn't mean it was always there or that it applied on every transaction. You can turn them off at the connection level and you in fact must turn them off for almost any kind of bulk data load.

You should read shlomi's post he gives good reasons. Specifically this is a github issue on his tool gh-ost which is for online schema migration, and FK's pose lots of problems for online schema migrations.


Maybe shitty toy database systems like MySQL let you disable constraints on a per session basis. A real database system might let you defer them until the end of a transaction—which is the only correct way to operate. Once you commit that transaction, what you put into the system better fucking make sense and it is the job of the database system (and only the database system) to enforce that.

Like I said before if your database system makes using constraints hard or lets you shoot yourself in the foot (lol at disabling constraints per session), don’t just walk but run away from that system.


> MySQL

What other database has been used at such a giant scale at so many companies, than MySQL? I'm sure Oracle and SQL Server are used at big companies, but nowhere near Facebook scale.


You aren't Facebook. Facebook has put massive work into adding layers on top of MySQL that your startup has not.

Also, you seem to be forgetting PostgreSQL


> Maybe shitty toy database systems like MySQL

While appropiate to define in few words some of MySQL's colossal mistakes, this isn't the kind of language that will sway heads that have been comfortably using MySQL because those defects are just "what DB's do".


Any database that would let you disable constraints on a session basis is a toy database. Such an operation doesn’t even make sense because at some point the relational integrity has to be enforced for the entire table. You can’t just have parts of a table be relationally correct. That is like saying 1 + 1 = 3. It is a completely illogical statement.

However I would not at all be surprised to learn MySQL supports such a thing. Which supports my assertion it is a toy used (or at least installed by) people who have no understanding of relational database architecture.


So are you asserting that the following products are all built on top of a "toy" database, and their engineers have no idea what they're doing:

Facebook, YouTube, Wikipedia, Pinterest, Slack, GitHub, Etsy, Yelp, LinkedIn, Shopify, Dropbox, Wordpress, Wix, Tumblr, Square, Uber, Booking.com, Box, Venmo, SendGrid, Okta, SurveyMonkey, WePay, Alibaba, SoundCloud, among countless others...

An alternative view is that your statements are incorrect. Do you have much direct experience with high-volume OLTP database workloads, or are you basing your views of MySQL on something else?


Once you are stuck with MySQL it is very, very, very hard to get an organization to switch--not only from a technical standpoint but a political one.

I bet you any competent engineer who knows their shit about DB in those companies regrets using MySQL. I bet their code is full of hacks, crappy schemas, and all kinds of work arounds because they chose mysql. I've seen it in every company that uses MySQL. The lengths people go to avoid schema changes is astonishing.

It is much, much better to start with a real database like PostgreSQL because whatever you pick is going to be what your entire org uses from now until eternity.


Cool, so I'm going to assume that means your answer to my question of "Do you have much direct experience with high-volume OLTP database workloads?" is "no". Given your "bet" as well as comments about schema change difficulty, I'm also going to assume you did not click through to my profile...


And to clarify, I'm not saying that to toot my own horn. My point was, I primarily work on open source schema management and related consulting. I talk to large companies about MySQL schema changes literally every single day. The comment about "the lengths people go to avoid schema changes is astonishing" simply does not gel with reality among large-scale MySQL users.

As for the random blind accusations about bad code, engineer incompetence, etc that's just rude, mean-spirited, and misinformed. I personally know a lot of exceptional database engineers who work on MySQL-related infrastructure at the companies I listed above. Why crap on other people's work that you haven't seen and know nothing about?


Agreed. I think MySQL is still dragging the bad reputation it got in the early 2000s, which is unfair considering how much it improved. And I write that I someone that used to hate MySQL for all its shortcuts. I have one app in production based on MySQL. I have been thinking of switching to PostgreSQL for years. But the truth is that, the more MySQL improves, the less the switch is justified :) The engineering effort put by Google, Facebook, Oracle, etc. in MySQL and InnoDB during the last ten years is impressive.


Qualifying MySQL as a "toy" database in 2019 is obviously wrong.

But I think most companies you mentioned don't use MySQL in the usual way, as they would use a "standard" enterprise database like Oracle, SQL Server or PostgreSQL.

These companies don't use MySQL directly. They use it indirectly as the storage component of a larger architecture. For example, YouTube uses Vitess "over" MySQL.

Companies like Instagram are known to do something similar with PostgreSQL.

My point is that maybe you and the parent comment are not thinking about the same use case.


> These companies don't use MySQL directly. They use it indirectly as the storage component of a larger architecture.

Yes and no. Often it's both. I say this first-hand, having performed significant work on the database tier for two of the companies I listed, and consulted for several others.

For example, while Facebook's largest db tier goes through a dao / writethru cache, there's plenty of other use-cases that are direct MySQL usage.

And in any case, why does it matter if there's another layer involved? It's still MySQL powering mission-critical global-scale use-cases. And for example with YouTube, literally the primary benefit of Vitess is that your application can treat it as a single normal unsharded MySQL installation, so those interactions are still very MySQLy.


> For example, while Facebook's largest db tier goes through a dao / writethru cache, there's plenty of other use-cases that are direct MySQL usage.

I didn't know about that. That's interesting!

> And in any case, why does it matter if there's another layer involved?

I was writing that in the context of the parent comment about "disabling constraints". I can see why disabling constraints makes sense in a sharded environment, with an intermediate layer like Vitess. But the benefit of disabling constraints is less clear when using MySQL directly in a non-sharded environment.

Since you're here, I'd like to ask why you would use MySQL over PostgreSQL in a new project nowadays?

Regarding MySQL, the two main advantages I can think of are that tables are organized as clustered index (instead of a heap in PostgreSQL, which can be an advantage or a drawback depending on the workload) and the replication tooling.

On the other hand, PostgreSQL has a lot of useful features that I miss in MySQL: table elimination/join removal (exists in MariaDB but not in MySQL), indexes bitmap scan (to combine indexes efficiently), partial indexes, transactional DDL, LISTEN/NOTIFY, materialized views, row-level security, table functions like generate_series.


To explain more re: FB and having another layer on top of MySQL, there are a bunch of separate sharded MySQL tiers there. It's split by workload -- for example, the access pattern, schema, and sharding key differs completely between the main social graph, Messenger data, ad market, financial transaction data, etc. And then there's also the internal MySQL database-as-a-service, which allows any engineer to provision one or many databases for any other purpose. Overall, some of these things have services on top that use MySQL more as low-level storage, and others use MySQL in a more traditional fashion.

re: "disabling constraints", that's kind of orthogonal. The large MySQL users simply don't create foreign key constraints in the first place; there's nothing to disable :) Whereas MySQL's ability to disable constraints for a single session is a feature intended to make things like logical dump/restore easier and faster, schema management easier, etc. Without that feature, these tools would need to construct a dependency graph and create tables in a specific order (and/or defer FK creation until after the tables), which is needlessly complex if the tables are new/empty, and very slow if restoring a logical dump which is already known to be referentially consistent.

As for MySQL vs Postgres, IMO both databases are close enough in major functionality that for many use-cases it's best to just go with what you already know, can hire for, and can operate. There are special cases where one is better than the other, for example personally I'd go with MySQL for social networking / UGC / very high volume OLTP, and go with Postgres for use-cases where solid geospatial, OLAP, or fulltext are core requirements and/or there's a desire to minimize the number of different data stores.

Ideally with MySQL you're just using it for OLTP, and deferring to separate systems for OLAP, fulltext search, etc. In a way that's "more UNIXy" but it's also potentially an operational headache.

In terms of specific feature comparison, you already have a great list there. A couple other things on the MySQL side I'd mention are InnoDB's buffer pool (smarter caching than relying on the OS cache as pg does) as well as the existence of MyRocks storage engine (LSM-based system offering better compression than pretty much anything else of comparable performance level for OLTP).

That all said -- Postgres is an awesome database, and I'd say that Postgres is more closely aligned with the textbook definition of a proper relational database. But then again I'd also say something similar about FreeBSD (vs Linux) for server operating systems, yet for practical purposes I always go with Linux anyway :)


Thanks for sharing your experience here!

> Without that feature, these tools would need to construct a dependency graph and create tables in a specific order (and/or defer FK creation until after the tables)

This problem alone justifies the ability to temporarily disable constraints. I notice PostgreSQL, which is a toy too ^__^, offers something similar.

> Go with Postgres for use-cases where solid geospatial, OLAP, or fulltext are core requirements and/or there's a desire to minimize the number of different data stores

I agree that PostgreSQL is a really good match for these use cases.

Great point about Linux and FreeBSD ;)


PostgreSQL is not a toy database either, and yet you can do:

    ALTER TABLE <tablename> DISABLE TRIGGER ALL
Or:

    SET session_replication_role = 'replica'


This really makes it seem like you have never used a database system at scale. There are reasons why systems like MySQL let you turn them off, and they are some of the same reasons why pretty much everyone who uses a database at scale has settled on MySQL. Also its why as is mentioned in the issue once you actually scale your database foreign keys become a nightmare. If all you have is a toy project you can feel free to use any database you want, but if you actually need to serve traffic you might want to rethink your priorities.


> everyone who uses a database at scale has settled on MySQL

hmm?

I've only every seen people using MySQL at scale if they started with MySQL in prototype and never had the energy to migrate.


Well here is one case of a large scale user migrating

https://eng.uber.com/mysql-migration/

And there are tons more. In fact here is a tool to help you do it

https://github.com/pivotal-cf/pg2mysql


Uber also has a giant team dedicated to re-inventing slack. I'd take their engineering prowess with a grain of salt.

Lots of people do dumb things for dumb reasons.

Most developers I interact with, even the really good ones, are profoundly stupid when it comes to databases.


I am a bit doubtful of this, especially within PostgreSQL you need to specifically go out of your way to create a NOT VALID constraint. I know that MySQL of old would default to an engine that didn't actually enforce key relationships (which was terrible but at least well documented) but in the modern world DBs will tend toward enforcement unless you specifically work against it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: