Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are programs with which you can add any desired amount of redundancy to your backup archives, so that they would survive corruption that does not affect a greater amount of data than the added redundancy.

For instance, on Linux there is par2cmdline. For all my backups, I create pax archives, which are then compressed, then encrypted, then expanded with par2create, then aggregated again in a single pax file (the legacy tar file formats are not good for faithfully storing all metadata of modern file systems and each kind of tar program may have proprietary non-portable extensions to handle this, therefore I use only the pax file format).

Besides that, important data should be replicated and stored on 2 or even 3 SSDs/HDDs/tapes, which should preferably be stored themselves in different locations.



Unfortunately some SSD controllers plainly refuse to read data they consider corrupted, even if you have extra parity that could potentially restore corrupted data, your entire drive might refuse to read.


Huh?

The issue being discussed is random blocks, yes?

If your entire drive is bricked, that is an entirely different issue.


Here’s the thing. That SSD controller is the interface between you and those blocks.

If it decides, by some arbitrary measurement, as defined by some logic within its black box firmware, that it should stop returning all blocks, then it will do so, and you have almost no recourse.

This is a very common failure mode of SSDs. As a consequence of some failed blocks (likely exceeding a number of failed blocks, or perhaps the controller’s own storage failed), drives will commonly brick themselves.

Perhaps you haven’t seen it happen, or your SSD doesn’t do this, or perhaps certain models or firmwares don’t, but some certainly do, both from my own experience, and countless accounts I’ve read elsewhere, so this is more common than you might realise.


This is correct, you still have to go through firmware to gain access to the block/page on “disk” and if the firmware decides the block is invalid than it fails.

You can sidestep this by bypassing the controller on a test bench though. Pinning wires to the chips. At that point it’s no longer an SSD.


The mechanism is usually that the SSD controller requires that some work be done before your read - for example rewriting some access tables to record 'hot' data.

That work can't be done because there is no free blocks. However, no space can be freed up because every spare writable block is bad or is in some other unusable state.

The drive is therefore dead - it will enumerate, but neither read nor write anything.


I don't think this is correct; it could read the flash block containing the [part of the] table in question, update it in memory, erase that block, then rewrite it into the same block.


I really wish this responsibility was something hoisted up into the FS and not a responsibility of the drive itself.

It's ridiculous (IMO) that SSD firmware is doing so much transparent work just to keep the illusion that the drive is actually spinning metal with similar sector write performance.


Linux supports raw flash, called an MTD device (memory technology device). It's often used in embedded systems. And it has MTD-native filesystems such as ubifs. But it's only really used in embedded systems because... PC SSDs don't expose that kind of interface. (Nor would you necessarily want them to. A faulty driver would quietly brick your hardware in a matter of minutes to hours)


A buggy firmware will brick an SSD and block every option for recovering at least part of the data.


Seems like the approach Apple is taking by soldering storage directly on the mainboard or using proprietary modules like in the Mac mini.


When only a number of 4 kB blocks cannot be read, if the amount of affected data is less than the amount of added redundancy the archive file can still be repaired.

For instance, if you have a 40 GB backup archive with 10% redundancy, 4 GB of data, i.e. one million 4 kB data blocks can be unreadable and you can still repair the archive and recover the complete content.

It is true that the entire SSD or HDD can become bricked. The solution for this, as I have already written in my previous comment, is to duplicate any SSD/HDD used for archival purposes, which I always do.


Yes, and? HDD controllers dying and head crashes are a thing too.

At least in the ‘bricked’ case it’s a trivial RMA - corrupt blocks tend to be a harder fight. And since ‘bricked’ is such a trivial RMA, manufacturers have more of an incentive to fix it or go broke, or avoid it in the first place.

This is why backups are important now; and always have been.


We're not talking about the SSD controller dying. The SSD controller in the hypothetical situation that's being described is working as intended.


Not as far as I can tell, where intended is ‘as any user would reasonably expect’. Bricking the drive (can’t even read) because of too many errors is not what most users would ever want.

Some would (enterprise maybe), but even then they’d want deterministic data deletes too, which doesn’t sound like are happening.


You can argue that controllers shouldn't behave that way. But they do, it's not a bug, and it's not a dead controller. It's a perfectly functional controller's response to dead blocks.


Cite? By definition it appears to not meet the definition of ‘functional’.


The definition of functional in the context of the discussion is that in works in the way the manufacture explicitly designed it work, in a standard industry practice fashion, not as an unforeseen bug or malfunction.

Not some abstract notion.


So not enumerating as a drive, and not allowing you to read even valid blocks is ‘working’?


Yes, same as a facility self-destructing, if it was programmed to do so, is working as per its spec.


And what spec requires that? I have yet to see one.

The manufacturers'.

Cite? I have yet to see that actually documented anywhere, and you keep avoiding actually referring to one either.

RE "....This is why backups are important now; and always have been..."

Still a big problem if backup is to the "..same technology..."


That’s why 3-2-1 is not just a good idea.


Thank you for this.

I had no knowledge of pax, or that par was an open standard, and I care about what they help with. Going to switch over to using both in my backups.


For handling pax archives, I recommend the "libarchive" package, which is available in many Linux distributions, even if it originally comes from FreeBSD.

Among other utilities, it installs the "bsdtar" program, which you can use in your scripts like this:

  bsdtar --create --verbose --format=pax --file="${DIRECTORY}".pax "${DIRECTORY}" || exit
And for extraction:

  bsdtar --extract --preserve-permissions --verbose --file="${DIRECTORY}".pax
The bsdtar program has options for compressing and/or encrypting the archives, for the case when you do not want to use directly other external programs.

"par2create" creates multiple files from the (normally compressed and encrypted) archive file, for storing the added redundancy. I make a directory where I move those files, then I use a second time bsdtar (obviously without any compression or encryption) to aggregate those files in a single archive with redundancy.

The libarchive package can also be taken directly from:

https://github.com/libarchive/libarchive

"libarchive" handles correctly all kinds of file metadata, e.g. extended file attributes and high-resolution file timestamps, which not all archiving utilities do. Many Linux utilities, with the default command-line options or when they have not been compiled from their source with adequate compilation options, which happens in some Linux distributions, may silently lose some of the file metadata, when copying, moving or archiving.


there's no reason that you have to create multiple files for par2 if you are storing the recovery data with the protected data. It only was split into files of varying size due to its source in protecting usenet posted binaries to allow users to not have to download the entire recovery data when they only needed a portion.


This is fine, but I'd prefer an option to transparently add parity bits to the drive, even if it means losing access to capacity.

Personally, I keep backups of critical data on a platter disk NAS, so I'm not concerned about losing critical data off of an SSD. However, I did recently have to reinstall Windows on a computer because of a randomly corrupted system file. Which is something this feature would have prevented.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: