What is the correct tool to properly merge a large set of tar.gz files for which...

mholt · on June 17, 2023

> What is the correct tool to properly merge a large set of tar.gz files for which may have an enormous overlap of similar files, and some that have been altered just slightly?

Can you elaborate on this? My understanding is that they should all extract into the same target folder without issues because each archive's set of files is distinct. But maybe I'm just assuming wrongly?

What exactly is your goal, too? It sounds like you are trying to find and de-duplicate visually similar images? Like what do you mean by "enormous overlap" or "altered just slightly"?

chaxor · on June 22, 2023

The problem isn't one takeout overlapping (multiple zips from one date) it's many takeouts over the years (full history).

So for example in 2001, you make a takeout with 30 zips, and then delete half of your photos off of Google. Then in 2007, you have another 20 zips, and delete 25% of your emails and photos to make more room, 2008 again, on up to now.

So now you have a big folder with many zips, and maybe some extracted folders, because things happen over the years, etc.

What's the best tool to merge all of this into one directory?

Got can help for the notes from Google keep that may have had things appended to or removed, photos can be overlapping a bit so really a set union is all that's required for many files, but some will be slightly different like the Google keep notes.

My best thought is to make some git repo and add things in, but to do a levenstein distance on the bits of each file to check if there is overlap in content and to estimate the 'lineage' of a file if there is significant overlap with another. Effectively you reconstruct the git commit tree with the set of all files over all histories. Then you build the git repo history from all of the files.

This would likely just be a local git repo since it would likely be several terabytes of info, but that would be the general idea I guess.

I just haven't found a good tool to actually do this easily unfortunately, but it seems like it would be a very basic , or commonly used scenario (especially for those 'should-be-a-git-repo' directories that everyone made before knowing about git. You know the ones: 'myfile.v1.doc', myfile.v2.doc', 'myfile.final.doc', myfile.reallyfinal.doc', myfile.finalfinal.doc')

mholt · on June 22, 2023

_> So now you have a big folder with many zips, and maybe some extracted folders, because things happen over the years, etc._

Oh, right.

Timelinize can do that. Takeout all your data, then import it into Timelinize. Then delete your Takeout (after Timelinize is finished and stable, of course, heh). Then next time you Takeout, just import it all into Timelinize again. (It de-dupes!) Then delete the Takeout, etc. (Maybe Timelinize can do the cleanup for you someday.)

The de-duping depends on the item being recognizable. Best if the data source provides an ID. Otherwise, things like certain metadata and content can be used to determine duplicates.

citelao · on June 17, 2023

I'm not chaxor, but as far as I remember I think you're right:

If I had unzipped all the takeout directories into one giant folder, there'd be no conflicts.

Since I didn't do that, I had to do weird multi-pass parsing, since an album could be split across multiple ZIPs. I get a bit neurotic around backups like this, so I'd have loved some sort of virtualized filesystem that non-destructively represented all of those zips "merged together." But in retrospect, I should have just merged the directories into one folder---would have made parsing easier :)

I don't recall substantial problems with duplicates. Just weird renames and EXIF data mismatches. And since I was trying to archive my data, I definitely didn't want similar photos to be deduplicated.

My problems are probably different than chaxor's, though.