Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: List deduplicator w/fuzzy matching?
2 points by jellicle on July 26, 2011 | hide | past | favorite | 6 comments
Hi. I have a spreadsheet with semi-duplicate entries that I want to merge (like merging Google Contacts, for example).

Input:

-- Robert Smith, 123 Main St., New York, NY, $foo

-- Bob Smith, 123 Main Street, NY, NY, $bar

Output:

-- Robert Smith, 123 Main Street, New York, NY, $foo, $bar

Googling, I see all sorts of various Windows-only list-management software that you can buy, or companies that will become my list cleanup provider for a couple thousand dollars, etc. This is for a one-shot merging. Is there any free/open-source software I can use? Or a web service that I can pay $9.95 to and upload my list, download a cleaned/merged version, something like that? I don't even mind making the final decisions about what is and isn't a duplicate entry - software doesn't have to be brilliant, just vaguely smart.



Did you look at Google Refine already?

http://code.google.com/p/google-refine/

Another option is to download a copy of sql server development edition and use the fuzzy matching SSIS utilities. It is pretty easy to use.


No, I hadn't heard of Google Refine. Looks like a great tool, might well solve my problem. Thanks so much!


Do the lists contain the same fields (ie first name, last name ..etc)

If you have Excel you can use conditional formatting to highlight the duplicates (Conditional Formatting -> New Rule -> Format only unique or duplicate values -> Select a formatting style). If you format it as a table and then sort by one of the columns then you should see all of the duplicates listed together and you can remove the duplicate rows manually.

Edit: The Robert/Bob thing does cause an issue with this method, but I think it's still a viable option.


Well, I can make them have the same fields, and I'm bright enough to merge exact matches myself, it's the inexact matches that are going to cause me some difficulties.



I've looked at Febrl before http://sourceforge.net/projects/febrl/ but in the end found it easiest to write my own.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: