What I need is something that takes all the extended characters (think Spanish o...

cataphract · on Nov 19, 2014

You can use ICU transliterators. Example for the PHP ICU bindings: http://php.net/manual/en/transliterator.transliterate.php#11...

cgranier · on Nov 19, 2014

Thanks. This looks very promising. I'll dig into it and hopefully come out with a clean database ;-)

driax · on Nov 19, 2014

When you say safe alternatives, you mean ASCII right. You should think about looking into something which also understand the characters a bit better. For example å,æ,ø can mostly be turned into aa,ae,oe for danish and norwegian. Just turning them into a,?,o would change the meaning.

cgranier · on Nov 19, 2014

Exactly. I need to turn them into something meaningful.

Pxtl · on Nov 19, 2014

Well, there are two separate problems.

First is phonetic similarity. This is mostly just to allow users to be able to understand each other and to help automatically catch alternate latinizations so you find out "Hey, he already registered under a latinized-spelling name".

The second is glyph similarity. This is the security concern where you have two glyphs that are graphically similar but phonetically completely different, but can easily be mistaken for each other. These glyphs are used to trick and confuse users. The first kind of check won't catch these, but they're the reason we don't have unicode in domain names.

Probably a correct system would have a very liberal interpretation of glyph similarity and would treat strings as matched when they contain similar glyphs.

berdario · on Nov 19, 2014

Have a look at unidecode

https://github.com/iki/unidecode

Originally Perl, there are ports for python, node, ruby,.Net, etc

Obviously it's imperfect and lossy, but it might be what you want

notatoad · on Nov 19, 2014

I use a python library called unidecode to do this on my site.