Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What I need is something that takes all the extended characters (think Spanish or Swedish) and turns them into alternative safe versions.

For instance, á into a, ñ into n, å into a, etc.

Had my hopes up when I saw the title.

Does anyone have any ideas or links to working scripts that I can turn into something useful? I need to "sanitize" a database of foreign documentaries before uploading to YouTube (their metadata input system chokes on extended chars). Thanks!



You can use ICU transliterators. Example for the PHP ICU bindings: http://php.net/manual/en/transliterator.transliterate.php#11...


Thanks. This looks very promising. I'll dig into it and hopefully come out with a clean database ;-)


When you say safe alternatives, you mean ASCII right. You should think about looking into something which also understand the characters a bit better. For example å,æ,ø can mostly be turned into aa,ae,oe for danish and norwegian. Just turning them into a,?,o would change the meaning.


Exactly. I need to turn them into something meaningful.


Well, there are two separate problems.

First is phonetic similarity. This is mostly just to allow users to be able to understand each other and to help automatically catch alternate latinizations so you find out "Hey, he already registered under a latinized-spelling name".

The second is glyph similarity. This is the security concern where you have two glyphs that are graphically similar but phonetically completely different, but can easily be mistaken for each other. These glyphs are used to trick and confuse users. The first kind of check won't catch these, but they're the reason we don't have unicode in domain names.

Probably a correct system would have a very liberal interpretation of glyph similarity and would treat strings as matched when they contain similar glyphs.


Have a look at unidecode

https://github.com/iki/unidecode

Originally Perl, there are ports for python, node, ruby,.Net, etc

Obviously it's imperfect and lossy, but it might be what you want


I use a python library called unidecode to do this on my site.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: