What I need is something that takes all the extended characters (think Spanish or Swedish) and turns them into alternative safe versions.
For instance, á into a, ñ into n, å into a, etc.
Had my hopes up when I saw the title.
Does anyone have any ideas or links to working scripts that I can turn into something useful? I need to "sanitize" a database of foreign documentaries before uploading to YouTube (their metadata input system chokes on extended chars). Thanks!
When you say safe alternatives, you mean ASCII right. You should think about looking into something which also understand the characters a bit better. For example å,æ,ø can mostly be turned into aa,ae,oe for danish and norwegian. Just turning them into a,?,o would change the meaning.
First is phonetic similarity. This is mostly just to allow users to be able to understand each other and to help automatically catch alternate latinizations so you find out "Hey, he already registered under a latinized-spelling name".
The second is glyph similarity. This is the security concern where you have two glyphs that are graphically similar but phonetically completely different, but can easily be mistaken for each other. These glyphs are used to trick and confuse users. The first kind of check won't catch these, but they're the reason we don't have unicode in domain names.
Probably a correct system would have a very liberal interpretation of glyph similarity and would treat strings as matched when they contain similar glyphs.
For instance, á into a, ñ into n, å into a, etc.
Had my hopes up when I saw the title.
Does anyone have any ideas or links to working scripts that I can turn into something useful? I need to "sanitize" a database of foreign documentaries before uploading to YouTube (their metadata input system chokes on extended chars). Thanks!