Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Consider the codepoint U+1F4A9 ("PILE OF POO").

This encodes to the byte sequence F0 9F 92 A9 in UTF-8. Notice that every one of these bytes has a value > 0x7F, which means they're all outside the ASCII range.

That's one of the useful properties of UTF-8: you know that a code point requiring multi-byte encoding will never contain any bytes that could be confused for ASCII, because every byte of a multi-byte code point will be > 0x7F.

Which in turn means that if you use any processing mechanism that only alters bytes which are in the ASCII range, and passes all other bytes through unmodified, you are guaranteed not to modify or corrupt any multi-byte UTF-8 sequences.



Oh that’s interesting, I didn’t realize utf-8 had that nice property.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: