Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

linux uses utf8 for display of filenames but, the paths themselves allow non-utf8 byte sequences.


It's an oversimplification to say Linux uses UTF-8 for display. Linux just stores bags of bytes and leaves interpretation to userspace. You could store paths in ISO-8859-1 if you wanted. The only special bytes are '\0' and '/'.


Not only could you, this actually happens in practice. Not necessarily ISO-8859-1, but specifically SHIFT-JIS, a Japanese encoding that you will run into if you run old Japanese software. To make things even worse, SHIFT-JIS is almost entirely incompatible with any form of UTF based encoding, and depending on the attempted normalisation you can quickly end up with paths that have been messed up multiple times in a row.

I forgot what Japanese emulator I tried to run when I found all of this out, ut sufficed to say I didn't enjoy the experience.


I buy digital Japanese Doujin music on sites like booth.pm, and their provided zip files extracts "beautifully" on Linux if you simply `unzip` them.


Lots of Japanese products are switching or have switched to UTF-8, so I have no doubt that modern ZIP files will extract without a problem.


Don't you mean `unzip -O shift-jis` them?


Except that Linux does support several filesystems that do claim to store the filenames in a specific encoding and therefore the kernel must do conversion. Mostly Windows FSes, but nowadays case-insensitive ext4 also applies.


These are exceptions, not the norm. The VFS layer does not care.


linux don't display them, the shell (emulator) do. Linux just send the bytes back to userland and let shell interpret them with a proper format to human. And even then. Tons of distro default the global LANG to c for some reason. So utf8 display isn't even working by default.


Even better: Each user can have his own locale and charset, and may even change that per program/shell/session. One may save filenames as UTF-8, one as ASCII, one as ISO8859-13, one as EBCDIC.

However, the common denominator nowadays is UTF-8, which has been a blessing overall getting rid of most of the aforementioned mess for international multi-user systems. And there is the C.UTF-8 locale which is slowly gaining traction.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: