"Unix filenames are bytes" Are they? Is the separator 0x2f or / ? Or are you tal...

acdha · on Oct 6, 2018

The main point was that they’re not validated as Unicode so you can store values which cannot be decoded. It’s easy to find cases where someone could find a way to create a file which couldn’t easily be renamed or deleted using the OS’s built-in tools because even they forgot about this.

Similarly, Unicode normalization means that you can have multiple visually indistinguishable values which may or may not point to the same file depending on the OS and filesystem. I’ve had to help very confused users who were told that a file they could see wasn’t present because their tools didn’t handle that, too.

gvx · on Oct 6, 2018

The former. The separator a byte with the value 0x2F, which is equivalent to `/` in ASCII.

treve · on Oct 6, 2018

0x2f is / in ASCII and UTF-8, so this question doesn't make that much sense.

masklinn · on Oct 6, 2018

> Are they? Is the separator 0x2f or / ? Or are you talking about the filename only, not paths?

It's 0xf2. Encode a path to UTF-16 and watch the entire thing burn, with the FS seeing the path you provided end up right before or after the first / (because it encodes as either 0x00f2 or 0xf200 depending on BE versus LE).

loeg · on Oct 6, 2018

Unix paths are nul-byte-terminated, so UTF-16 generally doesn't make any sense in this context. Valid Unicode paths are encoded as UTF-8 on unix systems. UTF-16 and UTF-32 are invalid ways to encode Unicode paths. (That's not to say no one has tried to do it, just that it doesn't make any sense.)

(As other commenters have pointed out, Unix paths do not require a specific encoding, so robust applications cannot rely on any assumptions about encoding of existing files. But when creating new files, they must not try to encode paths as UTF-16.)

masklinn · on Oct 7, 2018

> Unix paths are nul-byte-terminated

That's the point. The separator is not "/", because "/" would be a character to encode. The separator is a specific byte, and so is the terminator.

> Valid Unicode paths are encoded as UTF-8 on unix systems.

There is no such thing as "unicode paths" on UNIX systems, valid or invalid.

daeken · on Oct 6, 2018

Endianness affects order of bytes (0xbeef vs 0xefbe), not nibbles (0xfeeb).

BeeOnRope · on Oct 6, 2018

The GP's example shows it affecting bytes, not nibbles.