That’s just restating my point: Unix filenames are bytes (on most filesystems, a...

izacus · on Oct 6, 2018

> That’s just restating my point: Unix filenames are bytes (on most filesystems, anyway). The fact that many people were able to conflate them with text strings was a convenient fiction.

Python tools for backups are my worst terror beacuse of that - they kept destroying data of our clients because they dared (gasp!) name files with characters from their own language or do unthinkable things, like create documents titled "CV - <name with non-unicode characters>.docx".

The fact that Python3 at least tries to make programmers not destroy data as soon as you type in a foreign name (which happens even in USA) is a good thing.

guitarbill · on Oct 6, 2018

> Python tools for backups are my worst terror beacuse of that

You can have badly written tools in any language. There are even functions to get file paths as bytes (e.g. [os.getcwdb](https://docs.python.org/3/library/os.html#os.getcwdb), it's just most people don't use them because it's rare-ish to see horribly broken filenames and not convenient.

Do other languages get this right 100% of the time on all platforms? I don't think so, it's just you've never noticed.

* C: has no concept of unicode strings per se, may or may not work depending on the implementation and how you choose to display them (CLI probably "works", GUI probably not)

* Rust: seems to assume UTF-8? (https://doc.rust-lang.org/std/ffi/index.html#conversions)

* Go: gets this right, but probably breaks on Windows? "string is the set of all strings of 8-bit bytes, conventionally but not necessarily representing UTF-8-encoded text" (https://golang.org/pkg/builtin/#string)

In short, either I don't understand what point you're making, or it isn't unique to Python.

Reelin · on Oct 7, 2018

Not to bring up one of my favorite languages or anything, but I do think D got this completely right.

https://tour.dlang.org/tour/en/basics/alias-strings

> This means that a plain string is defined as an array of 8-bit Unicode code units. All array operations can be used on strings, but they will work on a code unit level, and not a character level. At the same time, standard library algorithms will interpret strings as sequences of code points, and there is also an option to treat them as sequence of graphemes by explicit usage of std.uni.byGrapheme.

And perhaps my favorite part (https://tour.dlang.org/tour/en/gems/unicode):

> According to the spec, it is an error to store non-Unicode data in the D string types; expect your program to fail in different ways if your string is encoded improperly.

I should note that what I really like about this approach is the total lack of ambiguity. There is no question about what belongs in a string, and if it's not UTF then you had better be using a byte or ubyte array or you are doing it wrong by definition.

1wd · on Oct 7, 2018

So in D it's impossible to work with files if their filename is not Unicode?

Reelin · on Oct 7, 2018

Rather it would be an error to grab a Unix filename, figure your job was done, and store it directly into a string. So you'd... handle things correctly. Somehow. I admit I've never had the bad luck of encountering a non-UTF8 encoded filename under Linux before and can't claim with any confidence that my code would handle it gracefully. In any language, assuming you're using the standard library facilities it provides things will hopefully mostly be taken care of behind the scenes anyway.

What I like about the D approach isn't that declaring it an error actually solves anything directly (obviously it doesn't) but that it removes any ambiguity about how things are expected to work. If the encoding of strings isn't well defined, then if you're writing a library which encodings are the users going to expect you to accept? Or worse, which encodings does that minimally documented library function you're about to call accept?

steveklabnik · on Oct 6, 2018

OsString/OsStr are not utf-8.

guitarbill · on Oct 6, 2018

Would you care to elaborate? I'm not claiming to know Rust, but the link I provided clearly says "[t]hese do inexpensive conversions from and to UTF-8 byte slices".

steveklabnik · on Oct 6, 2018

https://doc.rust-lang.org/std/ffi/struct.OsString.html has it pretty clearly: on unixes it’s a bag of bytes, on Windows it’s the modified UTF-16 they’ve got going on. There’s a trick called WTF-8 that bridges some gaps, though that’s considered an implementation detail: https://simonsapin.github.io/wtf-8/

They’re conversions because they’re not UTF-8 in the first place, that is, they’re not String/str. The conversions are as cheap as we can make them. That language is meant to talk about converting from OsString to String, not from the OS to OsString.

dataflow · on Oct 6, 2018

Why do people say "bag" instead of string/sequence/vector/array/list/etc.? Bags are multisets... they're by definition unordered. It's already a niche technical term so it's really weird to see it used differently in a technical context...

steveklabnik · on Oct 6, 2018

I think it feels really evocative. Like, a bag of dice or something. You can’t see what’s inside, you have no idea what’s on them. It reinforces the “no encoding” thing well.

jschwartzi · on Oct 7, 2018

I think it is more eloquently stated that "you shouldn't make assumptions about what's inside." Saying "you can't see what's inside" ignores the biggest cause of the conflation. Userspace tools allow you to trivially interpret the bag of bytes as a text string for the purpose of naming it for other people.

steveklabnik · on Oct 7, 2018

Yeah, that might be better, good point.

ubernostrum · on Oct 6, 2018

One thing that amuses me given the number of complaints about the Python 3 string transition is how vastly better Python 3 is for working with bytes. The infrastructure available is light-years ahead of what Python 2 offered, precisely because it gave up on trying to also make bags of bytes be the default string type.

ak217 · on Oct 6, 2018

Thank you for saying that. Working with strings and bytes in Python 3 is nothing short of a joy compared to the dodgy stuff Python 2 did. People who complain about the change are delusional.

necovek · on Oct 6, 2018

The only problem I have with Python3 strings/bytes handling is the fact that there are standard library functions which accept bytestrings in Py2 (regular "" strings), and Unicode strings in Py3 (again, regular "" strings in Py3).

This has led to developers attempting to conflate the two distinctly different concepts and make APIs support both while behaving differently.

A simple solution is there in plain sight: just use exclusively b"" and u"" strings for any code you wish to work in both Py2 and Py3, and forget about "". All and any libraries should be using those exclusively if they support both. Python3-only code should be using b"" and "" instead.

One could consider this a design oversight in Python 3: the fact that the syntax is so similar elsewhere makes people want to run the same code in both, yet a core type is basically incompatible.

joshuamorton · on Oct 6, 2018

u"" is a syntax error in python3 (or at least it was for a while, apparently its not anymore, that said...). The correct cross-platform solution is to do

    from __future__ import unicode_literals

which makes python2 string literals unicode unless declared bytes. Then "" strings are always unicode and b"" strings are always bytes, no matter the language version.

ak217 · on Oct 6, 2018

> u"" is a syntax error in python3

This has not been the case since 2012. The last release of Python 3 for which this was the case reached end of life in February 2016. Please stop misinforming people.

Gorgor · on Oct 6, 2018

While u"" is accepted in current Python 3, for some reason they ignored the raw unicode string ur"" which still is a syntax error in Python 3. So, unicode_literals is definitely preferable.

masklinn · on Oct 7, 2018

> The correct cross-platform solution is to do

It's absolutely not correct, because there are many APIs which take "native strings" across versions aka they take an `str` (byte string) in Python 2 and an `str` (unicode string) in Python 3. unicode_literals causes significantly more problems than they solve.

The correct cross-platform solution (and the very reason why the `u` prefix was reintroduced after having initially been removed from Python 3) is in fact to use b"" for known byte strings, u"" for known text, and "" for "native strings" to interact with APIs which specifically need that.

acdha · on Oct 6, 2018

Agreed. So many of the complaints leave me wanting to suggest that someone try it for longer than 15 minutes and see what they think then.

hermitdev · on Oct 6, 2018

I haven't noticed much pain from the ASCII -> UNICODE migration from 2->3, but the one thing that still bothers me is that they did not update the CSV module to be transparent. In particular, the need to explicitly set the newline argument to `csv.reader`[0]. For me, dealing with a lot of data translation/extraction/loading work (ETL), this has been a big annoyance in migrating from 2 to 3.

I've not really had many other issues porting from 2 -> 3 from my own code; issues usually arise from 3rd-party libs that are relied upon (especially if they utilize C/C++ extensions). DB libs have sometimes been problematic. IIRC, pysybase requires you to set the encoding of strings now (which wasn't required before). I use pysybase to talk to both Sybase & MS SQL (it talks the TDS protocol).

[0] https://docs.python.org/3/library/csv.html#csv.reader

masklinn · on Oct 7, 2018

I haven't noticed much pain from the ASCII -> UNICODE migration from 2->3, but the one thing that still bothers me is that they did not update the CSV module to be transparent. In particular, the need to explicitly set the newline argument to `csv.reader`[0]. For me, dealing with a lot of data translation/extraction/loading work (ETL), this has been a big annoyance in migrating from 2 to 3.

???

The Python 3 CSV module works on text only (unicode), and you seem to be misreading the note: the module does in fact do newline transparency: https://docs.python.org/3/library/csv.html#id3

`newline=''` is to be specified on the file so it doesn't perform newline translation, because that would change the semantics of newlines in quoted strings: by default, `open` will universally translate newlines to `\n`, `newline=''` means it still does universal newlines (for purpose of iteration) but it doesn't do translation, returning newline sequences unchanged.

The Python 2 CSV module only worked on byte strings, and magic bytes.

marcosdumay · on Oct 6, 2018

Edit: I didn't try to handle files with Python for a long time. All of the following is stuff that can change as an ecosystem matures. All that may not hold anymore.

It's better in some ways. But then, I gave up long ago on manipulating files with Python, because Python 3 simply decided that anything on my filesystem is utf-8 by default.

Want to get an stream out of a file? Better have utf-8 content. Otherwise all the documentation will be buried 15 meters deep, outdated on some aspects and dependent on some not-released yet features.

Want to name a file? Better have an utf-8 name. I gave-up on python before I was able to open a file with a name that doesn't conform to utf-8.

Want to read a line at a time? Sorry, we only do that to text. Go get an utf-8 file.

Want to match with a regular expression or parse it somehow? Sorry, we only do that to text.

And so on.

ak217 · on Oct 6, 2018

Perhaps your information is out of date? I think none of what you said is true. Maybe some of it was true in the past.

> Want to get an stream out of a file? Better have utf-8 content. Otherwise all the documentation will be buried 15 meters deep, outdated on some aspects and dependent on some not-released yet features.

    f = open("myfile.txt", "r", encoding=ENCODING)

> Want to name a file? Better have an utf-8 name. I gave-up on python before I was able to open a file with a name that doesn't conform to utf-8.

    export NAME="$(head -c 5 /dev/urandom)"; touch "$NAME"; python -c 'import os; open(os.environ["NAME"])'

> Want to read a line at a time? Sorry, we only do that to text. Go get an utf-8 file.

    fh = open("foo", "rb"); lines=[l for l in fh]

> Want to match with a regular expression or parse it somehow? Sorry, we only do that to text.

    import re; re.match(b"abc", b"abcdef")

marcosdumay · on Oct 6, 2018

Yes, my experience is probably out of date.

ryandrake · on Oct 6, 2018

This isn't really about Python, though. There is so much crappy software out there, in all languages, that make incorrect assumptions about things like text and filenames. To this day, I'm amazed when anything works even remotely correctly when when you throw something other than ISO-8859-1 and UTF-8 at it.

I lean towards "programmer problem" rather than "language problem".

roel_v · on Oct 6, 2018

"Unix filenames are bytes"

Are they? Is the separator 0x2f or / ? Or are you talking about the filename only, not paths?

acdha · on Oct 6, 2018

The main point was that they’re not validated as Unicode so you can store values which cannot be decoded. It’s easy to find cases where someone could find a way to create a file which couldn’t easily be renamed or deleted using the OS’s built-in tools because even they forgot about this.

Similarly, Unicode normalization means that you can have multiple visually indistinguishable values which may or may not point to the same file depending on the OS and filesystem. I’ve had to help very confused users who were told that a file they could see wasn’t present because their tools didn’t handle that, too.

gvx · on Oct 6, 2018

The former. The separator a byte with the value 0x2F, which is equivalent to `/` in ASCII.

treve · on Oct 6, 2018

0x2f is / in ASCII and UTF-8, so this question doesn't make that much sense.

masklinn · on Oct 6, 2018

> Are they? Is the separator 0x2f or / ? Or are you talking about the filename only, not paths?

It's 0xf2. Encode a path to UTF-16 and watch the entire thing burn, with the FS seeing the path you provided end up right before or after the first / (because it encodes as either 0x00f2 or 0xf200 depending on BE versus LE).

loeg · on Oct 6, 2018

Unix paths are nul-byte-terminated, so UTF-16 generally doesn't make any sense in this context. Valid Unicode paths are encoded as UTF-8 on unix systems. UTF-16 and UTF-32 are invalid ways to encode Unicode paths. (That's not to say no one has tried to do it, just that it doesn't make any sense.)

(As other commenters have pointed out, Unix paths do not require a specific encoding, so robust applications cannot rely on any assumptions about encoding of existing files. But when creating new files, they must not try to encode paths as UTF-16.)

masklinn · on Oct 7, 2018

> Unix paths are nul-byte-terminated

That's the point. The separator is not "/", because "/" would be a character to encode. The separator is a specific byte, and so is the terminator.

> Valid Unicode paths are encoded as UTF-8 on unix systems.

There is no such thing as "unicode paths" on UNIX systems, valid or invalid.

daeken · on Oct 6, 2018

Endianness affects order of bytes (0xbeef vs 0xefbe), not nibbles (0xfeeb).

BeeOnRope · on Oct 6, 2018

The GP's example shows it affecting bytes, not nibbles.