That’s just restating my point: Unix filenames are bytes (on most filesystems, anyway). The fact that many people were able to conflate them with text strings was a convenient fiction. Python no longer allows you to maintain that pretense but it’s easy to deal with by treating them as opaque blobs, attempt to decode and handle errors, or perform manipulations as bytes.
> That’s just restating my point: Unix filenames are bytes (on most filesystems, anyway). The fact that many people were able to conflate them with text strings was a convenient fiction.
Python tools for backups are my worst terror beacuse of that - they kept destroying data of our clients because they dared (gasp!) name files with characters from their own language or do unthinkable things, like create documents titled "CV - <name with non-unicode characters>.docx".
The fact that Python3 at least tries to make programmers not destroy data as soon as you type in a foreign name (which happens even in USA) is a good thing.
> Python tools for backups are my worst terror beacuse of that
You can have badly written tools in any language. There are even functions to get file paths as bytes (e.g. [os.getcwdb](https://docs.python.org/3/library/os.html#os.getcwdb), it's just most people don't use them because it's rare-ish to see horribly broken filenames and not convenient.
Do other languages get this right 100% of the time on all platforms? I don't think so, it's just you've never noticed.
* C: has no concept of unicode strings per se, may or may not work depending on the implementation and how you choose to display them (CLI probably "works", GUI probably not)
* Go: gets this right, but probably breaks on Windows? "string is the set of all strings of 8-bit bytes, conventionally but not necessarily representing UTF-8-encoded text" (https://golang.org/pkg/builtin/#string)
In short, either I don't understand what point you're making, or it isn't unique to Python.
> This means that a plain string is defined as an array of 8-bit Unicode code units. All array operations can be used on strings, but they will work on a code unit level, and not a character level. At the same time, standard library algorithms will interpret strings as sequences of code points, and there is also an option to treat them as sequence of graphemes by explicit usage of std.uni.byGrapheme.
> According to the spec, it is an error to store non-Unicode data in the D string types; expect your program to fail in different ways if your string is encoded improperly.
I should note that what I really like about this approach is the total lack of ambiguity. There is no question about what belongs in a string, and if it's not UTF then you had better be using a byte or ubyte array or you are doing it wrong by definition.
Rather it would be an error to grab a Unix filename, figure your job was done, and store it directly into a string. So you'd... handle things correctly. Somehow. I admit I've never had the bad luck of encountering a non-UTF8 encoded filename under Linux before and can't claim with any confidence that my code would handle it gracefully. In any language, assuming you're using the standard library facilities it provides things will hopefully mostly be taken care of behind the scenes anyway.
What I like about the D approach isn't that declaring it an error actually solves anything directly (obviously it doesn't) but that it removes any ambiguity about how things are expected to work. If the encoding of strings isn't well defined, then if you're writing a library which encodings are the users going to expect you to accept? Or worse, which encodings does that minimally documented library function you're about to call accept?
Would you care to elaborate? I'm not claiming to know Rust, but the link I provided clearly says "[t]hese do inexpensive conversions from and to UTF-8 byte slices".
They’re conversions because they’re not UTF-8 in the first place, that is, they’re not String/str. The conversions are as cheap as we can make them. That language is meant to talk about converting from OsString to String, not from the OS to OsString.
Why do people say "bag" instead of string/sequence/vector/array/list/etc.? Bags are multisets... they're by definition unordered. It's already a niche technical term so it's really weird to see it used differently in a technical context...
I think it feels really evocative. Like, a bag of dice or something. You can’t see what’s inside, you have no idea what’s on them. It reinforces the “no encoding” thing well.
I think it is more eloquently stated that "you shouldn't make assumptions about what's inside." Saying "you can't see what's inside" ignores the biggest cause of the conflation. Userspace tools allow you to trivially interpret the bag of bytes as a text string for the purpose of naming it for other people.
One thing that amuses me given the number of complaints about the Python 3 string transition is how vastly better Python 3 is for working with bytes. The infrastructure available is light-years ahead of what Python 2 offered, precisely because it gave up on trying to also make bags of bytes be the default string type.
Thank you for saying that. Working with strings and bytes in Python 3 is nothing short of a joy compared to the dodgy stuff Python 2 did. People who complain about the change are delusional.
The only problem I have with Python3 strings/bytes handling is the fact that there are standard library functions which accept bytestrings in Py2 (regular "" strings), and Unicode strings in Py3 (again, regular "" strings in Py3).
This has led to developers attempting to conflate the two distinctly different concepts and make APIs support both while behaving differently.
A simple solution is there in plain sight: just use exclusively b"" and u"" strings for any code you wish to work in both Py2 and Py3, and forget about "". All and any libraries should be using those exclusively if they support both. Python3-only code should be using b"" and "" instead.
One could consider this a design oversight in Python 3: the fact that the syntax is so similar elsewhere makes people want to run the same code in both, yet a core type is basically incompatible.
u"" is a syntax error in python3 (or at least it was for a while, apparently its not anymore, that said...). The correct cross-platform solution is to do
from __future__ import unicode_literals
which makes python2 string literals unicode unless declared bytes. Then "" strings are always unicode and b"" strings are always bytes, no matter the language version.
This has not been the case since 2012. The last release of Python 3 for which this was the case reached end of life in February 2016. Please stop misinforming people.
While u"" is accepted in current Python 3, for some reason they ignored the raw unicode string ur"" which still is a syntax error in Python 3. So, unicode_literals is definitely preferable.
It's absolutely not correct, because there are many APIs which take "native strings" across versions aka they take an `str` (byte string) in Python 2 and an `str` (unicode string) in Python 3. unicode_literals causes significantly more problems than they solve.
The correct cross-platform solution (and the very reason why the `u` prefix was reintroduced after having initially been removed from Python 3) is in fact to use b"" for known byte strings, u"" for known text, and "" for "native strings" to interact with APIs which specifically need that.
I haven't noticed much pain from the ASCII -> UNICODE migration from 2->3, but the one thing that still bothers me is that they did not update the CSV module to be transparent. In particular, the need to explicitly set the newline argument to `csv.reader`[0]. For me, dealing with a lot of data translation/extraction/loading work (ETL), this has been a big annoyance in migrating from 2 to 3.
I've not really had many other issues porting from 2 -> 3 from my own code; issues usually arise from 3rd-party libs that are relied upon (especially if they utilize C/C++ extensions). DB libs have sometimes been problematic. IIRC, pysybase requires you to set the encoding of strings now (which wasn't required before). I use pysybase to talk to both Sybase & MS SQL (it talks the TDS protocol).
I haven't noticed much pain from the ASCII -> UNICODE migration from 2->3, but the one thing that still bothers me is that they did not update the CSV module to be transparent. In particular, the need to explicitly set the newline argument to `csv.reader`[0]. For me, dealing with a lot of data translation/extraction/loading work (ETL), this has been a big annoyance in migrating from 2 to 3.
???
The Python 3 CSV module works on text only (unicode), and you seem to be misreading the note: the module does in fact do newline transparency: https://docs.python.org/3/library/csv.html#id3
`newline=''` is to be specified on the file so it doesn't perform newline translation, because that would change the semantics of newlines in quoted strings: by default, `open` will universally translate newlines to `\n`, `newline=''` means it still does universal newlines (for purpose of iteration) but it doesn't do translation, returning newline sequences unchanged.
The Python 2 CSV module only worked on byte strings, and magic bytes.
Edit: I didn't try to handle files with Python for a long time. All of the following is stuff that can change as an ecosystem matures. All that may not hold anymore.
It's better in some ways. But then, I gave up long ago on manipulating files with Python, because Python 3 simply decided that anything on my filesystem is utf-8 by default.
Want to get an stream out of a file? Better have utf-8 content. Otherwise all the documentation will be buried 15 meters deep, outdated on some aspects and dependent on some not-released yet features.
Want to name a file? Better have an utf-8 name. I gave-up on python before I was able to open a file with a name that doesn't conform to utf-8.
Want to read a line at a time? Sorry, we only do that to text. Go get an utf-8 file.
Want to match with a regular expression or parse it somehow? Sorry, we only do that to text.
Perhaps your information is out of date? I think none of what you said is true. Maybe some of it was true in the past.
> Want to get an stream out of a file? Better have utf-8 content. Otherwise all the documentation will be buried 15 meters deep, outdated on some aspects and dependent on some not-released yet features.
f = open("myfile.txt", "r", encoding=ENCODING)
> Want to name a file? Better have an utf-8 name. I gave-up on python before I was able to open a file with a name that doesn't conform to utf-8.
This isn't really about Python, though. There is so much crappy software out there, in all languages, that make incorrect assumptions about things like text and filenames. To this day, I'm amazed when anything works even remotely correctly when when you throw something other than ISO-8859-1 and UTF-8 at it.
I lean towards "programmer problem" rather than "language problem".
The main point was that they’re not validated as Unicode so you can store values which cannot be decoded. It’s easy to find cases where someone could find a way to create a file which couldn’t easily be renamed or deleted using the OS’s built-in tools because even they forgot about this.
Similarly, Unicode normalization means that you can have multiple visually indistinguishable values which may or may not point to the same file depending on the OS and filesystem. I’ve had to help very confused users who were told that a file they could see wasn’t present because their tools didn’t handle that, too.
> Are they? Is the separator 0x2f or / ? Or are you talking about the filename only, not paths?
It's 0xf2. Encode a path to UTF-16 and watch the entire thing burn, with the FS seeing the path you provided end up right before or after the first / (because it encodes as either 0x00f2 or 0xf200 depending on BE versus LE).
Unix paths are nul-byte-terminated, so UTF-16 generally doesn't make any sense in this context. Valid Unicode paths are encoded as UTF-8 on unix systems. UTF-16 and UTF-32 are invalid ways to encode Unicode paths. (That's not to say no one has tried to do it, just that it doesn't make any sense.)
(As other commenters have pointed out, Unix paths do not require a specific encoding, so robust applications cannot rely on any assumptions about encoding of existing files. But when creating new files, they must not try to encode paths as UTF-16.)