Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

IMHO the whole python3 string mess could have been prevented if they had chosen UTF-8 as the only string encoding instead of adding a strict string type with a lot of under-the-hood magic. That way strings and byte streams could remain the same underlying data, just as in python2. The main problem I have with byte-streams vs strings in python3 is that it adds a strict type checking at runtime which isn't checked at 'authoring time'. Some APIs even make it impossible to do upfront type checking even if type hints would be provided (e.g. reading file content either returns a byte stream, or a string, based on the content of a string parameter in the file open function).

Recommended reading: http://utf8everywhere.org/



> IMHO the whole python3 string mess could have been prevented if they had chosen UTF-8 as the only string encoding instead of adding a strict string type with a lot of under-the-hood magic.

That is basically what Python2 does, and it is completely wrong.


Can you give any reasons why this is completely wrong? The web seems to work just fine with UTF-8. The advantage is that you can pass string data around as generic byte streams without even knowing about the encoding. You'll only have to care about the encoding at the end points.


You are joking, right? Have you ever seen non-English webpages? More often than not, a multitude of ??? and Chinese characters pop up at some point or another.


I'm from Germany so I've seen a few non-English webpages. I can't remember having seen any text rendering problems since the late 90's or so.


this is because browsers have very sophisticated algorithms to detect the encoding because this was such a frequent issue. (and yes, UTF-8 adoption/support has been growing, which also helps)

being German and working in a multi-national company, i can confirm it is still very much an issue with software that doesn't handle this. Excel is one of the worst offenders, document corruption is rife especially when going between Excel on Windows and Excel for Mac. this is because Excel doesn't to UTF-8 as default for legacy reasons (I think), but also either doesn't have encoding detection or has very bad encoding detection.


As am I. The encoding detection used for a standardized(!) feed file format I had to write had cyclomatic complexity of 16 and only supported 4 encodings(X). On the other hand it was almost always correct. How you would do that on a global scale is beyond me.

(X) I hear you ask, 'Why would you do that even!?' Try telling tiny companies without IT department what an encoding is. It's faster to just figure it out on the receiving side.


I don't really understand your point here? Why would you change the internal representation of the text storage type? This doesn't change anything.

Or if I read this incorrectly and you want to merge 'bytes' and 'string', but enforce 'utf-8', how would you ensure that conversion to 'utf-8' while communicating with strangely encoded content (files encoded in utf-16-be for example, or worse) would be enforced? I think you can't, it's the programmers job to ensure everything transitioned over to the correct encoding, which is exactly the purpose of 'string' being distinct and incompatible with 'bytes'.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: