Hacker Newsnew | past | comments | ask | show | jobs | submit | zorkw4rg's commentslogin

I'm not so sure other languages do that any better (nodejs doesn't even support non-unicode filenames at all for instance). Modern python does a pretty good job at supporting unicode, very far away from being a "Mess" that's just very much not true at all. People always like to hate on python but then other languages supposedly designed by actually capable people do mess up other stuff all the time. Look at how the great Haskell represents strings for instance and what a clusterfuck[1] that is.

[1] https://mmhaskell.com/blog/2017/5/15/untangling-haskells-str...


Rust is probably one of the languages which does this crap best, and that's thanks to static typing and deciding to not decide:

1. it has proper, validated unicode strings (though the stdlib is not grapheme-aware so manipulating these strings is not ideal)

2. it has proper bytes, entirely separate from strings

3. it has "the OS layer is a giant pile of shit" OsString, because file paths might be random bag of bytes (UNIX) or random bags of 16-bit values (and possibly some other hare-brained scheme on other platforms but I don't believe rust supports other osstrings currently)

4. and it has nul-terminated bag o'bytes CString

For the latter two, conversion to a "proper" language string is explicitly known to be lossy, and the developer has to decide what to do in that case for their application.


> 1. it has proper, validated unicode strings (though the stdlib is not grapheme-aware so manipulating these strings is not ideal)

Grapheme clusters are overrated in their importance for processing. The list of times you want to iterate over grapheme clusters:

1. You want to figure out where to position the cursor when you hit left or right.

2. You want to reverse a string. (When was the last time you wanted to do that?)

The list of times when you want to iterate over Unicode codepoints:

1. When you're implementing collation, grapheme cluster searching, case modification, normalization, line breaking, word breaking, or any other Unicode algorithm.

2. When you're trying to break text into separate RFC 2047 encoded-words.

3. When you're trying to display the fonts for a Unicode string.

4. When you're trying to convert between charsets.

Cases where neither is appropriate:

1. When you want to break text to separate lines on the screen.

2. When you want to implement basic hashing/equality checks.

(I'm not sure where "cut the string down to 5 characters because we're out of display room" falls in this list. I suspect the actual answer is "wrong question, think about the problem differently").

Grapheme clusters is relatively expensive to compute, and its utility is very circumscribed. Iterating over Unicode codepoints is much more useful and foundational and yet still very cheap.


> Grapheme clusters are overrated in their importance for processing. The list of times you want to iterate over grapheme clusters:

> 1. You want to figure out where to position the cursor when you hit left or right.

> 2. You want to reverse a string. (When was the last time you wanted to do that?)

You missed the big one:

3. You want to determine the logical (and often visual) length of a string.

Sure, there are some languages where logical-length is less meaningful as a concept, but there are many, many languages in which it's a useful concept, and can only be easily derived by iterating grapheme clusters.


Visual length of a string is measured in pixels and millimetres, not characters. In a font/graphics library, not in a text processing one.


Sorry, visual length as in visual number of "character-equivalent for purposes of word length" things. Those things are close to, but not exactly the same as, grapheme clusters, so the latter can often be used as an imperfect (but much more useful than unicode points or bytes) proxy for the former.

There's no perfect representation of number-of-character-equivalents that doesn't require understanding of the language being handled (and it's meaningless in some languages as I said), but there are many written languages in which knowing the length in those terms is both extremely useful and extremely hard to do without grapheme cluster identification.


>character-equivalent for purposes of word length

Serious question: why would you want to do this?

I know it's fashionable to limit usernames to X characters... but why? The main reason I've seen has been to limit the rendered length so there are some mostly-reliable UI patterns that don't need to worry about overflows or multiple lines. At least until someone names themselves:

W W W W W W W W W W W W W W W W W W W W

Which is 20 characters, no spaces, and will break loads of things.

(I'm intentionally ignoring "db column size" because that depends on your encoding, so it's unrelated to graphemes)


Serious question: why would you want to do this?

Have you never, in your entire life, encountered a string data type with a length rule? All sorts of ID values (to take an obvious example) either have fixed length, or a set of fixed lengths such that every valid value is one of those lengths, and many are alphanumeric, meaning you cannot get round length checks by trying to treat them as integers. Validating/understanding these values also often requires identifying what code point, not what grapheme, is at a specific index.

Plus there are things like parsing algorithms for standard formats. To take another example: you know how people sometimes repost the Stack Overflow question asking why "chucknorris" turns into a reddish color when used as a CSS color value? HTML5 provides an algorithm for parsing a (string) color declaration and turning it into a 24-bit RGB color value. That algorithm requires, at times, checking the length in code points of the string, and identifying the values of code points at specific indices. A language which forbids those operations cannot implement the HTML5 color parsing algorithm (through string handling; you'd instead have to do something like turn the string into a sequence of ints corresponding to the code points, and then manually manage everything, and why do that to yourself?).


Yes. All instances I've seen have been due to byte-size restrictions (so it depends on encoding) or for visual reasons (based on fundamentally flawed assumptions). With exceptions for dubious science around word-lengths between languages / as a difficulty/intelligence proxy, or just having fun identifying patterns. (interesting, absolutely, but of questionable utility / bias)

But every example you've given have been around visuals, byte sizes, or code points (which are unambiguously useful, yes). Nothing about graphemes.


So?

Rust's stdlib provides iteration on code units and code points. The use cases where these are useful are covered.

It does not provide iteration on grapheme clusters, the use cases where this is useful are not covered (and require an external dependency).

At no point am I requesting replacing codepoints-wise iteration by clusters-wise iteration.


I think a more accurate characterization is that neither code points nor grapheme clusters are usually what you want, but when you're naively processing text it's usually better to go with grapheme clusters so you don't mess up _as_ badly :)

There are definitely some operations that make sense on code points: but if you go through your list, (1), (2), (4) is something you'll rarely implement yourself (you just need a library), (3) is ... kinda rare? The most common valid use case for dealing with code points is parsing, where the grammar is defined in ascii or in terms of code points (which is relatively common).

Treating strings as either code points or graphemes basically enshrines the assumptions that segmentation operations make sense on strings at all -- they only do in specific contexts.

Most string operations you can think of come from incorrect assumptions about text. Like you said, the answer to most questions of the form "how do I X a string" is "wrong question" (reversing a string is my favorite example of this).

The only string operation that universally makes sense is concatenation (when dealing with "valid" strings, i.e. strings that actually make sense and don't do silly things like starting with a stray modifier character). Replacement makes some sense but you have to define "replacement" better based on your context. Taking substrings makes sense but typically only if you already have some metric of validity for the substring -- either that the substring was an ingredient of a prior concatenation, or that you have a defined text format like HTML that lets you parse out substrings. (This is why i actually kinda agree with Rust's decision to use bytes rather than code points for indexing strings -- if you're doing it right you should have obtained the offset from an operation on the string anyway, so it doesn't matter how you index it, so pick the fast one)

Most string operations go downhill from here where there's usually a right thing to do for that operation but it's highly context dependent.

Even hashing and equality are context-dependent, sometimes comparing bytes is enough, but other times you want to NFC or something and it gets messy quickly :)

In the midst of all this, grapheme clusters + NFC (what Swift does) are abstractions that let you naively deal with strings and mess up less. Your algorithm will still be wrong, but its incorrectness will cause fewer problems.

But yeah, you're absolutely right that grapheme clusters are pretty niche for when they're the correct tool to reach for. I'd just like to add that they're often the less blatantly incorrect tool to reach for :)

> (I'm not sure where "cut the string down to 5 characters because we're out of display room" falls in this list. I suspect the actual answer is "wrong question, think about the problem differently").

This is true, and not thinking about the problem differently is what caused the iOS Arabic text crash last year.

For many if not most scripts fewer code points is not a guarantee of shorter size -- you can even get this in Latin if you have a font with some wild kerning -- it's just that this is much easier to trigger in Arabic since you have some letters that have tiny medial forms but big final forms.


There's a very sound argument to be made for the opposite conclusion, that if we care about a problem we should make it necessary to solve the problem correctly or else stuff very obviously breaks, not have broken systems seem like they kinda work until they're used in anger.

Outside of MySQL (which unaccountably had a weird MySQL-only character encoding which only covered the BMP and named it "utf8" then when you tried to shove actual UTF-8 strings into it, they'd get silently truncated because YOLO MySQL) UTF-8 implementations tended to handle the other planes much better than UTF-16 implementations, many of which were in practice UCS-2 and then some thin excuses. Why? Because if you didn't handle multiple code units in UTF-8 nothing worked, you couldn't even write some English words like café properly. For years pretending your UCS-2 code was UTF-16 would only be noticed by people using obscure writing systems or academics.

I am also reminded of approaches to i18n for software primarily developed and tested mainly by monolingual English speakers. Obviously these users won't know if a localised variant they're examining is correctly translated, but they can be given a fake "locale" in which translated text is visibly different in some consistent way, e.g. it has been "flipped" upside down by abusing symbols that look kind of like the Latin alphabet upside down, or Pig Latin is used "Openway Ocumentday". The idea here again is that problems are obvious rather than corner cases, if the translations are broken or missing it'll say "Open Document" in the test locale which is "wrong" and you don't need to wait for a specialist German-speaking tester to point that out.


> There's a very sound argument to be made for the opposite conclusion, that if we care about a problem we should make it necessary to solve the problem correctly or else stuff very obviously breaks, not have broken systems seem like they kinda work until they're used in anger.

Oh, definitely :)

I'm rationalizing the focus on grapheme clusters, if I had my way "what is a string" would be a mandatory unit of programming language education and reasoning about this would be more strongly enforced by programming languages.


> 1. it has proper, validated unicode strings (though the stdlib is not grapheme-aware so manipulating these strings is not ideal)

Sigh, I hoped newer languages would avoid D's mistake. Auto-decoding is slow, unnecessary in most cases, and still only gives you partial correctness, depending on what you're trying to do. It also means that even the simplest string operations may fail, which has consequences on the complexity of the API.


I have no idea what you're talking about. Rust rarely does auto- anything, and certainly does not decode (or transcode) strings without an explicit request by the developer: Rust strings are not inherently iterable. As developer, you specifically request an iterable on code units or code points (or grapheme clusters or words through https://unicode-rs.github.io/unicode-segmentation)


I see, thanks for the clarification - looks like I mis-extrapolated from your comment.


That sounds more like an implementation issue than a design issue. If you are using UTF-8 actual decoding into Unicode code points is not necessary for most operations and Rust will not do that.

It also does not imply that string operations may fail. String construction from raw bytes may fail, but otherwise the use of UTF-8 strings should not introduce additional failure conditions.


> I'm not so sure other languages do that any better

I can only speak of D since I'm familiar with it.

In D, strings are arrays of chars. The standard library assumes that they contain valid UTF-8 code points and together form valid UTF-8, but it's ultimately your responsibility to ensure that. This assumption allows the standard library to present strings as ranges of Unicode code points (i.e. whole characters spanning multiple bytes).

To enforce this assumption, when raw data is interpreted as D strings it is usually checked if it's valid UTF-8. For example, readText() takes a filename, reads its contents, and checks that it is valid UTF-8 before returning it. assumeUTF() will take an array of bytes and return it as-is, but will throw in a check when the program is built in debug mode. Finally, string.representation (nothing more than a cast under the hood) gives you the raw bytes, and .byChar etc. allow iteration over code units rather than code points, if you really want to avoid auto-decoding and process a string byte-wise.

There are also types for UTF-16 and UTF-32 strings and code units, which work as the above. For other encodings, there's std.encoding which provides conversion to and from some common ones.

My only grip with how D deals with Unicode is that its standard library insists on decoding UTF-8 into code points when processing strings as ranges (and many string processing functions in other languages are written as generic range algorithms in D). Often enough, it's unnecessary, slow, and makes processing non-UTF text a chore, but it's not too hard to avoid. Other than this, I think D's approach to Unicode is better among the other languages I've seen.


Some other languages handle Unicode just fine. Rust and Julia are fine. IMO you need to build unicode understanding into string handling from the start for it to happen. Every function and detail needs to make UTF-8 sense, not just some some unicode handling library.


Haskell get things much better than Python.

For example, if you use `readFile` from Data.Text, you'll use utf-8 names and read utf-8 content. If you use `readFile` from Data.ByteString, you'll use utf-8 names and read utf-8 content. You just import whatever you want at the local namespace and use it.

If you define a conversion procedure, you can make code that accepts either text or bytes, and return either one too, automatically. The tools for working with text have equivalents for working with bytes too, whenever that makes sense. Combining text with bytes is a type error, but there are so many shortcuts for going from one to the other that one barely notices them (unless you are concerned about performance, then you will notice).

That small article is basically all that you have to read to be productive dealing with text and bytes differences.


Pretty much everything you've described is possible in Python. The open function can return bytes or utf8, depending on what you ask for, and converting between the two is simply a call to .decode or .encode.

Combing text with bytes is a type error.

There, you're ready for text in python3.


This is one of the things that makes it hard for me to let go of Python, even though my programming style is evolving more towards functional languages: I've become very spoiled and take for granted a lot of the peripheral sanity and general well-behavedness that Python has, certainly in the unix environment.

Every other language or language implementation that I encounter seems to end up having an "oh god no" surprise or three hiding not-very-deep beneath the surface. Let's not even talk about ruby.


Java is pretty good at character processing and has been since the inception of the language. Adopting Unicode from the start helped enormously, along with clearly separating String from byte[] in the type system. Finally the fact you have static typing makes it a lot easier to avoid 'what the heck do I have here' problems with byte vs. str that still pop up even in Python3.

That said Python3 is vastly better than Python2. Basic operations like reading/writing to files and serialization to JSON for the most part just work without having to worry about encodings or manipulate anything other than str objects. I'm sure there are lots of cases where that's not true but for my work at least string handling is no longer a major issue in writing correct programs. The defaults largely seem to work.


Java's string handling is also broken by default in a few ways, due to it historically using UCS-2 internally and hence still allowing surrogate pairs to get split up, giving broken unicode strings.


I have not personally encountered this problem but it's definitely there. The other problem historically is that Java didn't explicitly require clients to specify encodings explicitly when moving between strings and bytes. That's been cleaned up quite a bit in recent releases of the JDK.

All things considered Java character handling was an enormous improvement over the languages that preceded it and still better than implementations in many other languages. (I wish the same could be said of date handling.)


What's the deal with Haskell strings? It's not a mess, it's basically enforcing the same "unicode sandwich" approach Python recommends by using the type checker. Of course to do that you need one type for when the string is in the different possible different layers of the sandwich.

There's added types for lazy vs. non-lazy but that's for performance optimization, and don't get me started on how Python "get messy" when you want to do performance optimization because it usually kicks you out of the language.


> What's the deal with Haskell strings?

I think the linked article laid the case fairly well. Basically, Haskell has a bunch of string types you need to understand and fret about, and the one named "String" is the one you almost never want, but it's also the only one with decent ergonomics unless you know to install a compiler extension.

I think it's a fair criticism. The "Lots of different string types" thing isn't (IMO) such a big deal coming from a language of Haskell's vintage. Given what Python's "decade-plus spent with a giant breaking change right in the middle of the platform hanging over our heads" wild ride has been like, I can't blame anyone for not wanting to replicate the adventure.

But, for newcomers, the whole thing where you need to know to

  {-# LANGUAGE Support, Twenty, First, Century #-}
is a pretty big stumbling block.


I think that "deal" is just complains coming from someone that doesn't yet understand the engineering tradeoffs between strong static and weak type systems. That the nice functions for String come in the Prologue and that you need to implement or use other libraries for the functions for byte sequences or other stuff is not an excuse or a problem of the language as a tool.


"Operation Popeye (Project Controlled Weather Popeye / Motorpool / Intermediary-Compatriot) was a highly classified weather modification program in Southeast Asia during 1967–1972. The cloud seeding operation during the Vietnam War ran from March 20, 1967 until July 5, 1972 in an attempt to extend the monsoon season." [1]

"After World War II, the U.S. military bombed dams in North Korea and North Vietnam to destroy the communist governments’ electricity and irrigation infrastructure. This was, until the Iran-Iraq War, the final occurrence of such soggy tactics. In 1977 the Geneva Conventions specifically outlawed the targeting of water infrastructure in wartime." [2]

[1] https://en.wikipedia.org/wiki/Operation_Popeye [2] https://medium.com/war-is-boring/dam-warfare-3da6ee24518a https://en.wikipedia.org/wiki/Bombing_of_Vietnam%27s_dikes


meanwhile there is not a single tech startup from germany worth talking about, but yeah sure not everyone has to do it the SV way of doing things, like actually being successful


> like actually being successful

lets be honest, there is a huge amount of SV startups that bleed CV money like there is no tomorrow. not to mention the ones that failed.


QML is actually pretty awesome, I got into it through KDE/Plasma widget development. I think it should be possible to get newest JS working using Babel in your build pipeline. I think it makes a LOT of sense in cases where you develop a UI and can't effort to embed webkit/chrome.


Its not like Zoho is known for their high availability anyways, their domain not being reachable is just par for the course.

Also since it said "suspended for abuse complaint", I would almost immediately assume the Zoho just didn't properly handle abuse claims and its their fault.

Needless to say I have a incredibly low opinion about their "service" based on having used their mail product for almost a year (switched to google afterward).


Is there anything worse than having to develop with people who can only use Git via a UI? It pretty much by definition means they have zero grasp on the fundamental concepts of Git.


The insanity of international copyright law has turned a lot of people into basically just ignoring the laws which is 100% justified, reasonable and morally sound in my opinion. It is not the consumer's fault that politicans are too inept in understanding the current century we live in, it is not the consumers fault that politicans are too corrupt to bring copyright law into this century.

One of the creative ways of digital self defense is https://unogs.com/ it lists all content on netflix and in which region it is available. This way you can use a VPN service and actually fully use Netflix as the service it was intended to be and you've been payed for properly.

And for that matter it is not the corporations fault for exploiting all laws to the greatest extend possible either, don't anthropomorphize corporations, they are not moral agents, they are soulless thoughtless profit maximization machines it is the fault of politicians and the population who voted for them, to not regulating them properly.


> And for that matter it is not the corporations fault for exploiting all laws to the greatest extend possible either,

It's not like the laws just sprang up out of nowhere. Those same corporations wrote many of the laws via lobbyists and got them passed by legislators who were either too ignorant or disinterested to think through the ramifications.


I am a big believer in paying for content id like to see more of. I can both pay for content and pirate the shit out of it.


Considering most hollywood movies are interchangeable (the scripts are all written from the same manual anyway), my solution is to not actually purchase anything. If it's on Netflix or HBO, it's fine, otherwise I don't bother. Same for music, if it's streamed somewhere, fine, otherwise no. I get blu rays for really good movies, but they are so rare it ends up very cheap.

When they come up with a guarantee that I can access my digital purchases forever from any corner of the world or solar system, then I'll consider buying digitally from the movie/music industry.


> the scripts are all written from the same manual anyway

I bet you mean "Save the Cat!: The Last Book on Screenwriting You'll Ever Need", its such a cancerous guide.


Looks like people are taking offense to me saying that all hollywood movies are the same. If you can watch the same superhero movie 30 times (with different skins and superhero names), power to you. I can't any more.


Maybe you just don't like superhero movies. There are, in fact, other movies that have been made.


Books are someone's fanfic that someone else liked enough to publish. Scripts are similar.

Most stories are written around the same small groups:

- Single protagonist

- Duo or buddies

- 3 person team (heavy hitter, smart/tech/engineer, leader)

- 5 person team (usually a 3 person team core with 2 additional members)

It's like "madlibbing" a story - using characters named "heavy", "tech", "leader", any IP can slot in their characters. For these, Teal'c/Carter/O'Neill or Raphael/Donatello/Leonardo or Hulk/Stark/Rogers.

Some plots will naturally be more widely adaptable to multiple IPs than others. Any hero can be used to tell a sufficiently generic story, but not all heroes work in all stories. Imagine doing Man of Steel in the MCU, or Infinity War in the DCEU.


You are complaining that movies are all similar, insofar as they tend to focus on 1, 2, 3 or 5 people?


you'd think they could make a fantastic 4 movie that was good then... if only for the novelty...


That explains why they never made the "Two and a Half Men" movie!


Please do not give these bad ideas to Hollywood executives, or we might suffer such a production.


This is supposed to be a bland analysis. Please, try to be more objective.

These roles are well-defined in modern literature, and it logically follows that a market must exist, to buy and sell stories and IPs. This allows a video production company to license plots and characters, combine them (insert hero A into slot Tech), and sell edited videos of the production.


> Increase your productivity by never leaving the home row.

I can't possibly be the only one who read "Increase your productivity by never leaving home.", But awww I'm so much more productive in the office though =)


Thats so depressing that we have to have serious discussions about technical countermeasures against our oppressive EU regime. Just because of some old ignorant evil assholes. Time to leave the EU I suppose.


In my experience its really bad for your mental health to read the source code of software you use, there is a troubling amount of horrific code in widespread use.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: