"Julia was not designed by language geeks — it came from math, science, and engi...

StefanKarpinski · on Jan 23, 2014

Jeff and I were slightly miffed at being called "not language nerds" ;-)

carljv · on Jan 23, 2014

Language nerds (or geeks), definitely! Maybe he meant something like "language dweebs" or "language snobs."

Julia's great strength, I think, is that it was designed by folks with very good grounding in language design, but who prioritized practicality.

Glide · on Jan 24, 2014

I cringed when I read that in the blog. I came across the benchmarks on the home page and I was thinking there was no way that it was possible to write a language that looks that good and performs that well without being a "language nerd."

exDM69 · on Jan 23, 2014

> "Julia was not designed by language geeks — it came from math, science, and engineering MIT students"

This makes me a bit cautious about the language. Scientific computing people are often very smart but they are not programmers or computer scientists and may do funny things that a computer scientist would not. Like one based indexing of arrays in Julia. This is not a big deal but I'm a bit wary that there may be some nasty surprises for a language geek computer scientist like me :)

Another example is the byte addressing of UTF-8 strings, which may give an error if you try to index strings in the middle of a UTF-8 sequence [1]. s = "\u2200 x \u2203 y"; s[2] is an error, instead of returning the second character of the string. I find this a little awkward.

There's a flip side to this too, if you're dealing with scientific computing there seems to be a wide variety of scientific computing libraries available in Julia [2].

Overall I find this language very interesting and it is on my shortlist of new languages to take a look at when time permits.

[1] http://docs.julialang.org/en/latest/manual/strings/#unicode-... [2] http://docs.julialang.org/en/release-0.2/packages/packagelis...

simonster · on Jan 23, 2014

> Another example is the byte addressing of UTF-8 strings, which may give an error if you try to index strings in the middle of a UTF-8 sequence [1]. s = "\u2200 x \u2203 y"; s[2] is an error, instead of returning the second character of the string. I find this a little awkward.

Yes, it's a little awkward, but to understand why this tradeoff was made, think about how you'd get the nth character in a UTF-8 string. There is a tradeoff between intuitive O(n) string indexing by characters and O(1) string indexing by bytes.

The way out that some programming languages have chosen is to store your strings as UTF-16, and use O(1) indexing by two-byte sequence. That's not a great solution, because 1) it takes twice as much memory to store an ASCII string and 2) if someone gives you a string that contains a Unicode character that can't be expressed in UCS-2, like 🐣, your code will either be unable to handle it at all or do the wrong thing, and you are unlikely to know that until it happens.

The other way out is to store all of your strings as UTF-32/UCS-4. I'm not sure any programming language does this, because using 4x as much memory for ASCII strings and making string manipulation significantly slower as a result (particularly for medium-sized strings that would have fit in L1 cache as UTF-8 but can't as UCS-4) is not really a great design decision.

Instead of O(n) string indexing by characters, Julia has fast string indexing by bytes with chr2ind and nextind functions to get byte indexes by character index, and iterating over strings gives 4-byte characters. Is this the appropriate tradeoff? That depends on your taste. But I don't think that additional computer science knowledge would have made this problem any easier.

StefanKarpinski · on Jan 23, 2014

It's also essentially the same approach that has been taken by Go and Rust, so we're in pretty decent company. Rob Pike and Ken Thompson might know a little bit about UTF-8 ;-)

exDM69 · on Jan 24, 2014

The problem I have with these design choices is that I predict lots of subtle off by one bugs and crashes because of non-ascii inputs in the future of Julia. I hope that I am wrong :)

> Yes, it's a little awkward, but to understand why this tradeoff was made, think about how you'd get the nth character in a UTF-8 string. There is a tradeoff between intuitive O(n) string indexing by characters and O(1) string indexing by bytes.

I understand the problem of UTF-8 character vs. byte addressing and O(n) vs. O(1) and I have thought about the problem long and hard. And I don't claim to have a "correct" solution, this is a tricky tradeoff one way or the other.

I think that Julia "does the right thing" but perhaps exposes it to the programmer in a bit funny manner that is prone to runtime errors.

> The way out that some programming languages have chosen is to store your strings as UTF-16, and use O(1) indexing by two-byte sequence.

Using UTF-16 is a horrible idea in many ways, it doesn't solve the variable width encoding problem of UTF-8 but still consumes twice the memory.

> The other way out is to store all of your strings as UTF-32/UCS-4. I'm not sure any programming language does this, because using 4x as much memory for ASCII strings and making string manipulation significantly slower as a result (particularly for medium-sized strings that would have fit in L1 cache as UTF-8 but can't as UCS-4) is not really a great design decision.

This solves the variable width encoding issue at the cost of 4x memory use. Your concern about performance and cache performance is a valid one.

However, I would like to see a comparison of some real world use case how this performs. There will be a performance hit, that is for sure but how big is it in practice?

In my opinion, the string type in a language should be targeted at short strings (long strings are some hundreds of characters, typically strings around 32 or so) and have practical operations for that. For long strings (kilobytes to megabytes) of text, another method (some kind of bytestring or "text" type) should be used. For a short string, a 4x memory use doesn't sound that bad but your point about caches is still valid.

> Instead of O(n) string indexing by characters, Julia has fast string indexing by bytes with chr2ind and nextind functions to get byte indexes by character index, and iterating over strings gives 4-byte characters. Is this the appropriate tradeoff? That depends on your taste.

This is obviously the right thing to do when you store strings in UTF-8.

My biggest concern is that there will be programs that crash when given non-ascii inputs. The biggest change I would have made is that str[n] should not throw a runtime error as long as n is within bounds.

Some options I can think of are: 1) str[n] returns n'th byte 2) str[n] returns character at n'th byte or some not-a-character value 3) Get rid of str[n] altogether and replace it with str.bytes()[n] (O(1)) and str.characters()[n] (where characters() returns some kind of lazy sequence if possible, O(n))

You're right, this boils down to a matter of taste. And my opinion is that crashing at runtime should always be avoided if it is possible by changing the design.

> But I don't think that additional computer science knowledge would have made this problem any easier.

There is a certain difference in "get things done" vs. "do it right" mentality between people who use computers for science and computer scientists. The right way to go is not in either extreme but some kind of delicate balance between the two.

astrieanna · on Jan 23, 2014

I think it's more like Julia is what happens when "language geeks"/experienced programmers write a language that's for technical computing, with a deep understanding of their problem domain and empathy for their users.

Strings in Julia are meant to be addressed in for loops; they index by byte not character because it's slow to index by character once you include Unicode. Julia trys, in general, to give you control over low-level things rather than hiding them with magic.

I like Julia because it's homoiconic, because of it's type system, because multiple dispatch is fun and new, and because it's just plain fun to write. I do static analysis, not math/science.

coldtea · on Jan 24, 2014

>This makes me a bit cautious about the language. Scientific computing people are often very smart but they are not programmers or computer scientists and may do funny things that a computer scientist would not.

Most languages, from C and C++ to Python and Java were not created by "computer scientists".

Usually it's either programmers that studied math or came from some other profession (physicists, linguists like Larry Wall, even philosophers).

>Another example is the byte addressing of UTF-8 strings, which may give an error if you try to index strings in the middle of a UTF-8 sequence [1]. s = "\u2200 x \u2203 y"; s[2] is an error, instead of returning the second character of the string. I find this a little awkward.

That makes perfect sense if Julia cannot yet handle indexing strings on graphemes.

In essense, there is NO "second character" that you're getting when "byte indexing" a string. You might get one (if it's ascii all the way), or more possible you'll just get an invalid part of a character as a byte.

In other languages with similar limitations (like PHP) you get a broken result with no warning at all.