0-based indexing with closed intervals is better for slicing. This shouldn't be ...

dragonwriter · on Jan 20, 2021

> 0-based indexing with closed intervals is better for slicing. This shouldn't be controversial.

base is irrelevant to this, and you want (and show!) half-open, not closed, intervals.

> This has two nice properties. One is that two slices are adjacent if the beginning and ends match, and the other, far more important, is that the length of the slice is end - start.

Yeah, those are all properties of half-open intervals, irrespective of indexing base. It would be as true of π-based indexing as it is of 0-based.

celrod · on Jan 20, 2021

It's related to 0-based indexing in that if you you want to take/iterate over the first `N` elements, `0:N` works with 0-based indexing + close-open, but if you had 1-based and close-open, you'd need the awkward `1:N+1`.

This is why 1-based index languages normally use closed-closed intervals, so that they can use `1:N`.

I'm a die hard Julian (Julia is 1-based), but I do a lot of pointer arithmetic in my packages internally. I've come to prefer 0-based indexing, as it really is more natural there. 0-based plus close-open intervals are also nicer for partitioning an iteration space/tiling loops, thanks to the fact the parent commented pointed out on the end of one iteration being the start of the next. This is a nice pattern for partitioning `N` into roughly block_size-sized blocks:

  iters, rem = divrem(N, block_size)
  start = 0
  for i in [0,iters)
    end = start + block_size + i < rem
    # operate on [start, end)
    start = end
  end

But that's only slightly nicer. To translate this into 1-based indexing and closed-closed intervals, you'd just substitute the `# operate` line with

    # operate on [start+1, end]

the `[0,iters)` with `[1:iters]`, and `i < rem` with `i <= rem`.

1- vs 0-based indexing is bike-shedding. A simple question we can all have opinions on that's easy to argue about, when it really doesn't matter much.

Julia uses 1-based indexing, but its pointer arithmetic is (obviously) 0-based, because pointer arithmetic != indexing. Adding 0 still adds 0, and adding 1 and `unsafe_load`ing will give me a different value than if I didn't add anything at all. (This is just reemphasizing the final point made by the blog post.)

cbolton · on Jan 20, 2021

Do you use 0-based arrays in your Julia code (since Julia supports arbitrary first indices, e.g. with OffsetArrays)? If not, why not?

celrod · on Jan 20, 2021

I have on occasion, and when working on custom array types I've added support for being offset with optionally compile-time known offsets. As StrideArrays.jl matures and I write more libraries making use of it, I may use 0-based indices more often. The idea of dynamic offsets when they aren't needed bothers me, even though I've benchmarked that as being pretty meaningless. The bigger problem is just that OffetArrays are a wrapper, and there's a Julia bug where TBAA information on wrappers often gets lost, so that the compiler reloads the pointer you're loading from on every iteration of a loop. Aside from that being slow itself, it also causes the autovectorizer to fail. This causes a severe regression when it occurs. Performance should be fine in other cases. LoopVectorization or `@simd ivdep` should also avoid the problem.

For the most part, my preference on 1 vs 0 is weak, and for coffee other people are likely to look at i do want to make it easy to understand / not full of surprises.

jomar · on Jan 20, 2021

The natural representation of intervals for doing arithmetic on is 0-based half-open.

Half-open because of the slicing properties, as noted in your posting and the grandparent posting.

0-based because of the simplification for converting between relative coordinate systems. Suppose you have one interval A represented as offsets within a larger interval B, and you'd like to know what A's coordinates are in the global coordinate system that B uses. This is much easier to compute when everything uses 0-based coordinates.

Here is a slightly longer discussion of that in a genomics context: https://github.com/ga4gh/ga4gh-schemas/issues/121#issuecomme... and a draft of a document I wrote up (again, in a genomics context) so as never to have to have this discussion ever again: https://github.com/jmarshall/ga4gh-schemablocks.github.io/bl...

sdbrady · on Jan 20, 2021

I think this is a good write up, but I think the notation is still carrying some mental baggage. It's not necessary to have these open and closed brackets/parenthesis. They don't add anything, and if anything, they confuse the matter. An interval is just (1, 2) (or [1, 2] if preferred aesthetically). Since a base cannot be "on" either 1 or 2, it's not meaningful to have these open/closed interval notion. In other words, (1, 2) == [1, 2) == (1, 2] == [1, 2].

Open/closed intervals only come into play in continuous dimensions. DNA sequences, arrays in memory, et al are discrete.

samatman · on Jan 20, 2021

On the first point, yep, completely misspoke, what you don't want is open intervals.

To the second point, as I said: Djikstra's argument for using 0 instead of 1 with half-open intervals is, to my taste, perfect. As I have nothing to add to it, I will simply defer.

bvrmn · on Jan 20, 2021

For </<= intervals and 1-based indexing you have to write `for(i=1; i++; i <= N)`. So you lost nice property of having `upper - lower` number of iterations.

dragonwriter · on Jan 20, 2021

For half-open intervals, you have upper-lower iterations regardless of base. In C for loop terms, it's:

  for (i=lower;i<upper;i++) {}

If you are using 0-based indexing, lower is zero and upper is the number of elements for a full iteration over all indexes; with 1-based indexing lower is 1 and upper is one greater than the number of elements.

Now, despite the utility of half-open intervals, most people’s intuition is around ordinal index ranges, so 0-based indexing is counterintuitive with half-open intervals because slices start at 0, and 1-based indexing is counterintuitive with them because they end at N+1.

This is because, useful or not, half-open intervals are counterintuitive, but they are worth developing the intuition for because they combine nicely.

bvrmn · on Jan 20, 2021

> with 1-based indexing lower is 1 and upper is one greater than the number of elements.

Do you see +1 tweak? Also consider that "for loop" can't be expressed as </* interval and always expressed as <=/* interval. So 1-based indexing have to be either 1 <= i <= N (closed interval, bad, N - 1 != number of iterations) or 1 <= i < N + 1 (half open interval, good, but +1 tweak).

quietbritishjim · on Jan 20, 2021

> I use Lua every day, and work with abstract syntax trees. I mess this up all. the. time.

I think this is the clincher in your argument (rather than Djikstra). I only use languages with 0-based indexing, and it seems natural to me, but I could almost be convinced that if only I used 1-based indexing regularly then I'd be fine with it too. But here you are with actual experience of using 1-based indexing regularly and you don't feel that way, which blows that idea out of the water.

Lio · on Jan 20, 2021

I've messed up array indexing under both systems. I prefer 0-based but only due to familiarity if I'm being honest.

Leads me to the old joke:

There are only 2 difficult things in computing. Naming things, cache invalidation and off by one errors.

wruza · on Jan 20, 2021

A programmer “strengthens their ankle and leans forward by holding their arm out, preparing to relax the shoulder” into a bar. The bartender bursts out laughing for few sprints.

I mess it up as well, both in Lua and C. The key to success is to not calculate values by adding or subtracting them, but to use meaningful names instead. E.g. dist(a, b), holen(a, b), endidx(start, len) and so on. Forget about ever writing “1” in your code, point your finger and call operations out loud. It doesn’t eradicate errors, but at least elevates them to the problem domain level. Off by one is not exclusive to 1-based or 0-based, it is exclusive to thinking you’re smart enough to correctly shorten math expressions in your head every damn time.

(But sometimes I also feel too clever and after some thinking I’m just writing (i+l2+b+2) because it’s obvious if b already comes negative and means “from behind” in “0-based -1 is last” terms.)

WA · on Jan 20, 2021

True, but there are examples where 1-based indexing is easier, like returning the last element based on length. I think array[array.length] is easier to understand than array[array.length - 1].

Or the predecessor of the last element: array[array.length - 2] makes you think, whereas array[array.length - 1] is more obvious.

robertlagrant · on Jan 20, 2021

This is why Python has negative indices :) Then it's just my_list[-1] or my_list[-2].

KarlKode · on Jan 20, 2021

But shouldn't it be [-0] to get the last element if [0] denotes the first element?

codeflo · on Jan 20, 2021

I think the most mathematically natural way to interpret negative indices is to threat them as numbers modulo list length. Then 0 = length, and -1 = length - 1.

bvrmn · on Jan 20, 2021

C programmers can give you a couple of advises about "correct" modulo on negative numbers.

nemetroid · on Jan 20, 2021

Negative indices have an implied ”len(list)” in front of them. They can be seen as an application of the idea of half-open intervals (0 <= i < len(list)), so it makes sense that they’re not symmetrical.

naniwaduni · on Jan 20, 2021

Is this sensible, or is it an unprincipled ad-hoc justification for a feature that happens to be useful?

Or another way: Is this more reasonable than, say, implicitly taking indices mod len(list)? How about implicitly flooring indices? If so, why?

nemetroid · on Jan 20, 2021

I'm not sure on what principles you'd make a principled justification. My most common use of negative indices is for slicing the end of the list, in which context the interpretation similar to len(list) makes sense. E.g., list[:-2] does what you'd expect (the same as list[2:], except from the other end).

> implicitly taking indices mod len(list)?

Isn't this doing that (plus some bounds checking)? -1 mod 4 evaluates to 3.

> implicitly flooring indices?

Not sure I understand what this means.

bvrmn · on Jan 20, 2021

[0] denotes an element with offset 0 from the start of the array. [-1] denotes an element with offset -1 from the start of the array.

samatman · on Jan 20, 2021

Which Lua has also, for strings, and it's quite convenient.

Doesn't work for tables, though, because you can actually put something at foo[-1], which can be useful.

I find that foo[#foo+1] is more common than foo[#foo] in my code, that is to say I add to an array more often than I access the last slot. Although I typically use the idiomatic table.insert(foo, bar) instead— but that's mostly because it's slightly awkward to add the +1 all the time.

adrian_b · on Jan 20, 2021

Icon (from the SNOBOL author) had the negative indices many years before Python and it is likely that this Python feature was inspired by it.

Mikhail_K · on Jan 20, 2021

Negative indices are bug-prone. If you mess up iteration bounds, you should get an error, not silent incorrect indexing.

robertlagrant · on Jan 20, 2021

Can you give an example?

Mikhail_K · on Jan 24, 2021

Sure. Let's take array elements in reverse, from m to m-n. Suppose n erroneously becomes larger than m. That should produce an error, not silently take several tailing elements.

tester34 · on Jan 20, 2021

sounds like a great way to introduce bugs

robertlagrant · on Jan 20, 2021

How so?

tester34 · on Jan 21, 2021

because when you misscalculate the index, then it will not crash, but use the wrong one

I'm getting this right?

robertlagrant · on Jan 21, 2021

I don't understand why that would be more error prone than using list[list.length-1]. It's just list[-1].

tester34 · on Jan 26, 2021

imagine if you used list[index] where index would be calculated

if you misscalculated something and went on negative, then you'd have big exception/error, but when negative indices are viable, then the code will work fine

stabbles · on Jan 20, 2021

Now do a closed range in type uint8_t representing [0x00, 0xff]

Or how about a closed range in uint64_t that ends at 0xffffffffffffffff

Or a range from roughly a = -9223372036854775808 to b = 9223372036854775807 in int64_t. b - a will overflow.

user-the-name · on Jan 20, 2021

How about an empty range in type uint8_t?

The fact of the matter is, for a type that can hold N distinct values, there are N + 1 different possible interval lengths. You can not represent intervals over a fixed-width type in that same type.

stabbles · on Jan 20, 2021

Exactly, there's no silver bullet for ranges

samatman · on Jan 20, 2021

Did you mean open range? I've never encountered a circumstance where closed ranges are useful, though I presume they exist.

And yes, I think any typed language with a less expressive range syntax than Ada has some work to do. That still leaves open the question of the default, and I maintain that 0 <= .. < n is the correct one.

TheRealPomax · on Jan 20, 2021

That... makes no sense? Whether you start counting a list at 0 or 1, if your language of choice supports [x,x) notation, that syntax will effect an empty interval, and [x,x+1) will effect a single cell, no matter whether your list indexes the first element as 0, or 1. The only difference is which cell [x,x+1) refers to, which is the part that keeps sparking the controversy.

Mikhail_K · on Jan 20, 2021

> 0-based indexing with closed intervals is better for slicing.

Given how confused and inconsistent Python slicing syntax is, it obviously isn't.

ModernMech · on Jan 20, 2021

Also given how Matlab is essentially slicing-oriented programming and is 1-based, I've always found it perfectly intuitive.

contravariant · on Jan 20, 2021

If we're being pedantic then [3,3) = [x,x) = {}

yodelshady · on Jan 20, 2021

Why on Earth would I want a "zero interval" slice?

It has none of the properties I want in a slice, and that a slice of literally any other length will have. In fact, it means every slice I use needs error-handling, because I might get... something that's functionally not a slice, but is still called one. If that doesn't scream "type error" at you, I don't know what would. In fact, that's precisely why 0 is a comparatively recent invention. Most list-based problems were solved just fine before then.

All these arguments are poor rationalisation for the field's inability to move on from the 8-bit era, where the loss of 1/256 of your address space mattered and compilers had bigger problems to solve than translating indices to offsets.

codeflo · on Jan 20, 2021

You want a zero-slice because it’s a simpler base case in almost any even mildly complex algorithm. Without empty lists, you need many additional branches throughout a code base to handle the “no element” case, causing more potential bugs. There’s nothing “8-bit” about cleaner code, quite the opposite.

yodelshady · on Jan 21, 2021

No, you have no branches, because the "no element" case is *not a slice! It is a type error at generation!

Instead, I'm dealing with that case at literally every one of thousands of places where I might receive it*! That's not clean.

alentist · on Jan 24, 2021

It is. Just like 0 is a natural number, the empty string is a string, the empty set is a set, etc.

If you've worked with mathematical proofs before, you'll have experience with how they make proofs shorter and more elegant.

That's why they're natural base cases.

jstimpfle · on Jan 20, 2021

Wikipedia says that Egyptians had a 0 as early as 1770 BC - almost 4000 years ago. If that's a "relatively recent" invention, what about computers?

0 comes up if you're doing any arithmetic at all. It's a natural and useful concept for almost anything. As long as you always can take something aways where there is something, you'll need 0. It wouldn't be very useful to make something such as a non-empty list type, to then be able to take away all elements except the last one, which must remain.

In more mathematical terms, 0 is called a neutral element (relative to the addition operation), and almost anything you might want to do (for example subtraction) requires as a consequence a neutral element.

samatman · on Jan 20, 2021

Well, you might want a function which returns String, not Maybe(String), so that you can just concatenate, rather than handle Some(String) or None all the time.

So that's why you might want an empty String. If you have an optional element in a syntax, it can be convenient to match it with a rule using a Kleene star, and that gives you an empty Node, which would return an empty String if you request its value. And so on.

With open intervals, you have to represent that as, say, (3, 2). Which sucks.

jstimpfle · on Jan 21, 2021

Just a friendly correction :-). You have repeatedly confused the meaning of open and closed. "Closed" means including the end element given in the range, and "open" means excluding it.

It may make more sense to think of intervals in the real numbers. "[1,2)" doesn't have a maximum to the right - it has 1.9999....... but not 2. Thus it's "open" to the right. Whereas "[1,2]" includes its maximum (2), so its sealed.

The terms open and closed apply to any set of numbers (not just intervals), where "closed" means that the set includes the limit of any series of numbers in the set, and open means "not closed".

samatman · on Jan 21, 2021

Much appreciated! The terminology and notation seemed backwards to me, and I basically never use it except when discussing Djikstra on HN ;-)

Visually, it looks like ) has more "room" than ], if that makes sense, and that "closed" would mean that the element defines the 'lid' of the range, while "open" means it defines the final extent.

But you can't argue with nomenclature, and extending the definition to the reals helped make it click for me, so I'm unlikely to screw it up again. Thanks!

dTal · on Jan 21, 2021

You could imagine the curved parentheses as "cutting the corner" of the square brackets - or imagine the curve as just gently kissing the endpoint at a single point, while the flat square solidly includes it. "Open" makes sense because it satisfies the criterion that for any number in the interval, you can find a larger (or smaller) number that is also in it - there's "no end".

yodelshady · on Jan 21, 2021

Literally never wanted that. Had to write thousands of lines of code dealing with that possibility, however, when I could have had a None.

It sucks and is Stockholm syndrome.