> But when you end up using one function, but you compile hundreds, some alarm b...

saghm · on Jan 24, 2025

In Rust, a library can define "features" that can be conditionally enabled or disabled when depending on it, which give a a built-in way to customize how much of the library is actually included. Tokio is a great example of this; people might be surprised to learn that the total number of direct dependencies that are required by tokio is only two[1]; everything else is optional!

Unfortunately, it doesn't seem like people are super diligent about looking into the default feature set that's used by their dependencies and proactively trimming that down. It doesn't help that the syntax for pulling in extra features is less verbose than removing optional but default ones (which requires both specifying "no-default-features" and then manually adding every one of the default features that you do still want back to the list of ones you pull in), and it _really_ doesn't help that the only way for libraries to expose the ability to prune unneeded features from their own dependencies to the users who inherit them is by manually making their own feature that maps to the features of every single one of their own dependencies. For example, if you're writing a library with five dependencies, and every one of them has one required dependency and four optional ones, giving the users of your library full control over what transitive features they pull in would mean making 20 features in your own library mapping to each of those transitive features, and that's not even counting the ones that you'd want to make for your own code in order to be a good citizen and not force downstream users to include all of your own code.

More and more I'm coming to the opinion that the ergonomics around features being so much worse for trying to cut down on the bloat is actually the catalyst for a lot of the issues around compile times in Rust. It doesn't seem to be super widely discussed when things like this come up though, so maybe it's time that I try to write a blog post or something with my strong feelings on this so at least I'll have something to point to assuming the status quo continues indefinitely.

[1]: https://github.com/tokio-rs/tokio/blob/ee19b0ed7371b069112b9...

ebiester · on Jan 24, 2025

That's how we get things like leftpad in the JS ecosystem.

On one side, I think if we had a good system of trust, that's not a problem.

And part of me likes the idea of something like Shadcn - you like a component? Copy it into your library. However, if there ends up being a vulnerability, you have no idea if you are affected.

For some code, that's not a problem. For other code, we truly depend on having as many eyes as possible on it.

cluckindan · on Jan 24, 2025

Wow. For those who don’t know, here’s a pseudocode implementation:

    Median(list) {
      let len = length(list)
      if len % 2 == 0 {
        let x = floor(len/2)
        return (list[x] + list[x+1]) / 2
      }
      return list[len/2]
    }

Note: assumes the list is already sorted.

Managed to resist calling is_odd there!

gpm · on Jan 24, 2025

In most cases I'd probably use nearly this. I note that it contains a bug due to integer overflow if naively translated to most languages.

But if I have a big enough list that I care about space usage (I don't want to make a copy that I sort and then throw away), or speed (I care about O(n) vs O(n log(n))) I'd be looking for a library before implementing my own.

Here are the relevant algorithms if you really want to implement your own fast median code though: https://cs.stackexchange.com/questions/1914/find-median-of-u...

cluckindan · on Jan 24, 2025

I use that math textbook algorithm in production to produce a median from a list which has a bounded size and is already sorted by the db, though that bound could technically grow to INT_MAX if someone managed to make that many requests in five minutes. Not very likely. :-)

gpm · on Jan 24, 2025

> and is already sorted by the db,

Right, if it's already sorted just taking the midpoint is the obviously correct algorithm (and O(1) time/space). It's only in the unsorted cases where with giant lists you should start thinking about alternatives.

If I'm working with gigabytes of photon counts (each element representing the number photons detected in a time interval) I don't want to sort my gigabyte long list before getting the median - sorting would destroy the very important structure of the data so I'd just have to throw away the copy afterwards. This is referencing some code I worked on a long time ago. I'm not sure I had to calculate a median specifically, but similar enough statistics. It's a simple function, but not a one size fits all algorithm.