It's great that the Moore Foundation provided funding for open source data science tools in Python. Good for them!
That being said, I do wonder if numpy is the most appropriate recipient. In my experience with data science, the tool that would benefit the most is not numpy, but pandas. While data scientists rarely use numpy directly, every data scientist I know who uses pandas says they are constantly having to google how to do things due to a somewhat confusing and inconsistent API. I use pandas at work every day and I'm always looking stuff up, particularly when it comes to confusing multi-indexes. In contrast, I rarely use R's dplyr at work, but the API is so natural that I hardly ever need to look things up. I would love if pandas could make a full-throated commitment to a more dplyr-like API.
Nothing against pandas -- I know the devs are selflessly working very hard hard. It's just that it seems there is more bang for the buck there.
If you look at the design documents for pandas 2 there is a good illustration of how a lot of pain points in pandas 1 spring from numpy (
https://pandas-dev.github.io/pandas2/internal-architecture.h...). I think any significant development effort numpy would probably greatly benefit both libraries.
Will have to check out dplyr :) love to see how they master the magic that is multi-indexes.
In many cases, the use of multi-indexes in Pandas is (I think) a result of culture/style or expectation that the cells of a dataframe should have scalar values. If that would change and it became common to have nested dataframes, the use of multi-indexes would diminish.
The tooling to support nested dataframes (and maybe even lists) is simple to create, It can even be a third party library. I find that multi-indices though may be an accurate conceptual way of thinking about certain data, they tend to be practically more inconvenient than nesting the dataframes. In all cases I have encountered only single level of nesting is required.
If you're excited about non-scalar values in DataFrames, you should take a look at xarray (http://xarray.pydata.org), which implements a very similar idea in its Dataset class.
The funding source used by NumPy here is equally available to pandas developers. If someone with the experience to deliver wrote a good proposal I think there's a decent chance that it would be funded.
But.. pandas uses numpy under the hood. If numpy is better and can offload some of the core functionality from pandas that will also benefit pandas right ?
I could be wrong, but I'm pretty sure that these would be solved by pandas API design improvements, not with numpy improvements under the hood. (NB: As always, a big thanks to the developers for all their work.)
I had a similar issue. I had to read a certain piece of R code and it used a lot of dplyr, I read the dplyr documentation and I immediately felt more comfortable manipulating data in R than in Python. Later on I created https://github.com/has2k1/plydata, a dplyr imitation.
A lot of people already mentioned that pandas is built on top of numpy. Also the pandas and numpy are housed under the same non-profit: https://www.numfocus.org
But isn't the issue here that completely redoing the API would break a lot of code? I don't see how throwing money at the problem would fix this. I don't use any of these libraries, so maybe I'm totally off base, but it sounds like it's more of a tech debt/design issue than an issue that requires the kind of programming hours that only money can buy.
On the other hand if lots of libraries use numpy, making it more efficient and/or capable would seem to give quite a lot of bang for the buck. And it sounds like that's the kind of problem that money can actually solve.
There have been a few independent attempts to add dplyr-like functionality to pandas without being backwards incompatible (e.g. dplython). I'd be very happy if the core pandas team went down this path.
That being said, I don't have a good understanding of how strong the distinction is between "design issues" and "issues where money helps". There must be some overlap.
I'll have to speak in generalities as I don't know enough about NumPy in particular to comment.
> That being said, I don't have a good understanding of how strong the distinction is between "design issues" and "issues where money helps". There must be some overlap.
That's true, but many projects have turned out bad no matter how much more money has been spent compared to less expensive, but better projects. See: Design by committee. The design of an API obviously requires careful thought, which I suppose is work that could be paid. But the issue of getting everyone to agree on a design isn't one that money can solve, and then you need to make some hard decisions about backward incompatibility. Perhaps you'd fund a fork of the project, splitting it into an old legacy one and a new, fancy version with a new API, but then you're committed to maintaining two projects which is its own headache.
These are the kinds of things I mean by design issues. Problems that aren't necessarily hard because they require many people to work for many billable hours to solve them, but because finding acceptable compromises is a very human issue quite irrespective of the programming effort involved.
Many a software project has recognized that serious, backwards-incompatible changes would improve the project, and often there is even a working implementation, but these human and legacy support issues prevent widespread adoption and then the new implementation dies a quiet death because nobody is using it, so nobody finds it worth their time to work on it.
Perhaps what you really want is a new library, rather than trying to contort a different project into the shape you want. Which is of course something money helps with, but then when the money dries up the question of adoption is going to determine whether it lives or dies as an open source project.
Again, those were some general thoughts, I don't know much about this particular project, so maybe I'm way off base. Just offering an alternative POV regarding what exactly constitutes "getting your money's worth" with respect to choosing which OS projects to fund.
pandas is often used for one-off reports, where backwards compatibility is not as important.
Production software relying on the API could always depend on previous versions if a new version brings a significantly improved API.
I'm a regular user of pandas, would definitely say it's my favorite Python library by far... but it is very hard to do certain operations with it (as the OP said, anything involving multiple indexes, and things like plotting multiple plots after a groupby, etc.)
> every data scientist I know who uses pandas says they are constantly having to google how to do things due to a somewhat confusing and inconsistent API.
That's a design error, not necessarily something that money will fix for you. This is why you need to think really long and hard before deploying a public API, it is very hard to change those.
well one could at least hope that some additional funding would improve the chance that these design errors are addressed, although I agree that it is no panacea
Just because the API is bad doesn't mean we should throw money at it. I agree that NumPy might not be the best recipient either. It's hard telling, really.
Personally, I believe the biggest blocker for me is to have good visualization tools. That's ultimately what gets me paid is showing other people my work and getting them to give me money to continue it.
On the core science stack IMO there's numpy, scipy, sympy, matplotlib, pandas and xarray. I probably use it next to least, but I really think sympy is the one that could benefit the most from some funding.
That being said, I do wonder if numpy is the most appropriate recipient. In my experience with data science, the tool that would benefit the most is not numpy, but pandas. While data scientists rarely use numpy directly, every data scientist I know who uses pandas says they are constantly having to google how to do things due to a somewhat confusing and inconsistent API. I use pandas at work every day and I'm always looking stuff up, particularly when it comes to confusing multi-indexes. In contrast, I rarely use R's dplyr at work, but the API is so natural that I hardly ever need to look things up. I would love if pandas could make a full-throated commitment to a more dplyr-like API.
Nothing against pandas -- I know the devs are selflessly working very hard hard. It's just that it seems there is more bang for the buck there.