It's great that the Moore Foundation provided funding for open source data scien...

abuckenheimer · on June 13, 2017

If you look at the design documents for pandas 2 there is a good illustration of how a lot of pain points in pandas 1 spring from numpy ( https://pandas-dev.github.io/pandas2/internal-architecture.h...). I think any significant development effort numpy would probably greatly benefit both libraries.

Will have to check out dplyr :) love to see how they master the magic that is multi-indexes.

has2k1 · on June 14, 2017

In many cases, the use of multi-indexes in Pandas is (I think) a result of culture/style or expectation that the cells of a dataframe should have scalar values. If that would change and it became common to have nested dataframes, the use of multi-indexes would diminish.

The tooling to support nested dataframes (and maybe even lists) is simple to create, It can even be a third party library. I find that multi-indices though may be an accurate conceptual way of thinking about certain data, they tend to be practically more inconvenient than nesting the dataframes. In all cases I have encountered only single level of nesting is required.

shoyer · on June 14, 2017

If you're excited about non-scalar values in DataFrames, you should take a look at xarray (http://xarray.pydata.org), which implements a very similar idea in its Dataset class.

csaid81 · on June 13, 2017

Thanks for the link! Good stuff.

By the way, dplyr doesn't use multi-indexes. I actually think this one of the reasons (although not the biggest reason) dplyr is easier to use.

ngoldbaum · on June 13, 2017

The funding source used by NumPy here is equally available to pandas developers. If someone with the experience to deliver wrote a good proposal I think there's a decent chance that it would be funded.

carreau · on June 13, 2017

But.. pandas uses numpy under the hood. If numpy is better and can offload some of the core functionality from pandas that will also benefit pandas right ?

csaid81 · on June 13, 2017

Right, but I'm talking about the pandas API. Stuff like how easy it is to remember exactly how to do aggregations, transformations, etc.

csaid81 · on June 13, 2017

Here's some specific examples:

https://twitter.com/Chris_Said/status/715249097326768128

https://twitter.com/Chris_Said/status/861244045535756290

I could be wrong, but I'm pretty sure that these would be solved by pandas API design improvements, not with numpy improvements under the hood. (NB: As always, a big thanks to the developers for all their work.)

has2k1 · on June 13, 2017

I had a similar issue. I had to read a certain piece of R code and it used a lot of dplyr, I read the dplyr documentation and I immediately felt more comfortable manipulating data in R than in Python. Later on I created https://github.com/has2k1/plydata, a dplyr imitation.

chaostheory · on June 14, 2017

A lot of people already mentioned that pandas is built on top of numpy. Also the pandas and numpy are housed under the same non-profit: https://www.numfocus.org

visarga · on June 14, 2017

First time I heard about NumFocus. Under their umbrella also sit iPython, Jupyter Notebook, Julia, Matplotlib, and a dozen more projects.

simen · on June 13, 2017

But isn't the issue here that completely redoing the API would break a lot of code? I don't see how throwing money at the problem would fix this. I don't use any of these libraries, so maybe I'm totally off base, but it sounds like it's more of a tech debt/design issue than an issue that requires the kind of programming hours that only money can buy.

On the other hand if lots of libraries use numpy, making it more efficient and/or capable would seem to give quite a lot of bang for the buck. And it sounds like that's the kind of problem that money can actually solve.

csaid81 · on June 13, 2017

Pandas makes lots of backwards-incompatible changes. See for example these changes in the latest release

http://pandas.pydata.org/pandas-docs/version/0.18.0/whatsnew...

There have been a few independent attempts to add dplyr-like functionality to pandas without being backwards incompatible (e.g. dplython). I'd be very happy if the core pandas team went down this path.

That being said, I don't have a good understanding of how strong the distinction is between "design issues" and "issues where money helps". There must be some overlap.

simen · on June 13, 2017

I'll have to speak in generalities as I don't know enough about NumPy in particular to comment.

> That being said, I don't have a good understanding of how strong the distinction is between "design issues" and "issues where money helps". There must be some overlap.

That's true, but many projects have turned out bad no matter how much more money has been spent compared to less expensive, but better projects. See: Design by committee. The design of an API obviously requires careful thought, which I suppose is work that could be paid. But the issue of getting everyone to agree on a design isn't one that money can solve, and then you need to make some hard decisions about backward incompatibility. Perhaps you'd fund a fork of the project, splitting it into an old legacy one and a new, fancy version with a new API, but then you're committed to maintaining two projects which is its own headache.

These are the kinds of things I mean by design issues. Problems that aren't necessarily hard because they require many people to work for many billable hours to solve them, but because finding acceptable compromises is a very human issue quite irrespective of the programming effort involved.

Many a software project has recognized that serious, backwards-incompatible changes would improve the project, and often there is even a working implementation, but these human and legacy support issues prevent widespread adoption and then the new implementation dies a quiet death because nobody is using it, so nobody finds it worth their time to work on it.

Perhaps what you really want is a new library, rather than trying to contort a different project into the shape you want. Which is of course something money helps with, but then when the money dries up the question of adoption is going to determine whether it lives or dies as an open source project.

Again, those were some general thoughts, I don't know much about this particular project, so maybe I'm way off base. Just offering an alternative POV regarding what exactly constitutes "getting your money's worth" with respect to choosing which OS projects to fund.

halflings · on June 13, 2017

pandas is often used for one-off reports, where backwards compatibility is not as important. Production software relying on the API could always depend on previous versions if a new version brings a significantly improved API.

I'm a regular user of pandas, would definitely say it's my favorite Python library by far... but it is very hard to do certain operations with it (as the OP said, anything involving multiple indexes, and things like plotting multiple plots after a groupby, etc.)

simen · on June 13, 2017

Ok, I might very well be totally off base. Sorry for butting in on a subject that I don't know much about.

jacquesm · on June 13, 2017

> every data scientist I know who uses pandas says they are constantly having to google how to do things due to a somewhat confusing and inconsistent API.

That's a design error, not necessarily something that money will fix for you. This is why you need to think really long and hard before deploying a public API, it is very hard to change those.

Barrin92 · on June 13, 2017

well one could at least hope that some additional funding would improve the chance that these design errors are addressed, although I agree that it is no panacea

ferdterguson · on June 14, 2017

Just because the API is bad doesn't mean we should throw money at it. I agree that NumPy might not be the best recipient either. It's hard telling, really.

Personally, I believe the biggest blocker for me is to have good visualization tools. That's ultimately what gets me paid is showing other people my work and getting them to give me money to continue it.

On the core science stack IMO there's numpy, scipy, sympy, matplotlib, pandas and xarray. I probably use it next to least, but I really think sympy is the one that could benefit the most from some funding.

isolli · on June 14, 2017

Do you not use Seaborn?

art187 · on June 14, 2017

I can't speak to the reasons why pandas wasn't funded, but the team is looking for funding.

At the end of the day a lot of code uses NumPy and not Pandas.

gaius · on June 13, 2017

Pandas is sponsored by AQR I thought?

dsacco · on June 14, 2017

Nope, just developed in-house there (and the original developer now works at Two Sigma).

art187 · on June 14, 2017