Not surprising. There are much better compute engines than pandas. Folks then as...

0cf8612b2e1e · on Aug 29, 2024

Even medium tasks, the slow step is rarely computations, but the squishy human entering them.

When I work on a decent 50GB+ dataset, I have to do something fairly naive before I get frustrated at computation time.

Edit: pandas is now Boring Technology (not a criticism). It is a solid default choice. In contrast, we are still in a Cambrian explosion of NeoPandas wannabes. I have no idea who will win, but there is a lot of fragmentation which makes it difficult me for to jump head first into one of these alternatives. Most of them are faster or less RAM pressure which is really low on my list of problems.

cpcloud · on Aug 29, 2024

All great reasons to stick with Pandas. If a thing works for you, then that's one less thing to manage.

Fully agree that if pandas is working for you, then you're better off sticking with it!

halfcat · on Aug 30, 2024

> YAGNI - For lots of small data tasks, pandas is perfectly fine.

There’s a real possibility the growth of single node RAM capacity will perpetually outpace the growth of a business’s data. AWS has machines with 32 TB RAM now.

It’s a real question as to whether big data systems become more accessible before single node machines become massive enough to consume almost every data set.

(Also love your work, thank you)

__mharrison__ · on Aug 30, 2024

But... but we have to have big data and a cluster...

The hoops folks will jump through...

(Thanks!)

whimsicalism · on Aug 29, 2024

If your task is too big for pandas, you should probably skip right over dask and polars for a better compute engine.

__mharrison__ · on Aug 29, 2024

Jump straight to cluster and skip Dask?

Not sure what "big" means here, but a combination of .pipe, pyarrow, and polars can speed up many slow Pandas operations.

Polars streaming is surprisingly good for larger than RAM. I get that clusters are cool, but I prefer to keep it on a single machine if possible.

Also, libraries like cudf can greatly speed up Pandas code on a single machine, while Snowpark can scale Pandas code to Snowflake scale.

cpcloud · on Aug 29, 2024

In my experience, Polars streaming runs out of memory at much smaller scales than both DuckDB and DataFusion and tends to use much more memory for the same workload when it doesn't outright segfault.

Polars is faster than those two once you get to less than a few GB, but beyond that you're better off with DuckDB or DataFusion.

I would love for this to improve in Polars, and I'm sure it will!

ritchie46 · on Aug 29, 2024

Do you mean segfault or OOM? I am not aware of Polars segfaulting on high memory pressure.

If it does segfault, would you mind opening an issue?

Some context; Polars is building a new streaming engine that will eventually be ready to run the whole Polars API (Also the hard stuff) in a streaming fashion. We expect the initial release end of this year/early next year.

Our in-memory engine isn't designed for out-of-core processing and thus if you benchmark it on restricted RAM, it will perform poorly as data is swapped or you go OOM. If you have a machine with enough RAM, Polars is very competitive in performance. And in our experience it is tough to beat in time-series/window functions.

cpcloud · on Aug 29, 2024

Segmentation violations are often the result of different underlying problems, one of which can be running out of memory.

We (the Ibis team) have opened related issues and the usual response is to not use streaming until it's ready, or to fix the problem if it can be fixed.

Not sure what else there is to do, seems like things are working as expected/intended for the moment!

We'll definitely be the first to try out any improvements to the streaming engine.

ritchie46 · on Aug 29, 2024

They have different implications for us. An abort due to an OOM isn't a bug in our program, as SEGFAULT is a serious bug we want to fix.

ledauphin · on Aug 30, 2024

when you say "our in-memory engine", are you talking about dataframe or the lazyframe?

ritchie46 · on Aug 30, 2024

A DataFrame is our in memory table. A LazyFrame is a compute plan that can have DataFrames as source.

The engine is what executes our plans and materializes a result. This is plural as we are building a new one.

__mharrison__ · on Aug 29, 2024

My understanding is that the Polars team is working on a new streaming engine. It looks like you will get your wish.