Even medium tasks, the slow step is rarely computations, but the squishy human entering them.
When I work on a decent 50GB+ dataset, I have to do something fairly naive before I get frustrated at computation time.
Edit: pandas is now Boring Technology (not a criticism). It is a solid default choice. In contrast, we are still in a Cambrian explosion of NeoPandas wannabes. I have no idea who will win, but there is a lot of fragmentation which makes it difficult me for to jump head first into one of these alternatives. Most of them are faster or less RAM pressure which is really low on my list of problems.
> YAGNI - For lots of small data tasks, pandas is perfectly fine.
There’s a real possibility the growth of single node RAM capacity will perpetually outpace the growth of a business’s data. AWS has machines with 32 TB RAM now.
It’s a real question as to whether big data systems become more accessible before single node machines become massive enough to consume almost every data set.
In my experience, Polars streaming runs out of memory at much smaller scales than both DuckDB and DataFusion and tends to use much more memory for the same workload when it doesn't outright segfault.
Polars is faster than those two once you get to less than a few GB, but beyond that you're better off with DuckDB or DataFusion.
I would love for this to improve in Polars, and I'm sure it will!
Do you mean segfault or OOM? I am not aware of Polars segfaulting on high memory pressure.
If it does segfault, would you mind opening an issue?
Some context; Polars is building a new streaming engine that will eventually be ready to run the whole Polars API (Also the hard stuff) in a streaming fashion. We expect the initial release end of this year/early next year.
Our in-memory engine isn't designed for out-of-core processing and thus if you benchmark it on restricted RAM, it will perform poorly as data is swapped or you go OOM. If you have a machine with enough RAM, Polars is very competitive in performance. And in our experience it is tough to beat in time-series/window functions.
Segmentation violations are often the result of different underlying problems, one of which can be running out of memory.
We (the Ibis team) have opened related issues and the usual response is to not use streaming until it's ready, or to fix the problem if it can be fixed.
Not sure what else there is to do, seems like things are working as expected/intended for the moment!
We'll definitely be the first to try out any improvements to the streaming engine.
Folks then ask why not jump from pandas to [insert favorite tool]?
- Existing codebases. Lots of legacy pandas floating about.
- Third party integration. Everyone supports pandas. Lots of libraries work with tools like Polars, but everything works with pandas.
- YAGNI - For lots of small data tasks, pandas is perfectly fine.