Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The R community has been hard at work on small data. I still highly prefer working on on memory data in R dplyr DataTable are elegant and fast.

The CRan packages are all high quality if the maintainer stops responding to emails for 2 months your package is automatically removed. Most packages come from university Prof's that have been doing this their whole career.



A really big part of a in-memory dataframe centric workflow is how easy it is to do one step at a time and inspect the result.

With a database it is difficult to run a query, look at the result and then run a query on the result. To me, that is what is missing in replacing pandas/dplyr/polars with DuckDB.


I'm not sure I really follow, you can create new tables for any step if you want to do it entirely within the db, but you can also just run duckdb against your dataframes in memory.


In R, data sources, intermediate results, and final results are all dataframes (slight simplification). With DuckDB, to have the same consistency you need every layer and step to be a database table, not a data frame, which is awkward for the standard R user and use case.


You can also use duckplyr as a drop in replacing for dplyr. Automatically fails over to dplyr for unsupported behavior, and for most operations is notably faster.

Data.Table is competitive with DuckDb in many cases, though as a DuckDB enthusiast I hate to admit this. :)


You can, but then every step starts with a drop table if exists; insert into …


Or you nest your queries:

    select second from (select 42 as first, (select 69) as second);
Intermediate steps won't be stored but until queries take a while to execute it's a nice way to do step-wise extension of an analysis.

Edit: It's a rather neat and underestimated property of query results that you can query them in the next scope.


We all have different definitions on what is difficult. Maybe annoying or bothersome had been better words, but below beats nesting things:

    df |> select(..) |>
        filter(...) |>
        mutate(...) |>
        ...
And every time I've learned something about the intermediate result I can add another line, or save the result in a new variable and branch my exploration. And I can easily just highlight and run and number of of steps from step 1 onwards.

Even oldschool

    df2 <- df[...]
    df2 <- df2[...]
Gives me the same benefit.


Yeah, sure, I do a lot of such things in RAM in Elixir, some Lisp, PHP or, if I must, Python.

But sometimes I just happen to have just imported a data set in a SQL client or I'm hooked into a remote database where I don't have anything but the SQL client. When developing an involved analysis query nesting also comes in handy sometimes, e.g. to mock away a part of the full query.



Absolutely, if the engine has them and they're not wonky somehow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: