The R community has been hard at work on small data. I still highly prefer worki...

wodenokoto · 2025-05-22T08:25:12 1747902312

A really big part of a in-memory dataframe centric workflow is how easy it is to do one step at a time and inspect the result.

With a database it is difficult to run a query, look at the result and then run a query on the result. To me, that is what is missing in replacing pandas/dplyr/polars with DuckDB.

IanCal · 2025-05-22T08:40:17 1747903217

I'm not sure I really follow, you can create new tables for any step if you want to do it entirely within the db, but you can also just run duckdb against your dataframes in memory.

jgalt212 · 2025-05-22T11:39:14 1747913954

In R, data sources, intermediate results, and final results are all dataframes (slight simplification). With DuckDB, to have the same consistency you need every layer and step to be a database table, not a data frame, which is awkward for the standard R user and use case.

datadrivenangel · 2025-05-22T15:15:04 1747926904

You can also use duckplyr as a drop in replacing for dplyr. Automatically fails over to dplyr for unsupported behavior, and for most operations is notably faster.

Data.Table is competitive with DuckDb in many cases, though as a DuckDB enthusiast I hate to admit this. :)

wodenokoto · 2025-05-22T12:18:57 1747916337

You can, but then every step starts with a drop table if exists; insert into …

cess11 · 2025-05-22T13:57:10 1747922230

Or you nest your queries:

    select second from (select 42 as first, (select 69) as second);

Intermediate steps won't be stored but until queries take a while to execute it's a nice way to do step-wise extension of an analysis.

Edit: It's a rather neat and underestimated property of query results that you can query them in the next scope.

wodenokoto · 2025-05-23T07:08:37 1747984117

We all have different definitions on what is difficult. Maybe annoying or bothersome had been better words, but below beats nesting things:

    df |> select(..) |>
        filter(...) |>
        mutate(...) |>
        ...

And every time I've learned something about the intermediate result I can add another line, or save the result in a new variable and branch my exploration. And I can easily just highlight and run and number of of steps from step 1 onwards.

Even oldschool

    df2 <- df[...]
    df2 <- df2[...]

Gives me the same benefit.

cess11 · 2025-05-23T08:44:54 1747989894

Yeah, sure, I do a lot of such things in RAM in Elixir, some Lisp, PHP or, if I must, Python.

But sometimes I just happen to have just imported a data set in a SQL client or I'm hooked into a remote database where I don't have anything but the SQL client. When developing an involved analysis query nesting also comes in handy sometimes, e.g. to mock away a part of the full query.

jcheng · 2025-05-22T16:21:54 1747930914

Or better yet, use CTEs: https://duckdb.org/docs/stable/sql/query_syntax/with.html

cess11 · 2025-05-23T06:09:36 1747980576

Absolutely, if the engine has them and they're not wonky somehow.