So there is more core work happening on support OLAP but I do think it will take some time.
In the meantime, I think we have all the pieces (storage, query engine, table format) to set up a true OLAP. For instance, I created https://github.com/viggy28/streambed to pressure test this idea.
Currently, Strembed expects REPLICA IDENTITY FULL for getting the before and after value of TOAST column. Since we have the data in object storage, we could populate it without the need for REPLICA IDENTITY FULL. Created an issue https://github.com/viggy28/streambed/issues/25 to track this feature.
Aside from the cost, my major motivation is to keep the infrastructure simple. The data is already there in Postgres, so I didn't want to add another data warehouse. I have also shared my thoughts on where this is heading https://viggy28.dev/article/postgres-gateway-drug/
Hello, I checked ingestr repo, and it is in my bookmark. Small world.
Agree, CDC is like Death by a thousand cuts. I believe Debezium has a Java library.
My initial need was Postgres compatibilty. Wanted to give an endpoint that BI and dashboard teams can use to query as if they are querying a Postgres replica. Added more context here https://news.ycombinator.com/item?id=48350820
Author here. For context, I was the tech lead for the Postgres team at Cloudflare, and this came directly out of a challenge I kept hitting there: BI and dashboard teams needed to run long-running analytical queries, and the answer was always to spin up another bespoke read replica or stand up an ETL dump into an analytical database and query that.
So the question I started with was: what's the fewest components I could get away with? That led to the architecture here — Streambed connects to Postgres as a logical replication subscriber (same mechanism as a read replica) and streams WAL changes straight into Apache Iceberg on S3, queryable from psql via an embedded DuckDB. There are a lot of edge cases to handle, and it's very much early days.
To me being able to query over psql is secondary. I’m fine with any SQL. What is very important is being able to transform the data to better suite analytical queries. That is, define custom transformations, define how data sectioned and what indices available.
Short answer: yes, column-level schema changes sync to Iceberg automatically[0].
Logical replication (pgoutput in v1) doesn't actually stream DDL statements. Instead, Postgres emits a fresh Relation message describing the table's current column layout right before the next change to that table. So we diff that against the last layout we knew and infer what changed.
From there we evolve the Iceberg schema in place: flush any buffered rows under the old schema first, then write a new metadata version with the change. What's handled today:
- ADD COLUMN — new field ID allocated; the column's Postgres DEFAULT is carried into Iceberg's initial-default/write-default, so existing rows read back correctly
- DROP COLUMN — removed from the current schema, existing data files untouched
- Type widening — int4→int8, float4→float8 (the changes Iceberg considers compatible)
- REPLICA IDENTITY changes
Hey vira28, thanks a lot for your work. This is a very promising project because other alternative like supabase/etl, Kuvasz-streamer, Sequin all have some subtle issues.
2) For a planetscale cluster are the replication slots on primary or the follower nodes?
I'm asking because isn't setting up slots on primary riskier than setting them on replicas/followers? Because If you have them primary In case of WAL buildup your primary will go down?
From what I understand Ducklake needs a dedicated metadata database and it also ties to DuckDB land wherease with Iceberg many engines can query directly.
You are already using Postgres in your setup so why not just use Postgres itself as metadata database? It is a much better setup than using Iceberg [1].
> and it also ties to DuckDB land wherease with Iceberg many engines can query directly.
Ducklake can also be queried by many engines [2]. Though not as exhaustive as Iceberg.
noob question here from someone who ony played a bit with iceberg and trino: what's the reason to do the analytics stil inside the postgres -- is it so that you don't eat up the IOPS/bandwidth of the main postgresql disks?
reply