More

vira28 · 2026-06-15T05:09:34 1781500174

Building https://github.com/viggy28/streambed - a full Postgres running on S3 using DuckDB as the query engine.

vira28 · 2026-06-14T20:02:12 1781467332

Certainly interested.

vira28 · 2026-06-11T07:04:18 1781161458

Agree. Appreciate the reminder.

vira28 · 2026-06-11T06:20:17 1781158817

Tomas Vondra, a major Postgres contributor recently revived a thread on using Bloom filters - https://www.postgresql.org/message-id/flat/5cd8c20c-14b5-4b0...

So there is more core work happening on support OLAP but I do think it will take some time.

In the meantime, I think we have all the pieces (storage, query engine, table format) to set up a true OLAP. For instance, I created https://github.com/viggy28/streambed to pressure test this idea.

vira28 · 2026-06-03T02:47:06 1780454826

The challenge with any CDC is making it reliable. Curious, how are you exporting to S3? - Debezium or some service in AWS or home grown tool?

vira28 · 2026-06-02T05:47:32 1780379252

Currently, Strembed expects REPLICA IDENTITY FULL for getting the before and after value of TOAST column. Since we have the data in object storage, we could populate it without the need for REPLICA IDENTITY FULL. Created an issue https://github.com/viggy28/streambed/issues/25 to track this feature.

vira28 · 2026-06-01T07:27:21 1780298841

Aside from the cost, my major motivation is to keep the infrastructure simple. The data is already there in Postgres, so I didn't want to add another data warehouse. I have also shared my thoughts on where this is heading https://viggy28.dev/article/postgres-gateway-drug/

vira28 · 2026-06-01T02:40:40 1780281640

Both projects are relevant. Curious, what kinda pushdown capabilities that you were looking for?

vira28 · 2026-05-31T23:51:33 1780271493

Hello, I checked ingestr repo, and it is in my bookmark. Small world.

Agree, CDC is like Death by a thousand cuts. I believe Debezium has a Java library.

My initial need was Postgres compatibilty. Wanted to give an endpoint that BI and dashboard teams can use to query as if they are querying a Postgres replica. Added more context here https://news.ycombinator.com/item?id=48350820

vira28 · 2026-05-31T23:23:51 1780269831

Author here. For context, I was the tech lead for the Postgres team at Cloudflare, and this came directly out of a challenge I kept hitting there: BI and dashboard teams needed to run long-running analytical queries, and the answer was always to spin up another bespoke read replica or stand up an ETL dump into an analytical database and query that.

So the question I started with was: what's the fewest components I could get away with? That led to the architecture here — Streambed connects to Postgres as a logical replication subscriber (same mechanism as a read replica) and streams WAL changes straight into Apache Iceberg on S3, queryable from psql via an embedded DuckDB. There are a lot of edge cases to handle, and it's very much early days.

Welcome any feedback.

kikimora · 2026-06-01T01:08:48 1780276128

To me being able to query over psql is secondary. I’m fine with any SQL. What is very important is being able to transform the data to better suite analytical queries. That is, define custom transformations, define how data sectioned and what indices available.

erikcw · 2026-06-01T05:02:54 1780290174

Thanks for releasing this! How do you handle DDL queries? Are table changes synchronized to the Iceberg table automatically?

Also, I recently started looking into olake[0] to serve the same purpose. What would you say differentiates Streambed?

[0] https://github.com/datazip-inc/olake

vira28 · 2026-06-01T07:22:03 1780298523

Thanks for the kind words!

Short answer: yes, column-level schema changes sync to Iceberg automatically[0].

Logical replication (pgoutput in v1) doesn't actually stream DDL statements. Instead, Postgres emits a fresh Relation message describing the table's current column layout right before the next change to that table. So we diff that against the last layout we knew and infer what changed.

From there we evolve the Iceberg schema in place: flush any buffered rows under the old schema first, then write a new metadata version with the change. What's handled today:

  - ADD COLUMN — new field ID allocated; the column's Postgres DEFAULT is carried into Iceberg's initial-default/write-default, so existing rows read back correctly
  - DROP COLUMN — removed from the current schema, existing data files untouched
  - Type widening — int4→int8, float4→float8 (the changes Iceberg considers compatible)
  - REPLICA IDENTITY changes

[0] https://github.com/viggy28/streambed/pull/21

saxenaabhi · 2026-06-01T12:16:08 1780316168

Hey vira28, thanks a lot for your work. This is a very promising project because other alternative like supabase/etl, Kuvasz-streamer, Sequin all have some subtle issues.

Few questions: 1) For a supabase project can we setup replication slot on replica instead of primary? https://sequinstream.com/docs/reference/databases#using-sequ...

2) For a planetscale cluster are the replication slots on primary or the follower nodes?

I'm asking because isn't setting up slots on primary riskier than setting them on replicas/followers? Because If you have them primary In case of WAL buildup your primary will go down?

vira28 · 2026-06-02T04:51:14 1780375874

Welcome. To avoid primary running out of disk space, you can configure max_slot_wal_keep_size https://www.postgresql.org/docs/17/runtime-config-replicatio...

Since Supabase is vanilla Postgres, streambed should work with replica as the source.

reg, Planetscale, I haven't looked at their offerings yet.

Where do you host your DB currently? Happy to try out with that provider as the source.

ashtuchkin · 2026-06-01T00:18:05 1780273085

Just wanted to say thank you! Very relevant to our use cases. I'll report if I find any issues.

vira28 · 2026-06-01T07:08:01 1780297681

Welcome. Would love to hear your experience. Feel free to share here or in the repo. Fully open source.

kshri24 · 2026-06-01T07:36:45 1780299405

> streams WAL changes straight into Apache Iceberg on S3, queryable from psql via an embedded DuckDB

Why not use Ducklake instead of Apache Iceberg? Wouldn't that simplify the architecture substantially?

vira28 · 2026-06-04T05:30:03 1780551003

From what I understand Ducklake needs a dedicated metadata database and it also ties to DuckDB land wherease with Iceberg many engines can query directly.

kshri24 · 2026-06-04T06:41:06 1780555266

You are already using Postgres in your setup so why not just use Postgres itself as metadata database? It is a much better setup than using Iceberg [1].

> and it also ties to DuckDB land wherease with Iceberg many engines can query directly.

Ducklake can also be queried by many engines [2]. Though not as exhaustive as Iceberg.

[1]: https://www.youtube.com/watch?v=-PYLFx3FRfQ [2]: https://ducklake.select/docs/stable/#list-of-ducklake-client...

raducu · 2026-06-01T05:18:16 1780291096

> queryable from psql via an embedded DuckDB.

noob question here from someone who ony played a bit with iceberg and trino: what's the reason to do the analytics stil inside the postgres -- is it so that you don't eat up the IOPS/bandwidth of the main postgresql disks?

alex_hirner · 2026-06-01T07:48:43 1780300123

How does it compare to https://github.com/supabase/etl ?

vira28 · 2026-06-04T05:32:50 1780551170

The idea is pretty similar. As per their README, Iceberg support is deprecated.

iamcreasy · 2026-06-01T05:21:01 1780291261

Very cool! What would a 10,000 feet solution look like for MySQL to Iceberg on S3?

vira28 · 2026-06-01T07:07:08 1780297628

Should be fairly doable using binlog-based producer https://github.com/go-mysql-org/go-mysql.

BodyCulture · 2026-06-01T08:53:58 1780304038

Why are your queries slow?