Hacker Newsnew | past | comments | ask | show | jobs | submit | lmsp's commentslogin

This is what Apache Pulsar (https://pulsar.incubator.apache.org/) already provides - infinite streaming storage, with simple/flexible messaging streaming API and kafka compatible


I do think it is technically possible base on my understanding of pulsar

- pulsar provides an failover subscription mode, which seems to be the equivalent of partition rebalancing of consumer group in kafka. https://pulsar.incubator.apache.org/docs/latest/getting-star...

- it has partitioned topics as well.

- it supports idempotent producing and have effectively-once delivery semantic.

It seems to have all the kind of primitives for kafka streams. to use.


Kafka streams is great. But it is a processing library. It isn’t an apple-to-apple comparison to a messaging system like Pulsar. First there are already many mature stream processing engines and libraries (spark, flink, heron, storm). They have been in production for years and it made no sense to write a new one without a good reason. Second, kafka stream could have done a better job, not just tight with kafka. Technically kafka streams isn’t really a very specific implementation or design to kafka. If confluent has done a better job on abstraction, I would expect it is very easy to plugin different messaging systems or log storages to run "kafka streams". Although I am not sure Confluent want to see that happen.

“Kafka has exactly-once delivery but messaging system x/y/z doesn’t provide it” is also confusing and misleading. Exactly-once is technically effectively-once: “at-least-once” and make the processing of the messages idempotent or “de-duplicated”. This has already been done in the industry for many decades in many mature stream processing engines like heron, flink. It isn’t a really new thing. And many messaging systems like pulsar already provides those primitives (e.g. idempotent producing, at-least-once delivery) for processing jobs to achieve effectively-once very easily. Streamlio folks did a great job about explaining exactly-once and effectively-once. It is worth checking this blog post out -- https://streaml.io/blog/exactly-once/

I think Pulsar itself as a distributed messaging system does provides all the three delivery semantics: at-most-once, at-least-once and effectively-once. It is very easy for people to use and integrate. I don’t think it is difficult to make kafka streams run with pulsar technically. The question is more is there a value to do that, do kafka folks wanna to do that, can the collaboration happen in the ASF?

that's just my two cents.


I think it is beyond partition rebalancing. There is a fact that people didn't realize of making message broker `stateless`. It is actually much better on reacting to failures or shifting traffic, which is critical when running a messaging bus for online services. because it doesn't have to wait for copying the data of a whole partition when error occurs.


Agreed, the architecture page and other blog posts do a more thorough job of explaining the details. Having a stateless broker layer on top of a focused data layer makes all the operations much easier. I expect BookKeeper to further integrate with the various cloud storage APIs so it can also start to become stateless cache.


another point added to 'rebalancing' -- when kafka rebalances the partitions, it has to copy all the data for the partitions that are moved around. it might not be a big problem when retention is small. however it is pretty worse when retention period is longer, rebalancing is going to exhaust all the bandwidth (both network and I/O) in the cluster. people don't realize the fact until they want to grow the cluster (adding more brokers) to support increased traffic.


actually, pulsar just supports kafka 0.10 api in the recent apache release. it might be worth checking it out as a drop in replacement. https://github.com/apache/incubator-pulsar/releases/tag/v1.2...


another side note, pulsar supports kafka api since 1.20.0 - https://github.com/apache/incubator-pulsar/releases/tag/v1.2...


you might want to checkout https://pulsar.apache.org/ a durable low latency pub/sub system. it also has a kafka api client.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: