You’re basically pitching this as a more complicated version of airflow that does basically the same thing, but slightly differently, and scales better?
… but your core dependencies are a Kafka cluster and an elastic search cluster which are both a pain in the ass to scale; so really, could you run this seriously without a really expensive hosted cloud instance of both of those?
This kind of wording:
> Since the application is a Kafka Stream, the application can be scale infinitely
Is a major turn off to me.
Kafka cannot scale infinitely. Nothing can. In fact, Kafka can be a pain in the ass to scale.
In makes me question some of the other commentary on the project.
As long as we're airing pet peeves, mine is about over-literal misunderstandings:
> Kafka cannot scale infinitely. Nothing can.
It is very common that when a phrase can't be literally true, it signals a metaphorical meaning. E.g., if a teen tells you their new teacher is a million years old, it's not a literal statement of age. Similarly, nobody expects "scale infinitely" to mean that, as in Universal Paperclips, that we'll be converting whole galaxies into Kestra clusters. It means that any bottlenecks are external to the system.
> Similarly, nobody expects "scale infinitely" to mean that, as in Universal Paperclips, that we'll be converting whole galaxies into Kestra clusters.
I disagree. I've worked with plenty of people that would probably take this statement at face value and assume you could scale to a completely arbitrary amount of load with no marginal effort.
Plus, there's a difference in context here: we're talking about a technical product. It doesn't hurt to be precise and technical in your description of it, does it? This is the most likely setting in which someone might interpret something literally.
Aren't those the same people who would assume that of many technologies whether or not the word "infinite" was used?
I do agree there's a difference in context, but for me it goes the other way. I'd expect pretty much anybody in a technical audience to know technical basics. For me that's a big part of the fun in writing on HN, in that it's not obligatory to dumb my points down just to coddle the clueless.
I don't pitch as more complicated version of Airflow, rather, I think it's more simple than Airflow on the UX side: we use declarative flow with yaml and not python code that can be
I agree with you that Kafka & ElasticSearch can be a pain to scale if you need to have a horizontal and vertical scaling.
On other side, on single machine, it's really has easy to setup. With this, you will have the same scaling than Airflow for exemple since it depend on a non scalable database (mysql or postgres). But the chance you will have with Kestra is that you will be able to scale to multiple node for your backend (as well with kestra that allow scaling all services). When you hit the limit with standard database, you will be stuck.
And yes clearly infinite scale is not a literal statement terms, nothing can scale infinitely but since the architecture is really robust (and scalable), the issues will be on other aspects than Kestra (cloud limit, database overload, ...).
A final point and a more important one, the backend are all pluggagle in Kestra since Kestra is really think as module: Look at the directory here : https://github.com/kestra-io/kestra :
- runner-kafka & runner-memory are 2 implementation of Kestra, you can add a new one that will use Redis, Pulsar, ...
- repository-elasticsearch & repository-memory is the same, you can implement another one, I started one implementation for JDBC that I don't have the time to finish for now : https://github.com/kestra-io/kestra/pull/368
But using a proper programming language to define dependencies is one of airflow's main advantages! I'd even go so far as to say you're not using it to its full potential if you're not writing code to infer complicated dependencies programmatically.
That hasn't been our experience. Our elasticsearch cluster has been a pain in the ass since day one with the main fix always "just double the size of the server" to the point where our ES cluster ended up costing more than our entire AWS bill pre-ES. (but that might be our limited experience) whereas something like postgres has required nearly 0 maintenance apart from adding the occasional index but even that has been just due to tuning, not that the DB fell over.
Both are AWS hosted products (RDS, AWS Elasticsearch).
Easiest database to scale is a pretty low bar. Databases are typically really hard to scale and Elasticsearch is no exception. Aside from the issue of ease, one thing that has been universally true for me is that Elasticsearch is incredibly expensive to scale in terms of compute costs.
Elasticsearch has built in horizontal scaling abilities, unlike Postgres/other SQL databases. It also has integrations with cloud providers for peer discovery, or can use DNS. Once a new data node is detected and reachable, the masters will start sending it shards of data, distributing the load. This all happens without any user intervention. I can't really speak to cost, it is somewhat easy to blow up the memory usage in Elastic for sure, but I can't say its been more expensive than similarly sized Postgres clusters.
Right, GB for GB ES is much easier to scale than Postgres (or any other DB) but probably also more expensive since ES is much more memory and compute hungry. But I can't say I have an apples-to-apples comparison since the use case for ES is usually "dump massive amounts of raw data in and index everything" which you wouldn't typically do with a Postgres instance. But in places where we have run large ES clusters my experience has not really been that it works without any user intervention (at least once you reach a certain scale) and that it involved a lot of operational support. Not that any other solution with comparable features would have been easier necessarily but still not easy in any absolute sense.
I go hilariously out of my way to eliminate elasticsearch at any org I join. Usually because it's only being used for logs and modern tools like loki are immeasurably easier to scale and cheaper to run. But I also find many many developers using it don't know about time series databases or anything at all about which data structures go in which kind of database and just dump everything into a horrifically organized search database. Its at least one order of magnitude worse to scale and operate than a mongo-type nosql database being used incorrectly by a developer who doesn't know any better and two orders of magnitude worse than a sql database being used incorrectly by a developer who doesn't know any better.
Loki's fine if you are very cost sensitive and are comfortable with Prometheus, but it's not really a replacement for a text-search database like Elasticsearch. It also scales about the same, both being horizontally scalable (I'm not sure what Loki's sharding strategy is). Our ELK stack runs on 3 2cpu/8gb ram nodes totaling about $160 a month and can handle 50+ million of records or so (I haven't ran it to its absolute limit). This is a comfortable price to performance ratio for us and I imagine many other companies.
I think people that have issues scaling any modern distributed data stack are because a) Don't have experts or b) Bad practices/stretching the use case. I worked on a project once where the ES cluster performance was degrading because they kept increasing the number of fields. At some point, they had more than 5k for a single document schema even though ES docs mention going over the limit (1k) is not a good idea. I mean if any of these big tech companies can manage clusters of hundreds of nodes for any of these data stacks I'm sure your scaling issues aren't because of the tool.
Easy/hard is depending on the experience of the user. Someone with a lot of experience with Elasticsearch will have a easy time scaling Elasticsearch and hard time scaling Kafka, and vice-versa.
Better to compare how complex they are to scale in terms of actions required.
Agree that both are expensive to scale on multiple node. But keep in mind, you can use it with a single node (like others do with a database like mysql).
Just don't go multiple node if not needed by the project. But when you will need to, with Kestra you can go multiple node and scale.
… but your core dependencies are a Kafka cluster and an elastic search cluster which are both a pain in the ass to scale; so really, could you run this seriously without a really expensive hosted cloud instance of both of those?
This kind of wording:
> Since the application is a Kafka Stream, the application can be scale infinitely
Is a major turn off to me.
Kafka cannot scale infinitely. Nothing can. In fact, Kafka can be a pain in the ass to scale.
In makes me question some of the other commentary on the project.