Faktory, a new background job system

jitl · on Oct 24, 2017

It seems nuts to me to introduce a stateful queue system like this where the persistence story is “data’s out there on a single node in some kind of disk format, hope it doesn’t go missing!”

I guess this style significantly reduces setup friction, but it’s almost an irresponsible design in a universe where the average cloud provider is telling you, “your compute instances may vanish at any time.” If this used Kafka, Redis, MySQL, or another well-known stand-alone data store, I know my data is replicated, and I can recover my enqueued jobs if Somethimg Bad happens.

I like the nice UI and the simple API, but there’s no way this will replace Resque or any Kafka-based-job-whatever at my place of work.

mperham · on Oct 24, 2017

It supports backup and restore today but RocksDB doesn't have a replication story at the moment.

Side note: Resque will lose jobs if it crashes, it doesn't use RPOPLPUSH. I hope you'll take another look at Faktory in a few months, maybe we'll have addressed your issues.

koolba · on Oct 24, 2017

Using RPOPLPUSH there's no guarantee you won't lose the job if the Redis server crashes prior to publishing it. Even with persistence enabled (AOF) there's still a small window of time between receiving the message and the next AOF fsync. The sender would think the message is queued as it got an OK response back but the message will be lost.

There are three common solutions to this:

First is to pretend it doesn't exist (quite common!).

Second is to understand that losing messages is a possibility and only use it for message types that the loss of a message wouldn't be critical (which covers quite a bit).

The third solution, which actually addresses the problem, separates the persistence of message details to transactional store from the event notification of a new message. A "sweeper" type task can thing check the persisted message list for messages that haven't been processed and re-publish them.

(Disclaimer: I'm a huge fan of Redis and think it's the bee's knees of data structure servers.)

elvinyung · on Oct 24, 2017

You mean RDB? I thought AOF can be configured to fsync at every write.

koolba · on Oct 24, 2017

The default is to fsync once per second so the failure window is small but it's not zero.

It's possible to fsync on every write[1] but it may be too slow as the single threaded nature of Redis means you've serialized every write operation. Plus that'd apply to all usage of that Redis server, not just the queue.

[1]: https://redis.io/topics/persistence#how-durable-is-the-appen...

je42 · on Oct 24, 2017

you can run a redis cluster.

jitl · on Oct 24, 2017

Fair, and I will :-) Thanks for your response. Are you married to RocksDB/instance storage, or are you considering adapters or something to a pluggable persistence layer?

zellyn · on Oct 24, 2017

Do you have plans for multi-node persistence / geographic redundancy? We'd love something like this, but right now it would probably be easier to roll something custom on top of either Kafka or CockroachDB…

manigandham · on Oct 24, 2017

You can just use Redis, Kafka, or better yet use Google's Cloud Pub/Sub or Azure Service Bus for an extremely cheap and highly-reliable system that already has all of these features included.

What exactly is the use-case for an entirely new and separate system that looks like it's single-node only?

karmajunkie · on Oct 24, 2017

I had a similar question, given that I'm happily processing jobs in Elixir that are enqueued by a Ruby app via Sidekiq. After taking a look at the README's, it looks like what this gets you is dumb clients, with the logic around retries, etc built into the server, rather than forcing the client to handle these details. That's great for polyglot systems—the dumber the client the better, in my book. As another commenter posted, I'll be interested to see more about what tradeoffs it makes in distribution.

If your current system is already something like SNS/SQS or an actual message queue with acks, then this probably isn't aimed at you.

manigandham · on Oct 24, 2017

Why run something extra when you can use a single database table which is already universally accessible by any dumb client?

Push, fetch, ack = insert, select, delete. A few lines of SQL or a stored procedure gets it done. Postgres basically made skip locked for queues.

karmajunkie · on Oct 25, 2017

Its a valid way to go, but does skip locked handle retries for you? There's more to a queue than just the message.

Also worth noting, postgres is also another piece of infrastructure to run, and not all applications use it.

manigandham · on Oct 25, 2017

The whole point is that Faktory is another piece of infrastructure, with arguably little value when you already have all of the features within Redis or a SQL database which you're probably running anyway. Sidekiq itself requires Redis.

For postgres, SKIP LOCKED is the easiest way to queue because the row only gets deleted if the transaction commits, which is when the job finishes. Otherwise if there's an error than don't commit or just crash and the row remains in the queue and another worker process tries it again.

karmajunkie · on Oct 26, 2017

With redis you're back to the original problem of needing to code things like retries back into your client. Postgres may have some limited abilities to use it as a message queue but you're still going to be missing some of the other features that will end up in faktory like periodic jobs.

If pg works for your situation, that's great, but i don't know why you'd assume that none of these objections have come up before and been responded to.

manigandham · on Oct 31, 2017

If you want the client-side retry logic too then that requires language-specific libraries, and it seems that's what Faktory requires anyway so this whole thing becomes unnecessary.

They couldve just ported the sidekiq library to several languages or use a single C library with language wrappers and gotten the same result with less work.

rubiquity · on Oct 24, 2017

When you peel back a layer or two from the background job onion you land smack dab in the land of message queues. I’ll be curious to see what trade offs this library makes around classic message queue problems such as delivery semantics, visibility windows, sharding or replication for multi node, etc.

There’s definitely a lot of message queues out there these days and things like Kafka which turn into message queues. That said, a lot of them are a pain to operate so there’s room for one that is easy to deploy and operate.

sandGorgon · on Oct 24, 2017

> It uses Facebook's high-performance RocksDB embedded datastore internally to persist all job data, queues, error state, etc

This breaks the devops story for me. The difference in featureset between Rocksdb and redis is not that big.. However redis is hugely supported on the cloud and in high-availability fully-managed mode.

Its so convenient to use sidekiq on heroku or aws. I really hope you build this on redis rather than a new persistence server.

deedubaya · on Oct 24, 2017

Correct me if I'm wrong, but RocksDB is embedded withink faktory, so this would actually be one less service to manage than using an external redis server. Isn't that a good thing?

From my experience, when running high throughput, quickly executing sidekiq workers on heroku, the expense often doesn't come from dynos, but the redis instance as the limiting scale factor usually comes down to connections. That won't be a problem with Faktory and an embedded datastore.

sandGorgon · on Oct 24, 2017

This is not the state of mind right now - and this is the same thing that blockchain devs seem to think. Everybody uses the cloud for anything reasonably important. And if it is not important, the difficulty delta between an embedded datastore and "./redis" is miniscule.

Which means that leveldb/rocksdb gives zero path to scalability at the cost of saving a maximum of a few minutes of effort.

At worst, this could have been done in redis with embedded lua packaged together. That would have solved both your external redis problem as well as a reasonable path to scale.

wilde · on Oct 24, 2017

It depends on how comfortable you are developing a monitoring and failover system for something stateful. Right now AWS handles failover of our redis instances. With Faktory, I’m not sure what that looks like yet.

deedubaya · on Oct 24, 2017

Indeed. It will be interesting to see how it handles fail over in the case of an outage.

meritt · on Oct 24, 2017

Can you provide a bit more details about the limits you're encountering? Connections? As in the sheer quantity? Or speed to connect? Are you hitting linux open files limits? I have a tough time understanding how Redis would be the bottleneck among a bunch of Ruby workers.

deedubaya · on Oct 24, 2017

Specifically on Heroku, the different tiers of redis provide limits for the number of connections to it that can be made. This means that if you hav 5x100 low-memory, low-cpu workers (arbitrary), you'll probably be playing $750/month just to have a redis instance that supports 5,000 connections. It has less to do with technical limitations and more to do with pricing of other services, I think.

https://elements.heroku.com/addons/heroku-redis

meritt · on Oct 24, 2017

Holy hell. You need to get off Heroku right away. This is insane.

deedubaya · on Oct 24, 2017

When Mike released Sidekiq for Crystal-lang, I thought this might be in the works. From the readme:

> The Ruby and Crystal versions of Sidekiq must remain data compatible in Redis. Both versions should be able to create and process jobs from each other. Their APIs are not and should not be identical but rather idiomatic to their respective languages.

It makes sense to unify the protocol for background job processing, making them language agnostic. Some languages tackle different problems better than others, so this will be a really useful tool.

Great work, Mike. Keep the hits rolling.

languagehacker · on Oct 24, 2017

Mike makes some amazing software in this space. I have some mostly pragmatic concerns about building out a framework that requires some specialized server to handle state for workers, though.

Maybe for something like RabbitMQ or SQS, this would be a satisfactory replacement, since these seem to be relatively monotasked persistence servers. So for your traditional Celery+RabbitMQ deployment, for instance, this could be a good replacement.

But let's we consider cases like Redis, Memcached, or Kafka, where the persistence store we're using is often also being utilized as a cache or linear log in other aspects of the same product. This would make Faktory troublesome, because it introduces additional maintenance costs compared to a service that we already need. Furthermore, if I can use an off-the-shelf, hosted storage solution for enqueued job definitions, like we do with Elasticache, I reduce my operational costs even more.

So it's not a question of whether or not Faktory works, but whether it's worth the cost of building, deploying, and maintaining a specialized monotasker server instance on top of the worker pool I already need to build and maintain. I'd be interested in understanding where the long-term value add would be in most large-scale practical SOAs, and how folks excited about this project anticipate implementing it might go.

mperham · on Oct 24, 2017

For those of you pointing out flaws, remember you are picking apart a pre-1.0, just launched project. We're all hackers and startup people right? Think MVP, no one can ship a perfect system completely finished.

blairanderson · on Oct 24, 2017

Just remember, grumpy HN comments are .001% of people viewing the homepage and clicking the link to your site.

Your potential customers know better than to get their opinions from reading these comments :)

yeukhon · on Oct 24, 2017

This is a strong defensive stand, especially coming from the author. Picking things apart with you is better than a "great work and now next post" comment.

twic · on Oct 24, 2017

Aren't the flaws flaws in the concept, rather than the implementation, though? Are those likely to be improved on?

geetfun · on Oct 24, 2017

Thanks Mike for creating Sidekiq and this. pre-1.0 is alright with me for showing it on HN.

manigandham · on Oct 24, 2017

The only flaw here seems to be a lack of use-case... why not just add an HTTP API to Sidekiq instead, if it doesn't already have it? That way the value of the queue logic and semantics can be offered to any external client.

mperham · on Oct 24, 2017

Sidekiq is not a server, it's a Ruby worker process. Redis is the server and I can't build a fast, easy to use embedded queue system on top of Redis.

manigandham · on Oct 24, 2017

Can't Sidekiq just run as a standalone process? It can take a connection string to use existing Redis/RDBMS with option to use embedded server maybe. You can just package it with Redis in a container too.

Although to be honest, the basics of a work queue are well covered now with the evolution of cloud services and other databases and message systems.

mperham · on Oct 24, 2017

Sidekiq is Ruby only. That won't support one of my key goals with Faktory: polyglot. Background jobs can benefit applications written in many languages.

There's many different tools and many different users. No one choice is appropriate for all. I hope some people find Faktory useful.

coryodaniel · on Oct 24, 2017

I’ve used Sidekiq for years. Great project. Looking forward to giving this a swing.

I really respect what Mike’s done being able to monetize his work on cool open source projects.

atombender · on Oct 24, 2017

This looks nice, but I'm also confused by why anyone would build a single-node data store in 2017. I can't find any information about whether replication and HA/failover is planned. Writing a queue on top of RocksDB is arguably trivial; it's the other stuff that is difficult.

The other thing about job systems that a lot of people seem to ignore is client-side scaling. We run our apps on Kubernetes, where you'd naturally want to tune worker scheduling dynamically to accomodate queue size. Feeding custom queue metrics into the horizontal pod autoscaler is one way to do this.

pmontra · on Oct 24, 2017

Many web applications are so low traffic that a single node is enough to handle everything: application server, database, queue system, static content, etc. It's up to the business deciding if they want to spend the money to buy high availability. In my experience almost everybody don't and have lived years with only occasional downtimes that didn't harm their business.

sscarduzio · on Oct 28, 2017

yes but downtime is different than data loss

continuations · on Oct 24, 2017

On a related matter, how does job queue differ from message queue?

What can a job queue do that Kafka or RabbitMQ cannot do?

mperham · on Oct 24, 2017

Background jobs are specialized messages. You can build a decent job system on top of a message queue but you'll lose the specifics. For instance, Rabbit and Kafka won't give you the built in retry system with error tracking and UI. Faktory enforces a specialized message format (the job payload) and can do many things with that data; message queues that treat messages as a simple byte array can't do that.

https://github.com/contribsys/faktory/wiki/The-Job-Payload

manigandham · on Oct 24, 2017

There is no difference. A "job" is just a message to a worker process describing what to do and maybe some data or arguments to be used in doing that work.

There are some useful features like priorities, scheduling in the future, claiming and then acknowledging a job as done or returning to pool, etc... but these can all be found in or coded on top of existing systems.

perfect_kiss · on Oct 24, 2017

One of the key features I love beanstalkd for, is a support for assigning different priorities for the jobs in the same queue, so the jobs with larger priority assigned will be processed first. Looks like this feature is missing from Faktory.

mperham · on Oct 24, 2017

https://github.com/contribsys/faktory/issues/new

nikolay · on Oct 24, 2017

How does this compare to https://github.com/antirez/disque?

manigandham · on Oct 24, 2017

Disque was never officially released and is considered deprecated now. Redis already does well as a queue, v4.0 came with modules which add even more functionality, and future releases will include a new streams datatype, similar to Kafka.

nikolay · on Oct 24, 2017

Thanks for sharing this - I've obviously missed these developments.

bryanlarsen · on Oct 24, 2017

Looks very similar to beanstalkd, the only difference I can see is that Faktory comes with a bundled GUI. Is there anything else I'm missing?

aarondf · on Oct 24, 2017

From the FAK:

> Faktory aims to be more feature-rich and better supported. Many of Faktory's OSS competitors are "dead" and no longer supported. I am fortunate enough to have both expertise in background jobs and a business model to support Faktory long-term.

falcolas · on Oct 24, 2017

Background job systems rarely need extensive support (or, for that matter, a constant stream of features) I've used Gearman for years, to great success.

Not saying that Gearman is a better solution, just saying that "better supported" for such a relatively simple tool is not always necessary.

switchbak · on Oct 24, 2017

I (unfortunately) use beanstalkd, and it seems to be pretty dead, at least on the Java client side.

My question is why to we even need a 'background job' system, isn't this just a message broker / queue? RabbitMQ (and friends) can do much of this, no? Maybe I'm missing some of the future features they aim to implement.

jsjohnst · on Oct 24, 2017

That it’s written by the same guy behind Sidekiq, a very successful Ruby framework for background jobs.

sscarduzio · on Oct 28, 2017

This would be great as a layer on top of SQS and/or GCE PubSub. And I'd host Faktory in a Lambda or CloudFunction.