Hacker Newsnew | past | comments | ask | show | jobs | submit | coreyoconnor's commentslogin

How do you educate people on stream processing? For pipeline like systems stream processing is essential IMO - backpressure/circuit breakers/etc are critical for resilient systems. Yet I have a hard time building an engineering team that can utilize stream processing; Instead of just falling back on synchronous procedures that are easier to understand (But nearly always slower and more error prone)


It's important to consider whether it's worth it, even?

I worked on stream processing, it was fun, but I also believe it was over-engineered and brittle. The customers also didn't want real-time data, they looked at the calculated values once a week, then made decisions based on that.

Then, I joined another company that somehow had money to pay 50-100 people, and they were using CSV, sh scripts, batch processing, and all that. It solved the clients' needs, and they didn't need to maintain a complicated architecture and the code that could have been difficult to reason about otherwise.

The first company with the stream processing after I left, was bought by a competitor at fire sale price, some of the tech were relevant for them, but the stream processing stuff was immediately shut down. The acquiring company had just simple batch processing and they were printing money in comparison.

If you think it's still worth going with stream processing, give your reasoning to the team, and most reasonable developers would learn it if they really believe it's a significantly better solution for the given problem.

Not to over-simplify, but if you can't convince 5 out of 10 people to learn to make their job better, it's either that the people are not up to the task, or you are wrong that stream processing would make a difference.


I agree. Unless the downstream data is going to be used to feed a system to make automated decisions (ex. HFT or Ad buying), having real time analytics is usually never worth the cost. It's almost always easier and more robust to have high tail latencies for humans to consume and as computers get faster and faster that tail latency decreases.

Systems that needed complex streaming architectures in 2015 could probably be handled today with fast disk and large postgres instance (or BigQuery).


Many successful ads feedback loops run at 15 minute granularities as well!


Yeah that reminds me of a startup I worked at that did real-time analytics for digital marketing campaigns. We went to all kinds of trouble to update dashboards with 5-minute latency, and real-time updates made for impressive sales demos, but I don't think we had a single customer that actually needed to make business decisions within 24 hours of looking at the data.


We were doing TV ads analytics by detecting ads on TV channels and checking web impact (among other things). The only thing is, most of these ads are deals made weeks or months in advance, so customers checked analytics about once before a renewal… so not sure it needed to be near real time…



personally i think streaming is quite a bit simpler. but as you you point out, no one cares!


Batch processing is just stream processing with a really big window ;-). More seriously, I find streaming windows are often the disconnect. Surprisingly often, users don't want windowed results. They want aggregation, filtering, uniqueness, ordering, and reporting over some batch. Or, they want to flexibly specify their window / partitioning / grouping for each reporting query. Modern OLAP systems are plenty fast enough to do that on the fly for most use cases - so even older streaming patterns like stream processing for real time stats in parallel with batch to an OLAP system aren't worth the complexity. Just query the DB and cache...


There are both technical and organizational challenges created by stream processing. I like stream processing and have done a lot of work on high-performance stream engines but I am not blind to the practical issues.

Companies are organized around an operational tempo that reflects what their systems are capable of. Even if you replace one of their systems with a real-time or quasi-real-time stream processing architecture, nothing else in the organization operates with that low of a latency, including the people. It is a very heavy lift to even ask them to reorganize the way they do things.

A related issue is that stream processing systems still work poorly for some data models and often don’t scale well. Most implementations place narrow constraints on the properties of the data models and their statefulness. If you have a system sitting in the middle of your operational data model that requires logic which does not fit within those limitations then the whole exercise starts to break down. Despite its many downsides, batching generalizes much better and more easily than stream processing. This could be ameliorated with better stream processing tech (as in, core data structures, algorithms, and architecture) but there hasn’t been much progress on that front.


Fundamentally I think the question is what kind of streams are you processing?

My concept of stream processing is trying to process gigabits to gigabytes a second, and turn it into something much much smaller so that it's manageable to database and analyze. To my mind for 'stream processing' calling malloc is sometimes too expensive let alone using any of the technologies called out in this tech stack.

I understand back pressure, and circuit breakers, but they have to happen at the OS / process level (for my general work) -- a metric that auto scales a microservice worker after going through prometheus + an HPA or something like that ends up with too many inefficiencies to make things practical. A few threads on a single machine just work, but end up taking ages to engineer a 'cloud native' solution.

Once I'm down to a job a second (and that job takes more than a few seconds to run to hide the framework's overhead) or less things like Airflow start to work, and not just fall flat, but at that point are these expensive frame works worth it? I'm only producing 1-1000 jobs a second.

Stream processing with these frameworks like Faust, Airflow, Kafka Streams etc, all just seem like brittle overkill once you start trying to actually deploy and use them. How do I tune the PostgreSQL database for Airflow? How do I manage my S3 life cycles to minimize cost?

A task queue + an HPA really feels more like the right kind of thing to me at that scale vs really caring too much about back pressure, etc when the data rate is 'low', but I've generally been told by colleagues to reach for more complicated stream processors that perform worse, are (IMO) harder to orchestrate, and (IMO) harder to manage and deploy.


I appreciate the spirit. Not treason, but definitely anti-making-shit-better.


A betrayal of your duty, your country and your people, but not treason per se.


I use Jira a lot and... there are keyboard shortcuts. There are keyboard shortcuts for Confluence.

I did not know about the jira shortcuts until recently. Not really a huge deal for me. Seems like linear's are definitely better but shrug. Keyboard shortcuts are not the source of my issues (ha!) with issue trackers.

Confluence shortcuts tho. I'm pretty sure I'm the only one in my company that knows them haha. Which says something about confluence for sure.


Jira used to aggressively interpret keyboard input as a stream of hotkeys which would result in entropy being introduced into your board. Latest redesign fixed that, but at the cost of being criminally slow.


I've had good success with akka at multiple companies. At Protenus the core ETL is based on akka streams. Which has worked great. No notable issues. Tho we don't run in clustered mode.

At glngn we didn't use akka streams (outside of akka http) but did use event sourced, persistent entities powered by akka typed in clustered mode. Deployed to k8s. Which definitely took effort to set up nicely and enable smooth deployments. Our clusters were not huge so many problems did not show up (split brain).

I haven't used other systems that provided an equivalent to akka typed persistent entities. I can't compare akka in that sense. However the event sourced persistent entities model is really, really effective. Definitely a different paradigm: some stuff is trivial that would otherwise be a trial.

I suspect many of the benefits could be achieved by using something like Kafka feeding non clustered nodes. But never tried.


I used to work on the Akka Team, mostly on streams, but I was doing different stuff elsewhere in the last years, not touching Akka much. To be honest, I am quite emotionally distanced nowadays from Akka...

That said, it still feels good reading success stories from people who used it. Thanks, you made my day!


You're welcome! I think akka has been making great strides in polishing the rough edges. (Along with the rest of the Scala ecosystem TBH). Unfortunately that doesn't seem to generate the hype I'd argue akka deserves shrug


"[Arrived] in a $1-million leather bodysuit with an anatomically-correct gold breastplate and a 15-carat-diamond nipple. 'You were just surrounded by the most interesting and intelligent people that you could find anywhere in the world,'"

Interesting juxtaposition haha.


IIRC you're referring to the wife of Michael Cowpland, who founded Corel. They were another hot tech firm at the time in Ottawa, but not related to Nortel.


I was going to quote this. Totally insane. Even with a gold breastplate and the diamond, I cannot fathom how this managed to cost 1 million.


The reason is the concern that users will confuse issues with the project with issues with how the project is distributed in Nixpkgs. He doesn't want to have to support nixpkgs.

There are likely multiple root causes. The whole space of issue management around libraries and applications using those libraries is a horrible and abusive mess. In my experience, between an application using a library and a library users will target whichever is easiest; not most applicable. Plus, most opensource user support is a fucking chore (you'd need to pay me for these days) that's unsustainable.

Another cause is likely the cost of nixpkgs contributions themselves. Personally, I no longer contribute to nixpkgs because even for tiny changes the process is ridiculously expensive. That's not including the cost of getting up to speed with nix/nixpkgs and the, often, highly opinionated packaging.

nixpkgs needs to be broken up into multiple independently distributable packages.


For fun I've been analyzing the contracts posted to r/CryptoMoonShots. Out of 20 posts 16 of them used the same contract; modulo names. This contract blocks everyone from removing funds but the owner.

How? Is it some complex chunk of code that requires a delicate hack?

No, not at all. There is literally a function with code, more or less, like: "If owner then OK here's all the funds". Anybody can check this in the contract. Yet people are dumping funds into these contracts. Even tho these contracts tend to only attract a few thousand dollars each. Well, costs next to nothing to create and spam.

A more detailed analysis of a similar contract to the one I've seen: https://cryptot3ddybear.gitlab.io/blog/posts/scam-explained-...


Typically the small amount of volume is by the contract owner attempting to pick up attention from momentum trading bots.

This type of contract made a killing a few months ago. Basically miners trade by sandwiching orders in the mempool. You can search the 'salmonella' contract for more info.


https://github.com/Defi-Cartel/salmonella

Link for the lazy, super interesting read.


Then it moves the security breach incentive to compromising the owner's keys, which is also usually pretty straightforward.


After a lot of scrolling through marketing hyperbole... What is it?


It's an Apple TV/Roku/Fire TV but from Google?


"please don't" doesn't pay the bills. You want those people to avoid transferring the reigns for cash? Find some sustainable way to pay them. No amount of feel good platitudes will do that.


Crime isn't supposed to pay the bills either. You have no right to give away a gift and the bulglarize anyone who accepts it. If you don't want to give your work away, put a protective license on it.


While the above case is (arguably) crime. In general crime is unrelated to this issue.

As for licensing: totally agreed. I would like to see more projects with protective licensing.

Too many projects are basically donations to Amazon.


I wonder about the complexity and AWS motivations.

What does AWS gain by improving IAM? There are barely any competitors, so they won't be losing people for that. They offer their own AWS professional services happy to charge you for making it "understandable". Their service agreements largely absolve them of client mistakes. Which usually result in larger bills from AWS.


That's pretty cynical.

AWS is a ball of complexity because it grew organically that way, and they don't have a culture of explaining, or, keeping things simple.

Both of those things would require strong strategic guidance, and a real effort to do.

Unless Bezos edicts: "Our APIs must remain simple even as they scale, and we must document in a manner that keeps the 80% common path easy to use, while the remaining 20% arcane functionality available ..." then it would happen.

But it won't.

It's reasonably well curated arbitrary complexity, it is what it is.

This is not an issue anyone handles well.


No cynacism meant. My mistake. "motivations" was incorrect. I was trying to ask about how the business of AWS manifests such a thing. Which I think you've described. Thanks!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: