Storm : a Realtime Computation System Similar to Hadoop

terhechte · on Oct 22, 2012

Previous Discussion: http://news.ycombinator.com/item?id=3014039

Xorlev · on Oct 22, 2012

It's barely similar, it's a fault-tolerant system for scaling computation. Storm provides real-time streaming computation. Your spouts provide infinite streams of tuples, small objects which store serialized other types that you then emit 0 or more tuples out of that tuple.

You could liken it to a streaming mapreduce that you can rearrange into directed graphs of data flows called a topology.

Re: Spark, it's a totally different paradigm that's like a map reduce which takes advantage of memory locality where Hadoop takes advantage of disk locality. Hive on Spark is a pretty beastly system.

_jmar777 · on Oct 22, 2012

Storm is actually not similar to Hadoop at all. I think this title resulted from a misreading of the README, which states: "Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation."

/nitpick

dkhenry · on Oct 22, 2012

So which is better Storm[1] or Spark[2] ?

1. http://storm-project.net/ 2. http://www.spark-project.org/

cf · on Oct 22, 2012

They are for different usecases. Storm is more for realtime data like analyzing twitter or bidding on ads. Spark is very much a in-memory map-reduce and designed for batch computations. Spark makes sense when you just have a lot of data you want to analyze or get statistics on.

EzGraphs · on Oct 22, 2012

JRuby DSL and Integration for Storm here: https://github.com/colinsurprenant/redstorm

t-crayford · on Oct 22, 2012

How is this better than Hadoop?

lmirosevic · on Oct 22, 2012

Apples and oranges. Hadoop is a framework for parallel batch processing of existing data, like server logs and sales statistics. Storm is a framework for processing streaming data in realtime, like Twitter's live stream of tweets.