Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Storm : a Realtime Computation System Similar to Hadoop (github.com/nathanmarz)
30 points by EzGraphs on Oct 22, 2012 | hide | past | favorite | 8 comments



It's barely similar, it's a fault-tolerant system for scaling computation. Storm provides real-time streaming computation. Your spouts provide infinite streams of tuples, small objects which store serialized other types that you then emit 0 or more tuples out of that tuple.

You could liken it to a streaming mapreduce that you can rearrange into directed graphs of data flows called a topology.

Re: Spark, it's a totally different paradigm that's like a map reduce which takes advantage of memory locality where Hadoop takes advantage of disk locality. Hive on Spark is a pretty beastly system.


Storm is actually not similar to Hadoop at all. I think this title resulted from a misreading of the README, which states: "Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation."

/nitpick


So which is better Storm[1] or Spark[2] ?

1. http://storm-project.net/ 2. http://www.spark-project.org/


They are for different usecases. Storm is more for realtime data like analyzing twitter or bidding on ads. Spark is very much a in-memory map-reduce and designed for batch computations. Spark makes sense when you just have a lot of data you want to analyze or get statistics on.


JRuby DSL and Integration for Storm here: https://github.com/colinsurprenant/redstorm


How is this better than Hadoop?


Apples and oranges. Hadoop is a framework for parallel batch processing of existing data, like server logs and sales statistics. Storm is a framework for processing streaming data in realtime, like Twitter's live stream of tweets.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: