Introduction to Apache Storm


What is Apache Storm?

Apache Storm is a distributed real-time big data processing system. Storm is designed to handle large amounts of data in a fault-tolerant and horizontally scalable approach. It is a streaming data framework with the highest ingestion rate. Although Storm is stateless, it manages the distributed environment and cluster state through Apache ZooKeeper. It's simple and you can perform various operations on live data in parallel.

Apache Storm continues to be the leader in real-time data analysis. Storm is easy to set up and operate, and it guarantees that each message will be processed at least once through the topology.

Apache Storm vs Hadoop

Basically Hadoop and Storm frameworks are used to analyze big data. The two complement each other and differ in some ways. Apache Storm does everything except persistence, while Hadoop is good at everything but lags behind real-time computation. The following table compares the properties of Storm and Hadoop.

StormHadoop
Real-time stream processingBatch processing
StatelessStateful
Master/slave architecture and coordination based on ZooKeeper. The master node is called nimbus and the slave nodes are supervisors. Master-slave structure with/without ZooKeeper based coordination. The master node is Job Tracker, and the slave node is Task Tracker.
Storm streaming process can access tens of thousands of messages per second on the cluster. Hadoop Distributed File System (HDFS) uses the MapReduce framework to process large amounts of data, taking minutes or hours.
The Storm topology runs until user shutdown or unexpected unrecoverable failure. MapReduce jobs are executed sequentially and eventually completed.
Both are distributed and fault-tolerant
If nimbus/supervisor dies, reboot it from It continues where it left off, so nothing is affected. If the JobTracker crashes, all running jobs will be lost.

Examples of using Apache Storm

Apache Storm is very famous for real-time big data stream processing. Therefore, most companies use Storm as an integral part of their systems. Some notable examples are as follows -

Twitter - Twitter is using Apache Storm as its “Publisher Analytics Product”. The "Publisher Analytics Product" handles every tweet and click in the Twitter platform. Apache Storm is deeply integrated with Twitter infrastructure.

NaviSite - NaviSite is using Storm for event log monitoring/auditing system. Every log generated in the system will go through Storm. Storm will check the message against the configured set of regular expressions and if there is a match then that specific message will be saved to the database.

Wego - Wego is a travel metasearch engine based in Singapore. Travel-related data comes from many sources around the world and at different times. Storm helps Wego search real-time data, resolve concurrency issues, and find the best matches for end users.

Apache Storm Advantages

Here is a list of benefits provided by Apache Storm:

  • Storm is open source, powerful, User friendly. It can be used by both small and large companies.

  • Storm is fault-tolerant, flexible, reliable, and supports any programming language.

  • Allows real-time streaming.

  • Storm is incredibly fast because it has huge power to process data.

  • Storm can maintain performance even as load increases by linearly increasing resources. It is highly scalable.

  • Storm performs data refreshes and delivers responses end-to-end in seconds or minutes depending on the problem. It has very low latency.

  • Storm has operational intelligence.

  • Storm provides guaranteed data processing even if any connected node in the cluster dies or messages are lost.