Apache Storm core concepts


Apache Storm reads a raw stream of real-time data from one end and passes it through a series of small processing units and outputs processed/useful information at the other end.

The following figure describes the core concepts of Apache Storm.

all_grouping.jpg

Now let’s take a closer look at the components of Apache Storm -

##ComponentsDescriptionTupleTuple is the main data structure in Storm. It is a list of ordered elements. By default, Tuple supports all data types. Typically, it is modeled as a comma-separated set of values ​​and passed to the Storm cluster. StreamA stream is an unordered sequence of tuples. SpoutsThe source of the stream. Typically, Storm accepts input data from raw data sources such as Twitter Streaming API, Apache Kafka queue, Kestrel queue, etc. Otherwise, you can write spouts to read data from the data source. "ISpout" is the core interface for implementing spouts. Some specific interfaces are IRichSpout, BaseRichSpout, KafkaSpout, etc. BoltsBolts are logical processing units. Spouts pass data to Bolts and Bolts processes and produce new output streams. Bolts can perform operations such as filtering, aggregation, joining, and interacting with data sources and databases. Bolts receive data and emit it to one or more Bolts. "IBolt" is the core interface for implementing Bolts. Some common interfaces are IRichBolt, IBasicBolt, etc.

Let’s look at a real-time example of “Twitter Analytics” to see how it is modeled in Apache Storm. The diagram below depicts the structure.

global_grouping.jpg

The input to "Twitter Analytics" comes from the Twitter Streaming API. Spout will use the Twitter Streaming API to read the user's tweets and output them as a stream of tuples. A single tuple from the spout will have the twitter username and a single tweet as comma separated values. The steam of this tuple will then be forwarded to Bolt, and Bolt splits the tweet into individual words, counts the word count, and saves the information to the configured data source. Now we can easily get the results by querying the data source.

Topology

Spouts and Bolts are connected together to form a topology. Real-time application logic is specified in the Storm topology. Simply put, a topology is a directed graph where vertices are computations and edges are data flows.

Simple topology starts with spouts. Spouts emit data to one or more Bolts. Bolts represent nodes in the topology with minimal processing logic, and the output of a Bolt can be emitted to another Bolt as input.

Storm keeps the topology running until you terminate it. The main job of Apache Storm is to run topologies, and to run any number of topologies at a given time.

Quest

Now you have a basic idea about Spouts and Bolts. They are the smallest logical unit of the topology, and the topology is built using a single Spout and Bolt array. They should be executed correctly in a specific order for the topology to run successfully. Each spout and bolt executed by Storm is called a "task". Simply put, a task is the execution of Spouts or Bolts. Each spout and bolt can have multiple instances running in multiple separate threads at a given time.

Processes

The topology runs in a distributed fashion on multiple worker nodes. Storm distributes tasks evenly across all worker nodes. The role of a worker node is to listen for jobs and start or stop processes when new jobs arrive.

Stream grouping

Data flows from Spouts to Bolts, or from one Bolts to another Bolts. Flow grouping controls how tuples are routed in the topology and helps us understand the flow of tuples in the topology. There are four built-in groupings, described below.

Random Grouping

In random grouping, an equal number of tuples are randomly distributed among all workers executing Bolts. The diagram below depicts the structure.

field_grouping.jpg

Field grouping

Fields with the same value in a tuple are grouped together, and the remaining tuples are saved externally. Tuples with the same field values ​​are then sent forward to the same process executing Bolts. For example, if the stream is grouped by the field "word", tuples with the same string "Hello" will be moved to the same worker. The following diagram shows how field grouping works.

shuffle_grouping.jpg

Global Grouping

All streams can be grouped and forwarded to one Bolts. This grouping sends tuples generated by all instances of the source to a single target instance (specifically, the worker with the lowest ID is selected).

twitter_analysis.jpg

All Grouping

All Grouping sends a single copy of each tuple to all instances of receiving Bolts. This grouping is used to send signals to Bolts. All groupings are useful for join operations.

core_concept.jpg