Components involved in Storm

Let's see what's involved in building a stream processing application using storm.

Storm application, which takes data and processes it is known as a storm of topology. Every storm appllication will contain some basic components

Spout:

The spout is the root of every storm of topology, the data processing starts at the spout. Spout collects the data from the input source.

Spout acts as the connection between the external data source and the storm application, it pushes the data to the remaining components of the storm application.

Spout also keeps the track of the data till it process is done and successful, if there is failure then spout resends the data to the storm components again.

Reamining componenets in the storm applications is called as Bolts

Bolts

Bolts takes care of the all the operation that the happening to the input data. There will be one or more bolts present in the in the storm application.

Different bolts will perform different operation on the given data. The operations could be anything. For example, the opertaons could be extracting some text from tweet or counting and summarizing them, so on.

All the bolts are arranged in manner that output of one bolt goes as input to another bolt. Multiple bolts are used to reduce the workload of the bolts apache-storm-topology

Twitter Trending Example for Storm Topology

Let's consider the twitter trend, twitter receives as many as tweets. Now it will perform following tasks to calculate the trending tag.

  • Collect the tweets
  • Split the each tweet to find the words
  • Count the words, not number of words but the occurances of words
  • Sort the words based on the occurance in decendng order
The toplogy for the above scenario, real-time-apache-topology

Parallel Stream Processing in Apache Storm

In above scnario, we have split the task among the botls, and each bolt is assigned with different task. But if we notice clearly we can find that one tweet/data is not dependent on the other.

When there is dependency in data, we can bring parallel processing in bolts.
  • There is no dependency between the tweets so we can bring parallel processing to Split Bolt
  • Once tweets are split into words then we use Count Bolt. Couting of a particular word is not dependent on any other word, so we can bring the parallel processing to Count Bold
  • Once all the word occurance are counted then we need to sor it using Sort Bold, but sort bolt has dependecy between the words count beacuse when we sort a item we have to compare it with someother item. So we cannot bring paralallel processing to Sort Bolt beacuse of the dependency
parallel-processing-in-storm-apache

Storm Cluster

We've seen how we can set up a storm of topology which is an application that can process a stream of events based on apologies.

These storm topologies are actually run by something called the storm cluster. To run a topology you need to take the jar where you have written all the code that specifies the topology and submit that topology to a storm cluster.

Storm Cluster

Storm cluster is group of process which are centrally co-ordinated by one process just like Master and slaves system. Each process might represent a Spout or a Bolt.

One process centrally coordinates all the different spouts and bolts of all the different topologies that are running on that cluster.

You can have all the different Java processes corresponding to the topologies running on a single machine, It is called the Local Mode.

Local Mode

Local Mode is basically just for development purposes so that you can test the code that you have written for topology

Remote Mode

When you want to run the Storm topology for real time, then you might take the help of Remote Mode.

Remote Mode has more power of large number of machines connected, In Remote Mode, You will run diffrent process involved in storm clusted in different systems.

But who will moniter the processes in the cluster like what every machine is doing.

Nimbus

Nimbus is central co-ordnating process of the storm cluster. Nimbus is single process which moniters all the other process running in the cluster.

When you want to run a storm application,

  • You will create the topology (defining work of bolts)
  • Make the topology as a jar
  • Submit the jar to Nimbus

Once jar is submitted to nimbus, nimbus figures out what are the required Spouts and Bolts for the topology and starts the process according to that.

Each process represent to the soput or to the bolt. All these process started by nimbus is knowns as Executors

  • In local Mode, Nimbus and all of your executiors run in One machine.
  • In remote Mode, Nimbus and all of your executiors run in Different machines.

Nimbus takes help from a tool called ZooKeeper, to co-ordnating all the Executors and its states

About Author

change-this-to-your-details

Share this Article Facebook
Comment / Suggestion Section
Point our Mistakes and Post Your Suggestions
 
Join My Facebook Group
Join Group