Let's see what's involved in building a stream processing application using storm.
Storm application, which takes data and processes it is known as a storm of topology. Every storm appllication will contain some basic components
The spout is the root of every storm of topology, the data processing starts at the spout. Spout collects the data from the input source.
Spout acts as the connection between the external data source and the storm application, it pushes the data to the remaining components of the storm application.
Spout also keeps the track of the data till it process is done and successful, if there is failure then spout resends the data to the storm components again.
Reamining componenets in the storm applications is called as Bolts
Bolts takes care of the all the operation that the happening to the input data. There will be one or more bolts present in the in the storm application.
Different bolts will perform different operation on the given data. The operations could be anything. For example, the opertaons could be extracting some text from tweet or counting and summarizing them, so on.
All the bolts are arranged in manner that output of one bolt goes as input to another bolt. Multiple bolts are used to reduce the workload of the bolts
Let's consider the twitter trend, twitter receives as many as tweets. Now it will perform following tasks to calculate the trending tag.
In above scnario, we have split the task among the botls, and each bolt is assigned with different task. But if we notice clearly we can find that one tweet/data is not dependent on the other.
When there is dependency in data, we can bring parallel processing in bolts.
We've seen how we can set up a storm of topology which is an application that can process a stream of events based on apologies.
These storm topologies are actually run by something called the storm cluster. To run a topology you need to take the jar where you have written all the code that specifies the topology and submit that topology to a storm cluster.
Storm cluster is group of process which are centrally co-ordinated by one process just like Master and slaves system. Each process might represent a Spout or a Bolt.
One process centrally coordinates all the different spouts and bolts of all the different topologies that are running on that cluster.
You can have all the different Java processes corresponding to the topologies running on a single machine, It is called the Local Mode.
Local Mode is basically just for development purposes so that you can test the code that you have written for topology
When you want to run the Storm topology for real time, then you might take the help of Remote Mode.
Remote Mode has more power of large number of machines connected, In Remote Mode, You will run diffrent process involved in storm clusted in different systems.
But who will moniter the processes in the cluster like what every machine is doing.
Nimbus is central co-ordnating process of the storm cluster. Nimbus is single process which moniters all the other process running in the cluster.
When you want to run a storm application,
Once jar is submitted to nimbus, nimbus figures out what are the required Spouts and Bolts for the topology and starts the process according to that.
Each process represent to the soput or to the bolt. All these process started by nimbus is knowns as Executors
Nimbus takes help from a tool called ZooKeeper, to co-ordnating all the Executors and its states