Introduction to Apache strom

Apache Storm is a real-time data processing software, It can process through data to find a particular trend or similar words in the queries.

Storm allows developers to build powerful applications that are highly responsive and can find trends between topics on twitter, monitoring spikes in payment failures, and so on.

Apache Storm is a free and open source distributed real time computation system.

What is streaming data?

Data which is flowing into your system continuously is called streaming data.

For example, say every uber cab traveling out on street is sending it's location information back to uber servers. This location information of each car is further used to serve nearest cab requests by uber cab users.

Here, location information continuously flowing to central servers is a continuous stream of data records containing location information. This is streaming data.

User clicks on a web page getting continuously collected at servers is also streaming data.

Real time streaming data processing:

Storm process stream of data as it arrives. As soon as log record arrives, required processing is done on that log record and it is marked as done. Apache storm is made for serving such real time streaming data processing requirements.

Apache Storm Benefits

Apache Storm is continuing to be a leader in real-time data analytics. Storm is easy to setup, operate and it guarantees that every message will be processed through the topology at least once.

  • Storm is open source, robust, and user friendly. It could be utilized in small companies as well as large corporations.
  • Storm is fault tolerant, flexible, reliable, and supports any programming language.
  • Allows real-time stream processing.
  • Using storm you can build up applications which can be highly responsive to latest data and react within seconds and minutes
  • Storm can keep up the performance even under increasing load by adding resources linearly. It is highly scalable.

Batch Processing

Batch processing is where the processing happens for blocks of data that have already been stored over a period of time.

For Example, Let's consider Twitter (or something like that) at the time of Oscars. At the time of Oscars the #Oscars will be leading tag in twitter but after sometime some other tag becomes the leading tag like #Moonlight or #12YearsASlave.

In case of batch processing, the data base or the data storage system will collect all the tweets for 10 minutes (for a specific time). Once the data is collected it will initiate the processing of the tweets.

Lets consider the the processing takes 2 Minutes, now the Twitter can change the leading tag once in 12 Minutes (100 minutes of collecting + 2 Mintes of processing).

The above process is knows as batch processing because it collects the tweets/data for given period of time then only the processing occurs. So always the process happens to a batch of data at given intervals

The Batch processing tha data provides periodic updates on the trend.
Twitter doenot use above technique, because it is inefficient

Streaming data Process

Consider above example with Oscars and Twitter, In the streamg process when ever there is data is is received the prpcess will be initiated on the data but for first specific period, it might need some data to prcess from that point it will change the trends based on every tweet.

In case of Streamng process, we will wait for little period of time like 10 minutes(till we get enough data) to calculate the first set of trend, but once we have enough data to process then we will process it.

From that moment onwards we will have enough data to prcoess the tweets, so here we will no need to wait for any periodic updates.

The streaming data provides continuous updates on the trend.
Stream processing is useful for tasks like fraud detection. If you stream-process transaction data, you can detect anomalies that signal fraud in real time, then stop fraudulent transactions before they are completed.

Who is using Stream Processing?

In general, stream processing is useful in use cases where we can detect a problem and we have a reasonable response to improve the outcome. Also, it plays a key role in a data-driven organization.

  • Algorithmic Trading, Stock Market Surveillance,
  • Supply chain optimizations
  • Smart Patient Care
  • Intrusion, Surveillance and Fraud Detection
  • Computer system and network monitoring
  • Geospatial data processing

About Author


Share this Article Facebook
Comment / Suggestion Section
Point our Mistakes and Post Your Suggestions