Streaming Big Data with Apache Spark

by Azhar Uddin Technical Architect

Big ‘Streaming’ Data With Apache Spark

How can we detect faults in a stream of sensor data being transmitted from an embedded device mounted for measuring temperature variations in an energy plant? Videos are becoming popular more than single text posts in social media like twitter and how can we perform a market analysis using these streamed data. It is apparent that, amount of streaming data, we use is increasing, and Apache Spark Streaming, an extension on Spark core, comes as a promising utility for processing large-scale streaming data. In this chapter, we will look into basics involved in this and look into an example of log analyser with Spark.

Analysis of Stream Data

Three main steps are included in the pipeline of stream data processing:

Ingesting Data: Streamed data should be received and buffered, before processing. Data could be coming from sources like Twitter, Kaffka or TCPSockets, etc.
Processing Data: The captured data should be cleaned, necessary information should be extracted and transformed into results. Algorithms supported by Spark can be effectively used in this step for complex tasks such as machine learning and graph processing as well.
Storing Data: The generated results data should be stored for consumption. Data storage systems could be a database, a filesystem (like HDFS) or a dashboard even.

Development and Deployment Considerations

Before moving into the SparkStreaming API details, let us see about its dependency linking and StamingContextAPI which is the entry point to any application with SparkStreaming.

Linking Spark Streaming

Both Spark and Spark Streaming can be imported from the Maven Repository. Before writing any Apache Spark trainingStreaming application, dependencies should be configured in Maven project as below.

Or, for sbt as below.

libraryDependencies += “org.apache.spark” % “spark-streaming_2.12” % “1.3.1” If the application will be reading input from external source like Kafka, twitter, etc. the relevant libraries to handle the data receipt and buffering, should be added accordingly.

Streaming Context Object

The first step in any Spark Streaming application is to initialize a StreamingContext object from the SparkConf object.

Once initialized, the normal flow of the application will be:

1. Create input DStreams to define sources of input.
2. Apply transformations and output operations to DStream to define the necessary computations.
3. Receive data and process them by calling start() method on the initialized, StreamingContext object.
4. Wait until the processing ends either due to an error or by calling the function stop() on initialized StreamingContext object. This waiting is done by calling the function, awaitTermination on the same object.

StreamingContext object is initialized as below:

import org.apache.spark._
import org.apache.spark.streaming._
val configurations = new
SparkConf().setAppName(applicationName).setMaster(masterURL)
val streamingContextObject= new StreamingContext(configurations, Seconds(2))

The second argument for creating the StreamingContext object is the batch interval (about which we will see the details in next section)

This internally generates a SparkContext (referred from streaming ContextObject, spark Contest ),which initializes all Spark functionalities.