In this tutorial, we will write a WordCount program that count the occurrences of each word in a stream data received from a Data server. We will use Netcat to simulate the Data server and the WordCount program will use Structured Streaming to count each word.

First, we create a WordCount object and create a Spark session as follows:

Next, we create a Streaming DataFrame to receive data from Data server (Data server is in port 9999 of localhost) using spark.readStream. The type of data source can be specified using spark.readStream.format(). Currently, Spark supports four different stream data sources: File source, Socket source, Kafka source and Rate Source [1]. File source is used to read a stored file as a stream of data (Spark only supports five file formats which are text, csv, json, orc and parquet). Socket source is data from socket connection and Kafka source is data from Kafka while Rate source is normally used for testing that generate data at the specified number of rows per second. Each source has different options which can be found at here.

The above ‘lines’ is a DataFrame in the form of table with one column named “value”. This is unbounded table and each row of this table is a line of stream text.

After receiving stream data from Data server, we need to split the text into words. This can be done using flatMap function. However, as ‘lines’ is a DataFrame, its data type is Row (Un-typed) instead of String. Therefore, we need to convert the ‘lines’ from DataFrame to Dataset with String Type before splitting words as follows ( _.split(” “) is short expression of line -> line.split(” “) ) :

After splitting words,we count the occurrences of each word using count() as follows:

Finally, we display the result on the console to check if the program functions properly. We can choose one of three output modes supported by Spark, that Complete Mode, Append Mode or Update Mode [2]. Complete Mode will update entire results until present time while Append Mode only shows the new results since the last trigger and Update Mode only shows the rows that have changed since the last trigger. After running the query, we use awaitTermination() to make sure that the program will not be stopped if the query is still active)

To check the operation of the program, we open a terminal (Ubuntu) and use Netcat with the command: nc -lk 9999 to simulate the Data server. We then run the WordCount program (press Ctl+Shift+F10).

Enter a text on Netcat terminal and check if the WordCount program counts word correctly.

So, we have finish writing a WordCount program using Structured Streaming to count the occurrences of each word in a stream data. The full code of this program is provided below.

 

[1]https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources

[2]https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

November 20, 2018
ITechSeeker