In the previous tutorial (Integrating Kafka with Spark using DStream), we learned how to integrate Kafka with Spark using an old API of Spark – Spark Streaming (DStream) . In this tutorial, we will use a newer API of Spark, which is Structured Streaming (see more on the tutorials Spark Structured Streaming) for this integration.

First, we add the following dependency to pom.xml file.

Next, we create a Java class named KafkaSparkUsingStructuredStreaming and create a Spark session as follows:

We then config Spark to get data from Kafka’s test topic as follows (more details of configuration options can be found at here . Note: unlike Scala API, we have to explicitly write Dataset<Row> to define a DataFrame in Java API)

To display the received data, we use the following code (see more on the tutorial WordCount program with Structured Streaming)

Run the code, we get the result below:

As we can see, the returned result is a DataFrame with 7 columns (key, value, topic, partition, offset, timestamp, timestampType) where ‘value’ column is the text message in ASCII format.

To convert from ASCII to String, we can use the following code ( the newly created DataFrame has only one column that is value column)

We then count the occurrences of each word and display the result by using the same steps as described on the tutorial WordCount program with Structured Streaming:

After running the program above, we open a Producer terminal and write some text messages.

We should be able to see the output of the program as follows:

So, we have finished writing a program that use Structured Streaming to integrate Kafka with Spark. The use of Structured Streaming with DataFrame and Dataset makes our code become cleaner and easier to read and understand while improving the overal performance of the system. The full code of this tutorial is provided below.

December 12, 2018