In this tutorial, we will use TwitterStreaming API to get Twitter’s tweets in real time and publish it to a Kafka topic. We then use Spark as a Consumer to read Twitter data from Kafka and analyse that data. To get more understanding about this tutorial, please refer to the previous two tutorials: Publishing Tweeter’s data to a Kafka topic and Integrating Kafka with Spark using Structured Streaming.

In this tutorial, we will use both Java and Scala to write the following two programs:

– KafkaTwitterStreaming.java: a Java program that is used to collect Twitter’s data in real time (Twitter Streaming Data) and publish it to a Kafka topic

– KafkaSparkSS.scala: a Scala program thats use Spark to read data from Kafka broker and analyse it.

Similar to the tutorial Publishing Tweeter’s data to a Kafka topic, we will use twitter4j library to collect Twitter’s data. However, for real-time data we need to use twitter4j-stream instead of twitter4j-core, so we add the following dependency to the pom.xml file:

We config the properties of Kafka and connect the program to our Twitter’s account (more details on the tutorial Publishing Tweeter’s data to a Kafka topic) with the following two methods (we add .setJSONStoreEnabled(true) to publish data to Kafka in Json format):

Next, we write the getStreamTweets() method to collect Twitter data in real-time as follows:

The main method of our KafkaTwitterStreaming class is as follows (the Kafka topic is “Big Data”):

Run KafkaTwitterStreaming in IntelliJ , we get the following result (the returned result is in Json format, containing information about the tweets such as ID, content(in text format), posted date(created_at), …):

Open a terminal and start a Kafka Consummer, we get the same result as above (below is the fourth tweets with the ID: 1081476078546370560):

So, we have finished writing a java program (KafkaTwitterStreaming.java) to collect Twitter’s tweets in real time and publish it to Kafka (the number of tweets increases over time as the program collect data continously). The next step we need to do is writing a Scala program (KafkaSparkSS.scala) to get data from Kafka broker and perform some analysis on that data. To do so, we use Spark and connect it with Kafka as described on the tutorial Integrating Kafka with Spark using Structured Streaming.

As mentioned above, the returned result is in Json format but Spark Structured Streaming can not extract Schema of Kafka data automatically. Therefore, we have to pre-define the Schema of the recieved data. For data with a simple Schema, we can define the Schema manually as described on the tutorial Statistical analysis with Spark DataFrame. However, the Schema of Twitter’s data is so complex that we can hardly define it by hand. Consequently, we first use Spark to extract the Scheme of a given sample data automatically and then use this schema for the real-time data. To do so, we copy a sample data received from running KafkaTwitterStreaming.java and save it as twitter.json. We then use the following code to extract the Schema:

Next, we use the above Schema to read the stream data of Twitter as follows:

We display the result of Spark with the code below:

Run KafkaSparkSS.scala, we get the following result:

So, we have finished writing a program that makes use of TwitterStreaming API, Kafka and Spark to collect and analyze Twitter’s data in real time. You can download the full code of this tutorial on the Github of ITechSeeker.

January 5, 2019
ITechSeeker