In this project, we will use SMACK Stack to write a Real-time Twitter analysis system based on Lambda Architecture . We will limit our project to only analyze the hashtag of Twitter’s tweets as an example of how to implement Lambda Architecture with SMACK stack.

The overview of this project can be described using the following diagram:

The project will work as follows: we first use the Twitter4J library (more details about Twitter4J can be found on the tutorial Real-time Twitter Analysis) to collect Twitter’s stream data and publish it to Kafka. This data is then saved in Cassandra Database (see more at Connecting Kafka to Cassandra Sink), and is also sent to Spark Streaming for real-time processing (see more at Integrating Kafka with Spark using Structured Streaming). Next, we use Spark Batch Processing to analyze the data stored in Cassandra. In this step, we use Akka Scheduler to specify the interval of Batch processing (run each 30 minutes). The results of Spark Streaming (real time) and Spark Batch processing are both stored in Cassandra. Finally, we write a simple Client UI and use Akka HTTP to create a REST API that allows users to access the result of hashtag’s analysis.

This Project will mainly use Scala (except ClientUI which is written in Java) and divided into 5 packages (main_package, data_collector package and the remaining 3 packages are corresponding to 3 layers of Lambda Architecture described on the tutorial Introduction of Lambda Architecture) as follows:

– main_package: this package contains the main code to run Lambda Architecture (Server side) and User Interface (Client side). The aim of this project is to implement Lambda Architecture, so Client side is only a simple UI to help users access the result of Twitter’s data processing.

– data_collector: this package is used to collect Twitter stream data and send it to Kafka broker. The data are then saved in Cassandra Database using Kafka Cassandra Sink (see more at Connecting Kafka to Cassandra Sink). In this package, we use three components of SMACK Stack: C(Cassandra), A(Akka) and K(Kafka).

– batch_layer: this package is the implementation of Lambda Architecture’s Batch Layer by using Spark Cassandra Connector to read data from Cassandra Database and then use Spark DataFrame to analyze this data. The results are then saved in the batch_view table of Cassandra Database. Apart from using S(Spark) and C(Cassandra), this package also utilizes A(Akka) to create actors for Batch Processing.

– speed_layer: this package is the implementation of Lambda Architecture’s Speed Layer by using Spark Structured Streaming to read data directly from Kafka broker and then use Spark DataFrame to analyze this data. The results are then saved in the realtime_view table of Cassandra Database. In this package, we use all four components of SMACK Stack: S(Spark), A(Akka), C(Cassandra) and K(Kafka). A(Akka) is used to create actors for Real-time Processing similar to those in Batch Processing.

– serving_layer: this package is the implementation of Lambda Architecture’s Serving Layer by using Akka Http to create a RestAPI that allows users to access data from the UI of Client side. When a user send a GET request, the RestAPI will read data from Cassandra Database and return the result, which is the combination of both batch_view and realtime_view.

The full code of this project can be downloaded on the Gihub of ITechSeeker at here.

March 11, 2019
ITechSeeker