In the previous post, we implemented the first part of our system (Data collection and storage). In this post, we will implement the second part of our system, which is Batch Layer of Lambda Architecture. The Batch Processing of our system is carried out by BatchProcessingActor with the following code ( you can find the details of how to use Spark to process Cassandra data in the tutorial Connecting Spark to Cassandra)

In this part, we also use AppConfiguarion as in Data collection and storage to get the value of tweetDuration and batchInterval specified in the /resources/application.conf file:

– tweetDuration: this parameter is used to limit what Tweeter’s data will be analyzed. In this project, we set tweetDuration = 3 hours, which means that we will only analyze the hashtag of tweets posted within the past 3 hour (using dataFrame.filter() to filter these tweets)

– batchInterval: this parameter defines the time between two Batch Processings. We set batchInterval = 30 minutes, which means that we will run Batch Processing every 30 minutes. Here, we use Akka Scheduler to set this run time (more details of Akka can be found on the tutorial series of Apache Akka)

Run the Runner object, we get the following result:


– In the above code, we use println() to display the results on the screen to help readers get more understanding of how BatchProcessingActor works . However, in the final code of the project, we will remove these println() to avoid unnecessary operations.  In addition, we will also move the Runner object to the main_package and add RealtimeProcessing to BatchProcessingActor (we will see more in the next posts)

– In the scope of this project, we store the result of Batch Processing in the hashtag_batchview table by deleting the result of previous Batch Processing. However, in practice, we should store the result of each run for latter analysis.

March 16, 2019