Apache Spark is an unified analytics engine for big data analysis. It was originally developed by AMPLab at the University of California, Berkeley in 2009 to improve the processing speed of Hadoop systems. It then became an open source tool in 2010 and was donated to the Apache Software Foundation in 2013. Spark now become one of major projects of Apache Software Foundation. The first version of Spark was released in May 2014 and the second version was published in July 2016 [1] [2]. At the time of this post (11/2018), the latest version of Spark is 2.3.2.

Spark can achieve as fast processing speed due to cluster computing based on Hadoop MapReduce and in-memory cluster computing ( In-memory cluster computing is normally 10-100 times faster that performing on Disk or network). Spark is designed to be applied in many different areas such as batch applications, iterative algorithms, interactive queries and streaming…

[1]     “What is Apache Spark? – Definition from WhatIs.com,” SearchDataManagement. [Online]. Available: https://searchdatamanagement.techtarget.com/definition/Apache-Spark. [Accessed: 05-Nov-2018].

[2]     tutorialspoint.com, “Apache Spark Introduction,” www.tutorialspoint.com. [Online]. Available: https://www.tutorialspoint.com/apache_spark/apache_spark_introduction.htm. [Accessed: 05-Nov-2018].

November 8, 2018