In this tutorial series, we will learn how to use Spark SQL, Dataset and DataFrames. SparkSQL is one of the five main components of Spark that is developed for structured data processing. We can interact with SparkSQL via SQL, DataFrames API or Datasets API. In order to understand the concepts of Spark DataFrames and Datasets , we first need to know about RDD (Resilient Distributed Datasets).
Resilient Distributed Datasets is the fundamental data structure of Spark. Resilient can be interpreted as the ability to recover data when it is lost or a failure occurs. Distributed means that Spark objects and elements are immutable and distributed across different node in a cluster. This allows Spark to execute tasks in parallel, thereby improving the speed and performance of the system
DataFrame is a high level API of RDD that was introduced in 2013 (since Apache Spark 1.3) to help user easily perform data processing task and significantly improve the system’s performance. Similar to RDD, DataFrame also store data in distributed and immutable manner, but in the form of column, similar to Relational Database.
However, the biggest disadvantage of DataFrame is that it does not support Compile-time type safety, make it difficult to perform data manipulation. For example: if we use DataFrame to get people(“age”), the returned result is not Int object but Column object. Therefore we can not perform operation on the result as on Int object. Not supporting type safety also prevent users from taking advantages of type system supported by programming languages like Scala, Java… In addition, it also increase run-time errors the should have been detected at compile time.
To address this disadvantage of DataFrame, Spark has introduced Datasets API since 2015 (Apache Spark 1.6) by combining the advantages of both RDD and DataFram. Unlike RDD which use Java serialization, Dataset makes use of Encoder to serialize objects (Serialization is the process of converting an Object to a byte sequence for manipulation or transmitting object over the network. De-serialization is the reverse process of Serialization that convert a sequence of byte back to an Object). Encoder Serialization mechanism enables Spark to perform several operations such as filtering, sorting, hashing,… without De-serialization as in normal Serialization mechanisms. DataFrame and Dataset has been merged into one API since Spark 2.0, in which DataFrame is a type alias of DataSet[Row] (Row is an untyped object) and Dataset[T] is a collection of strongly-typed objects with ‘T’ being a typed object.
The following table lists Typed (Dataset) and Un-typed(DataFrame) API of Spark in each programming language (Note: in Scala API, DataFrame is a type alias of Dataset[Row] but in JavaAPI, we need to explicitly write Dataset[Row] to define a DataFrame. Python and R are two languages that do not support Compile-time type safety, so we only have untyped API (DataFrame) without typed API (Dataset) ).