If you have used Python or participated in a Kaggle competition, you should be familiar with Pandas library. Pandas provides many functions that help users easily perform analysis on a given dataset. Spark DataFrame also support similar functions but process data in a distributed maner, thereby improving the speed and performance of the process. Althought the functions of Spark DataFrame is not comparable with those of Pandas but Spark is making progress in improving this API (Note: we can convert between Spark DataFrame and Pandas DataFrame using Apache Arrow).

In this tutorial, we will use Spark DataFrame to perform statistical analysis with the same steps as using Pandas. We will use the dataset of Kanggle competition’s Porto Segura’s Safe Driver Prediction and follow the steps as in Data Preparation & Exploration. The only difference here is that we will use Spark DataFrame instead of Pandas.

We first create a Scala Object named DFProcessing and creat a Spark session as follows:

After downloading and extracting the dataset of Porto Segura’s Safe Driver Prediction, we use DataFrame to load data from the train.csv file as described on the tutorial Loading data from various sources (with Pandas, we just need to use pd.read_csv() but in Spark, we need to config reading options to get the desired format)

We can use .head or .show(5) to display the first five rows of the DataFrame as follows (Currently, Spark doesn’t support the use of tail() function):

We use the code below to count the number of row, number of column and check duplicate data (similar to .shape and .drop_duplicates() in pandas). The number of row before and after using .dropDuplicates() are unchanged, indicating that there is no duplicate data in the given dataset.

The .printSchema() function of Spark also returns the same result as .info() function of Pandas:

To get meta data of each variable, we use the following code with the same logic as described in the original article. However, we convert python code to Spark code: classify variables by their roles(target, id và input); classify by their value (binary: either 0 or 1; nominal (category number): natural numbers representing for a specific category; interval: real number ; ordinal: natural numbers representing for a specific order ); classify by data type (Integer, Double,..)

Next, we list all nominal variable using .filter() function of DataFrame as follows (Note: we need to use “===” instead of “==”):

To list all variables by their role and level, we use .groupby() and .count() as below:

In the Descriptive Statistics section, we use .describe() to get the value of count, mean, std, min and max of each variable. With Pandas, we just need two lines of code to get such information but in Spark, we need to do a bit more steps than that. For convinience, we will write a function named getVarDF() for this task. First, we use .filter() to select variables in a particular category (binary, interval, nominal,..), then we perform following steps: convert value of all rows of a column to a String List, convert this String List to Column List and then use .select to only select the necessary variables (here we use col() function to convert String to Column type. For more information about Spark functions, please visit Spark SQL functions)

Using the getVarDF() function above to get information of interval, ordinal and binary variables, we get the following result:

To check the missing value of each variable, we use .filter() and .count() as follows (Note: the returned result is different from the result in Data Preparation & Exploration as we do not perform under-sampling process)

To list all distinct value of each categorical variable, we use .distinct().count() as follows:

So, we have finished writing a Spark program that perform statistical analysis on a given dataset with the same steps as in Data Preparation & Exploration (without Data Visualization). The full code of this tutorial is provided below.

Note: Spark also provides some functions specified for Stactistical analysis. Please visit Spark Basic Statistics for more details.

December 18, 2018