In the tutorial A basic Spark application using DataFrame, we used DataFrame to read and extract information from a json file(people.json). In this tutorial, we will learn how to use Spark to read data from different types of data source.

Spark SQL allow us to perform data manipulation on various data sources because from DataFrame, we can create a SQL temporary view and use .sql() function to run SQL queries as described in the tutorial A basic Spark application using DataFrame. We can also create a Dataset from DataFrame by using .as[class] function which will be described in the next tutorial

We first create a  Scala Object named DataLoader and create a Spark session as follows:

Next, we use spark.read() to read data from different file formats. For convenience, we store data in a directory named data_source (create this directory by right-clicking on src/main/resources, select New -> Directory). We then copy three files (employees.json, people.csv and users.parquet) from spark\examples\src\main\resources (spark is the unzipped folder of Spark file downloaded from Spark’s homepage) to the data_source directory.

The use of spark.read() requires the file path, so we write a getFilePath() function to get the path of a particular file as follows:

The default file format of Spark is .parquet (we can set the default file format using spark.sql.sources.default). To read this default file format, we just need to use spark.read.load():

Run the above code, we get the following result:

To read files in other formats, we can use .format() to specify the file format. For common file formats such as json, parquet, jdbc, orc, libsvm, csv, text, we can use short name function instead of .format() as follows:

Spark also allow us to config other options for reading data. Each type of data source has different options. In this example, we will use three options (‘sep’, ‘inferSchema’ and ‘header’) to read a csv file (more details of options for each data source can be found at here).

As can be seen from the above result, by default, Spark returns result in raw format. However, if we config reading options, the result will be returned in a cleaner format (similar to the result from reading json file)

To save the DataFrame in a desired format, we can use .write() function ( use either .format() or short name as described in .read()). We can also use write.mode() to specify how to save data(Save Mode). Spark support four modes below:

– SaveMode.ErrorIfExists: returns an Exception if the file already exists

– SaveMode.Append: append data to the file if it already exists

– SaveMode.Overwrite: overwrite the file if it already exists (existing data is deleted first and then write the new data)

– SaveMode.Ignore: do not save or change existing data if the file already exists.

After running the above code, a folder named people_csvDF1 will be created in the data_source  directory (Data is saved in partition format as Spark uses Hadoop File Format)

Note: if “IllegalArgumentException: Illegal pattern component: XXX” error occurs, it might be because the version of commons-lang3 is not up to date. Therefore, we need to add the following dependency of commons-lang3 to pom file:

So, we have finished writing a Spark application that read data from various data sources. The full code of this tutorial is provided below:

December 13, 2018
ITechSeeker