PySpark SQL

Apache Spark is the most successful software of Apache Software Foundation and designed for fast computing. Several industries are using Apache Spark to find their solutions. PySpark SQL is a module in Spark which integrates relational processing with Spark's functional programming API. We can extract the data by using an SQL query language. We can use the queries same as the SQL language.

If you have a basic understanding of RDBMS, PySpark SQL will be easy to use, where you can extend the limitation of traditional relational data processing. Spark also supports the Hive Query Language, but there are limitations of the Hive database. Spark SQL was developed to remove the drawbacks of the Hive database. Let's have a look at the following drawbacks of Hive:

Drawbacks of Hive

It cannot resume processing, which means if the execution fails in the middle of a workflow, you cannot resume from where it got stuck.
We cannot drop the encrypted databases in cascade when the trash is enabled. It leads to the execution error. For dropping such type of database, users have to use the Purge option.
The ad-hoc queries are executed using MapReduce, which is launched by the Hive but when we analyze the medium size database, it delays the performance.
Hive doesn't support the update or delete operation.
It is limited to the subquery support.

These drawbacks are the reasons to develop the Apache SQL.

PySpark SQL Brief Introduction

PySpark supports integrated relational processing with Spark's functional programming. It provides support for the various data sources to makes it possible to weave SQL queries with code transformations, thus resulting a very powerful tool.

PySpark SQL establishes the connection between the RDD and relational table. It provides much closer integration between relational and procedural processing through declarative Dataframe API, which is integrated with Spark code.

Using SQL, it can be easily accessible to more users and improve optimization for the current ones. It also supports the wide range of data sources and algorithms in Big-data.

Feature of PySpark SQL

The features of PySpark SQL are given below:

1) Consistence Data Access

It provides consistent data access means SQL supports a shared way to access a variety of data sources like Hive, Avro, Parquet, JSON, and JDBC. It plays a significant role in accommodating all existing users into Spark SQL.

2) Incorporation with Spark

PySpark SQL queries are integrated with Spark programs. We can use the queries inside the Spark programs.

One of its most advantages is that developers do not have to manually manage state failure or keep the application in sync with batch jobs.

3) Standard Connectivity

It provides a connection through JDBC or ODBC, and these two are the industry standards for connectivity for business intelligence tools.

4) User-Defined Functions

PySpark SQL has a language combined User-Defined Function (UDFs). UDF is used to define a new column-based function that extends the vocabulary of Spark SQL's DSL for transforming DataFrame.

5) Hive Compatibility

PySpark SQL runs unmodified Hive queries on current data. It allows full compatibility with current Hive data.

PySpark SQL Module

Some important classes of Spark SQL and DataFrames are the following:

pyspark.sql.SparkSession: It represents the main entry point for DataFrame and SQL functionality.
pyspark.sql.DataFrame: It represents a distributed collection of data grouped into named columns.
pyspark.sql.Column: It represents a column expression in a DataFrame.
pyspark.sql.Row: It represents a row of data in a DataFrame.
pyspark.sql.GroupedData: Aggregation methods, returned by DataFrame.groupBy().
pyspark.sql.DataFrameNaFunctions: It represents methods for handling missing data (null values).
pyspark.sql.DataFrameStatFunctions: It represents methods for statistics functionality.
pysark.sql.functions: It represents a list of built-in functions available for DataFrame.
pyspark.sql.types: It represents a list of available data types.
pyspark.sql.Window: It is used to work with Window functions.

Consider the following example of PySpark SQL.

import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession 
spark = SparkSession.builder.getOrCreate()
df = spark.sql('''select 'spark' as hello ''')
df.show()

Output:

 +-----+
|hello|
+-----+
|spark|
+-----+

Code Explanation:

In the above code, we have imported the findspark module and called findspark.init() constructor; then, we imported the SparkSession module to create spark session.

from pyspark.sql import SparkSession

A spark session can be used to create the Dataset and DataFrame API. A SparkSession can also be used to create DataFrame, register DataFrame as a table, execute SQL over tables, cache table, and read parquet file.

class builder

It is a builder of Spark Session.

getOrCreate()

It is used to get an existing SparkSession, or if there is no existing one, create a new one based on the options set in the builder.

Few other Methods

Few methods of PySpark SQL are following:

1. appName(name)

It is used to set the name of the application, which will be displayed in the Spark web UI. The parameter name accepts the name of the parameter.

2. config(key=None, value = None, conf = None)

It is used to set a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's configuration.

Parameters:

key- A key name string of a configuration property.
value- It represents the value of a configuration property.
conf - An instance of SparkConf.

3. master(master)

It sets the spark master url to connect to, such as "local" to run locally, "local[4]" to run locally with 4 cores.

Parameters:

master:a url for spark master.

4. SparkSession.catalog

It is an interface that the user may create, drop, alter, or query the underlying database, tables, functions, etc.

5. SparkSession.conf

It is runtime configuration interface for spark. This is the interface through that the user can get and set all Spark and Hadoop configurations that are relevant to Spark SQL.

class pyspark.sql.DataFrame

It is a distributed collection of data grouped into named columns. A DataFrame is similar as the relational table in Spark SQL, can be created using various function in SQLContext.

After creation of dataframe, we can manipulate it using the several domain-specific-languages (DSL) which are pre-defined functions of DataFrame. Consider the following example.

# To create DataFrame using SQLContext
student = sqlContext.read.parquet("...")
department = sqlContext.read.parquet("...")
student.filter(marks > 55).join(department, student.student_Id == department.id) \
.groupBy(student.name, "gender").({"name": "student_Id", "mark": "department"})

Let's consider the following example:

Querying Using Spark SQL

In the following code, first, we create a DataFrame and execute the SQL queries to retrieve the data. Consider the following code:

from pyspark.sql import *
#Create DataFrame
songdf = spark.read.csv(r"C:\Users\DEVANSH SHARMA\top50.csv", inferSchema = True, header = True)
#Perform SQL queries
songdf.select("Genre").show()
songdf.filter(songdf["Genre"]=="pop").show()

Output:

+----------------+
|           Genre|
+----------------+
|    canadian pop|
|  reggaeton flow|
|       dance pop|
|             pop|
|         dfw rap|
|             pop|
|      trap music|
|             pop|
|     country rap|
|      electropop|
|       reggaeton|
|       dance pop|
|             pop|
|  panamanian pop|
|canadian hip hop|
|       dance pop|
|           latin|
|         dfw rap|
|canadian hip hop|
|     escape room|
+----------------+
only showing top 20 rows

+---+--------------------+-------------+-----+----------------+------+------------+--------------+--------+--------+-------+--------------+------------+----------+
|_c0|          Track.Name|  Artist.Name|Genre|Beats.Per.Minute|Energy|Danceability|Loudness..dB..|Liveness|Valence.|Length.|Acousticness..|Speechiness.|Popularity|
+---+--------------------+-------------+-----+----------------+------+------------+--------------+--------+--------+-------+--------------+------------+----------+
|  4|Beautiful People ...|   Ed Sheeran|  pop|              93|    65|          64|            -8|       8|      55|    198|            12|          19|        86|
|  6|I Don't Care (wit...|   Ed Sheeran|  pop|             102|    68|          80|            -5|       9|      84|    220|             9|           4|        84|
|  8|   How Do You Sleep?|    Sam Smith|  pop|             111|    68|          48|            -5|       8|      35|    202|            15|           9|        90|
| 13|   Someone You Loved|Lewis Capaldi|  pop|             110|    41|          50|            -6|      11|      45|    182|            75|           3|        88|
| 38|Antisocial (with ...|   Ed Sheeran|  pop|             152|    82|          72|            -5|      36|      91|    162|            13|           5|        87|
| 44|                Talk|       Khalid|  pop|             136|    40|          90|            -9|       6|      35|    198|             5|          13|        84|
| 50|Cross Me (feat. C...|   Ed Sheeran|  pop|              95|    79|          75|            -6|       7|      61|    206|            21|          12|        82|
+---+--------------------+-------------+-----+----------------+------+------------+--------------+--------+--------+-------+--------------+------------+----------+

Using groupBy() function

The groupBy() function collects the similar category data.

Output:

+----------------+-----+
|           Genre|count|
+----------------+-----+
|        boy band|    1|
|      electropop|    2|
|             pop|    7|
|         brostep|    2|
|        big room|    1|
|       pop house|    1|
|  australian pop|    1|
|             edm|    3|
|  r&b en espanol|    1|
|       dance pop|    8|
|       reggaeton|    2|
|    canadian pop|    2|
|      trap music|    1|
|     escape room|    1|
|  reggaeton flow|    2|
|  panamanian pop|    2|
|     atl hip hop|    1|
|     country rap|    2|
|canadian hip hop|    3|
|         dfw rap|    2|
+----------------+-----+

repartition(numpartitions, *cols)

The repartition() returns a new DataFrame which is a partitioning expression. This function accepts two parameter numpartitions and *col. The numpartitions parameter specifies the target number of columns.

song_spotify.repartition(10).rdd.getNumPartitions()
data = song_spotify.union(song_spotify).repartition("Energy")
data.show(5)

Output:

+---+--------------------+-----------+-------+----------------+------+------------+--------------+--------+--------+-------+--------------+------------+----------+
|_c0|          Track.Name|Artist.Name|  Genre|Beats.Per.Minute|Energy|Danceability|Loudness..dB..|Liveness|Valence.|Length.|Acousticness..|Speechiness.|Popularity|
+---+--------------------+-----------+-------+----------------+------+------------+--------------+--------+--------+-------+--------------+------------+----------+
|  4|Beautiful People ...| Ed Sheeran|    pop|              93|    65|          64|            -8|       8|      55|    198|            12|          19|        86|
|  5|Goodbyes (Feat. Y...|Post Malone|dfw rap|             150|    65|          58|            -4|      11|      18|    175|            45|           7|        94|
| 17|          LA CANCI?N|   J Balvin|  latin|             176|    65|          75|            -6|      11|      43|    243|            15|          32|        90|
|  4|Beautiful People ...| Ed Sheeran|    pop|              93|    65|          64|            -8|       8|      55|    198|            12|          19|        86|
|  5|Goodbyes (Feat. Y...|Post Malone|dfw rap|             150|    65|          58|            -4|      11|      18|    175|            45|           7|        94|
+---+--------------------+-----------+-------+----------------+------+------------+--------------+--------+--------+-------+--------------+------------+----------+
only showing top 5 rows

Next TopicPySpark UDF

← prev next →