PySpark SQLApache Spark is the most successful software of Apache Software Foundation and designed for fast computing. Several industries are using Apache Spark to find their solutions. PySpark SQL is a module in Spark which integrates relational processing with Spark's functional programming API. We can extract the data by using an SQL query language. We can use the queries same as the SQL language. If you have a basic understanding of RDBMS, PySpark SQL will be easy to use, where you can extend the limitation of traditional relational data processing. Spark also supports the Hive Query Language, but there are limitations of the Hive database. Spark SQL was developed to remove the drawbacks of the Hive database. Let's have a look at the following drawbacks of Hive: Drawbacks of Hive
These drawbacks are the reasons to develop the Apache SQL. PySpark SQL Brief IntroductionPySpark supports integrated relational processing with Spark's functional programming. It provides support for the various data sources to makes it possible to weave SQL queries with code transformations, thus resulting a very powerful tool. PySpark SQL establishes the connection between the RDD and relational table. It provides much closer integration between relational and procedural processing through declarative Dataframe API, which is integrated with Spark code. Using SQL, it can be easily accessible to more users and improve optimization for the current ones. It also supports the wide range of data sources and algorithms in Big-data. Feature of PySpark SQLThe features of PySpark SQL are given below: 1) Consistence Data Access It provides consistent data access means SQL supports a shared way to access a variety of data sources like Hive, Avro, Parquet, JSON, and JDBC. It plays a significant role in accommodating all existing users into Spark SQL. 2) Incorporation with Spark PySpark SQL queries are integrated with Spark programs. We can use the queries inside the Spark programs. One of its most advantages is that developers do not have to manually manage state failure or keep the application in sync with batch jobs. 3) Standard Connectivity It provides a connection through JDBC or ODBC, and these two are the industry standards for connectivity for business intelligence tools. 4) User-Defined Functions PySpark SQL has a language combined User-Defined Function (UDFs). UDF is used to define a new column-based function that extends the vocabulary of Spark SQL's DSL for transforming DataFrame. 5) Hive Compatibility PySpark SQL runs unmodified Hive queries on current data. It allows full compatibility with current Hive data. PySpark SQL ModuleSome important classes of Spark SQL and DataFrames are the following:
Consider the following example of PySpark SQL. Output: +-----+ |hello| +-----+ |spark| +-----+ Code Explanation: In the above code, we have imported the findspark module and called findspark.init() constructor; then, we imported the SparkSession module to create spark session. from pyspark.sql import SparkSession A spark session can be used to create the Dataset and DataFrame API. A SparkSession can also be used to create DataFrame, register DataFrame as a table, execute SQL over tables, cache table, and read parquet file. class builder It is a builder of Spark Session. getOrCreate() It is used to get an existing SparkSession, or if there is no existing one, create a new one based on the options set in the builder. Few other MethodsFew methods of PySpark SQL are following: 1. appName(name) It is used to set the name of the application, which will be displayed in the Spark web UI. The parameter name accepts the name of the parameter. 2. config(key=None, value = None, conf = None) It is used to set a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's configuration. Parameters:
3. master(master) It sets the spark master url to connect to, such as "local" to run locally, "local[4]" to run locally with 4 cores. Parameters:
4. SparkSession.catalog It is an interface that the user may create, drop, alter, or query the underlying database, tables, functions, etc. 5. SparkSession.conf It is runtime configuration interface for spark. This is the interface through that the user can get and set all Spark and Hadoop configurations that are relevant to Spark SQL. class pyspark.sql.DataFrameIt is a distributed collection of data grouped into named columns. A DataFrame is similar as the relational table in Spark SQL, can be created using various function in SQLContext. After creation of dataframe, we can manipulate it using the several domain-specific-languages (DSL) which are pre-defined functions of DataFrame. Consider the following example. Let's consider the following example: Querying Using Spark SQL In the following code, first, we create a DataFrame and execute the SQL queries to retrieve the data. Consider the following code: Output: +----------------+ | Genre| +----------------+ | canadian pop| | reggaeton flow| | dance pop| | pop| | dfw rap| | pop| | trap music| | pop| | country rap| | electropop| | reggaeton| | dance pop| | pop| | panamanian pop| |canadian hip hop| | dance pop| | latin| | dfw rap| |canadian hip hop| | escape room| +----------------+ only showing top 20 rows +---+--------------------+-------------+-----+----------------+------+------------+--------------+--------+--------+-------+--------------+------------+----------+ |_c0| Track.Name| Artist.Name|Genre|Beats.Per.Minute|Energy|Danceability|Loudness..dB..|Liveness|Valence.|Length.|Acousticness..|Speechiness.|Popularity| +---+--------------------+-------------+-----+----------------+------+------------+--------------+--------+--------+-------+--------------+------------+----------+ | 4|Beautiful People ...| Ed Sheeran| pop| 93| 65| 64| -8| 8| 55| 198| 12| 19| 86| | 6|I Don't Care (wit...| Ed Sheeran| pop| 102| 68| 80| -5| 9| 84| 220| 9| 4| 84| | 8| How Do You Sleep?| Sam Smith| pop| 111| 68| 48| -5| 8| 35| 202| 15| 9| 90| | 13| Someone You Loved|Lewis Capaldi| pop| 110| 41| 50| -6| 11| 45| 182| 75| 3| 88| | 38|Antisocial (with ...| Ed Sheeran| pop| 152| 82| 72| -5| 36| 91| 162| 13| 5| 87| | 44| Talk| Khalid| pop| 136| 40| 90| -9| 6| 35| 198| 5| 13| 84| | 50|Cross Me (feat. C...| Ed Sheeran| pop| 95| 79| 75| -6| 7| 61| 206| 21| 12| 82| +---+--------------------+-------------+-----+----------------+------+------------+--------------+--------+--------+-------+--------------+------------+----------+ Using groupBy() function The groupBy() function collects the similar category data. Output: +----------------+-----+ | Genre|count| +----------------+-----+ | boy band| 1| | electropop| 2| | pop| 7| | brostep| 2| | big room| 1| | pop house| 1| | australian pop| 1| | edm| 3| | r&b en espanol| 1| | dance pop| 8| | reggaeton| 2| | canadian pop| 2| | trap music| 1| | escape room| 1| | reggaeton flow| 2| | panamanian pop| 2| | atl hip hop| 1| | country rap| 2| |canadian hip hop| 3| | dfw rap| 2| +----------------+-----+ repartition(numpartitions, *cols) The repartition() returns a new DataFrame which is a partitioning expression. This function accepts two parameter numpartitions and *col. The numpartitions parameter specifies the target number of columns. Output: +---+--------------------+-----------+-------+----------------+------+------------+--------------+--------+--------+-------+--------------+------------+----------+ |_c0| Track.Name|Artist.Name| Genre|Beats.Per.Minute|Energy|Danceability|Loudness..dB..|Liveness|Valence.|Length.|Acousticness..|Speechiness.|Popularity| +---+--------------------+-----------+-------+----------------+------+------------+--------------+--------+--------+-------+--------------+------------+----------+ | 4|Beautiful People ...| Ed Sheeran| pop| 93| 65| 64| -8| 8| 55| 198| 12| 19| 86| | 5|Goodbyes (Feat. Y...|Post Malone|dfw rap| 150| 65| 58| -4| 11| 18| 175| 45| 7| 94| | 17| LA CANCI?N| J Balvin| latin| 176| 65| 75| -6| 11| 43| 243| 15| 32| 90| | 4|Beautiful People ...| Ed Sheeran| pop| 93| 65| 64| -8| 8| 55| 198| 12| 19| 86| | 5|Goodbyes (Feat. Y...|Post Malone|dfw rap| 150| 65| 58| -4| 11| 18| 175| 45| 7| 94| +---+--------------------+-----------+-------+----------------+------+------------+--------------+--------+--------+-------+--------------+------------+----------+ only showing top 5 rows
Next TopicPySpark UDF
|