PySpark UDF

The Spark SQL provides the PySpark UDF (User Define Function) that is used to define a new Column-based function. It extends the vocabulary of Spark SQL's DSL for transforming Datasets.

Register a function as a UDF

def cube(s):
    return s*s*s
spark.udf.register("cubewithPython", cube)

We can optionally set the return type of UDF. The default return type is StringType. Consider the following example:

from pyspark.sql.types import LongType
def cube_typed(s):
  return s * s
spark.udf.register("cubewithPython", cube_typed, LongType()) 

Call the UDF function

PySpark UDF's functionality is same as the pandas map() function and apply() function. These functions are used for panda's series and dataframe. In the below example, we will create a PySpark dataframe.

import pandas as pd
# We consider the following example data
df_pd = pd.DataFrame(
    data={'integers': [1, 2, 3],
     'floats': [-1.0, 0.5, 2.7],
     'integer_arrays': [[1, 2], [3, 4, 5], [6, 7, 8, 9]]}
)
df = spark.createDataFrame(df_pd)
df.printSchema() # It will print the Schema
df.show() # Display the Dataframe

The code will print the Schema of the Dataframe and the dataframe.

Output

root
 |-- integers: long (nullable = true)
 |-- floats: double (nullable = true)
 |-- integer_arrays: array (nullable = true)
 |    |-- element: long (containsNull = true)

+--------+------+--------------+
|integers|floats|integer_arrays|
+--------+------+--------------+
|       1|  -1.0|        [1, 2]|
|       2|   0.5|     [3, 4, 5]|
|       3|   2.7|  [6, 7, 8, 9]|
+--------+------+--------------+

Evaluation Order and null checking

PySpark SQL doesn't give the assurance that the order of evaluation of subexpressions remains the same. It is not necessary to evaluate Python input of an operator or function left-to-right or in any other fixed order. For example, logical AND and OR expressions do not have left-to-right "short-circuiting" semantics.

Therefore, it is quite unsafe to depend on the order of evaluation of a Boolean expression. For example, the order of WHERE and HAVING clauses, since such expressions and clauses can be reordered during query optimization and planning. If a UDF depends on short-circuiting semantics (order of evaluation) in SQL for null checking, there's no surety that the null check will happen before invoking the UDF.

spark.udf.register("strlen_nullsafe", lambda s: len(s) if not s is None else -1, "int")
spark.sql("select s from test1 where s is not null and strlen_nullsafe(s) > 1") // ok
spark.sql("select s from test1 where if(s is not null, strlen(s), null) > 1")   // ok

Primitive type Outputs

Let's consider a function square() that squares a number, and register this function as Spark UDF.

def square(x):
   return x**2

Now we convert it into the UDF. While registering, we have to specify the data type using the pyspark.sql.types. The problem with the spark UDF is that it doesn't convert an integer to float, whereas, Python function works for both integer and float values. A PySpark UDF will return a column of NULLs if the input data type doesn't match the output data type. Let's consider the following program:

from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import udf
square_udf_int = F.udf(lambda z: square(z), IntegerType())
(
    df.select('integers',
              'floats',
              square_udf_int('integers').alias('int_squared'),
              square_udf_int('floats').alias('float_squared'))
    .show()
)

Output:

+--------+------+-----------+-------------+
|integers|floats|int_squared|float_squared|
+--------+------+-----------+-------------+
|       1|  -1.0|          1|         null|
|       2|   0.5|          4|         null|
|       3|   2.7|          9|         null|
+--------+------+-----------+-------------+

As we can see the above output, it returns null for the float inputs. Now have a look on another example.

Register with UDF with float type output

import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import udf
from pyspark.sql.types import FloatType
square_udf_float = F.udf(lambda z: square(z), FloatType())
(
    df.select('integers',
              'floats',
              square_udf_float('integers').alias('int_squared'),
              square_udf_float('floats').alias('float_squared'))
    .show()
)

Output:

+--------+------+-----------+-------------+
|integers|floats|int_squared|float_squared|
+--------+------+-----------+-------------+
|       1|  -1.0|       null|          1.0|
|       2|   0.5|       null|         0.25|
|       3|   2.7|       null|         7.29|
+--------+------+-----------+-------------+

Specifying float type output using the Python function

Here we force the output to be float also for the integer inputs.

import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import udf
def square_float(x):
    return float(x**2)
square_udf_float2 = F.udf(lambda z: square_float(z), FloatType())
(
    df.select('integers',
              'floats',
              square_udf_float2('integers').alias('int_squared'),
              square_udf_float2('floats').alias('float_squared'))
    .show()
)

Output:

+--------+------+-----------+-------------+
|integers|floats|int_squared|float_squared|
+--------+------+-----------+-------------+
|       1|  -1.0|        1.0|          1.0|
|       2|   0.5|        4.0|         0.25|
|       3|   2.7|        9.0|         7.29|
+--------+------+-----------+-------------+

Composite Type Output

If the output of Python functions is in the form of list, then the input value must be a list, which is specified with ArrayType() when registering the UDF. Consider the following code:

from pyspark.sql.types import ArrayType
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import udf
def square_list(x):
    return [float(val)**2 for val in x]
square_list_udf = F.udf(lambda y: square_list(y), ArrayType(FloatType()))
df.select('integer_arrays', square_list_udf('integer_arrays')).show()

Output:

+--------------+------------------------+
|integer_arrays|(integer_arrays)|
+--------------+------------------------+
|        [1, 2]|              [1.0, 4.0]|
|     [3, 4, 5]|       [9.0, 16.0, 25.0]|
|  [6, 7, 8, 9]|    [36.0, 49.0, 64.0...|
+--------------+------------------------+

Some Common UDF Problem

Py4JJavaError

It is the most common exception while working with the UDF. It comes from a mismatched data type between Python and Spark. If the Python function uses a data type from a Python module like numpy.ndarray, then the UDF throws an exception.

import numpy as np
# Example data
d_np = pd.DataFrame({'int_arrays': [[1,2,3], [4,5,6]]})
df_np = spark.createDataFrame(d_np)
df_np.show()

Output:

+----------+
|int_arrays|
+----------+
| [1, 2, 3]|
| [4, 5, 6]|
+----------+

In below example, we are creating a function which returns nd.ndarray. Their values are also Numpy objects Numpy.int32 instead of Python primitives.

def square_array_wrong(x):
    return np.square(x)
square_array_wrong([1,2,3])

Output:

array([1, 4, 9], dtype=int32)

If we execute the below code, it will throw an exception Py4JavaError.

from pyspark.sql.types import ArrayType
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import udf
spark_square_array_wrong = F.udf(square_array_wrong, ArrayType(FloatType()))
df_np.withColumn('doubled', spark_square_array_wrong('int_arrays')).show()

Output:

The solution of this type of exception is to convert it back to a list whose values are Python primitives.

from pyspark.sql.types import ArrayType
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import udf
def square_array_right(x):
    return np.square(x).tolist()

spark_square_array_right = F.udf(square_array_right, ArrayType(IntegerType()))
zz = df_np.withColumn('squared', spark_square_array_right('int_arrays'))
zz.show()

Output:

+----------+------------+
|int_arrays|     squared|
+----------+------------+
| [1, 2, 3]|   [1, 4, 9]|
| [4, 5, 6]|[16, 25, 36]|
+----------+------------+

In the above code, we described the solution of the exception. Now do it your own and observe the difference between both programs.

Slowness

PySpark has another demerit; it takes a lot of time to run compared to the Python counterpart. The small data-size in term of the file size is one of the reasons for the slowness. Spark sends the whole data frame to one and only one executor and leaves other executer waiting. The solution is to repartition the dataframe. For example:

When we repartitioned the data, each executer processes one partition at a time, and thus reduces the execution time.

Next TopicPySpark RDD

← prev next →