PySpark Serializer

PySpark Serialization is used to perform tuning on Apache Spark. PySpark supports custom serializers for transferring data. It helps to enhance performance. When we perform a function on an RDD (Spark's Resilient Distributed Dataset), it needs to be serialized so that it can be sent to each working node to execute on its segment of data.

The map() function accepts another function as an argument that needs to be serialized. The objects that are made within this function; those objects will also need to be serialized.

All data which is transferred over the network to write to disk or that remains in memory must be serialized. Serialization plays a vital role in performing a complex operation.

The PySpark supports the following two types of serialization.

MarshelSerializer
PickleSerializer

Let's understand types of PySpark Serializers in detail.

1. MarshalSerializer

MarshalSerializer is used to serialize objects by using PySpark. It is faster than the PickleSerializer, but it supports few datatypes only. The serializer must be chosen at the time of creating SparkContext.

class MarshalSerializer(FramedSerializer):
     """
     http://docs.python.org/2/library/marshal.html
     """
    dumps = marshal.dumps
    loads = marshal.loads

Instance Methods

The instance methods of the marshal serializer are given below:

Inherited from FramedSerializer: __init__, dump_stream, load_stream
Inherited from Serializer: __eq__, __ne__
Inherited from object: __delattr__, __format__, __getattribute__, __hash__,
__new__, __reduce__, __reduce_ex__, __repr__, 
__setattr__, __sizeof__, __str__, __subclasshook__

Consider the following example:

from pyspark.context import SparkContext
from pyspark.serializers import MarshalSerializer
sc = SparkContext('local', 'test', serializer=MarshalSerializer())
sc.parallelize(list(range(1000))).map(lambda x:  3 * x).take(10)
sc.stop()

Output:

[0, 6, 9, 12, 15, 18, 21, 24, 27, 30]

PickleSerializer

PickelSerializer is used to serialize objects. It supports nearly any Python Object, although it is not fast as specialized serializers. Consider the following code:

class PickleSerializer(FramedSerializer):
     """
         http://docs.python.org/2/library/pickle.html
     """
    def dumps(self, obj):
         return cPickle.dumps(obj, 2)
     loads = cPickle.loads

In the above code, class PickleSerializer has inherited the FramedSerializer and defined the dump method inside the class.

Instance Methods

The instance methods of the pickle serializer are given below:

Inherited from FramedSerializer: __init__, 
dump_stream, 
load_stream
Inherited from Serializer: __eq__,
 __ne__
Inherited from object: __delattr__,__format__,__getattribute__,__hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__,__str__, __subclasshook__

Next Topic#

← prev next →