Javatpoint Logo
Javatpoint Logo

PySpark Serializer

PySpark Serialization is used to perform tuning on Apache Spark. PySpark supports custom serializers for transferring data. It helps to enhance performance. When we perform a function on an RDD (Spark's Resilient Distributed Dataset), it needs to be serialized so that it can be sent to each working node to execute on its segment of data.

The map() function accepts another function as an argument that needs to be serialized. The objects that are made within this function; those objects will also need to be serialized.

All data which is transferred over the network to write to disk or that remains in memory must be serialized. Serialization plays a vital role in performing a complex operation.

The PySpark supports the following two types of serialization.

  • MarshelSerializer
  • PickleSerializer

Let's understand types of PySpark Serializers in detail.

1. MarshalSerializer

MarshalSerializer is used to serialize objects by using PySpark. It is faster than the PickleSerializer, but it supports few datatypes only. The serializer must be chosen at the time of creating SparkContext.

  • Instance Methods

The instance methods of the marshal serializer are given below:

Consider the following example:

Output:

[0, 6, 9, 12, 15, 18, 21, 24, 27, 30]

PickleSerializer

PickelSerializer is used to serialize objects. It supports nearly any Python Object, although it is not fast as specialized serializers. Consider the following code:

In the above code, class PickleSerializer has inherited the FramedSerializer and defined the dump method inside the class.

  • Instance Methods

The instance methods of the pickle serializer are given below:



Next Topic#





Youtube For Videos Join Our Youtube Channel: Join Now

Feedback


Help Others, Please Share

facebook twitter pinterest

Learn Latest Tutorials


Preparation


Trending Technologies


B.Tech / MCA