Listing of Contents
Watch Now The Real Python team has produced a video course that corresponds to this lesson. Watch it together with the written lesson to enhance your comprehension: How to Use a PDF File in Python
PDF is a file format that may be used to reliably exhibit and share documents across several operating systems. Although Adobe created the PDF format first, it is now an open standard managed by the International Organization for Standardization (ISO). With the PyPDF2 library, you may manipulate a pre-existing PDF in Python.
PyPDF2 is a pure Python module that may be used for a variety of PDF operations.
At the conclusion of this article, you will be able to:
- Extract document information from a PDF in Python
- Rotate pages
- Merge PDFs
- Split PDFs
- Add watermarks
- Encrypt a PDF
Let’s get started!
The evolution of pyPdf, PyPDF2, and PyPDF4
In 2005, the initial pyPdf package was published. 2010 was the final official release of pyPdf. After almost a year, the corporation Phasit funded the PyPDF2 derivative of the pyPdf programming language. The code was developed to be backwards compatible with the original and functioned pretty well for a number of years until its final release in 2016.
After a short sequence of releases of a package known as PyPDF3, the project was renamed PyPDF4. The primary difference between pyPdf and PyPDF2+ is the addition of Python 3 functionality in the later version. There is an alternative Python 3 branch of the original pyPdf for Python 3, although it has not been updated for some years.
Although PyPDF2 was discontinued in 2016, it has been resurrected in 2022 and is being actively maintained at present. The new PyPDF4 is not fully compatible with the previous version, PyPDF2. PyPDF4 is not included more prominently in this page because there are a few examples that are incompatible with PyPDF4. Replace the imports for PyPDF2 with those for PyPDF4 and see the results.
pdf: A Replacement
Patrick Maupin developed the pdfrw package, which can do many of the same functions as PyPDF2. Except for encryption, pdfrw may be used to do all of the actions covered in this article for PyPDF2, with the exception of signing.
The most distinguishing feature of pdfrw is that it interfaces with the ReportLab package, allowing you to take an old PDF and create a new PDF using ReportLab utilising part or all of the original PDF.
Installation
PyPDF2 may be installed using pip or conda if you are using Anaconda instead of standard Python.
This is how to install PyPDF2 using pip:
$ pip install pypdf2
PyPDF2 does not have any dependencies, thus the installation is relatively rapid. You will likely spend equal amounts of time obtaining and installing the software.
Let’s now move on to learning how to extract information from a PDF.
Python Instructions for Extracting Document Content From a PDF
PyPDF2 may be used to extract information and text from PDFs. This may be beneficial for automating previous PDF files with certain sorts of automation.
The following categories of data may now be extracted:
- Author
- Creator
- Producer
- Subject
- Title
- Number of pages
You must locate a PDF file to utilise for this example. You may use any PDF file on your device. To facilitate this activity, I visited Leanpub and downloaded a sample of one of my books. The filename of the sample you want to download is reportlab-sample.pdf.
Let’s create some code using this Document and see how to access the following attributes:
# extract_doc_info.py
from PyPDF2 import PdfFileReader
def extract_information(pdf_path):
with open(pdf_path, 'rb') as f:
pdf = PdfFileReader(f)
information = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
txt = f"""
Information about {pdf_path}:
Author: {information.author}
Creator: {information.creator}
Producer: {information.producer}
Subject: {information.subject}
Title: {information.title}
Number of pages: {number_of_pages}
"""
print(txt)
return information
if __name__ == '__main__':
path = 'reportlab-sample.pdf'
extract_information(path)
Here, PdfFileReader is imported from the PyPDF2 package. PdfFileReader is a class that has many methods for working with PDF files. In this example, you invoke.getDocumentInfo(), which returns a DocumentInformation object. This provides the majority of the information you want. Also, you call.getNumPages() on the reader object, which returns the document’s page count.
This last code block utilises Python 3’s new f-strings to format strings. Check out Python 3’s f-Strings: An Enhanced String Formatting Syntax for additional information (Guide).
The information variable contains numerous instance properties that you may utilise to get the remaining document details. You print off and return the information for possible future use.
While PyPDF2 has.extractText(), which can be used on its page objects (which are not shown in this example), it is not particularly effective. Some PDFs provide text, whilst others return an empty string. If you want to extract text from a PDF, you should instead investigate the PDFMiner project. PDFMiner is far more robust and was created exclusively for extracting text from PDFs.
You are now prepared to learn about rotating PDF page rotation.
Guide to Rotating Pages
You may sometimes get PDFs with pages that are in landscape mode rather than portrait mode. Or maybe they are even inverted. This may occur when a document is scanned to PDF or emailed. You could print the document and read it on paper, or you could utilise Python’s capabilities to rotate the troublesome pages.
For this example, you may go to Real Python and print an article to PDF.
Let’s discover how to flip a couple pages of this essay using PyPDF2:
# rotate_pages.py
from PyPDF2 import PdfFileReader, PdfFileWriter
def rotate_pages(pdf_path):
pdf_writer = PdfFileWriter()
pdf_reader = PdfFileReader(pdf_path)
# Rotate page 90 degrees to the right
page_1 = pdf_reader.getPage(0).rotateClockwise(90)
pdf_writer.addPage(page_1)
# Rotate page 90 degrees to the left
page_2 = pdf_reader.getPage(1).rotateCounterClockwise(90)
pdf_writer.addPage(page_2)
# Add a page in normal orientation
pdf_writer.addPage(pdf_reader.getPage(2))
with open('rotate_pages.pdf', 'wb') as fh:
pdf_writer.write(fh)
if __name__ == '__main__':
path = 'Jupyter_Notebook_An_Introduction.pdf'
rotate_pages(path)
In this example, PdfFileWriter must be imported in addition to PdfFileReader since a new PDF must be generated. rotate pages() requires the path to the PDF to be modified. Inside this function, you must construct a writer object with the name pdf writer and a reader object with the name pdf reader.
Next, you may get the specified page using.GetPage(). Here, you get page zero, the initial page. Then, 90 degrees is sent to the page object’s.rotateClockwise() function. Then, you call.rotateCounterClockwise() with an angle of 90 degrees for page two.
Notice that the PyPDF2 module restricts page rotation to 90-degree increments. Otherwise, you will obtain an AssertionError.
Following each rotation method call, you call.addPage (). This will append the page’s rotated version to the writer object. The last page added to the writer object is page 3, which has not been rotated.
Lastly, you write the new PDF using the.write extension (). As an argument, it requires a file-like object. This PDF will consist of three pages. The first two will be spun in different directions and in landscape format, while the third will be a standard page.
Let’s now examine how to combine many PDFs into one.
How to Join PDF Files
There are several instances in which you may wish to combine two or more PDFs into a single PDF. For instance, a common cover page may be required for several sorts of reports. You can do this type of task using Python.
For this example, it is possible to open a PDF and print a page as a separate PDF. Then repeat the process with a different page. This will provide you with many inputs for example purposes.
Let’s proceed and build some code that may be used to combine PDFs:
# pdf_merging.py
from PyPDF2 import PdfFileReader, PdfFileWriter
def merge_pdfs(paths, output):
pdf_writer = PdfFileWriter()
for path in paths:
pdf_reader = PdfFileReader(path)
for page in range(pdf_reader.getNumPages()):
# Add each page to the writer object
pdf_writer.addPage(pdf_reader.getPage(page))
# Write out the merged PDF
with open(output, 'wb') as out:
pdf_writer.write(out)
if __name__ == '__main__':
paths = ['document1.pdf', 'document2.pdf']
merge_pdfs(paths, output='merged.pdf')
If you have a list of PDFs that you want to combine, you may use merge pdfs(). This function accepts a list of input routes and an output path, since you will need to know where to store the result.
Finally, you iterate over the inputs and generate a PDF reader object for each one. Then, you will loop through all of the PDF file’s pages and employ. addPage() is used to add each page to itself.
After you have completed iterating over each page of each PDF in your list, you will output the final result.
If you did not want to combine all of the pages of each PDF, you could improve this script by specifying a range of pages to be included. You could also develop a command line interface for this function using Python’s argparse module if you want a challenge.
Learn how to do the opposite of merging!
How to Divide a PDF
There are situations when it may be necessary to divide a PDF into numerous PDFs. This is particularly true for PDFs that include a great deal of scanned-in material, although there are a multitude of reasons to divide a PDF.
Here’s how to divide your PDF into numerous files using PyPDF2:
# pdf_splitting.py
from PyPDF2 import PdfFileReader, PdfFileWriter
def split(path, name_of_split):
pdf = PdfFileReader(path)
for page in range(pdf.getNumPages()):
pdf_writer = PdfFileWriter()
pdf_writer.addPage(pdf.getPage(page))
output = f'{name_of_split}{page}.pdf'
with open(output, 'wb') as output_pdf:
pdf_writer.write(output_pdf)
if __name__ == '__main__':
path = 'Jupyter_Notebook_An_Introduction.pdf'
split(path, 'jupyter_page')
In this example, you build a PDF reader object and iterate through its pages once again. For each PDF page, a new PDF writer instance will be created and a single page will be added. Afterwards, you will save this page to a file with a distinct name. When the script has completed executing, each page of the original PDF should be divided into distinct PDFs.
Now, let’s learn how to apply a watermark to your PDF file.
How to Include Watermarks
Watermarks are graphics or patterns used to identify printed and digital documents. Some watermarks are only visible under certain lighting conditions. Watermarking is essential since it helps you to safeguard your intellectual property, such as your photos and Files. Overlay is another synonym for watermark.
Python and PyPDF2 may be used to watermark documents. You need a PDF containing simply your watermark picture or text.
Let’s now discover how to apply a watermark:
create watermark() of
# pdf_watermarker.py
from PyPDF2 import PdfFileWriter, PdfFileReader
def create_watermark(input_pdf, output, watermark):
watermark_obj = PdfFileReader(watermark)
watermark_page = watermark_obj.getPage(0)
pdf_reader = PdfFileReader(input_pdf)
pdf_writer = PdfFileWriter()
# Watermark all the pages
for page in range(pdf_reader.getNumPages()):
page = pdf_reader.getPage(page)
page.mergePage(watermark_page)
pdf_writer.addPage(page)
with open(output, 'wb') as out:
pdf_writer.write(out)
if __name__ == '__main__':
create_watermark(
input_pdf='Jupyter_Notebook_An_Introduction.pdf',
output='watermarked_notebook.pdf',
watermark='watermark.pdf')
takes three arguments:
input_pdf
: the PDF file path to be watermarkedoutput
: the path you want to save the watermarked version of the PDFwatermark
: a PDF that contains your watermark image or text
In the code, you open the watermark PDF and get just the first page, since that is where the watermark should be located. Finally, using input pdf, you build a PDF reader object and a generic pdf writer object for outputting the watermarked PDF.
The next step is to iterate through the input pdf’s pages. This is where the magic occurs. You must call.mergePage() and supply it the watermark page parameter. When you do so, the watermark page will appear on top of the current page. Afterwards, you include the newly-merged page into your pdf writer object.
The process concludes when the freshly watermarked PDF is saved to disc.
The last item covered will be how PyPDF2 handles encryption.
How to Protect a PDF File
Currently, PyPDF2 only allows adding user and owner passwords to an existing PDF. In the PDF universe, an owner password grants you administrator access over the PDF and enables you to define permissions for the document. On the other hand, the user password just permits document access.
As far as I can tell, PyPDF2 does not enable you to specify any document rights, while allowing you to set the owner’s password.
Nonetheless, here’s how to add a password, which will encrypt the PDF by default:
# pdf_encrypt.py
from PyPDF2 import PdfFileWriter, PdfFileReader
def add_encryption(input_pdf, output_pdf, password):
pdf_writer = PdfFileWriter()
pdf_reader = PdfFileReader(input_pdf)
for page in range(pdf_reader.getNumPages()):
pdf_writer.addPage(pdf_reader.getPage(page))
pdf_writer.encrypt(user_pwd=password, owner_pwd=None,
use_128bit=True)
with open(output_pdf, 'wb') as fh:
pdf_writer.write(fh)
if __name__ == '__main__':
add_encryption(input_pdf='reportlab-sample.pdf',
output_pdf='reportlab-encrypted.pdf',
password='twofish')
add encryption() accepts the input and output PDF paths as well as the PDF-encryption password. Like previously, it then opens a PDF writer and reader object. As you will wish to encrypt the whole PDF file, you must go over each page and add them to the writer.
The last step is to invoke.encrypt(), which accepts the user password, the owner password, and the 128-bit encryption flag. The default setting for encryption is 128 bits. If this value is set to False, 40-bit encryption will be used instead.
Note: According to pdflib.com, PDF encryption employs either RC4 or AES (Advanced Encryption Standard) to encrypt the PDF.
Even if you have encrypted your Document, it is not always safe. There are tools to erase PDF passwords. If you are interested in learning more, Carnegie Mellon University has a fascinating article on the subject.
Conclusion
The PyPDF2 library is often highly helpful and quick. You may automate massive tasks using PyPDF2 and use its skills to do your work more effectively!
This lesson instructed you on the following:
- Extract metadata from a PDF
- Rotate pages
- Merge and split PDFs
- Add watermarks
- Add encryption
Keep a watch on the more recent PyPDF4 package, since it will likely replace PyPDF2 in the near future. You may also like to investigate pdfrw, which can do many of the same functions as PyPDF2.
Further Reading
If you want to learn more about dealing with PDFs in Python using Python, you can consult the following resources.
Mark as Finished
Watch Now The Real Python team has produced a video course that corresponds to this lesson. Watch it together with the written lesson to enhance your comprehension: How to Use a PDF File in Python