Python Tutorial

As in today's world, we all are familiar with PDF files because they are one of the most widely used digital formats of documents. The full form of pdf is "Portable Document Format," which uses the ".pdf" extension to save the document files. This is independent of software-hardware or operating systems, and it can be used for presenting or exchanging documents reliably.

PDF was invented by Adobe, and this is now an open standard maintained by the international organization for standardization. The PDF file can also contain links or buttons form fields, audio-video, or other business logic for better interaction with the users or the viewers.

In this tutorial, we will discuss how we can perform various operations:

How to extract text from PDF
How to rotate pages of the PDF
How to merge two PDF together
How to split a PDF file
How to add watermarks to the PDF pages

We can perform all these operations by using a simple Python script.

Installation

For interacting with PDF files, we will be using a 3rd party module, that is, PyPDF2. The PyPDF2 is an inbuilt library of Python, which is used as a PDF toolkit. This module is capable of:

It can extract the information of the documents such as title, author name, and many more.
It can split the pages of the document file.
It can crop the pages of the PDF document file.
It can you merge the multiple pages into a single page inside the PDF document file.
It can encrypt and decrypt PDF files.

For installing PyPDF2 we can use the following command from the command line:

The name of this module is case-sensitive, so we have to make sure that the "y" is in lowercase and everything in the name of the module is in uppercase.

Operations on PDF File using PyPDF2 Module

In this section, we will discuss various operations that we can perform on PDF files by using the PyPDF2 module in Python.

1. How to Extract Text from PDF Document File.

We can extract the text from the PDF file by using the PyPDF2 module in Python by using the following approach.

Approach:

For extracting the text from the PDF file using Python, we will follow the following steps:

Step 1: We will open the PDF file named 'exp.pdf' in binary mode and save the file object as "pdf_File_Object".

Step 2: We will create an object "pdf_Reader" for the "PDFFileReader" class of the "PyPDF2" module, and then we will pass the PDF file object and get the object for reading the PDF.

Step 3: For getting the number of pages in the PDF document file, we will use the numPages

Step 4: We will create an object "page_Object" for PageObject class of the "PyPDF2" The PDF reader object has the function "getPage()" which takes the page number as an argument and returns the object of the page.

Step 5: We will use extract text which is a function of page object for extracting text from the PDF page.

Step 6: At last, we will close the PDF document file object.

Code:

import PyPDF2 as PDF
 
# Here we will create a pdf file object
pdf_File_Object = open('exp.pdf', 'rb')
 
# Here, we will creat a pdf reader object
pdf_Reader = PDF.PdfFileReader(pdf_File_Object)
 
# Now we will print number of pages in pdf file
print("No. of pages in the given PDF file: ", pdf_Reader.numPages)
 
# Here, create a page object
page_Object = pdf_Reader.getPage(0)
 
# Now, we will extract text from page
print(page_Object.extractText())
 
# At last, close the pdf file object
pdf_File_Object.close()

Output:

No. of pages in the given PDF file:  10
 
GUIDELINES
*
 
 
FOR 
 
RE
-
OPENING OF CAMPUS 
 
IN VIEW OF COVID
-
19 PANDEMIC
 
(FOR 
STUDENTS
)
 
2021
-
22

This has printed the text of the first page of the PDF file in output.

2. How to Rotate PDF File Pages

We can rotate the pages of PDF file using PyPDF2 module in Python.

Approach:

For rotating the pages of the given pdf file, we will be using the following steps:

Step 1: We will create a PDF reader object for the original PDF.

Step 2: We will write the rotated pages to the new PDF file. For writing Into the PDF file, we will use the object of the pdfFileWriter class of the PyPDF2

Step 3: We will iterate each page of the original PDF document file. We will get page object getPage() function of the PDF reader class. then we will rotate the page by using the rotateClockwise() function of the page object class.

for page in range(pdf_Reader.numPages):
page_Object = pdf_Reader.getPage(page)
page_Object.rotateClockwise(rotation_1)
pdf_Writer.addPage(page_Object)

Step 4: We will add pages PDF writer object using the addPage() function of the PDF writer class by passing the rotated page object.

Step 5: Then, we will write the PDF pages to the newly created PDF file. We can do this by opening the new file object and writing PDF pages by using the write() function off the PDF writer object.

new_File = open(new_File_Name, 'wb')
pdf_Writer.write(new_File)

Step 6: We will close the original PDF file object end the newly created new file object.

pdf_File_Object.close()
new_File.close()

Code:

# Frst, we will import the modules
import PyPDF2 as PDF
 
def PDF_rotate(original_File_Name, new_File_Name, rotation_1):
 
    # Then, we will create a pdf File object of original pdf
    pdf_File_Object = open(original_File_Name, 'rb')
    
    # Then, we will create a pdf Reader object
    pdf_Reader = PDF.PdfFileReader(pdf_File_Object)
 
    # Then we will create a pdf writer object for new pdf
    pdf_Writer = PDF.PdfFileWriter()
     
    # Now, we will rotate each page of the PDF document
    for page in range(pdf_Reader.numPages):
 
        # Then, we will create rotated page object
        page_Object = pdf_Reader.getPage(page)
        page_Object.rotateClockwise(rotation_1)
 
        # We will add the rotated page object to pdf writer
        pdf_Writer.addPage(page_Object)
 
    # Now we will open a new pdf file object
    new_File = open(new_File_Name, 'wb')
     
    # We will write the rotated pages to new file
    pdf_Writer.write(new_File)
 
    # At last, we will close the original pdf file object
    pdf_File_Object.close()
     
    # And now, we will close the new pdf file object
    new_File.close()
     
 
def main():
 
    # original pdf file name
    original_File_Name = 'exp.pdf'
    
    # new pdf file name
    new_File_Name = 'rotated_exp.pdf'
     
    # rotation angle
    rotation_1 = 270
   
    # calling the PDF_rotate function
    PDF_rotate(original_File_Name, new_File_Name, rotation_1)
     
if __name__ == "__main__":
    # calling the main function
    main()

Output:

Original File:

Rotated File:

3. How to Merge two PDF Files.

We can merge two PDF files by using the PyPDF2 module in Python.

Approach:

For merging two PDF files in Python, we will be using the following steps:

Step 1: For merging two PDf files, we will be using a pre-built class, pdfFileMerger of the PyPDF2

We will create an object called pdf_Merger of PDF merger class:
pdf_Merger = PDF.PdfFileMerger()

Step 2: Then, we will append the file object of each PDF to the PDF merger object using the append()

for pdf in pdf:
pdf_Merger.append(pdf)

Step 3: At last, we will write the pdf pages to the output pdf file by using the write method of the PDF merger object.

with open(output_1, 'wb') as K:
pdf_Merger.write(K)

Code:

# First, we will import the modules
import PyPDF2 as PDF
 
 
def PDF_merge(pdf, output_1):
    # Here, we will create pdf file merger object
    pdf_Merger = PDF.PdfFileMerger()
 
    # now, we will append pdfs one by one
    for pdf in pdf:
        pdf_Merger.append(pdf)
 
    # then, we will write combined pdf to output pdf file
    with open(output_1, 'wb') as K:
        pdf_Merger.write(K)
 
 
def main():
    # here, we will select the pdf files to merge
    pdf = ['exp.pdf', 'rotated_exp.pdf']
 
    # Here, we will create output pdf file name
    output_1 = 'combined_exp.pdf'
 
    # Now, we will call pdf merge function
    PDF_merge(pdf = pdf, output_1 = output_1)
 
 
if __name__ == "__main__":
    # At last we will call the main function
    main()

Output:

The output of this code will be in the form of a combined PDF named combined_exp.pdf, which is obtained by merging exp.pdf and rotate_exp.pdf file.

4. How to Split PDF File

We can split the PDF document file in Python using the PyPDF2 module according to our requirements.

In this code, we will not use a new function or class, and we will be using simple logic and iterations. The splits of the pdf will be created according to the list of splits_1 we would be passing.

Code:

# First, we will import the modules
import PyPDF2 as PDF
 
def PDF_split(pdf_1, splits_1):
    # here, we will create an input pdf file object
    pdf_File_Object = open(pdf_1, 'rb')
     
    # here, we will create pdf reader object
    pdf_Reader = PDF.PdfFileReader(pdf_File_Object)
     
    # Now we will start indexing of first slice
    start = 0
     
    # then we will start indexing of last slice
    end = splits_1[0]
     
     
    for g in range(len(splits_1) + 1):
        # we will create pdf writer object for (g + 1)th split
        pdf_Writer = PDF.PdfFileWriter()
         
        # output pdf file name
        output_pdf = pdf_1.split('.pdf')[0] + str(g) + '.pdf'
         
        # Now, we will add pages to pdf writer object
        for page_1 in range(start, end):
            pdf_Writer.addPage(pdf_Reader.getPage(page_1))
         
        # Here, we will write split pdf pages to pdf file
        with open(output_pdf, "wb") as K:
            pdf_Writer.write(K)
 
        # Now, we will interchange page split start position for next split
        start = end
        try:
            # then, we will set split end position for next split
            end = splits_1[g + 1]
        except IndexError:
            # then, we will set split end position for last split
            end = pdf_Reader.numPages
         
    # Now, we will close the input pdf file object
    pdf_File_Object.close()
             
def main():
    # pdf file to split
    pdf_1 = 'exp.pdf'
     
    # split page positions
    splits_1 = [2,4]
     
    # we will call PDF_split function to split pdf
    PDF_split(pdf_1, splits_1)
 
if __name__ == "__main__":
    # at last, we will call the main function
    main()

Output:

The output of this code will generate 3 new pdf files, which are the split files of the main pdf. We can check in the PDF folder. It contains 3 new pdf files.

5. How to Add Watermark to PDF Pages.

We can add watermark to the pages of PDF document files using the PyPDF2 module in Python.

Approach:

In this, we will follow every step same as the page rotation example, the only difference is:

The page object will be converted into the watermark page object by using the add_watermark() function.

For understanding what the add_watermark() function do, we can see the following example:

wm_File_Object = open(wm_File, 'rb')
pdf_Reader = PDF.PdfFileReader(wm_File_Object) 
page_Object.mergePage(pdf_Reader.getPage(0))
wm_File_Object.close()
return page_Object

In this, first, we created a pdf reader object of the water_mark.pdf file. For the passed page object, we have used the mergepage() function, which has passed the page object of the first page of the water_mark pdf reader object. This will cause an overlay of water_mark pdf over the passed page object.

Code:

# First, we will import the modules
import PyPDF2 as PDF
 
def add_watermark_1(wm_File, page_Object):
    # here, we will open watermark pdf file
    wm_File_Object = open(wm_File, 'rb')
    
    # Now, we will create pdf reader object of watermark pdf file
    pdf_Reader = PDF.PdfFileReader(wm_File_Object)
     
    # then, we will merge watermark pdf's first page with passed page object.
    page_Object.mergePage(pdf_Reader.getPage(3))
     
    # Here, we will close the watermark pdf file object
    wm_File_Object.close()
     
    # we will return watermarked page object
    return page_Object
 
def main():
    # Now, we will create watermark pdf file name
    user_watermark = 'water_mark.pdf'
    
    # original pdf file name
    original_File_Name = 'exp.pdf'
     
    # new pdf file name
    new_File_Name = 'watermarked_exp.pdf'
     
    # now, we will create pdf File object of original pdf
    pdf_File_Object = open(original_File_Name, 'rb')
     
    # here, we will create a pdf Reader object
    pdf_Reader = PDF.PdfFileReader(pdf_File_Object)
 
    # create a pdf writer object for new pdf
    pdf_Writer = PDF.PdfFileWriter()
     
    # add watermark to each page
    for page_1 in range(pdf_Reader.numPages):
        # Now, we will create watermarked page object
        wm_page_Object = add_watermark(user_watermark, pdf_Reader.getPage(page_1))
         
        # then, we will add watermarked page object to pdf writer
        pdf_Writer.addPage(wm_page_Object)
 
    # new pdf file object
    new_File = open(new_File_Name, 'wb')
     
    # we will then write watermarked pages to new file
    pdf_Writer.write(new_File)
 
    # close the original pdf file object
    pdf_File_Object.close()
    # close the new pdf file object
    new_File.close()
 
if __name__ == "__main__":
    # call the main function
    main()

Output:

water_mark.pdf: