Python Libraries for PDF Extraction

In this tutorial, we will learn about the Python libraries for the PDF data extract for further analysis. We will go through the essential Python libraries.

PDF is a portable document format which is generally used to store data safely. PDF resumes are created in various ways. For example - some Job seekers make a resume in word format and save them as a PDF, while some create it using the online CV template. So our task is to parse pdf resumes and extract every text without loss of information.

Below are the essential Python libraries used to extract text from PDF files.

PyPDF2
Tika
Textract
PyMuPDF
PDFtotext
PDFminer
Tabula

We will get the introduction of each document along with the Python code.

PyPDF2

PyPDF2 is a complete Python package that can be used to perform the many types of PDF operations. We can use this module to perform the following tasks.

We can extract information from a PDF in Python.
We can rotate pages.
We can merge two or more PDFs.
We can split the PDFs.
We can add watermarks.
We can encrypt a PDF.

To use this module, we need to install it on our local machine using the pip command.

Now let's understand the following code to extract data from the PDF.

Example -

import PyPDF2

path = r"C:\Users\DEVANSH SHARMA\Desktop\Devansh Resume.pdf"

# creating a pdf file object
pdfFileObj = open(path, 'rb')
#creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#printing number of pages in pdf file
print(pdfReader.numPages)
#creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
pypdf2_text = ''
for i in range(pdfReader.numPages):
    pypdf2_text +=pdfReader.getPage(i).extractText()
#closing the pdf file object
pdfFileObj.close()

In the above code, we printed the number of pages in the pdf. We can also extract the information.

Disadvantages of Using PyPDF2

Following are the disadvantages of the PyPDF2 package.

However, this library can extract text but cannot preserve the text's structure in the original PDF.
It doesn't hold the table structure.
It also included unnecessary spaces and newlines in the extracted text.
It also included unnecessary spaces and newlines in the extracted text.

Textract

There are several packages exist for extracting the content from various formats of files on their own. The Textract library is slightly different from the others; it provides a single interface for removing content from any file without any irrelevant markup.

Textract is also used to extract information from PDF files and other formats, including CSV, doc, eml, epub, JSON, jpg, mp3, msg, xls, etc.

The most important thing is to remember it extracts the information in the byte format. To convert byte data into a string, we need to use another Python package for decoding, like codecs.

Let's understand the following code for extracting text from PDF using Textract, Input PDF, and output extract text.

Example -

import textract
import codecs
path = r"C:\Users\DEVANSH SHARMA\Desktop\Devansh Resume.pdf"

#extract text in byte format
textract_text = textract.process(path)
#convert bytes to string
textract_str_text = codecs.decode(textract_text)

This package can extract the information without any data loss. It maintains the original structure of the original document; however, the table structure is not preserved.

This is a recommended library for text extraction for not only PDF but also other types of files.

PyMuPDF

PyMuPDF is a Python binding for MuPDF, a lightweight PDF viewer. It is not entirely based on Python, and this package is known for its top performance and high rendering quality.

With PyMuPDF, we can access files with extensions like *.pdf, *.xps, *.oxps, *.epub, *.cbz or *.fb2 from your Python scripts. Several popular image formats are supported as well, including multipage TIFF images.

We can extract the information of the multipage documents using the PyMuPDF. It also allows us to get the information of the particular page by providing the page number. Following is the code to extract text from the PDF using PyMuPDF.

Example -

import fitz
path = r"C:\Users\DEVANSH SHARMA\Desktop\Devansh Resume.pdf"

with fitz.open(path) as doc:
pymupdf_text = ""
for page in doc:
    pymupdf_text += page.getText()

This library removes the unnecessary space from the text, so the text cleaning task of pre-processing is automatically done by this package.

PyMuPDF is capable of maintaining the structure of the document. However, extracting tables in the original format is not practical, and removing the tabular data is not recommended. We will have to use some other packages to preserve information in tables. This library provides an effective result with the textual data of PDF.

PDFtotext

PDFtotext is another python-based package used to extract texts from PDF files. It can only read the data of PDF files, while other formats are not supported. The data is removed in the form of an object, and the structure of the PDF is preserved.

Following is the code to fetch data from the PDF.

Example -

import pdftotext

path = r"C:\Users\DEVANSH SHARMA\Desktop\User.pdf"

# Load your PDF
with open(path, "rb") as f:
    pdf = pdftotext.PDF(f)
# Read all the text into one string
pdftotext_text = "\n\n".join(pdf)

The main advantage of using this library is that it can preserve the table structure of the PDF along with its text. If you want to extract table data, this library is more appropriate than previous libraries.

PDFMiner

PDFMiner is a python based package that is used to extract only PDF files. It can also convert PDF files into other file formats like HTML/XML. There are various versions of PDFMiner and the latest version is compatible with python 3.6 and above.

This library provides its response form of an API request. That's why this package takes slightly time other than other purely python-based packages.

Let's understand the following example -

Example -

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def pdf_to_text(path):
    res_manager = PDFResourceManager()
    ret_str = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(res_manager, ret_str,      codec=codec,laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(res_manager, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    text = ret_str.getvalue()
    fp.close()
    device.close()
    ret_str.close()
    return text
pdf_miner_text = pdf_to_text(path1)

Tabula

Tabula is java-based, mainly used to read table data in a PDF. It is a simple python wrapper for tabular-java, and it extracts the information and saves it into the Python Dataframe. We can convert that dataframe into CSV, tsv, excel, or JSON file format.

In the following code, we extract the table into DataFrame from a PDF file using the Tabula package along with the input PDF and output extracted text.

Example -

import tabula 
path = r"C:\Users\DEVANSH SHARMA\Desktop\User.pdf"

df = tabula.read_pdf(path, pages='all')

This library is most useful for extracting table information. Using Tabula along with the other package mentioned above can be useful to extract full pdf.

Conclusion

This tutorial included some important Python libraries to extract text from PDFs. These libraries are beneficial in their terms; however, some are suitable for removing text, and some are good for extracting data from the table. We can choose according to our requirements. We have also included the code example. Let's see the summary of the discussed libraries -

PyPDF2 -It is a less recommended library because it doesn't preserve the format.
Tika -To use the Tike, we need to install the Java and must be familiar with the Java installations, unnecessary involve java connection, good to extract contents, critical metadata.
Textract -It returns a byte object, and we need to convert it into a string.
PyMuPDF -It extracts text from PDF files, removes unnecessary spaces from the text, and preserves the document's original structure.
PDFminer -It preserves the structure of PDF file text but not the table structure.
PDFtoText - It is the most recommended library as it preserves table and original structure.