PDFBox Reading Text

One of the main features of PDFBox library is its ability to quickly and accurately extract text from an existing PDF document. In this section, we will learn how to read text from an existing document in the PDFBox library by using a Java Program. The PDF document may contain text, animation, and images etc as its text contents. We can extract text from the existing PDF document by using getText() method of the PDFTextStripper class.

Follow the steps below to read text from the existing PDF document-

Load PDF Document

We can load the existing PDF document by using the static load() method. This method accepts a file object as a parameter. We can also invoke it using the class name PDDocument of the PDFBox.

File file = new File("Path of Document"); 
PDDocument doc = PDDocument.load(file); 

Instantiate PDFTextStripper class

PDFTextStripper class is used to retrieve text from a PDF document. We can instantiate this class as following-

Retrieve Text

getText() method is used to read the text contents from the PDF document. In this method, we need to pass the document object as a parameter. This method returns the text as a string object.

Close Document

After completing the task, we need to close the PDDocument class object by using the close() method.

Example-

This is a PDF document,in which we are going to extract its text content by using PDFBox library of a Java program.

Java Program-

import java.io.File;
import java.io.IOException;

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class ExtractText {
	
	public static void main(String[] args)throws IOException {
		
		//Loading an existing document
	      File file = new File("/eclipse-workspace/blank.pdf");
	      PDDocument doc = PDDocument.load(file);
	
	//Instantiate PDFTextStripper class
	      PDFTextStripper pdfStripper = new PDFTextStripper();

	//Retrieving text from PDF document
	      String text = pdfStripper.getText(doc);
	      System.out.println("Text in PDF\n---------------------------------");
	      System.out.println(text);

	//Closing the document
	doc.close();
	}
}

Output:

After successful execution, the above program retrieves the text from the PDF document as shown in the following output.