PDFBox Extracting Phone Numbers

The PDFBox library has a variety of features. It has an ability to quickly and accurately extract phone contacts from an existing PDF document. In this section, we will learn how to read phone numbers from an existing document in the PDFBox library by using a Java Program. The PDF document may also contain text, animation, and imagesetc. as its contents.

Follow the steps below to extract Phone Numbers from the existing PDF document-

Load PDF Document

We can load the existing PDF document by using the static load() method. This method accepts a file object as a parameter. We can also invoke it using the class name PDDocument of the PDFBox.

File file = new File("Path of Document"); 
PDDocument doc = PDDocument.load(file); 

Instantiate StringBuilder and PDFTextStripper class

StringBuilder and PDFTextStripper class are used to retrieve text from a PDF document. We can instantiate these classess as following:

StringBuilder sb = new StringBuilder();			
PDFTextStripper stripper = new PDFTextStripper();

Set Patterns for Phone Number

The Pattern refers to the format of Phone Number we are looking for. In our example, we are looking for numbers with 10 digits and atleast one surrounding whitespaces on both ends with Phone Numbers. Patterns can be set from the following:

Retrieve Phone number

We can retrieve the Phone Numbers by using Matcher which refers the actual text where the pattern will be found. If the Phone Number will be found, print the Phone Numbers using group() method which refers to the next number that follows the pattern we have specified.

Matcher m = p.matcher(sb);
while (m.find()){
	    System.out.println(m.group());			
	 }

Close Document

After completing the task, we need to close the PDDocument class object by using the close() method.

Example-

This is a PDF document which contains Text and Phone Numbers both. From this PDF, we want to extract only Phone Numbers. Here, we assume that the Phone numbers are 10 digits long. We can do this by using PDFBox library of Java Program.

Java Program

import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.text.PDFTextStripper;
import java.util.regex.*;

public class ExtractPhone {
		
		public static void main(String[] args)throws IOException {
					
		// PDF file from the phone numbers are extracted
	         File fileName = new File("/eclipse-workspace/phone.pdf");
		   PDDocument doc = PDDocument.load(fileName);

	// StringBuilder to store the extracted text
		   StringBuilder sb = new StringBuilder();			
		   PDFTextStripper stripper = new PDFTextStripper();

	// Add text to the StringBuilder from the PDF
	sb.append(stripper.getText(doc));

	// Regex-> The Pattern refers to the format you are looking for. In our example,we are looking for 
	//numbers with 10 digits with atleast one surrounding white spaces on both ends.
	       Pattern p = Pattern.compile("\\s\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d\\s");

	// Matcher refers to the actual text where the pattern will be found
	       Matcher m = p.matcher(sb);
	while (m.find()){
	//group() method refers to the next number that follows the pattern we have specified.
			   System.out.println(m.group());			
			   }

			if (doc != null) {
			doc.close();
			   }
			   System.out.println("\nPhone Number is extracted");
		}
}

Output:

After successful execution of the above program, we can see the following output.