PDFBox Extracting Phone Numbers
The PDFBox library has a variety of features. It has an ability to quickly and accurately extract phone contacts from an existing PDF document. In this section, we will learn how to read phone numbers from an existing document in the PDFBox library by using a Java Program. The PDF document may also contain text, animation, and imagesetc. as its contents.
Follow the steps below to extract Phone Numbers from the existing PDF document-
Load PDF Document
We can load the existing PDF document by using the static load() method. This method accepts a file object as a parameter. We can also invoke it using the class name PDDocument of the PDFBox.
Instantiate StringBuilder and PDFTextStripper class
StringBuilder and PDFTextStripper class are used to retrieve text from a PDF document. We can instantiate these classess as following:
Set Patterns for Phone Number
The Pattern refers to the format of Phone Number we are looking for. In our example, we are looking for numbers with 10 digits and atleast one surrounding whitespaces on both ends with Phone Numbers. Patterns can be set from the following:
Retrieve Phone number
We can retrieve the Phone Numbers by using Matcher which refers the actual text where the pattern will be found. If the Phone Number will be found, print the Phone Numbers using group() method which refers to the next number that follows the pattern we have specified.
After completing the task, we need to close the PDDocument class object by using the close() method.
This is a PDF document which contains Text and Phone Numbers both. From this PDF, we want to extract only Phone Numbers. Here, we assume that the Phone numbers are 10 digits long. We can do this by using PDFBox library of Java Program.
After successful execution of the above program, we can see the following output.