Javatpoint Logo
Javatpoint Logo

PDFBox Introduction

What does PDF mean?

PDF stands for Portable Document Format. It is a file format which is used to display a printed document in digital form. It is independent of the environment in which it was created or the environment in which it is viewed or printed.

It is developed and specified by Adobe® Systems as a universally compatible file format based on the PostScript format.

The main goal of PDF document is to enable the users to exchange and view electronic/digital documents easily and reliably. Each PDF file has fixed, secure and multidimensional layout including text, fonts, graphics, audio, video, animation and hyperlinks.

Libraries to create and manipulate PDF document:

  1. iText - It is an open source Java library which supports the development and conversion of PDF documents.
  2. JasperReport - It is an open source Java reporting tool which generates a report in PDF documents.
  3. Adobe PDF Library - It is based on the technology of Adobe Acrobat software. This library provides an environment for generating, manipulating, rendering and printing PDF documents.

What is a PDFBox?

PDFBox is an open-source library which is written in Java. It supports the development and conversion of PDF Documents. PDFBox Library comes as a JAR file. It allows the creation of new PDF documents, manipulation of existing documents, bookmarking PDF and the ability to extract content from PDF documents. We can also use it to digitally sign, print and validate files against the PDF/A-1b standard.

PDFBox library was originally developed in 2002 by Ben Litchfield. It was taken up as an Apache project in 2008, and became an Apache top level project in 2009. It offers unicode support for PDF creation, and has better support for interactive forms.

PDFBox comes with a series of command line utilities for performing the various operation over PDF documents. These utilities includes encrypting and decrypting PDF, overlaying, merging, debugging, converting text to PDF and PDF to an image.

Components of PDFBox

PDFBox has the following components:

  1. PDFBox- It is the main part of the PDFBox library. It contains the classes and interfaces related to the content extraction and manipulation from files.
  2. FontBox- It contains the classes and interfaces to handle the font information.
  3. XmpBox- It contains the classes and interfaces to handle the XMP metadata.
  4. PreFlight- It is used to verify the PDF files for PDF/A-1B standard.

Application of PDFBox

PDFBox has the following Application:

  1. Apache Nutch- Apache Nutch is a highly extensible and scalable open source web search software. It is based on Apache Lucene, adding web crawler, line-graph databases like Hadoop, the parser for HTML and other file formats etc.
  2. Apache Tika- It is a toolkit library which is mainly used for documents type detection and content extraction from various file formats using existing parser libraries.

Next TopicPDFBox Features

Please Share

facebook twitter google plus pinterest

Learn Latest Tutorials


Trending Technologies

B.Tech / MCA