Tika Extracting PDF File

To extract content from pdf file, Tika uses PDFParser. PDFParser is a class that is used to extract content and metadata from a pdf file. This class is located into the org.apache.tika.parser.pdf package.

It contains constructor and methods that are tabled below.

Tika PDFParser Constructor

Constructor	Description
public PDFParser()	It is used to create instance of this class.

Tika PDFParser Methods

Method	Description
public Set<MediaType> getSupportedTypes(ParseContext context)	It returns the set of media types supported by this parser when used with the given parse context.
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException	It parses a document stream into a sequence of XHTML SAX events.
public PDFParserConfig getPDFParserConfig()	It is used to get pdfparser config.
public void setPDFParserConfig(PDFParserConfig config)	It is used to set config for pdfparser
public void setEnableAutoSpace(boolean v)	The parser should estimate where spaces should be inserted between words.
public boolean getExtractAnnotationText()	It extracts text in annotations..
public void setExtractAnnotationText(boolean v)	If true (the default), text in annotations will be extracted.
public void setSuppressDuplicateOverlappingText(boolean v)	If true, the parser should try to remove duplicated text over the same region.

Tika Extracting PDF File Example

In the following example, we are extracting content and metadata from a pdf file.

package tikaexample;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class PdfParserExample {
	public static void main(String[] args) throws IOException, SAXException, TikaException {
		 BodyContentHandler handler   = new BodyContentHandler();
		 PDFParser parser             = new PDFParser();
		 Metadata metadata            = new Metadata();
		 ParseContext pcontext        = new ParseContext();
		 try (InputStream stream = AutoDetectParseExample.class.getResourceAsStream("javatpoint.pdf")) {
		        parser.parse(stream, handler, metadata, pcontext);
	     System.out.println("Document Content:" + handler.toString());
	     System.out.println("Document Metadata:");
	     String[] metadatas = metadata.names(); 
	     for(String data : metadatas) {
	         System.out.println(data + ":   " + metadata.get(data));  
	     }
		 }catch(Exception e) {System.out.println("Exception message: "+ e.getMessage());}
	   }
	}

Output:

Document Content:
Welcome to the Javatpoint. 
 
Javatpoint is a Technical portal that contains latest computer science topics. 



Document Metadata:
pdf:PDFVersion:   1.4
xmp:CreatorTool:   Online2PDF.com
access_permission:modify_annotations:   true
access_permission:can_print_degraded:   true
meta:creation-date:   2018-05-05T11:25:40Z
created:   Sat May 05 16:55:40 IST 2018
access_permission:extract_for_accessibility:   true
access_permission:assemble_document:   true
xmpTPg:NPages:   1
Creation-Date:   2018-05-05T11:25:40Z
dcterms:created:   2018-05-05T11:25:40Z
dc:format:   application/pdf; version=1.4
access_permission:extract_content:   true
access_permission:can_print:   true
pdf:docinfo:creator_tool:   Online2PDF.com
access_permission:fill_in_form:   true
pdf:encrypted:   false
producer:   Online2PDF.com
access_permission:can_modify:   true
pdf:docinfo:producer:   Online2PDF.com
pdf:docinfo:created:   2018-05-05T11:25:40Z
Content-Type:   application/pdf

Next TopicTika Extracting XML File

← prev next →