Javatpoint Logo
Javatpoint Logo

Tika Extracting PDF File

To extract content from pdf file, Tika uses PDFParser. PDFParser is a class that is used to extract content and metadata from a pdf file. This class is located into the org.apache.tika.parser.pdf package.

It contains constructor and methods that are tabled below.

Tika PDFParser Constructor

Constructor Description
public PDFParser() It is used to create instance of this class.

Tika PDFParser Methods

Method Description
public Set<MediaType> getSupportedTypes(ParseContext context) It returns the set of media types supported by this parser when used with the given parse context.
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException It parses a document stream into a sequence of XHTML SAX events.
public PDFParserConfig getPDFParserConfig() It is used to get pdfparser config.
public void setPDFParserConfig(PDFParserConfig config) It is used to set config for pdfparser
public void setEnableAutoSpace(boolean v) The parser should estimate where spaces should be inserted between words.
public boolean getExtractAnnotationText() It extracts text in annotations..
public void setExtractAnnotationText(boolean v) If true (the default), text in annotations will be extracted.
public void setSuppressDuplicateOverlappingText(boolean v) If true, the parser should try to remove duplicated text over the same region.

Tika Extracting PDF File Example

In the following example, we are extracting content and metadata from a pdf file.


Document Content:
Welcome to the Javatpoint. 
Javatpoint is a Technical portal that contains latest computer science topics. 

Document Metadata:
pdf:PDFVersion:   1.4
access_permission:modify_annotations:   true
access_permission:can_print_degraded:   true
meta:creation-date:   2018-05-05T11:25:40Z
created:   Sat May 05 16:55:40 IST 2018
access_permission:extract_for_accessibility:   true
access_permission:assemble_document:   true
xmpTPg:NPages:   1
Creation-Date:   2018-05-05T11:25:40Z
dcterms:created:   2018-05-05T11:25:40Z
dc:format:   application/pdf; version=1.4
access_permission:extract_content:   true
access_permission:can_print:   true
access_permission:fill_in_form:   true
pdf:encrypted:   false
access_permission:can_modify:   true
pdf:docinfo:created:   2018-05-05T11:25:40Z
Content-Type:   application/pdf

Please Share

facebook twitter google plus pinterest

Learn Latest Tutorials


Trending Technologies

B.Tech / MCA