Tika Extracting PDF FileTo extract content from pdf file, Tika uses PDFParser. PDFParser is a class that is used to extract content and metadata from a pdf file. This class is located into the org.apache.tika.parser.pdf package. It contains constructor and methods that are tabled below. Tika PDFParser Constructor
Tika PDFParser Methods
Tika Extracting PDF File ExampleIn the following example, we are extracting content and metadata from a pdf file. Output: Document Content: Welcome to the Javatpoint. Javatpoint is a Technical portal that contains latest computer science topics. Document Metadata: pdf:PDFVersion: 1.4 xmp:CreatorTool: Online2PDF.com access_permission:modify_annotations: true access_permission:can_print_degraded: true meta:creation-date: 2018-05-05T11:25:40Z created: Sat May 05 16:55:40 IST 2018 access_permission:extract_for_accessibility: true access_permission:assemble_document: true xmpTPg:NPages: 1 Creation-Date: 2018-05-05T11:25:40Z dcterms:created: 2018-05-05T11:25:40Z dc:format: application/pdf; version=1.4 access_permission:extract_content: true access_permission:can_print: true pdf:docinfo:creator_tool: Online2PDF.com access_permission:fill_in_form: true pdf:encrypted: false producer: Online2PDF.com access_permission:can_modify: true pdf:docinfo:producer: Online2PDF.com pdf:docinfo:created: 2018-05-05T11:25:40Z Content-Type: application/pdf
Next TopicTika Extracting XML File
|