Javatpoint Logo
Javatpoint Logo

Tika Parser API

Tika Parser is an interface that provides the facility to extract content and metadata from any type of document. It is key component of Tika and organized into the org.apache.tika.parser package. It provides a parse() method which has the following signature.

It takes four arguments, InputStream, ContentHandler, Metadata and ParseContect class objects. The purpose of each of the four arguments is shown below.


Tika Parser API

These arguments have following description.

Argument Description
InputStream stream Document is read from this input stream.
ContentHandler handler ContentHandler is an interface that handle the content of the document.
Metadata metadata It is a multi-valued metadata container.
ParseContext context It is used to pass context information to Tika parsers.

Tika also provides AutoDetectParser class which automatically figure out what kind of content a file has, and then calls appropriate parser.

Apart from these, it supports various other parsers classes that can be used to parse document of parse class type. See the following table.

Parser Package Description
AppleSingleFileParser org.apache.tika.parser.apple It is used to parse apple file.
ClassParser org.apache.tika.parser.asm It is used to parse class file.
AudioParser org.apache.tika.parser.audio It is used to parse audio file.
MidiParser org.apache.tika.parser.audio It is used to parse midi file.
Pkcs7Parser org.apache.tika.parser.crypto It is used to parse pkcs7.
TSDParser org.apache.tika.parser.crypto It is used to parse tsd.
DWGParser org.apache.tika.parser.dwg It is used to parse dwg.
EnviHeaderParser org.apache.tika.parser.envi It is used to parse envi.
EpubParser org.apache.tika.parser.epub It is used to parse epub.
ExecutableParser org.apache.tika.parser.executable It is used to parse executable.
HtmlParser org.apache.tika.parser.html It is used to parse html file.
ImageParser org.apache.tika.parser.image It is used to parse image file.
WebPParser org.apache.tika.parser.image It is used to parse webp.
IptcAnpaParser org.apache.tika.parser.iptc It is used to parse iptcanpa.
JpegParser org.apache.tika.parser.jpeg It is used to parse jpeg.
DBFParser org.apache.tika.parser.dbf It is used to parse dbf file.
Mp3Parser org.apache.tika.parser.mp3 It is used to parse mp3.
MP4Parser org.apache.tika.parser.mp4 It is used to parse mp4.
PDFParser org.apache.tika.parser.pdf It is used to parse pdf file.

Tika Parser Example

In this example, we are using AutoDetectParser that detect document type automatically and then parse the content and metadata.

Output:

Following is the content of hello.txt file after extraction.

Hello Welcome to Javatpoint





Please Share

facebook twitter google plus pinterest

Learn Latest Tutorials


B.Tech / MCA