Apache Tika Supported Formats

As we know, Apache Tika supports over the thousand of document types. Here, we are listing some common formats. These are just introductory, while Tika can detect much wider range than listed below.

Apache Tika can detect, extract content and metadata from the following document types.

HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
iWorks document formats
WordPerfect document formats
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Feed and Syndication formats
etc

Hyper Text Markup Language

To parse and extract content, metadata from HTML type document, Tika uses HtmlParser class which is responsible to extract HTML file.

XML

It is an extensible markup language that is used for all kinds of content. The DcXMLParser class is used to extract content from the document and ignore XML structure.

Microsoft Office Document Formats

Microsoft Office produces documents in the generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The OfficeParser and OOXMLParser classes use Apache POI libraries to support text and metadata extraction from both OLE2 and OOXML documents.

OpenDocument Format

It is mostly used as the default format of the OpenOffice.org office suite. The OpenDocumentParser class supports this format.

iWorks document formats

The various iWorks document formats (Numbers, Pages, Keynote) are supported by the IWorkPackageParser class which extracts text and metadata.

Portable Document Format

The PDFParser class is used to parse Portable Document Format (PDF) documents using the Apache PDFBox library.

Electronic Publication Format

Electronic Publication Format is supported by the EpubParser class which is used for many digital books. Xml-based fiction book is supported by the FictionBookParser class.

Rich Text Format

The RTFParser class uses the standard javax.swing.text.rtf feature to extract text content from Rich Text Format (RTF) documents.

Compression and packaging formats

Tika uses the Commons Compress library to support various compression and packaging formats. The CompressorParser class handles parsing of the top level compression formats, then PackageParser class and its subclasses parse the packaging formats and then pass the unpacked document streams to a second parsing stage using the parser instance specified in the parse context. Formats supported include Tar, AR, ARJ, CPIO, Dump, Zip, 7Zip, Gzip, BZip2, XZ, LZMA, Z and Pack200.

Text formats

Extracting text content from plain text files seems like a simple task until we start thinking of all the possible character encodings. The TXTParser class uses encoding detection code from the ICU project to automatically detect the character encoding of a text document.