Apache Tika Supported Formats
As we know, Apache Tika supports over the thousand of document types. Here, we are listing some common formats. These are just introductory, while Tika can detect much wider range than listed below.
Apache Tika can detect, extract content and metadata from the following document types.
Hyper Text Markup Language
To parse and extract content, metadata from HTML type document, Tika uses HtmlParser class which is responsible to extract HTML file.
It is an extensible markup language that is used for all kinds of content. The DcXMLParser class is used to extract content from the document and ignore XML structure.
Microsoft Office Document Formats
Microsoft Office produces documents in the generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The OfficeParser and OOXMLParser classes use Apache POI libraries to support text and metadata extraction from both OLE2 and OOXML documents.
It is mostly used as the default format of the OpenOffice.org office suite. The OpenDocumentParser class supports this format.
iWorks document formats
The various iWorks document formats (Numbers, Pages, Keynote) are supported by the IWorkPackageParser class which extracts text and metadata.
Portable Document Format
The PDFParser class is used to parse Portable Document Format (PDF) documents using the Apache PDFBox library.
Electronic Publication Format
Electronic Publication Format is supported by the EpubParser class which is used for many digital books. Xml-based fiction book is supported by the FictionBookParser class.
Rich Text Format
The RTFParser class uses the standard javax.swing.text.rtf feature to extract text content from Rich Text Format (RTF) documents.
Compression and packaging formats
Tika uses the Commons Compress library to support various compression and packaging formats. The CompressorParser class handles parsing of the top level compression formats, then PackageParser class and its subclasses parse the packaging formats and then pass the unpacked document streams to a second parsing stage using the parser instance specified in the parse context. Formats supported include Tar, AR, ARJ, CPIO, Dump, Zip, 7Zip, Gzip, BZip2, XZ, LZMA, Z and Pack200.
Extracting text content from plain text files seems like a simple task until we start thinking of all the possible character encodings. The TXTParser class uses encoding detection code from the ICU project to automatically detect the character encoding of a text document.
Feed and Syndication formats
The RSS and Atom feed syndication formats are supported by FeedParser class.