Tika Document Type Detection
Document detection is a process to identify type of a document. Document types are different, the text/plain represents text file and image/jpeg to image type file.
Tika detects document type so that it can call appropriate parser to extract content and metadata.
Tika supports all the document types mentioned in MIME (Multipurpose Internet Mail Extension).
Currently eight official top-level types and thousands of subtype are supported by Internet Assigned Numbers Authority (IANA).
Following are the top-level media types.
Media Types in Tika
Media types are the types of files, they tell to the computer what applications to associate with what files.
Detecting media types accurately is a major task that Tika handle perfectly.
Tika provides Java API and class-level support for interacting with the Tika MIME data-base
Tika has its own media type registry that stores IANA-registered types and other known types that are being used in practice.
Tika uses the MediaType class to represent media types. Instances of this class are immutable and contain only the media type's type/subtype pair and optional name=value parameters.
Following are the some commonly used file extensions. See the table.
Tika uses its detect() method that detects document type. See an example.
File type : text/plain