Javatpoint Logo
Javatpoint Logo

Tika Document Type Detection

Document detection is a process to identify type of a document. Document types are different, the text/plain represents text file and image/jpeg to image type file.

Tika detects document type so that it can call appropriate parser to extract content and metadata.

Tika supports all the document types mentioned in MIME (Multipurpose Internet Mail Extension).

Currently eight official top-level types and thousands of subtype are supported by Internet Assigned Numbers Authority (IANA).

Following are the top-level media types.

Top-level type Description
Text/* It means text-based documents such as HTML, CSS, CSV and plain text..
Image/* All the image subtype such as JPEG, Portable Network Graphics, GIF etc.
Audio/* It includes music and other audio formats such as MP3 and Ogg audio.
Video/* Video formats such as QuickTime and Mp4.
Model/* File formats for expressing physical or behavioral models in various domains.
For example VRML format used to express 3D models
Application/* Application-specific document formats that don't necessarily fit any of the other top-level categories. For example PDF and Microsoft Word (application/msword) documents.
Message/* Email and other message types sent over the internet and other networks.
Multipart/* It shows container formats for related component documents. Like message/* types, multipart/* documents are messages transmitted over the network.

Media Types in Tika

Media types are the types of files, they tell to the computer what applications to associate with what files.

Detecting media types accurately is a major task that Tika handle perfectly.

Tika provides Java API and class-level support for interacting with the Tika MIME data-base

Tika has its own media type registry that stores IANA-registered types and other known types that are being used in practice.

Tika uses the MediaType class to represent media types. Instances of this class are immutable and contain only the media type's type/subtype pair and optional name=value parameters.

Following are the some commonly used file extensions. See the table.

Extension File Format Media Type
.txt Text document text/plain
.html HTML page text/html
.xls Microsoft Excel spreadsheet application/vnd.ms-excel
.jpg JPEG image image/jpeg
.mp3 MP3 audio audio/mpeg
.zip Zip archive application/zip

Tika uses its detect() method that detects document type. See an example.

Output:

File type : text/plain





Please Share

facebook twitter google plus pinterest

Learn Latest Tutorials


B.Tech / MCA