Tika Document Type Detection

Document detection is a process to identify type of a document. Document types are different, the text/plain represents text file and image/jpeg to image type file.

Tika detects document type so that it can call appropriate parser to extract content and metadata.

Tika supports all the document types mentioned in MIME (Multipurpose Internet Mail Extension).

Currently eight official top-level types and thousands of subtype are supported by Internet Assigned Numbers Authority (IANA).

Following are the top-level media types.

Top-level type	Description
Text/*	It means text-based documents such as HTML, CSS, CSV and plain text..
Image/*	All the image subtype such as JPEG, Portable Network Graphics, GIF etc.
Audio/*	It includes music and other audio formats such as MP3 and Ogg audio.
Video/*	Video formats such as QuickTime and Mp4.
Model/*	File formats for expressing physical or behavioral models in various domains. For example VRML format used to express 3D models
Application/*	Application-specific document formats that don't necessarily fit any of the other top-level categories. For example PDF and Microsoft Word (application/msword) documents.
Message/*	Email and other message types sent over the internet and other networks.
Multipart/*	It shows container formats for related component documents. Like message/* types, multipart/* documents are messages transmitted over the network.

Media Types in Tika

Media types are the types of files, they tell to the computer what applications to associate with what files.

Detecting media types accurately is a major task that Tika handle perfectly.

Tika provides Java API and class-level support for interacting with the Tika MIME data-base

Tika has its own media type registry that stores IANA-registered types and other known types that are being used in practice.

Tika uses the MediaType class to represent media types. Instances of this class are immutable and contain only the media type's type/subtype pair and optional name=value parameters.

Following are the some commonly used file extensions. See the table.

Extension	File Format	Media Type
.txt	Text document	text/plain
.html	HTML page	text/html
.xls	Microsoft Excel spreadsheet	application/vnd.ms-excel
.jpg	JPEG image	image/jpeg
.mp3	MP3 audio	audio/mpeg
.zip	Zip archive	application/zip

Tika uses its detect() method that detects document type. See an example.

package tikaexample;
import java.io.File;
import java.io.IOException;
import org.apache.tika.Tika;
public class SimpleTypeDetect {
	public static void main(String[] args) throws IOException {
		Tika tika = new Tika();
			String type = tika.detect(new File("javatpoint.txt"));
			System.out.println("file type : " + type);
	}
}

Output: