Tika Language Detection

Tika can identify language of any document or piece of text. It is useful while extracting text from document formats which do not include language information in their metadata.

Tika uses LanguageProfile and Language-Identifier classes to matching ISO 639 language code.

Tika can detect 18 of the 184 currently registered ISO 639-1 languages.

ISO 639 is a set of standards defined by the International Organization for Standardization ( ISO ).

Tika is able to detect various language including english, german, Italian etc. See the following table.

Code name	Language
da	Danish
de	German
et	Estonian
el	Greek
en	English
es	Spanish
fi	Finnish
fr	French
hu	Hungarian
is	Icelandic
it	Italian
nl	Dutch
no	Norwegian
pl	Polish
pt	Portuguese
ru	Russian
sv	Swedish
th	Thai

Language Detection in Tika

The following image, shows the key components of language detection process.

The org.apache.tika.language package contains all the required classes to detect document or text language. Lets see an example.

Tika Language Detection Example

package tikaexample;

import org.apache.tika.language.LanguageIdentifier;

public class LanguageDetectionExample {
	public static void main(String[] args) {
		      LanguageIdentifier identifier = new LanguageIdentifier("Hello, this is javatpoint.");
		      String language = identifier.getLanguage();
		      System.out.println("Language code is : " + language);
		   }
}

Output: