Tika Language Detection
Tika can identify language of any document or piece of text. It is useful while extracting text from document formats which do not include language information in their metadata.
Tika uses LanguageProfile and Language-Identifier classes to matching ISO 639 language code.
Tika can detect 18 of the 184 currently registered ISO 639-1 languages.
ISO 639 is a set of standards defined by the International Organization for Standardization ( ISO ).
Tika is able to detect various language including english, german, Italian etc. See the following table.
Language Detection in Tika
The following image, shows the key components of language detection process.
The org.apache.tika.language package contains all the required classes to detect document or text language. Lets see an example.
Tika Language Detection Example
Language code is : en