Tika Introduction

Tika is a content analysis tool, designed and developed by Apache Software Foundation. It is written in Java and used to detect and extract content and metadata from the file.

It supports thousand of file types including .XML, XLS, PDF etc.

It is cross-platform and it's repository is available at github for public access.


In 2007, Apache started a project to develop a tool that can extract the content from the file of any type. The prime purpose was to make it more usable with CMS (Content Management System) and Web crawlers. And in 2011, first official version 1.0 was released.

The current stable version of Tika is 1.17, released on December 13, 2017.


Tika is used by world wide and top giants are using it for information retrieval. There are most well known companies that use Tika.

  • FICO (Fair Issac Corporation)
  • Goldman Sachs
  • NASA
  • Drupal (software)
  • Alfresco (software)

Forbes Magazine published a report on the key role of Tika that was used by 400 journalist to extract 11.5 million documents to get information.

