Tika Parsing Document to Plain Text

Tika allows us to get extracted content in various formats like text, html or xhtml etc. The ContentHandler class is responsible for returning content. We can use BodyContentHandler also if want to get content of the document's body as plain text.

Lets see an example in which we are getting plain text output from the html file.

Tika Parsing to Plain Text Example

package tikaexample;

import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class AutoDetectParseExample {
	public static void main(String[] args) throws IOException, SAXException, TikaException {
		BodyContentHandler handler = new BodyContentHandler();
	    AutoDetectParser parser = new AutoDetectParser();
	    Metadata metadata = new Metadata();
	    try (InputStream stream = AutoDetectParseExample.class.getResourceAsStream("index.html")) {
	        parser.parse(stream, handler, metadata);
	        System.out.println(handler.toString());
	    }
	}
}

Output:

Following is the our html file.

// index. html

<html>
<head>
<title>Index Page</title>
</head>
<body>
<h2>Hello, Welcome to Javatpoint. </h2>
</body>
</html>

After extracting, it produces the output in plain text.

Hello, Welcome to Javatpoint.

Next TopicTika Parsing Document to XHTML

← prev next →

For Videos Join Our Youtube Channel: Join Now

Feedback

Send your Feedback to [email protected]

Help Others, Please Share

Learn Latest Tutorials

Splunk

SPSS

Swagger

Transact-SQL

Tumblr

ReactJS

Regex

Reinforcement Learning

R Programming

RxJS

React Native

Python Design Patterns

Python Pillow

Python Turtle

Keras

Preparation

Aptitude

Reasoning

Verbal Ability

Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

AWS

Selenium

Cloud Computing

Hadoop

ReactJS

Data Science

Angular 7

Blockchain

Git

Machine Learning

DevOps

B.Tech / MCA

DBMS

Data Structures

DAA

Operating System

Computer Network

Compiler Design

Computer Organization

Discrete Mathematics

Ethical Hacking

Computer Graphics

Software Engineering

Web Technology

Cyber Security

Automata

C Programming

C++

Java

.Net

Python

Programs

Control System

Data Mining

Data Warehouse

^{Like/Subscribe us for latest updates or newsletter}

Tika Tutorial

Tika Document Parsing

Tika Extraction