Tika Parsing Document to XHTML

Tika uses ToXMLContentHandler class to get output in XHTML format. It returns XHTML content of the whole document as a string.

This class contains the following constructors and methods.

Tika ToXMLContentHandler Constructors

Following are the constructors of ToXMLContentHandler class.

Constructor	Description
public ToXMLContentHandler()	It is used to create instance of the class.
public ToXMLContentHandler(String encoding)	It creates instance by getting string argument.

Tika ToXMLContentHandler Methods

Following are the methods of ToXMLContentHandler class.

Methods	Description
public void characters(char[] ch, int start, int length) throws SAXException	It writes the given characters to the given character stream.
protected void write(char ch) throws SAXException	It writes the given character as-is.
protected void write(String string) throws SAXException	It writes the given string of character as-is.
public void startDocument() throws SAXException	It writes the XML prefix.

Tika Parsing Document to XHTML Example

This example produce the output in XHTML format while the input is in text format.

package tikaexample;

import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.ToXMLContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
public class XhtmlParseExample {
	public static void main(String[] args) throws IOException, SAXException, TikaException {
	    ContentHandler handler = new ToXMLContentHandler();
		 
	    AutoDetectParser pa0rser = new AutoDetectParser();
	    Metadata metadata = new Metadata(); 
	    try (InputStream stream = XhtmlParseExample.class.getResourceAsStream("Hello.txt")) {
	        parser.parse(stream, handler, metadata);
	        System.out.println(handler.toString());
	    }
	}
}

Output:

Following is the content of hello.txt file.

Hello Welcome to Javatpoint

After extraction, it produces the output in XHTML format. See the below.

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.txt.TXTParser" />
<meta name="Content-Encoding" content="ISO-8859-1" />
<meta name="Content-Type" content="text/plain; charset=ISO-8859-1" />
<title></title>
</head>
<body><p>Hello Welcome to Javatpoint</p>
</body></html>

Next TopicTika Extracting HTML File

← prev next →