Apache Tika has a wonderful feature, that can transform source document (PDF, MSOffice, Open Office etc.) into HTML during content extraction, what can be used for example to make document preview directly on webpage without involving any third-party components. Sound pretty simple, but I’ve dug through a lot of google search results and I can’t find a simple working example anywhere.
But, here is a working snippet I extracted from tika-app:
ByteArrayOutputStream out = new ByteArrayOutputStream();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
handler.setResult(new StreamResult(out));
ExpandedTitleContentHandler handler1 = new ExpandedTitleContentHandler(handler);
tikaParser.parse(new ByteArrayInputStream(file), handler1, new Metadata());
return new String(out.toByteArray(), "UTF-8");
It works pretty nicely. Here is an example of original MSOffice document:
And here how the above looks in my webapp as HTML preview:
this is a copy from my old blogspot blog