Converting a .docx to html using Apache POI and getting no text

Converting a .docx to html using Apache POI and getting no text - java

I currrently have some code that converts a .doc document to html but the code I am using for converting a .docx to text unfortunately doesn't get the text and convert it. Below is my code.
private void convertWordDocXtoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
XWPFDocument wordDocument = null;
try {
wordDocument = new XWPFDocument(new FileInputStream(file));
} catch (IOException ex) {
Exceptions.printStackTrace(ex);
}
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
acDocTextArea.setText(newDocText);
String htmlText = result;
}
Any ideas as to why this isn't working would be much appreciated. The ByteArrayOutput should return the entire html but it is empty and has no text.

Mark, you're using HWPF package which supports only .doc format, see this description. The document also mentions attempts to provide the interface for .docx files, through XWPF package. However they seem to lack human resources and users are encouraged to submit extensions. Limited functionality should be available though, extracting the text must be one of them.
You should also see this question: How to Extract docx (word 2007 above) using apache POI.

I too was struck at this point.
Now I know there is a 3rd party API to convert docx to html
works fine
https://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML

Related

Display XML with stylesheet in JEditorPane

I have an XML file, which uses an XSS and XSL stored in the folder to display the XML in a proper format.
when i use the following code
JEditorPane editor = new JEditorPane();
editor.setBounds(114, 65, 262, 186);
frame.getContentPane().add(editor);
editor.setContentType( "html" );
File file=new File("c:/r/testResult.xml");
editor.setPage(file.toURI().toURL());
All i can see is the text part of the XML without any styling. what should i do to make this display with style sheet.

The JEditorPane does not automatically process XSLT style-sheets. You must perform the transformation yourself:
try (InputStream xslt = getClass().getResourceAsStream("StyleSheet.xslt");
InputStream xml = getClass().getResourceAsStream("Document.xml")) {
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = db.parse(xml);
StringWriter output = new StringWriter();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer(new StreamSource(xslt));
transformer.transform(new DOMSource(doc), new StreamResult(output));
String html = output.toString();
// JEditorPane doesn't like the META tag...
html = html.replace("<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">", "");
editor.setContentType("text/html; charset=UTF-8");
editor.setText(html);
} catch (IOException | ParserConfigurationException | SAXException | TransformerException e) {
editor.setText("Unable to format document due to:\n\t" + e);
}
editor.setCaretPosition(0);
Use an appropriate InputStream or StreamSource for your particular xslt and xml documents.

How to perform mail merge functionality in java using dot/dotx and doc/docx format document template

I want to perform mail merge functionality in java using dot/dotx and doc/docx format documents. I tried using docx4j but it removes much rich text indentation from the documents.
I also tried fetching out some of the html content from the word document but couldnt able to repaste in word document.
public static void readDocxFile1(String fileName) {
// this.file = file;
try {
File file = new File(fileName);
FileInputStream finStream=new FileInputStream(file.getAbsolutePath());
HWPFDocument doc=new HWPFDocument(finStream);
WordExtractor wordExtract=new WordExtractor(doc);
Document newDocument = DocumentBuilderFactory.newInstance() .newDocumentBuilder().newDocument();
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) ;
wordToHtmlConverter.processDocument(doc);
StringWriter stringWriter = new StringWriter();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.transform(new DOMSource( wordToHtmlConverter.getDocument()), new StreamResult( stringWriter ) );
String html = stringWriter.toString();
System.out.println("html>>>>>>"+html);
}
catch(Exception e)
{
e.printStackTrace();
}
}
My requirement is that I have to (1) read a dot/dotx or doc/docx template and for the no. of people looping it to (2) replace the keywords and then (3) repasting it in the new document.
Please suggest a way how can I perform this feature.
Also please suggest if ASPOSE.WORDS API for JAVA will do this for me.

Yes, you can meet these requirements using Aspose.Words for Java API. I would suggest you please refer to the following sections of documentation:
Loading, Saving and Converting
Mail Merge and Reporting
Find and Replace Overview
I work with Aspose as Developer Evangelist.

How to reduce the file size of a xml file that is created using java?

I have to convert a text file with coordinates into a xml file. But the point of converting the text file into a xml file is so that the file size to be smaller. How can I reduce the size of my file?
public void writeXML() throws Exception
{
ArrayList<Frame> frameList = new ArrayList<Frame>();
frameList = readFile();
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
try
{
dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.newDocument();
// append stuff
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
//transformer.setOutputProperty(OutputKeys.INDENT, "yes");
DOMSource source = new DOMSource(doc);
StreamResult console = new StreamResult(System.out);
StreamResult file = new StreamResult(new File("file.xml"));
//transformer.transform(source, console);
transformer.transform(source, file);
System.out.println("DONE");
}
catch (Exception e)
{
e.printStackTrace();
}
}

But the point of converting the text file into a xml file is so that the file size to be smaller.
That is probably not achievable. XML is less dense than a typical text representation because properly designed XML adds a significant amount of "markup" to the file.
A file consisting of coordinates in an appropriately designed text form (e.g. CSV) will take less space than the same coordinates expressed in XML.
If you want "denser" files than a custom text format:
consider compressing the text file, or
consider using a binary representation instead of text.
If you are fixed on the idea of using XML, then the best way to reduce file size will be to compress it. Given that XML has a lot of redundancy in it (e.g. the markup), you should get significant compression.

How to use Tika's XWPFWordExtractorDecorator class?

Someone told me that Tika's XWPFWordExtractorDecorator class is used to convert docx into html. But I am not sure how to use this class to get the HTML from docx. Any other library for doing the same job is also appreciated/

You shouldn't use it directly
Instead, call Tika in the usual way, and it'll call the appropriate code for you
If you want XHTML from parsing a file, the code looks something like
// Either of these will work, the latter is recommended
//InputStream input = new FileInputStream("test.docx");
InputStream input = TikaInputStream.get(new File("test.docx"));
// AutoDetect is normally best, unless you know the best parser for the type
Parser parser = new AutoDetectParser();
// Handler for indented XHTML
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.setResult(new StreamResult(sw));
// Call the Tika Parser
try {
Metadata metadata = new Metadata();
parser.parse(input, handler, metadata, new ParseContext());
String xml = sw.toString();
} finally {
input.close();
}

How to save parsed and changed DOM document in xml file?

I have xml-file. I need to read it, make some changes and write new changed version to some new destination.
I managed to read, parse and patch this file (with DocumentBuilderFactory, DocumentBuilder, Document and so on).
But I cannot find a way how to save that file. Is there a way to get it's plain text view (as String) or any better way?

Something like this works:
Transformer transformer = TransformerFactory.newInstance().newTransformer();
Result output = new StreamResult(new File("output.xml"));
Source input = new DOMSource(myDocument);
transformer.transform(input, output);

That will work, provided you're using xerces-j:
public void serialise(org.w3c.dom.Document document) {
java.io.ByteArrayOutputStream data = new java.io.ByteArrayOutputStream();
java.io.PrintStream ps = new java.io.PrintStream(data);
org.apache.xml.serialize.OutputFormat of =
new org.apache.xml.serialize.OutputFormat("XML", "ISO-8859-1", true);
of.setIndent(1);
of.setIndenting(true);
org.apache.xml.serialize.XMLSerializer serializer =
new org.apache.xml.serialize.XMLSerializer(ps, of);
// As a DOM Serializer
serializer.asDOMSerializer();
serializer.serialize(document);
return data.toString();
}

That will give you possibility to define xml format
new XMLWriter(new FileOutputStream(fileName),
new OutputFormat(){{
setEncoding("UTF-8");
setIndent(" ");
setTrimText(false);
setNewlines(true);
setPadText(true);
}}).write(document);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Converting a .docx to html using Apache POI and getting no text - java

I too was struck at this point. Now I know there is a 3rd party API to convert docx to html works fine https://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML

Related

Display XML with stylesheet in JEditorPane

How to perform mail merge functionality in java using dot/dotx and doc/docx format document template

How to reduce the file size of a xml file that is created using java?

How to use Tika's XWPFWordExtractorDecorator class?

How to save parsed and changed DOM document in xml file?

Categories

Resources