How to convert doc or docx into HTML in Java. Using Apache POI, I was able to convert doc to html but unable to convert docx into html? Please show me sample code? This code work with doc but not docx.
HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(stream);
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
There is no reason why this shouldn't / can't work.
Please review the following:
How to extract plain text from a DOCX file using the new OOXML support in Apache POI 3.5?
https://stackoverflow.com/a/5507019/751158
In short, make sure you're using an up-to-date version of POI, and have all of the required libraries.
(If you need additional assistance, please explain what isn't working. Are you getting compile-time errors? Run-time errors? Unexpected output?)
Related
I have a xml file as object in Java as org.w3c.dom.Document doc and I want to convert this into File file. How can I convert the type Document to File?
thanks
I want to add metadata elements in an existing xml file (standard dita) with type File.
I know a way to add elements to the file, but then I have to convert the file to a org.w3c.dom.Document. I did that with the method loadXML:
private Document loadXML(File f) throws Exception{
DocumentBuilder b = DocumentBuilderFactory.newInstance().newDocumentBuilder();
return builder.parse(f);
After that I change the org.w3c.dom.Document, then I want to continue with the flow of the program and I have to convert the Document doc back to a File file.
What is a efficient way to do that? Or what is a better solution to get some elements in a xml File without converting it?
You can use a Transformer class to output the entire XML content to a File, as showed below:
Document doc =...
// write the content into xml file
DOMSource source = new DOMSource(doc);
FileWriter writer = new FileWriter(new File("/tmp/output.xml"));
StreamResult result = new StreamResult(writer);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.transform(source, result);
With JDK 1.8.0 a short way is to use the built-in XMLSerializer (which was introduced with JDK 1.4 as a fork of Apache Xerces)
import com.sun.org.apache.xml.internal.serialize.XMLSerializer;
Document doc = //use your method loadXML(File f)
//change Document
java.io.Writer writer = new java.io.FileWriter("MyOutput.xml");
XMLSerializer xml = new XMLSerializer(writer, null);
xml.serialize(doc);
Use an object of type OutputFormat to configure output, for example like this:
OutputFormat format = new OutputFormat(Method.XML, StandardCharsets.UTF_8.toString(), true);
format.setIndent(4);
format.setLineWidth(80);
format.setPreserveEmptyAttributes(true);
format.setPreserveSpace(true);
XMLSerializer xml = new XMLSerializer(writer, format);
Note that the classes are from com.sun.* package which is not documented and therefore generally is not seen as the preferred way of doing things. However, with javax.xml.transform.OutputKeys you cannot specify the amount of indentation or line width for example. So, if this is important then this solution should help.
I am trying to read a XML file using FileInputStreamReader class. But, when I try to read large XML files, some problems occur. Which java class is more suitable to read large XML files and what are the most efficient parsers to parse XML files?
I think that DOM parsing is a good way to parse XML files
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = factory.newDocumentBuilder();
Document doc = docBuilder.parse(yourFile);
in this way you start the parsing of your XML document, then you can modify the nodes and change what you want to change
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
DOMSource source = new DOMSource(yourDoc);
StreamResult result = new StreamResult(yourFile);
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.transform(source, result);
And this second parte is to save all your changes. The "setOutputPropery" method is not mandatory, it's only used to give to the XML file a nice indent.
I am facing issue while trying to convert xml to html using xslt.
following is the dummy code i am using to parse
**TransformerFactory tFactory = TransformerFactory.newInstance();
Source xslDoc = new StreamSource( xsltPath );
Source xmlDoc = new StreamSource( xmlPath );
oFileOutputStream=new FileOutputStream( htmlOutputPath );
htmlFile = oFileOutputStream;
Transformer transformer = tFactory.newTransformer( xslDoc );
transformer.transform( xmlDoc, new StreamResult( htmlFile ) );**
getting error as follow
ERROR: 'XML document structures must start and end within the same entity.'
ERROR: 'com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: XML document structures must start and end within the same entity.'|#]
Any idea
It seems that you have closed the stream while transformer is still doing the work. Please check that you have not closed any stream resources.
I convert org.w3c.dom.Element to String in this way:
StringWriter writer = new StringWriter();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new DOMSource(node), new StreamResult(writer));
String result = writer.toString();
But when I use it later I get an exception: io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence which says about wrong encoding.
In fact, it's completely unnecessery. I needed it for serialization nodes and further export them to different formats (xml, html, test). So I found out that it's better to share org.w3c.dom Documents. From Document you can get any information you need.
I need to transform a DOMSource into a StreamSource, because a third-party library only accepts stream sources for SOAP.
Performance is not so much of an issue in this case, so I came up with this horribly verbose set of commands:
DOMSource src = new DOMSource(document);
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer();
StreamResult result = new StreamResult();
ByteArrayOutputStream out = new ByteArrayOutputStream();
result.setOutputStream(out);
transformer.transform(src, result);
ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray());
StreamSource streamSource = new StreamSource(in);
Isn't there a simpler way to do this?
This is as good a way as any. Because your third party library only accepts XML in lexical form, you have no alternative but to serialize the DOM so that the external library can re-parse it. Stupid design - tell them so.