How can I limit the pages output with WordToHtmlConverter and HWPFDocument?

How can I limit the pages output with WordToHtmlConverter and HWPFDocument? - java

I'm converting a Word/.doc file to HTML and I'd like to be able to get a subset of pages. Is it possible to limit the range of output? I'm open to creating a new HWPFDocument from the original with only the subset of pages or after converting limit the length there.
File localFile = ...
FileInputStream fis = new FileInputStream(localFile);
HWPFDocument wordDoc = new HWPFDocument(fis);
Document newDoc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDoc);
wordToHtmlConverter.processDocument(wordDoc);
StringWriter stringWriter = new StringWriter();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.transform(
new DOMSource(wordToHtmlConverter.getDocument()),
new StreamResult(stringWriter));
String htmlString = stringWriter.toString();
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(htmlFile), "UTF-8"));
out.write(htmlString);
out.close();

Not with POI. There is no notion of a page in the HWPF format. Pages are an artifact as the consumer. There are no pages until the consumer renders them, and each client can render pages slightly differently, even between different versions of Word.

Related

Remove SOAP envelope

I have an InputStream containing a SOAP message, including the envelope. I don't know the contents of the body beforehand and therefore cannot create a Jaxb annotated class for it.
I've tried many ways, inlcuding a custom SOAPWrapper JaxB Class with XmlAnyElement and other ways. Currently I have this:
private InputStream removeSoapEnvelope(final InputStream inputStream) throws IOException, TransformerException
{
final SoapBody body = messageFactory.createWebServiceMessage(inputStream)
.getSoapBody();
final Transformer transformer = TransformerFactory.newInstance()
.newTransformer();
final DOMResult domResult = new DOMResult();
transformer.transform(body.getPayloadSource(), domResult);
final StringWriter writer = new StringWriter();
transformer.transform(new DOMSource(domResult.getNode()), new StreamResult(writer));
byte[] barray = writer.toString()
.getBytes(StandardCharsets.UTF_8);
return new ByteArrayInputStream(barray);
}
It seems to work but is horribly inefficient. Is there no short and concise way of achieving this with standard libraries and without regex?
Thanks

Here's a solution using XPath to get the element (pure JaxB? not sure). Takes the document as a regular XML document so it should work for any I guess
FileInputStream fileIS;
fileIS = new FileInputStream(System.getProperty("user.home") + "/tmp/soap.xml");
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument;
xmlDocument = builder.parse(fileIS);
XPath xPath = XPathFactory.newInstance().newXPath();
String expression01 = "//*[local-name()='Body']";
Node currentNode = (Node) xPath.compile(expression01).evaluate(xmlDocument, XPathConstants.NODE);
StringWriter buf = new StringWriter();
Transformer xform = TransformerFactory.newInstance().newTransformer();
xform.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
xform.setOutputProperty(OutputKeys.INDENT, "yes");
xform.transform(new DOMSource(currentNode), new StreamResult(buf));
System.out.println(buf.toString());
Result:
<soap:Body>
<incident xmlns="http://example.com">
<Company type="String">Test</Company>
</incident>
</soap:Body>

I ended up doing it with regex. All other options are too slow:
private InputStream removeSoapEnvelope(final InputStream inputStream) throws IOException
{
final String text = new String(inputStream.readAllBytes(), UTF_8);
final String replace = text.replaceAll("\\s*<\\/?(?:SOAP-ENV|soap):(?:.|\\s)*?>", "");
File file = File.createTempFile("temp", XML_NS_PREFIX);
Files.writeString(file.toPath(), replace);
return new FileInputStream(file);
}

creating space in end of xml tag in java

My xml tag is :
<Description/>
I want with space :
<Description />
How can I do this in Java?
I am signing xml document , in original file space has been used but when I used following code and print it, it printing without space.
String thisLine = "";
String xmlString = "";
BufferedReader br = new BufferedReader(new FileReader(originalXmlFilePath));
while ((thisLine = br.readLine()) != null) {
xmlString = xmlString + thisLine.trim();
}
br.close();
ByteArrayInputStream xmlStream = new ByteArrayInputStream(xmlString.getBytes());
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setValidating(false);
Document doc = dbf.newDocumentBuilder().parse
(xmlStream );
doc.setXmlStandalone(true);
DOMSignContext dsc = new DOMSignContext
(keyEntry.getPrivateKey(), doc.getDocumentElement());
javax.xml.crypto.dsig.XMLSignature signature = fac.newXMLSignature(si, ki);
signature.sign(dsc);
// Output the resulting document.
// OutputStream os = new FileOutputStream(new File(destnSignedXmlFilePath));
TransformerFactory tf = TransformerFactory.newInstance();
Transformer trans = tf.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
trans.setOutputProperty(OutputKeys.INDENT, "yes");
StringWriter writer = new StringWriter();
trans.transform(new DOMSource(doc), new StreamResult(writer));
String output = writer.getBuffer().toString();//.replaceAll("\n|\r", "");
System.out.println("output== "+output);

What you are doing wrong is signing an arbitrary unprocessed text instead of submitting a canonical version of your document (without spaces in tags, but also with sorted attributes, with quotes of the same type, etc.) to the digital signature computation.
The Canonical XML and Exclusive Canonical XML W3C recommendations specify a standard and comprehensive way to eliminate arbitrary differences.

apache-poi WordToHtmlConverter not converting images properly

I am using apache.poi to transform a word file into html. My document has text and images - the text can be converted to html just fine, but images are not converted. Is there any way to convert images as well?
HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream("my_document_path"));
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
wordToHtmlConverter.processDocument(wordDocument);
org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.MEDIA_TYPE,"text/image" );
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
File filename = new File("stored_html_path");
FileWriter fw = new FileWriter(filename); //the true will append the new data
fw.write(result);//appends the string to the file
fw.close();

Transformer not reading Special Character from Document Object

I am trying to read xml data from Document Object, and then using transformer to render the data inside the document object to pdf,using XSL,
My code is :
Document doc = toXML(arg1,arg2);
doc contains data like :
İlkyönetmeliği
with in tags
InputStream inputStream = new FileInputStream(xslFilePath);
transformer = factory.newTransformer(new StreamSource(inputStream));
transformer.setParameter("encoding", "UTF-8");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new DOMSource(doc.getDocumentElement()), res);
Special characters present in xml are not getting rendered accordingly and displaying like
#lk yard#m.
I have also set encoding to UTF-8 ,but still it is displaying like above.

It is not clear what causes your encoding problem because I cannot see how your document is read/constructed and how your transformation result res is set up. Try the following standalone example code which handles encoding with XSLT. Maybe you can even modify it gradually to use your actual data in order to see what goes wrong.
public static void main(String[] args) {
try {
String inputEncoding = "UTF-16";
String xsltEncoding = "ASCII";
String outputEncoding = "UTF-8";
ByteArrayOutputStream bos = new ByteArrayOutputStream();
OutputStreamWriter osw = new OutputStreamWriter(bos, inputEncoding);
osw.write("<?xml version='1.0' encoding='" + inputEncoding + "'?>");
osw.write("<root>İlkyönetmeliği</root>"); osw.close();
byte[] inputBytes = bos.toByteArray();
bos.reset();
osw = new OutputStreamWriter(bos, xsltEncoding);
osw.write("<?xml version='1.0' encoding='" + xsltEncoding + "'?>");
osw.write("<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'>");
osw.write("<xsl:template match='#*|node()'><xsl:copy><xsl:apply-templates select='#*|node()'/></xsl:copy></xsl:template>");
osw.write("</xsl:stylesheet>"); osw.close();
byte[] xsltBytes = bos.toByteArray();
bos.reset();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document d = db.parse(new InputSource(new InputStreamReader(new ByteArrayInputStream(inputBytes), inputEncoding)));
// if encoding declaration correct, use: Document d = db.parse(new InputSource(new ByteArrayInputStream(inputBytes)));
System.out.println(XPathFactory.newInstance().newXPath().evaluate("/root[1]", d));
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer(new StreamSource(new InputStreamReader(new ByteArrayInputStream(xsltBytes), xsltEncoding)));
// if encoding declaration correct, use: Transformer t = tf.newTransformer(new StreamSource(new ByteArrayInputStream(xsltBytes)));
StreamResult sr = new StreamResult(new OutputStreamWriter(bos, outputEncoding));
t.setOutputProperty(OutputKeys.ENCODING, outputEncoding);
t.transform(new DOMSource(d.getDocumentElement()), sr);
byte[] outputBytes = bos.toByteArray();
Scanner s = new Scanner(new InputStreamReader(new ByteArrayInputStream(outputBytes), outputEncoding));
String output = s.useDelimiter("</>").next(); // read all
s.close();
System.out.println(output);
} catch (Exception ex) {
ex.printStackTrace(System.err);
}
The example code applies the XSLT identity template to a minimal input containing the non-ASCII characters.
I output the string to check if it has been parsed correctly in the document using XPath. You may want to check your (intermediate) document if you know how to locate it with XPath.
Note that, if present, the parser tries to pick up the encoding declared in the XML processing instruction (PI) by default when reading an XML file. It assumes that actual and declared encoding are the same. If they differ or the PI is missing, then you can enforce the actual encoding e.g. by using an InputStreamReader as in the code above.

How to force javax xslt transformer to encode national characters using utf-8 and not html entities?

I'm working on filter that should transform an output with some stylesheet. Important sections of code looks like this:
PrintWriter out = response.getWriter();
...
StringReader sr = new StringReader(content);
Source xmlSource = new StreamSource(sr, requestSystemId);
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setParameter("encoding", "UTF-8");
//same result when using ByteArrayOutputStream xo = new java.io.ByteArrayOutputStream();
StringWriter xo = new StringWriter();
StreamResult result = new StreamResult(xo);
transformer.transform(xmlSource, result);
out.write(xo.toString());
The problem is that national characters are encoded as html entities and not by using UTF. Is there any way to force transformer to use UTF-8 instead of entities?

You need to set the output method to text instead of (default) xml.
transformer.setOutputProperty(OutputKeys.METHOD, "text");
You should however also set the response encoding beforehand:
response.setCharacterEncoding("UTF-8");
And instruct the webbrowser to use the same encoding:
response.setContentType("text/html;charset=UTF-8");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I limit the pages output with WordToHtmlConverter and HWPFDocument? - java

Not with POI. There is no notion of a page in the HWPF format. Pages are an artifact as the consumer. There are no pages until the consumer renders them, and each client can render pages slightly differently, even between different versions of Word.

Related

Remove SOAP envelope

creating space in end of xml tag in java

apache-poi WordToHtmlConverter not converting images properly

Transformer not reading Special Character from Document Object

How to force javax xslt transformer to encode national characters using utf-8 and not html entities?

Categories

Resources