When we tried to parse the node content and converted to string, the string content is generated like displayed below,
&l t; rootNode &g t;
......
&l t;/rootNode &g t;
When we tried to add the string content again in an XML using JDOM Element, it is expected to append as shown below, instead we are getting the value as the same shown above without the unicode conversion process.
<rootNode>
.....
</rootNode>
We have tried StringUtils, XMLEscapeUtils but we are not getting the expected result, can someone guide me on the right path.
Edit
Adding code from OP's comment:
String inputStr = "<rootnode></rootnode>";
org.jDom.Element Element e = new Element("parentnode");
e.addContent(inputStr);
Problem: JDOM's Element.addContent(String) is adding your inputStr as an unparsed string.
Solution: Instead, you need to parse the string into an element, then add it to e. You'll have to read it into its own document, then detach and move it over. Here's a sketch:
import org.jdom.Element;
import org.jdom.input.SAXBuilder;
import org.jdom.Document;
import java.io.StringReader;
....
String inputStr = "<rootnode></rootnode>";
Element e = new Element("parentnode");
StringReader stringReader = new StringReader(inputStr);
SAXBuilder builder = new SAXBuilder();
Document doc = builder.build(stringReader);
Element rootE = doc.getRootElement();
e.addContent(rootE.detach()); // <== Add an Element rather than a String
Related
We are building a java code to read word document (.docx) into our program using apache POI.
We are stuck when we encounter formulas and chemical equation inside the document.
Yet, we managed to read formulas but we have no idea how to locate its index in concerned string..
INPUT (format is *.docx)
text before formulae **CHEMICAL EQUATION** text after
OUTPUT (format shall be HTML) we designed
text before formulae text after **CHEMICAL EQUATION**
We are unable to fetch the string and reconstruct to its original form.
Question
Now is there any way to locate the position of the image and formulae within the stripped line, so that it can be restored to its original form after reconstruction of the string, as against having it appended at the end of string.?
If the needed format is HTML, then Word text content together with Office MathML equations can be read the following way.
In Reading equations & formula from Word (Docx) to html and save database using java I have provided an example which gets all Office MathML equations out of an Word document into HTML. It uses paragraph.getCTP().getOMathList() and paragraph.getCTP().getOMathParaList() to get the OMath elements from the paragraph. This takes the OMath elements out of the text context.
If one wants get those OMath elements in context with the other elements in the paragraphs, then using a org.apache.xmlbeans.XmlCursor is needed to loop over all different XML elements in the paragraph. The following example uses the XmlCursor to get text runs together with OMath elements from the paragraph.
The transformation from Office MathML into MathML is taken using the same XSLT approach as in Reading equations & formula from Word (Docx) to html and save database using java. There also is described where the OMML2MML.XSL comes from.
The file Formula.docx looks like:
Code:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;
import org.apache.xmlbeans.XmlCursor;
import org.w3c.dom.Node;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.awt.Desktop;
import java.util.List;
import java.util.ArrayList;
/*
needs the full ooxml-schemas-1.4.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/
public class WordReadTextWithFormulasAsHTML {
static File stylesheet = new File("OMML2MML.XSL");
static TransformerFactory tFactory = TransformerFactory.newInstance();
static StreamSource stylesource = new StreamSource(stylesheet);
//method for getting MathML from oMath
static String getMathML(CTOMath ctomath) throws Exception {
Transformer transformer = tFactory.newTransformer(stylesource);
Node node = ctomath.getDomNode();
DOMSource source = new DOMSource(node);
StringWriter stringwriter = new StringWriter();
StreamResult result = new StreamResult(stringwriter);
transformer.setOutputProperty("omit-xml-declaration", "yes");
transformer.transform(source, result);
String mathML = stringwriter.toString();
stringwriter.close();
//The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
//We don't need this since we want using the MathML in HTML, not in XML.
//So ideally we should changing the OMML2MML.XSL to not do so.
//But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
mathML = mathML.replaceAll("xmlns:mml", "xmlns");
mathML = mathML.replaceAll("mml:", "");
return mathML;
}
//method for getting HTML including MathML from XWPFParagraph
static String getTextAndFormulas(XWPFParagraph paragraph) throws Exception {
StringBuffer textWithFormulas = new StringBuffer();
//using a cursor to go through the paragraph from top to down
XmlCursor xmlcursor = paragraph.getCTP().newCursor();
while (xmlcursor.hasNextToken()) {
XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
if (tokentype.isStart()) {
if (xmlcursor.getName().getPrefix().equalsIgnoreCase("w") && xmlcursor.getName().getLocalPart().equalsIgnoreCase("r")) {
//elements w:r are text runs within the paragraph
//simply append the text data
textWithFormulas.append(xmlcursor.getTextValue());
} else if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("oMath")) {
//we have oMath
//append the oMath as MathML
textWithFormulas.append(getMathML((CTOMath)xmlcursor.getObject()));
}
} else if (tokentype.isEnd()) {
//we have to check whether we are at the end of the paragraph
xmlcursor.push();
xmlcursor.toParent();
if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("p")) {
break;
}
xmlcursor.pop();
}
}
return textWithFormulas.toString();
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));
//using a StringBuffer for appending all the content as HTML
StringBuffer allHTML = new StringBuffer();
//loop over all IBodyElements - should be self explained
for (IBodyElement ibodyelement : document.getBodyElements()) {
if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
allHTML.append("<p>");
allHTML.append(getTextAndFormulas(paragraph));
allHTML.append("</p>");
} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
XWPFTable table = (XWPFTable)ibodyelement;
allHTML.append("<table border=1>");
for (XWPFTableRow row : table.getRows()) {
allHTML.append("<tr>");
for (XWPFTableCell cell : row.getTableCells()) {
allHTML.append("<td>");
for (XWPFParagraph paragraph : cell.getParagraphs()) {
allHTML.append("<p>");
allHTML.append(getTextAndFormulas(paragraph));
allHTML.append("</p>");
}
allHTML.append("</td>");
}
allHTML.append("</tr>");
}
allHTML.append("</table>");
}
}
document.close();
//creating a sample HTML file
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("result.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
writer.write("<!DOCTYPE html>\n");
writer.write("<html lang=\"en\">");
writer.write("<head>");
writer.write("<meta charset=\"utf-8\"/>");
//using MathJax for helping all browsers to interpret MathML
writer.write("<script type=\"text/javascript\"");
writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
writer.write(">");
writer.write("</script>");
writer.write("</head>");
writer.write("<body>");
writer.write(allHTML.toString());
writer.write("</body>");
writer.write("</html>");
writer.close();
Desktop.getDesktop().browse(new File("result.html").toURI());
}
}
Result:
Just tested this code using apache poi 5.0.0 and it works. You need poi-ooxml-full-5.0.0.jar for apache poi 5.0.0. Please read https://poi.apache.org/help/faq.html#faq-N10025 for what ooxml libraries are needed for what apache poi version.
XWPFParagraph paragraph;
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
formulas=formulas + getMathML(ctomath);
}
With the above code it is able to extract the math formula from the given paragraph of a docx file.
Also for the purpose displaying the formula in a html page I m converting it to mathml code and rendering it with MathJax on the page. This I m able to do.
But the problem is, Is it possible to get the position of the formula in the given paragraph. So that I can display the formula in the exact location in the paragraph while rendering it as a html page.
lets say I have some xml:
<document>blabla<bold>test<list><item>hello<italics>dfh</italics></item></list></bold>sdfsd</document>
and I now need to get the content of as a string, so I would have
blabla<bold>test<list><item>hello<italics>dfh</italics></item></list></bold>sdfsd
i have been messing with this in my head for a while now, and I haven't seem to be able to figure it out.
Hope to get some directions to what I have to do.
EDIT:
just to be clear, lets say I have the XML like this:
SAXBuilder sb = new SAXBuilder();
Document doc = sb.build(new StringReader("<document>blabla<bold>test<list><item>hello<italics>dfh</italics></item></list></bold>sdfsd</document>"));
and I now need to get the content of
It is very unusual to need to get an inconsistent subset of an XML document like you want. It's much more common to get just the text content: blabla test hello dfh sdfsd
Note that you can get a subset of the content as the "contentlist" of the root element, and then output just that list as a string:
XMLOutputter xout = new XMLOutputter();
String txt = xout.outputString(doc.getRootElement().getContent());
System.out.println(txt);
For me, I wrote the code:
public static void main(String[] args) throws JDOMException, IOException {
SAXBuilder sb = new SAXBuilder();
Document doc = sb.build(new StringReader("<document>blabla<bold>test<list><item>hello<italics>dfh</italics></item></list></bold>sdfsd</document>"));
XMLOutputter xout = new XMLOutputter();
String txt = xout.outputString(doc.getRootElement().getContent());
System.out.println(txt);
}
and it output:
blabla<bold>test<list><item>hello<italics>dfh</italics></item></list></bold>sdfsd
Assume I have
<Sports>
<Soccer>
<Players>
<Player_1> Messi Leonel </Player_1>
</Players>
</Soccer>
</Sports>
How to get Player_1 node text in one line without iteration using Dom4J?
Return value should be: Messi Leonel
Thanks
Got it, to the person who looks something like this
File file = new File("/path/to/file.xml");
SAXReader reader = new SAXReader();
Document document = reader.read(file);
String name = document.selectSingleNode("//Sports/Soccer/Players/Player_1").getText();
How get node value with its children nodes? For example I have following node parsed into dom Document instance:
<root>
<ch1>That is a text with <value name="val1">value contents</value></ch1>
</root>
I select ch1 node using xpath. Now I need to get its contents, everything what is containing between <ch1> and </ch1>, e.g. That is a text with <value name="val1">value contents</value>.
How can I do it?
I have found the following code snippet that uses transformation, it gives almost exactly what I want. It is possible to tune result by changing output method.
public static String serializeDoc(Node doc) {
StringWriter outText = new StringWriter();
StreamResult sr = new StreamResult(outText);
Properties oprops = new Properties();
oprops.put(OutputKeys.METHOD, "xml");
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = null;
try {
t = tf.newTransformer();
t.setOutputProperties(oprops);
t.transform(new DOMSource(doc), sr);
} catch (Exception e) {
System.out.println(e);
}
return outText.toString();
}
If this is server side java (ie you do not need to worry about it running on other jvm's) and you are using the Sun/Oracle JDK, you can do the following:
import com.sun.org.apache.xml.internal.serialize.OutputFormat;
import com.sun.org.apache.xml.internal.serialize.XMLSerializer;
...
Node n = ...;
OutputFormat outputFormat = new OutputFormat();
outputFormat.setOmitXMLDeclaration(true);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
XMLSerializer ser = new XMLSerializer(baos, outputFormat);
ser.serialize(n);
System.out.println(new String(baos.toByteArray()));
Remember to ensure your ultimate conversion to string may need to take an encoding parameter if the parsed xml dom has its text nodes in a different encoding than your platforms default one or you'll get garbage on the unusual characters.
You could use jOOX to wrap your DOM objects and get many utility functions from it, such as the one you need. In your case, this will produce the result you need (using css-style selectors to find <ch1/>:
String xml = $(document).find("ch1").content();
Or with XPath as you did:
String xml = $(document).xpath("//ch1").content();
Internally, jOOX will use a transformer to generate that output, as others have mentioned
As far as I know, there is no equivalent of innerHTML in Document. DOM is meant to hide the details of the markup from you.
You can probably get the effect you want by going through the children of that node. Suppose for example that you want to copy out the text, but replace each "value" tag with a programmatically supplied value:
HashMap<String, String> values = ...;
StringBuilder str = new StringBuilder();
for(Element child = ch1.getFirstChild; child != null; child = child.getNextSibling()) {
if(child.getNodeType() == Node.TEXT_NODE) {
str.append(child.getTextContent());
} else if(child.getNodeName().equals("value")) {
str.append(values.get(child.getAttributes().getNamedItem("name").getTextContent()));
}
}
String output = str.toString();
Say I have a Java String which has xml data like so:
String content = "<abc> Hello <mark> World </mark> </abc>";
Now, I seek to render this String as text on a web page and hightlight/mark the word "World". The tag "abc" could change dynamically, so is there a way I can rename the outermost xml tag in a String using Java ?
I would like to convert the above String to the format shown below:
String content = "<i> Hello <mark> World </mark> </i>";
Now, I could use the new String to set html content and display the text in italics and highlight the word World.
Thanks,
Sony
PS: I am using xquery over files in BaseX xml database. The String content is essentially a result of an xquery which uses ft:extract(), a function to extract full text search results.
XML "parsing" with regexes can be cumbersome. If there is a possibility that your XML string can be more complicated than the one used in your example, you should consider processing it as a real XML node.
String newName = "i";
// parse String as DOM
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(content)));
// modify DOM
doc.renameNode(doc.getDocumentElement(), null, newName);
This code assumes that the element to that needs to be renamed is always the outermost element, that is, the root element.
Now the document is a DOM tree. It can be converted back to String object with a transformer.
// output DOM as String
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StringWriter sw = new StringWriter();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.transform(new DOMSource(doc), new StreamResult(sw));
String italicsContent = sw.toString();
Perhaps a simple regex?
String content = "<abc> Sample text <mark> content </mark> </abc>";
Pattern outerTags = Pattern.compile("^<(\\w+)>(.*)</\\1>$");
Matcher m = outerTags.matcher(content);
if (m.matches()) {
content = "<i>" + m.group(2) + "</i>";
System.out.println(content);
}
Alternatively, use a DOM parser, find the children of the outer tag and print them, preceded and followed by your desired tag as strings