get node raw text

get node raw text - java

How get node value with its children nodes? For example I have following node parsed into dom Document instance:
<root>
<ch1>That is a text with <value name="val1">value contents</value></ch1>
</root>
I select ch1 node using xpath. Now I need to get its contents, everything what is containing between <ch1> and </ch1>, e.g. That is a text with <value name="val1">value contents</value>.
How can I do it?

I have found the following code snippet that uses transformation, it gives almost exactly what I want. It is possible to tune result by changing output method.
public static String serializeDoc(Node doc) {
StringWriter outText = new StringWriter();
StreamResult sr = new StreamResult(outText);
Properties oprops = new Properties();
oprops.put(OutputKeys.METHOD, "xml");
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = null;
try {
t = tf.newTransformer();
t.setOutputProperties(oprops);
t.transform(new DOMSource(doc), sr);
} catch (Exception e) {
System.out.println(e);
}
return outText.toString();
}

If this is server side java (ie you do not need to worry about it running on other jvm's) and you are using the Sun/Oracle JDK, you can do the following:
import com.sun.org.apache.xml.internal.serialize.OutputFormat;
import com.sun.org.apache.xml.internal.serialize.XMLSerializer;
...
Node n = ...;
OutputFormat outputFormat = new OutputFormat();
outputFormat.setOmitXMLDeclaration(true);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
XMLSerializer ser = new XMLSerializer(baos, outputFormat);
ser.serialize(n);
System.out.println(new String(baos.toByteArray()));
Remember to ensure your ultimate conversion to string may need to take an encoding parameter if the parsed xml dom has its text nodes in a different encoding than your platforms default one or you'll get garbage on the unusual characters.

You could use jOOX to wrap your DOM objects and get many utility functions from it, such as the one you need. In your case, this will produce the result you need (using css-style selectors to find <ch1/>:
String xml = $(document).find("ch1").content();
Or with XPath as you did:
String xml = $(document).xpath("//ch1").content();
Internally, jOOX will use a transformer to generate that output, as others have mentioned

As far as I know, there is no equivalent of innerHTML in Document. DOM is meant to hide the details of the markup from you.
You can probably get the effect you want by going through the children of that node. Suppose for example that you want to copy out the text, but replace each "value" tag with a programmatically supplied value:
HashMap<String, String> values = ...;
StringBuilder str = new StringBuilder();
for(Element child = ch1.getFirstChild; child != null; child = child.getNextSibling()) {
if(child.getNodeType() == Node.TEXT_NODE) {
str.append(child.getTextContent());
} else if(child.getNodeName().equals("value")) {
str.append(values.get(child.getAttributes().getNamedItem("name").getTextContent()));
}
}
String output = str.toString();

Related

How to make minor edits to an XML file in Java

I am trying to change a single value in a large (5mb) XML file. I always know the value will be in the first 10 lines, therefore I do not need to read in 99% of the file. Yet it seems doing a partial XML read in Java is quite tricky.
In this picture you can see the single value I need to access.
I have read a lot about XML in Java and the best practices of handling it. However, in this case I am unsure of what the best approach would be - A DOM, STAX or SAX parser all seem to have different best use case scenarios - and I am not sure which would best suit this problem. Since all I need to do is edit one value.
Perhaps, I shouldn't even use an XML parser and just go with regex, but it seem like it is a pretty bad idea to use regex on XML
Hoping someone could point me in the right direction,
Thanks!

I would choose DOM over SAX or StAX simply for the (relative) simplicity of the API. Yes, there is some boilerplate code to get the DOM populated, but once you get past that it is fairly straight-forward.
Having said that, if your XML source is 100s or 1000s of megabytes, one of the streaming APIs would be better suited. As it is, 5MB is not what I would consider a large dataset, so go ahead and use DOM and call it a day:
import java.io.File;
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;
import javax.xml.xpath.*;
import org.w3c.dom.*;
public class ChangeVersion
{
public static void main(String[] args)
throws Exception
{
if (args.length < 3) {
System.err.println("Usage: ChangeVersion <input> <output> <new version>");
System.exit(1);
}
File inputFile = new File(args[0]);
File outputFile = new File(args[1]);
int updatedVersion = Integer.parseInt(args[2], 10);
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = domFactory.newDocumentBuilder();
Document doc = docBuilder.parse(inputFile);
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();
XPathExpression expr = xpath.compile("/PremiereData/Project/#Version");
NodeList versionAttrNodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
for (int i = 0; i < versionAttrNodes.getLength(); i++) {
Attr versionAttr = (Attr) versionAttrNodes.item(i);
versionAttr.setNodeValue(String.valueOf(updatedVersion));
}
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.transform(new DOMSource(doc), new StreamResult(outputFile));
}
}

You can use the StAX parser to write the XML as you read it. While doing this you can replace the content as it parses. Using a StAX parser will only contain parts of the xml in memory at any given time.
public static void main(String [] args) throws Exception {
final String newProjectId = "888";
File inputFile = new File("in.xml");
File outputFile = new File("out.xml");
System.out.println("Reading " + inputFile);
System.out.println("Writing " + outputFile);
XMLInputFactory inFactory = XMLInputFactory.newInstance();
XMLEventReader eventReader = inFactory.createXMLEventReader(new FileInputStream(inputFile));
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLEventWriter writer = factory.createXMLEventWriter(new FileWriter(outputFile));
XMLEventFactory eventFactory = XMLEventFactory.newInstance();
boolean useExistingEvent; // specifies if we should use the event right from the reader
while (eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
useExistingEvent = true;
// look for our Project element
if(event.getEventType() == XMLEvent.START_ELEMENT) {
// read characters
StartElement elemEvent = event.asStartElement();
Attribute attr = elemEvent.getAttributeByName(QName.valueOf("ObjectID"));
// check to see if this is the project we want
// TODO: put what logic you want here
if("Project".equals(elemEvent.getName().getLocalPart()) && attr != null && attr.getValue().equals("1")) {
Attribute versionAttr = elemEvent.getAttributeByName(QName.valueOf("Version"));
// we need to make a list of new attributes for this element which doesnt include the Version a
List<Attribute> newAttrs = new ArrayList<>(); // new list of attrs
Iterator<Attribute> existingAttrs = elemEvent.getAttributes();
while(existingAttrs.hasNext()) {
Attribute existing = existingAttrs.next();
// copy over everything but version attribute
if(!existing.getName().getLocalPart().equals("Version")) {
newAttrs.add(existing);
}
}
// add our new attribute for projectId
newAttrs.add(eventFactory.createAttribute(versionAttr.getName(), newProjectId));
// were using our own event instead of the existing one
useExistingEvent = false;
writer.add(eventFactory.createStartElement(elemEvent.getName(), newAttrs.iterator(), elemEvent.getNamespaces()));
}
}
// persist the existing event.
if(useExistingEvent) {
writer.add(event);
}
}
writer.close();
}

Java transformer w3c.dom.document to inputstream

My scenario is this:
I have a HTML which I loaded into a w3c.dom.Document, after loading it as a doc, I parsed through its nodes and made a few changes in their values, but now I need to transform this document into a String, or preferably into a InputStream directly.
And I managed to do so, however, to the ends I need this HTML it must keep some properties of the initial file, for instance (and this is the one thing I'm struggling a lot trying to solve), all tags must be closed.
Say, I have a link tag on the header, <link .... /> I NEED the dash (/) at the end. However after the transformer transform my doc into a outputStream (which then I proceed to send to an inputStream) all the '/' before the > disappear. All my tags, which ended in /> are changed into simple >.
The reason I need this structure is that one of the libraries I'm using (and I'm afraid I can't go looking for another one, specially not at this point) require all tags to be closed, if not it throws exceptions everywhere and my program crashes....
Does anyone have any good ideas or solutions for me? This is my first contact with the Transform class, so I might be missing something that could help me.
Thank you all so very much,
Warm regards
Some bit of the code to explain the scenario a little bit
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
org.w3c.dom.Document doc = docBuilder.parse(his); // his = the HTML inputStream
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//*[#id='pessoaNome']";
org.w3c.dom.Element pessoaNome = null;
try
{
pessoaNome = (org.w3c.dom.Element) (Node) xPath.compile(expression).evaluate(doc, XPathConstants.NODE);
}
catch (Exception e)
{
e.printStackTrace();
}
pessoaNome.setTextContext("The new values for the node");
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Source xmlSource = new DOMSource(doc);
Result outputTarget = new StreamResult(outputStream);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, "HTML");
transformer.transform(xmlSource, outputTarget);
InputStream is = new ByteArrayInputStream(outputStream.toByteArray()); // At this point outputStream is already all messed up, not just the '/'. but this is the only thing causing me problems
as #Lee pointed out, I changed it to use Jsoup. Code got a lot cleaner, just had to set up the outputSettings for it to work like a charm. Code below
org.jsoup.nodes.Document doc = Jsoup.parse(new File(HTML), "UTF-8");
org.jsoup.nodes.Element pessoaNome = doc.getElementById("pessoaNome");
pessoaNome.html("My new html in here");
OutputSettings oSettings = new OutputSettings();
oSettings.syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml);
doc.outputSettings(oSettings);
InputStream is = new ByteArrayInputStream(doc.outerHtml().getBytes());

Have a look at jTidy which cleans HTML. There is also jsoup which is newer as supposedly does the same things only better.

How to rename an xml node to a html tag

Say I have a Java String which has xml data like so:
String content = "<abc> Hello <mark> World </mark> </abc>";
Now, I seek to render this String as text on a web page and hightlight/mark the word "World". The tag "abc" could change dynamically, so is there a way I can rename the outermost xml tag in a String using Java ?
I would like to convert the above String to the format shown below:
String content = "<i> Hello <mark> World </mark> </i>";
Now, I could use the new String to set html content and display the text in italics and highlight the word World.
Thanks,
Sony
PS: I am using xquery over files in BaseX xml database. The String content is essentially a result of an xquery which uses ft:extract(), a function to extract full text search results.

XML "parsing" with regexes can be cumbersome. If there is a possibility that your XML string can be more complicated than the one used in your example, you should consider processing it as a real XML node.
String newName = "i";
// parse String as DOM
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(content)));
// modify DOM
doc.renameNode(doc.getDocumentElement(), null, newName);
This code assumes that the element to that needs to be renamed is always the outermost element, that is, the root element.
Now the document is a DOM tree. It can be converted back to String object with a transformer.
// output DOM as String
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StringWriter sw = new StringWriter();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.transform(new DOMSource(doc), new StreamResult(sw));
String italicsContent = sw.toString();

Perhaps a simple regex?
String content = "<abc> Sample text <mark> content </mark> </abc>";
Pattern outerTags = Pattern.compile("^<(\\w+)>(.*)</\\1>$");
Matcher m = outerTags.matcher(content);
if (m.matches()) {
content = "<i>" + m.group(2) + "</i>";
System.out.println(content);
}
Alternatively, use a DOM parser, find the children of the outer tag and print them, preceded and followed by your desired tag as strings

How to save parsed and changed DOM document in xml file?

I have xml-file. I need to read it, make some changes and write new changed version to some new destination.
I managed to read, parse and patch this file (with DocumentBuilderFactory, DocumentBuilder, Document and so on).
But I cannot find a way how to save that file. Is there a way to get it's plain text view (as String) or any better way?

Something like this works:
Transformer transformer = TransformerFactory.newInstance().newTransformer();
Result output = new StreamResult(new File("output.xml"));
Source input = new DOMSource(myDocument);
transformer.transform(input, output);

That will work, provided you're using xerces-j:
public void serialise(org.w3c.dom.Document document) {
java.io.ByteArrayOutputStream data = new java.io.ByteArrayOutputStream();
java.io.PrintStream ps = new java.io.PrintStream(data);
org.apache.xml.serialize.OutputFormat of =
new org.apache.xml.serialize.OutputFormat("XML", "ISO-8859-1", true);
of.setIndent(1);
of.setIndenting(true);
org.apache.xml.serialize.XMLSerializer serializer =
new org.apache.xml.serialize.XMLSerializer(ps, of);
// As a DOM Serializer
serializer.asDOMSerializer();
serializer.serialize(document);
return data.toString();
}

That will give you possibility to define xml format
new XMLWriter(new FileOutputStream(fileName),
new OutputFormat(){{
setEncoding("UTF-8");
setIndent(" ");
setTrimText(false);
setNewlines(true);
setPadText(true);
}}).write(document);

Print empty XML element as opening tag, closing tag

Is there any way to print empty elements like <tag></tag> rather than <tag /> using org.w3c.dom? I'm modifying XML files that need to be diff'ed against old versions of themselves for review.
If it helps, the code that writes the XML to the file:
TransformerFactory t = TransformerFactory.newInstance();
Transformer transformer = t.newTransformer();
DOMSource source = new DOMSource(doc);
StringWriter xml = new StringWriter();
StreamResult result = new StreamResult(xml);
transformer.transform(source, result);
File f = new File("output.xml");
FileWriter writer = new FileWriter(f);
BufferedWriter out = new BufferedWriter(writer);
out.write(xml.toString());
out.close();
Thanks.

You may want to consider converting both the old and the new XML file to Canonical XML - http://en.wikipedia.org/wiki/Canonical_XML - before comparing them with e.g. diff.
James Clark has a small Java program to do so on http://www.jclark.com/xml/

I'm assuming the empty elements are actually ELEMENT_NODEs with no children within the document. Try adding an empty text node to them instead. That may trick the writer into believing there is a text node there, so it will write it out as if there was one. But the text node won't output anything because it is an empty string.
Calling this method with the document as both parameters should do the trick:
private static void fillEmptyElementsWithEmptyTextNodes(
final Document doc, final Node root)
{
final NodeList children = root.getChildNodes();
if (root.getType() == Node.ELEMENT_NODE &&
children.getLength() == 0)
{
root.appendChild(doc.createTextNode(""));
}
// Recurse to children.
for(int i = 0; i < children.getLength(); ++i)
{
final Node child = children.item(i);
fillEmptyElementsWithEmptyTextNodes(doc, child);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

get node raw text - java

Related

How to make minor edits to an XML file in Java

Java transformer w3c.dom.document to inputstream

How to rename an xml node to a html tag

How to save parsed and changed DOM document in xml file?

Print empty XML element as opening tag, closing tag

Categories

Resources