Java transformer w3c.dom.document to inputstream

Java transformer w3c.dom.document to inputstream - java

My scenario is this:
I have a HTML which I loaded into a w3c.dom.Document, after loading it as a doc, I parsed through its nodes and made a few changes in their values, but now I need to transform this document into a String, or preferably into a InputStream directly.
And I managed to do so, however, to the ends I need this HTML it must keep some properties of the initial file, for instance (and this is the one thing I'm struggling a lot trying to solve), all tags must be closed.
Say, I have a link tag on the header, <link .... /> I NEED the dash (/) at the end. However after the transformer transform my doc into a outputStream (which then I proceed to send to an inputStream) all the '/' before the > disappear. All my tags, which ended in /> are changed into simple >.
The reason I need this structure is that one of the libraries I'm using (and I'm afraid I can't go looking for another one, specially not at this point) require all tags to be closed, if not it throws exceptions everywhere and my program crashes....
Does anyone have any good ideas or solutions for me? This is my first contact with the Transform class, so I might be missing something that could help me.
Thank you all so very much,
Warm regards
Some bit of the code to explain the scenario a little bit
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
org.w3c.dom.Document doc = docBuilder.parse(his); // his = the HTML inputStream
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//*[#id='pessoaNome']";
org.w3c.dom.Element pessoaNome = null;
try
{
pessoaNome = (org.w3c.dom.Element) (Node) xPath.compile(expression).evaluate(doc, XPathConstants.NODE);
}
catch (Exception e)
{
e.printStackTrace();
}
pessoaNome.setTextContext("The new values for the node");
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Source xmlSource = new DOMSource(doc);
Result outputTarget = new StreamResult(outputStream);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, "HTML");
transformer.transform(xmlSource, outputTarget);
InputStream is = new ByteArrayInputStream(outputStream.toByteArray()); // At this point outputStream is already all messed up, not just the '/'. but this is the only thing causing me problems
as #Lee pointed out, I changed it to use Jsoup. Code got a lot cleaner, just had to set up the outputSettings for it to work like a charm. Code below
org.jsoup.nodes.Document doc = Jsoup.parse(new File(HTML), "UTF-8");
org.jsoup.nodes.Element pessoaNome = doc.getElementById("pessoaNome");
pessoaNome.html("My new html in here");
OutputSettings oSettings = new OutputSettings();
oSettings.syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml);
doc.outputSettings(oSettings);
InputStream is = new ByteArrayInputStream(doc.outerHtml().getBytes());

Have a look at jTidy which cleans HTML. There is also jsoup which is newer as supposedly does the same things only better.

Related

Trying to get the value of a tag in an xml string java

I have an xml string stored in a StringBuilder.
My xml looks like this
couldn't write it in code so here's a screenshot
inside the report tag, it looks like
what it looks like
I would like to get access to any tag value I want in the record tag, what I have is :
StringBuilder informationString = new StringBuilder();
Scanner scanner = new Scanner(url.openStream());
while (scanner.hasNext()) {
informationString.append(scanner.nextLine());
}
//Close the scanner
scanner.close();
System.out.println(informationString);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(String.valueOf(informationString))));
Element rootElement = document.getDocumentElement();
But I do not know what to do with this and am very lost
Thanks by advance for helping

In general, you can use the below routine
Element documentElement=....
NodeList elmList=documentElement.getElementsByTagName("elementName");
Element e=(Element)elmList.itm(x);//putting it in a loop would do
You could keep using the above to get elements recursively.
Though a better approach would be to use XPath (Saxon has a decent XPath implementaton, though there are many more libraries to choose from)

Javax DocumentBuilder produces “double-UTF-8’ed” charset encoding

I’ve got a Java DOM Document which MyFilter has rewritten. From logging output I know that the contents of the Document are still correct. I am using the following lines to convert theDocument to a List<String> to pass it back through an interface:
Transformer transformer = TransformerFactory.newInstance().newTransformer();
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
transformer.transform(new DOMSource(theDocument), new StreamResult(buffer));
return Arrays.asList(new String(buffer.toByteArray()).split("\r?\n"));
The filter is called from this file copying method using org.apache.commons.io.FileUtils:
List<String> lines = FileUtils.readLines(source, "UTF-8");
if (filters != null) {
for (final MyFilter filter : filters) {
lines = filter.filter(lines);
}
}
FileUtils.writeLines(destination, "UTF-8", lines);
This works perfectly fine on my machine (where I could debug it), but on other machines just running the code, reproducibly any non-ASCII characters get double-UTF-8’ed (e.g., Größe becomes GrÃ¶ÃŸe). The code is executed within a web app running in Tomcat. I am sure they are differently configured, but what I want is that I get the non-corrupt result on any configuration.
Any ideas what I could be missing?

When you have Document object created you have to read it Content.
After it you have to write it to file using LSSerializer interface, which DOM standart provides for this purpose.
By default, the LSSerializer produces an XML document without spaces or line
breaks. As a result, the output looks less pretty, but it is actually more suitable for parsing by another program because it is free from unnecessary white space.
If you want white space, you use yet another magic incantation after creating the serializer:
ser.getDomConfig().setParameter("format-pretty-print", true);
Code snippets looks like:
private String getContentFromDocument(Document doc) {
String content;
DOMImplementation impl = doc.getImplementation();
DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS", "3.0");
LSSerializer ser = implLS.createLSSerializer();
ser.getDomConfig().setParameter("format-pretty-print", true);
content = ser.writeToString(doc);
return content;
}
And after you have string content you can write it to file, like:
public void writeToXmlFile(String xmlContent) {
File theDir = new File("./output");
if (!theDir.exists())
theDir.mkdir();
String fileName = "./output/" + this.getClass().getSimpleName() + "_"
+ Calendar.getInstance().getTimeInMillis() + ".xml";
try (OutputStream stream = new FileOutputStream(new File(fileName))) {
try (OutputStreamWriter out = new OutputStreamWriter(stream, StandardCharsets.UTF_8)) {
out.write(xmlContent);
out.write("\n");
}
} catch (IOException ex) {
System.err.println("Cannot write to file!" + ex.getMessage());
}
}
BTW:
Have you tried to get Document object at a little bit easier, like:
DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = documentFactory.newDocumentBuilder();
Document doc = builder.parse(new File(fileName));
You can try this as well. It should be enough for parsing xml file.

I finally found it: The problem was within the String(byte[]) constructor which interprets byte[] relative to the platform’s default charset. This should at least have been tagged deprecated. The transformer obviously produces UTF-8 output independent of the platform. Changing the method like below passes the same charset to both:
final String ENCODING = "UTF-8";
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, ENCODING);
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
transformer.transform(new DOMSource(theDocument), new StreamResult(buffer));
return Arrays.asList(new String(buffer.toByteArray(), ENCODING).split("\r?\n"));
To get it working, it does not really matter which encoding, just both should use the same. Hovever, it is good to choose some unicode charset as otherwise unmappable characters may get lost. However, the charset will be reflected in the XML declaration, thus when the List<String> gets saved later, it is important to save it accordigly.

Adding element to XML using javax parser without document modification

i am trying to add elements to xml document. Elements are added successfuly but problem is, that parser modifies original xml file in other places e.g it swaps namespace and id attributes or deletes duplicate namespace definitions. I need to get precisely the same document (same syntax, preserved whitespaces) only with specific elements added. I would greatly appreciate any suggestions. Here is my code:
public void appendTimestamp(String timestamp, String signedXMLFile, String timestampedXMLFile){
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
try{
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new File(signedXMLFile));
XPath xPath = XPathFactory.newInstance().newXPath();
NodeList list = (NodeList)xPath.evaluate("//*[local-name()='Signature']/*[local-name()='Object']/*[local-name()='QualifyingProperties']", doc, XPathConstants.NODESET);
if(list.getLength() != 1){
throw new Exception();
}
Node node = list.item(0);
Node unsignedProps = doc.createElement("xades:UnsignedProperties");
Node unsignedSignatureProps = doc.createElement("xzep:UnsignedSignatureProperties");
Node timestampNode = doc.createElement("xzep:SignatureTimeStamp");
timestampNode.appendChild(doc.createTextNode(timestamp));
unsignedSignatureProps.appendChild(timestampNode);
unsignedProps.appendChild(unsignedSignatureProps);
node.appendChild(unsignedProps);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "no");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
DOMSource source = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult stringWriter = new StreamResult(writer);
transformer.transform(source, stringWriter);
writer.flush();
System.out.println(writer.toString());
}catch(Exception e){
e.printStackTrace();
}
}
The original xml file:
...
<ds:Object Id="objectIdVerificationObject" xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
...
Modified xml file:
...
<ds:Object xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Id="objectIdVerificationObject">
...

If you use the dom model, then the whole xml file is read, then represented in the memory as the node tree and then saved to xml in a way determined by the writer. So it is almost impossible to preserve the original xml format as you don't have the control over it and for example whitespaces are not represented at all in the node tree.
You need to read partially the original xml and ouptut its content to the new file preserving what was read, then in the "right" place you need to add your new content adn then continue simple coapying of the original.
For example you could use the XMLStreamWriter and XMLStreamReader to achieve that as they offer "the low" level operations.
But probably it would be much easier to just copy the xml as the text line by one till you recognize the insertion point, then create new xml portion and append it as text and continue with copying.

get node raw text

How get node value with its children nodes? For example I have following node parsed into dom Document instance:
<root>
<ch1>That is a text with <value name="val1">value contents</value></ch1>
</root>
I select ch1 node using xpath. Now I need to get its contents, everything what is containing between <ch1> and </ch1>, e.g. That is a text with <value name="val1">value contents</value>.
How can I do it?

I have found the following code snippet that uses transformation, it gives almost exactly what I want. It is possible to tune result by changing output method.
public static String serializeDoc(Node doc) {
StringWriter outText = new StringWriter();
StreamResult sr = new StreamResult(outText);
Properties oprops = new Properties();
oprops.put(OutputKeys.METHOD, "xml");
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = null;
try {
t = tf.newTransformer();
t.setOutputProperties(oprops);
t.transform(new DOMSource(doc), sr);
} catch (Exception e) {
System.out.println(e);
}
return outText.toString();
}

If this is server side java (ie you do not need to worry about it running on other jvm's) and you are using the Sun/Oracle JDK, you can do the following:
import com.sun.org.apache.xml.internal.serialize.OutputFormat;
import com.sun.org.apache.xml.internal.serialize.XMLSerializer;
...
Node n = ...;
OutputFormat outputFormat = new OutputFormat();
outputFormat.setOmitXMLDeclaration(true);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
XMLSerializer ser = new XMLSerializer(baos, outputFormat);
ser.serialize(n);
System.out.println(new String(baos.toByteArray()));
Remember to ensure your ultimate conversion to string may need to take an encoding parameter if the parsed xml dom has its text nodes in a different encoding than your platforms default one or you'll get garbage on the unusual characters.

You could use jOOX to wrap your DOM objects and get many utility functions from it, such as the one you need. In your case, this will produce the result you need (using css-style selectors to find <ch1/>:
String xml = $(document).find("ch1").content();
Or with XPath as you did:
String xml = $(document).xpath("//ch1").content();
Internally, jOOX will use a transformer to generate that output, as others have mentioned

As far as I know, there is no equivalent of innerHTML in Document. DOM is meant to hide the details of the markup from you.
You can probably get the effect you want by going through the children of that node. Suppose for example that you want to copy out the text, but replace each "value" tag with a programmatically supplied value:
HashMap<String, String> values = ...;
StringBuilder str = new StringBuilder();
for(Element child = ch1.getFirstChild; child != null; child = child.getNextSibling()) {
if(child.getNodeType() == Node.TEXT_NODE) {
str.append(child.getTextContent());
} else if(child.getNodeName().equals("value")) {
str.append(values.get(child.getAttributes().getNamedItem("name").getTextContent()));
}
}
String output = str.toString();

How to deal with unknown entity references?

I'm parsing (a lot of) XML files that contain entity references which i dont know in advance (can't change that fact).
For example:
xml = "<tag>I'm content with &funny; &entity; &references;.</tag>"
when i try to parse this using the following code:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
final DocumentBuilder db = dbf.newDocumentBuilder();
final InputSource is = new InputSource(new StringReader(xml));
final Document d = db.parse(is);
i get the following exception:
org.xml.sax.SAXParseException: The entity "funny" was referenced, but not declared.
but, what i do want to achieve is, that the parser replaces every entity that is not declared (unknown to the parser) with an empty String ''.
Or even better, is there a way to pass a map to the parser like:
Map<String,String> entityMapping = ...
entityMapping.put("funny","very");
entityMapping.put("entity","important");
entityMapping.put("references","stuff");
so that i could do the following:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
final DocumentBuilder db = dbf.newDocumentBuilder();
final InputSource is = new InputSource(new StringReader(xml));
db.setEntityResolver(entityMapping);
final Document d = db.parse(is);
if i would obtain the text from the document using this example code i should receive:
I'm content with very important stuff.
Any suggestions? Of course, i already would be happy to just replace the unknown entity's with empty strings.
Thanks,

The StAX API has support for this. Have a look at XMLInputFactory, it has a runtime property which dictates whether or not internal entities are expanded, or left in place. If set to false, then the StAX event stream will contain instances of EntityReference to represent the unexpanded entities.
If you still want a DOM as the end result, you can chain it together like this:
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
String xml = "my xml";
StringReader xmlReader = new StringReader(xml);
XMLEventReader eventReader = inputFactory.createXMLEventReader(xmlReader);
StAXSource source = new StAXSource(eventReader);
DOMResult result = new DOMResult();
transformer.transform(source, result);
Node document = result.getNode();
In this case, the resulting DOM will contain nodes of org.w3c.dom.EntityReference mixed in with the text nodes. You can then process these as you see fit.

Since your XML input seems to be available as a String, could you not do a simple pre-processing with regular expression replacement?
xml = "...";
/* replace entities before parsing */
for (Map.Entry<String,String> entry : entityMapping.entrySet()) {
xml = xml.replaceAll("&" + entry.getKey() + ";", entry.getValue());
}
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
...
It's quite hacky, and you may want to spend some extra effort to ensure that the regexps only match where they really should (think <entity name="&don't-match-me;"/>), but at least it's something...
Of course, there are more efficient ways to achieve the same effect than calling replaceAll() a lot of times.

You could add the entities at the befinning of the file. Look here for more infos.
You could also take a look at this thread where someone seems to have implemented an EntityResolver interface (you could also implement EntityResolver2 !) where you can process the entities on the fly (e.g. with your proposed Map).
WARNING: there is a bug! in jdk6, but you could try it with jdk5

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java transformer w3c.dom.document to inputstream - java

Have a look at jTidy which cleans HTML. There is also jsoup which is newer as supposedly does the same things only better.

Related

Trying to get the value of a tag in an xml string java

Javax DocumentBuilder produces “double-UTF-8’ed” charset encoding

Adding element to XML using javax parser without document modification

get node raw text

How to deal with unknown entity references?

Categories

Resources