Android: Parsing XML DOM parser. Converting childnodes to string - java

Again a question. This time I'm parsing XML messages I receive from a server.
Someone thought to be smart and decided to place HTML pages in a XML message. Now I'm kind of facing problems because I want to extract that HTML page as a string from this XML message.
Ok this is the XML message I'm parsing:
<AmigoRequest>
<From></From>
<To></To>
<MessageType>showMessage</MessageType>
<Param0>general message</Param0>
<Param1><html><head>test</head><body>Testhtml</body></html></Param1>
</AmigoRequest>
You see that in Param1 a HTML page is specified. I've tried to extract the message the following way:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
return results.item(0).getFirstChild().getNodeValue();
}
}
return "";
}
Where d is the XML message in document form.
It always returns me a null value, because getNodeValue() returns null.
When i try results.item(0).getFirstChild().hasChildNodes() it will return true because he sees there is a tag in the message.
How can i extract the html message <html><head>test</head><body>Testhtml</body></html> from Param0 in a string?
I'm using Android sdk 1.5 (well almost java) and a DOM Parser.
Thanks for your time and replies.
Antek

You could take the content of param1, like this:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
// String extractHTMLTags(String s) is a function that you have
// to implement in a way that will extract all the HTML tags inside a string.
return extractHTMLTags(results.item(0).getTextContent());
}
}
return "";
}
All you have to do is to implement a function:
String extractHTMLTags(String s)
that will remove all HTML tag occurrences from a string.
For that you can take a look at this post: Remove HTML tags from a String

after checking a lot and scratching my head thousands of times I came up with simple alteration that it needs to change your API level to 8

EDIT: I just saw your comment above about getTextContent() not being supported on Android. I'm going to leave this answer up in case it's useful to someone who's on a different platform.
If your DOM API supports it, you can call getTextContent(), as follows:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results != null) {
return results.getTextContent();
}
}
return "";
}
However, getTextContent() is a DOM Level 3 API call; not all parsers are guaranteed to support it. Xerces-J does.
By the way, in your original example, your check for null is in the wrong place; it should be:
if (results != null && results.getLength() > 0) {
Otherwise, you'd get a NPE if results really does come back as null.

Since getTextContent() isn't available to you, another option would be to write it -- it isn't hard. In fact, if you're writing this solely for your own use -- or your employer doesn't have overly strict rules about open source -- you could look at Apache's implementation as a starting point; lines 610-646 seem to contain most of what you need. (Please be respectful of Apache's copyright and license.)
Otherwise, some rough pseudocode for the method would be:
String getTextContent(Node node) {
if (node has no children)
return "";
if (node has 1 child)
return getTextContent(node.getFirstChild());
return getTextContent(new StringBuffer()).toString();
}
StringBuffer getTextContent(Node node, StringBuffer sb) {
for each child of node {
if (child is a text node) sb.append(child's text)
else getTextContent(child, sb);
}
return sb;
}

Well i was almost there with the code...
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db;
Element node = (Element) results.item(0); // get the value of Param1
Document doc2 = null;
try {
db = dbf.newDocumentBuilder();
doc2 = db.newDocument(); //create new document
doc2.appendChild(doc2.importNode(node, true)); //import the <html>...</html> result in doc2
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
Log.d(TAG, " Exception ", e);
} catch (DOMException e) {
// TODO: handle exception
Log.d(TAG, " Exception ", e);
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace(); }
return doc2. .....// All I'm missing is something to convert a Document to a string.
}
}
return "";
}
Like explained in the comment of my code. All I am missing is to make a String out of a Document. You can't use the Transform class in Android... doc2.toString() will give you a serialization of the object..
But my next step is write my own parser if this doesnt work out ;)
Not the best code but a temponary solution.
public String getParam1(String b) {
return b
.substring(b.indexOf("<Param1>") + "<Param1>".length(), b.indexOf("</Param1>"));
}
Where String b is the XML document string.

Related

Java XML Read with WSIL file

at the moment I am trying to program a program which is able to render a link of an xml-file. I use Jsoup, my current code is the following
public static String XmlReader() {
InputStream is = RestService.getInstance().getWsilFile();
try {
Document doc = Jsoup.parse(fis, null, "", Parser.xmlParser());
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
}
I would like to read the following part from a XML file:
<wsil:service>
<wsil:abstract>Read the full documentation on: https://host/sap/bc/mdrs/cdo?type=psm_isi_r&objname=II_QUERY_PROJECT_IN&saml2=disabled</wsil:abstract>
<wsil:name>Query Projects</wsil:name>
<wsil:description location="host/sap/bc/srt/wsdl/srvc_00163E5E1FED1EE897C188AB4A5723EF/wsdl11/allinone/ws_policy/document?sap-vhost=host&saml2=disabled" referencedNamespace="http://schemas.xmlsoap.org/wsdl/"/>
</wsil:service>
I want to return the following URL as String
host/sap/bc/srt/wsdl/srvc_00163E5E1FED1EE897C188AB4A5723EF/wsdl11/allinone/ws_policy/document?sap-vhost=host&saml2=disabled
How can I do that ?
Thank you
If there is only one tag wsil:description then you can use this code:
doc.outputSettings().escapeMode(EscapeMode.xhtml);
String val = doc.select("wsil|description").attr("location");
Escape mode should be changed, since you are not working on regular html, but xml.
If you have more than one tag with given name you can search for distinct neighbour element, and find required tag with respect to it:
String val = doc.select("wsil|name:contains(Query Projects)").first().parent().select("wsil|description").attr("location");

Parse XML result from Web service request Java

I have this XML result from a web service request. The tags that are inside the box are the ones that I need from the xml result.
Here's what I have so far:
private Node getMessageNode(QueryResponseQueryResult paramQueryResponseQueryResult, String[] paramArrayOfString)
{
MessageElement[] arrayOfMessageElement = paramQueryResponseQueryResult.get_any();
Document localDocument = null;
String res;
try
{
localDocument = arrayOfMessageElement[0].getAsDocument(); //result from the webservice
}
catch (Exception localException) {}
if (localDocument == null) {
return null;
}
Object localObject = localDocument.getDocumentElement();
localObject = Nodes.findChildByTags((Node)localObject, paramArrayOfString);
return localDocument; //This returns the XML above
}
How do I parse the result to return only those tags on the box and still return it as XML type?
Thanks in advance.
You can use Xpath of XQuery to perform this task.
You should get the document, and then you can get the child node of table using
getElementByTagName("table") or run XPath on it:
See here a good xpath tutorial.

Deserializing an XML element containing other xml markup as a single string using SimpleXML

I have been using SimpleXML for a while now to serialize my java objects, but
I am still learning and run into trouble sometimes. I have the following XML that
I want to deserialize:
<messages>
<message>
<text>
A communications error has occurred. Please try again, or contact administrator. Alternatively, please register.
</text>
</message>
I would like process it such that the contents of the element are treated as a single string and the anchor tags to be ignored. I have no control on how this XML is generated - it is, as you can see, an error message from some server. How do I achieve this?
Many thanks in advance.
You might want to try escaping the text by importing:
import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
And using it as:
a.setWordCloudStringToDisplay(escapeHtml(wordcloud));
To read text and Element is not offered basically by Simple XML. You have to use Converter. You can read https://stackoverflow.com/questions/17462970/simpleframwork-xml-element-with-inner-text-and-child-elements that answer quite the same problem except that it read only one text.
Here is a solution to get multiples text and href in a single string.
First, I create a A class for the 'a' tag, with a toString methode to print the tag as it is in xml :
#Root(name = "a")
public class A {
#Attribute(required = false)
private String href;
#Text
private String value;
#Override
public String toString(){
return "" + value + "";
}
}
Then the Text class to read the 'text', where the convert is necessary :
#Root(name = "Text")
#Convert(Text.Parsing.class)
public class Text {
#Element
public String value;
private static class Parsing implements Converter<Text> {
// to read <a href...>
private final Serializer ser = new Persister();
#Override
public Text read(InputNode node) throws Exception {
Text t = new Text();
String s;
InputNode aref;
// read the begining of text (until first xml tag)
s = node.getValue();
if (s != null) { t.value = s; }
// read first tag (return null if no more tag in the Text)
aref = node.getNext();
while (aref != null) {
// add to the value using toString() of A class
t.value = t.value + ser.read(A.class, aref);
// read the next part of text (after the xml tag, until the next tag)
s = node.getValue();
// add to the value
if (s != null) { t.value = t.value + s; }
// read the next tag and loop
aref = node.getNext();
}
return t;
}
#Override
public void write(OutputNode node, Text value) throws Exception {
throw new UnsupportedOperationException("Not supported yet.");
}
}
}
Note that I read the 'a' tag with a standard serializer, and add in the A class a toString methode to get it back as an xml string. I have not found a way to read directly the 'a' tag as text.
And the main class (don't forget the AnnotationStrategy which map the Convert method to the deserialisation of the text element) :
public class parseText {
public static void main(String[] args) throws Exception {
Serializer serializer = new Persister(new AnnotationStrategy());
InputStream in = ClassLoader.getSystemResourceAsStream("file.xml");
Text t = serializer.read(Text.class, in, false);
System.out.println("Texte : " + t.value);
}
}
When I use it with the following xml file :
<text>
A communications error has occurred. Please try again, or contact administrator.
Alternatively, please register.
</text>
It give the following result :
Texte :
A communications error has occurred. Please try again, or contact administrator.
Alternatively, please register.
I hope this will help you to solve your problem.

What's wrong with this Java XML-Parsing code?

I'm trying to parse an XML file and be able to insert a path and get the value of the field.
It looks as follows:
import java.io.IOException;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
public class XMLConfigManager {
private Element config = null;
public XMLConfigManager(String file) {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {
Document domTree;
DocumentBuilder db = dbf.newDocumentBuilder();
domTree = db.parse(file);
config = domTree.getDocumentElement();
}
catch (IllegalArgumentException iae) {
iae.printStackTrace();
}
catch (ParserConfigurationException pce) {
pce.printStackTrace();
}
catch (SAXException se) {
se.printStackTrace();
}
catch (IOException ioe) {
ioe.printStackTrace();
}
}
public String getStringValue(String path) {
String[] pathArray = path.split("\\|");
Element tempElement = config;
NodeList tempNodeList = null;
for (int i = 0; i < pathArray.length; i++) {
if (i == 0) {
if (tempElement.getNodeName().equals(pathArray[0])) {
System.out.println("First element is correct, do nothing here (just in next step)");
}
else {
return "**This node does not exist**";
}
}
else {
tempNodeList = tempElement.getChildNodes();
tempElement = getChildElement(pathArray[i],tempNodeList);
}
}
return tempElement.getNodeValue();
}
private Element getChildElement(String identifier, NodeList nl) {
String tempNodeName = null;
for (int i = 0; i < nl.getLength(); i++) {
tempNodeName = nl.item(i).getNodeName();
if (tempNodeName.equals(identifier)) {
Element returner = (Element)nl.item(i).getChildNodes();
return returner;
}
}
return null;
}
}
The XML looks like this (for test purposes):
<?xml version="1.0" encoding="UTF-8"?>
<amc>
<controller>
<someOtherTest>bla</someOtherTest>
<general>
<spam>This is test return String</spam>
<interval>1000</interval>
</general>
</controller>
<agent>
<name>test</name>
<ifc>ifcTest</ifc>
</agent>
</amc>
Now I can call the class like this
XMLConfigManager xmlcm = new XMLConfigManager("myConfig.xml");
System.out.println(xmlcm.getStringValue("amc|controller|general|spam"));
Here, I'm expecting the value of the tag spam, so this would be "This is test return String". But I'm getting null.
I've tried to fix this for days now and I just can't get it. The iteration works so it gets to the tag spam, but then, just as I said, it returns null instead of the text.
Is this a bug or am I just doing wrong? Why? :(
Thank you very much for help!
Regards, Flo
You're calling Node.getNodeValue() - which is documented to return null when you call it on an element. You should call getTextContent() instead - or use a higher level API, of course.
As others mentioned before me, you seem to be reinventing the concept of XPath. You can replace your code with the following:
javax.xml.xpath.XPath xpath = javax.xml.xpath.XPathFactory.newInstance().newXPath();
String expression = "/amc/controller/general/spam";
org.xml.sax.InputSource inputSource = new org.xml.sax.InputSource("myConfig.xml");
String result = xpath.evaluate(expression, inputSource);
See also: XML Validation and XPath Evaluation in J2SE 5.0
EDIT:
An example of extracting a collection with XPath:
NodeList result = (NodeList) xpath.evaluate(expression, inputSource, XPathConstants.NODESET);
for (int i = 0; i < result.getLength(); i++) {
System.out.println(result.item(i).getTextContent());
}
The javax.xml.xpath.XPath interface is documented here, and there are a few more examples in the aforementioned article.
In addition, there are third-party libraries for XML manipulation, which you may find more convenient, such as dom4j (suggested by duffymo) or JDOM. Regardless of which library you use, you can leverage the quite powerful XPath language.
Because you're using getNodeValue() rather than getTextContent().
Doing this by hand is an accident waiting to happen; either use the built-in XPath solutions, or a third-party library as suggested by #duffymo. This is not a situation where re-invention adds value, IMO.
I'd wonder why you're not using a library like dom4j and built-in XPath. You're doing a lot of work with a very low-level API (WC3 DOM).
Step through with a debugger and see what children that <spam> node has. You should quickly figure out why it's null. It'll be faster than asking here.

Ideal Java library for cleaning html, and escaping malformed fragments

I've got some HTML files that need to be parsed and cleaned, and they occasionally have content with special characters like <, >, ", etc. which have not been properly escaped.
I have tried running the files through jTidy, but the best I can get it to do is just omit the content it sees as malformed html. Is there a different library that will just escape the malformed fragments instead of omitting them? If not, any recommendations on what library would be easiest to modify?
Clarification:
Sample input: <p> blah blah <M+1> blah </p>
Desired output: <p> blah blah <M+1> blah </p>
You can also try TagSoup. TagSoup emits regular old SAX events so in the end you get what looks like a well-formed XML document.
I have had very good luck with TagSoup and I'm always surprised at how well it handles poorly constructed HTML files.
Ultimately I solved this by running a regular expression first and an unmodified TagSoup second.
Here is my regular expression code to escape unknown tags like <M+1>
private static String escapeUnknownTags(String input) {
Scanner scan = new Scanner(input);
StringBuilder builder = new StringBuilder();
while (scan.hasNext()) {
String s = scan.findWithinHorizon("[^<]*</?[^<>]*>?", 1000000);
if (s == null) {
builder.append(escape(scan.next(".*")));
} else {
processMatch(s, builder);
}
}
return builder.toString();
}
private static void processMatch(String s, StringBuilder builder) {
if (!isKnown(s)) {
String escaped = escape(s);
builder.append(escaped);
}
else {
builder.append(s);
}
}
private static String escape(String s) {
s = s.replaceAll("<", "<");
s = s.replaceAll(">", ">");
return s;
}
private static boolean isKnown(String s) {
Scanner scan = new Scanner(s);
if (scan.findWithinHorizon("[^<]*</?([^<> ]*)[^<>]*>?", 10000) == null) {
return false;
}
MatchResult mr = scan.match();
try {
String tag = mr.group(1).toLowerCase();
if (HTML.getTag(tag) != null) {
return true;
}
}
catch (Exception e) {
// Should never happen
e.printStackTrace();
}
return false;
}
HTML cleaner
HtmlCleaner is open-source HTML parser written in Java. HTML found on
Web is usually dirty, ill-formed and unsuitable for further
processing. For any serious consumption of such documents, it is
necessary to first clean up the mess and bring the order to tags,
attributes and ordinary text. For the given HTML document, HtmlCleaner
reorders individual elements and produces well-formed XML. By default,
it follows similar rules that the most of web browsers use in order to
create Document Object Model. However, user may provide custom tag and
rule set for tag filtering and balancing.
Ok, I suspect it is this. Use the following code, it will help.
javax.swing.text.html.HTML

Categories

Resources