Java XML Read with WSIL file - java

at the moment I am trying to program a program which is able to render a link of an xml-file. I use Jsoup, my current code is the following
public static String XmlReader() {
InputStream is = RestService.getInstance().getWsilFile();
try {
Document doc = Jsoup.parse(fis, null, "", Parser.xmlParser());
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
}
I would like to read the following part from a XML file:
<wsil:service>
<wsil:abstract>Read the full documentation on: https://host/sap/bc/mdrs/cdo?type=psm_isi_r&objname=II_QUERY_PROJECT_IN&saml2=disabled</wsil:abstract>
<wsil:name>Query Projects</wsil:name>
<wsil:description location="host/sap/bc/srt/wsdl/srvc_00163E5E1FED1EE897C188AB4A5723EF/wsdl11/allinone/ws_policy/document?sap-vhost=host&saml2=disabled" referencedNamespace="http://schemas.xmlsoap.org/wsdl/"/>
</wsil:service>
I want to return the following URL as String
host/sap/bc/srt/wsdl/srvc_00163E5E1FED1EE897C188AB4A5723EF/wsdl11/allinone/ws_policy/document?sap-vhost=host&saml2=disabled
How can I do that ?
Thank you

If there is only one tag wsil:description then you can use this code:
doc.outputSettings().escapeMode(EscapeMode.xhtml);
String val = doc.select("wsil|description").attr("location");
Escape mode should be changed, since you are not working on regular html, but xml.
If you have more than one tag with given name you can search for distinct neighbour element, and find required tag with respect to it:
String val = doc.select("wsil|name:contains(Query Projects)").first().parent().select("wsil|description").attr("location");

Related

Parse XML result from Web service request Java

I have this XML result from a web service request. The tags that are inside the box are the ones that I need from the xml result.
Here's what I have so far:
private Node getMessageNode(QueryResponseQueryResult paramQueryResponseQueryResult, String[] paramArrayOfString)
{
MessageElement[] arrayOfMessageElement = paramQueryResponseQueryResult.get_any();
Document localDocument = null;
String res;
try
{
localDocument = arrayOfMessageElement[0].getAsDocument(); //result from the webservice
}
catch (Exception localException) {}
if (localDocument == null) {
return null;
}
Object localObject = localDocument.getDocumentElement();
localObject = Nodes.findChildByTags((Node)localObject, paramArrayOfString);
return localDocument; //This returns the XML above
}
How do I parse the result to return only those tags on the box and still return it as XML type?
Thanks in advance.
You can use Xpath of XQuery to perform this task.
You should get the document, and then you can get the child node of table using
getElementByTagName("table") or run XPath on it:
See here a good xpath tutorial.

how to read string which is written outside the html <> tags.?

I am having HTML code of 1000 lines and I wanted to extract the data which is written outside the HTML <> Tags.
for example..
<>Java Programm<>
It should read only "Java Programm" and escape whatever written inside the "<>" tags
I tried following code but it is reading whole data including <> but I do not need "<>" in my output.
public static void main(String[] args) throws Exception {
try {
FileInputStream fin = new FileInputStream("C:\\Users\\File.txt");
int i;
while ((i=fin.read())!=-1) {
System.out.print((char)i);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
You would need an HTML parser. For JSoup it's
File input = new File("C:\\Users\\File.txt");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element body = doc.body(); //Get the body of the html
System.out.println(body.text()) ; //Get the all the text inside the body tag
This is one way to do it. Simple enough :), there of course are other ways to do it. This ofcourse will leave the text outside of body tag. You can explore JSoup a here and find a solution for it.

JSoup core web text extraction

I am new to JSoup, Sorry if my question is too trivial.
I am trying to extract article text from http://www.nytimes.com/ but on printing the parse document
I am not able to see any articles in the parsed output
public class App
{
public static void main( String[] args )
{
String url = "http://www.nytimes.com/";
Document document;
try {
document = Jsoup.connect(url).get();
System.out.println(document.html()); // Articles not getting printed
//System.out.println(document.toString()); // Same here
String title = document.title();
System.out.println("title : " + title); // Title is fine
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
ok I have tried to parse "http://en.wikipedia.org/wiki/Big_data" to retrieve the wiki data, same issue here as well I am not getting the wiki data in the out put.
Any help or hint will be much appreciated.
Thanks.
Here's how to get all <p class="summary> text:
final String url = "http://www.nytimes.com/";
Document doc = Jsoup.connect(url).get();
for( Element element : doc.select("p.summary") )
{
if( element.hasText() ) // Skip those tags without text
{
System.out.println(element.text());
}
}
If you need all <p> tags, without any filtering, you can use doc.select("p") instead. But in most cases it's better to select only those you need (see here for Jsoup Selector documentation).

HTML Parser fetch link text

I'm using HTML Parser to fetch links from a web page. I need to store the URL, link text and the URL to the parent page containing the link. I have managed to get the link URL as well as the parent URL.
I still ned to get the link text.
link text
Unfortunately I'm having a hard time figuring it out, any help would be greatly appreciated.
public static List<LinkContainer> findUrls(String resource) {
String[] tagNames = {"A", "AREA"};
List<LinkContainer> urls = new ArrayList<LinkContainer>();
Tag tag;
String url;
String sourceUrl;
try {
for (String tagName : tagNames) {
Parser parser = new Parser(resource);
NodeList nodes = parser.parse(new TagNameFilter(tagName));
NodeIterator i = nodes.elements();
while (i.hasMoreNodes()) {
tag = (Tag) i.nextNode();
url = tag.getAttribute("href");
sourceUrl = tag.getPage().getUrl();
if (RegexUtil.verifyUrl(url)) {
urls.add(new LinkContainer(url, null, sourceUrl));
}
}
}
} catch (ParserException pe) {
pe.printStackTrace();
}
return urls;
}
Have you tried ((LinkTag) tag).getLinkText() ? Personally I prefer n html parser which produces XML according to a well used standard, e.g., xerces or similar. This is what you get from using e.g., http://nekohtml.sourceforge.net/.
You would need to check the children of each A Tag. If you assume that your A tags only have a single child (the text itself), you can use the getFirstChild() method. This should be an instance of TextNode, and you can call getText() on this to get the link text.

Android: Parsing XML DOM parser. Converting childnodes to string

Again a question. This time I'm parsing XML messages I receive from a server.
Someone thought to be smart and decided to place HTML pages in a XML message. Now I'm kind of facing problems because I want to extract that HTML page as a string from this XML message.
Ok this is the XML message I'm parsing:
<AmigoRequest>
<From></From>
<To></To>
<MessageType>showMessage</MessageType>
<Param0>general message</Param0>
<Param1><html><head>test</head><body>Testhtml</body></html></Param1>
</AmigoRequest>
You see that in Param1 a HTML page is specified. I've tried to extract the message the following way:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
return results.item(0).getFirstChild().getNodeValue();
}
}
return "";
}
Where d is the XML message in document form.
It always returns me a null value, because getNodeValue() returns null.
When i try results.item(0).getFirstChild().hasChildNodes() it will return true because he sees there is a tag in the message.
How can i extract the html message <html><head>test</head><body>Testhtml</body></html> from Param0 in a string?
I'm using Android sdk 1.5 (well almost java) and a DOM Parser.
Thanks for your time and replies.
Antek
You could take the content of param1, like this:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
// String extractHTMLTags(String s) is a function that you have
// to implement in a way that will extract all the HTML tags inside a string.
return extractHTMLTags(results.item(0).getTextContent());
}
}
return "";
}
All you have to do is to implement a function:
String extractHTMLTags(String s)
that will remove all HTML tag occurrences from a string.
For that you can take a look at this post: Remove HTML tags from a String
after checking a lot and scratching my head thousands of times I came up with simple alteration that it needs to change your API level to 8
EDIT: I just saw your comment above about getTextContent() not being supported on Android. I'm going to leave this answer up in case it's useful to someone who's on a different platform.
If your DOM API supports it, you can call getTextContent(), as follows:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results != null) {
return results.getTextContent();
}
}
return "";
}
However, getTextContent() is a DOM Level 3 API call; not all parsers are guaranteed to support it. Xerces-J does.
By the way, in your original example, your check for null is in the wrong place; it should be:
if (results != null && results.getLength() > 0) {
Otherwise, you'd get a NPE if results really does come back as null.
Since getTextContent() isn't available to you, another option would be to write it -- it isn't hard. In fact, if you're writing this solely for your own use -- or your employer doesn't have overly strict rules about open source -- you could look at Apache's implementation as a starting point; lines 610-646 seem to contain most of what you need. (Please be respectful of Apache's copyright and license.)
Otherwise, some rough pseudocode for the method would be:
String getTextContent(Node node) {
if (node has no children)
return "";
if (node has 1 child)
return getTextContent(node.getFirstChild());
return getTextContent(new StringBuffer()).toString();
}
StringBuffer getTextContent(Node node, StringBuffer sb) {
for each child of node {
if (child is a text node) sb.append(child's text)
else getTextContent(child, sb);
}
return sb;
}
Well i was almost there with the code...
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db;
Element node = (Element) results.item(0); // get the value of Param1
Document doc2 = null;
try {
db = dbf.newDocumentBuilder();
doc2 = db.newDocument(); //create new document
doc2.appendChild(doc2.importNode(node, true)); //import the <html>...</html> result in doc2
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
Log.d(TAG, " Exception ", e);
} catch (DOMException e) {
// TODO: handle exception
Log.d(TAG, " Exception ", e);
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace(); }
return doc2. .....// All I'm missing is something to convert a Document to a string.
}
}
return "";
}
Like explained in the comment of my code. All I am missing is to make a String out of a Document. You can't use the Transform class in Android... doc2.toString() will give you a serialization of the object..
But my next step is write my own parser if this doesnt work out ;)
Not the best code but a temponary solution.
public String getParam1(String b) {
return b
.substring(b.indexOf("<Param1>") + "<Param1>".length(), b.indexOf("</Param1>"));
}
Where String b is the XML document string.

Categories

Resources