I have store in a String variable(link) the url that I get the xml response, I use a dom to parse the xml data.
In order to be sure that I extract the data correctly I store the xml in the local drive, build my parser and I took the data:
document = builder.parse(new File(filepath));
So when I try to get it from url I used:
document = builder.parse(new URL(link).openStream());
And it didn't work. What am I missing?
The data of the xml are stored in a list which then are shown in a jsf datatable.
Well the above works just fine, the problem was the index of elements of the nodelist. For some reason when i was reading from file
obj.setattribute1(cDetails.item(1).getTextContent());
obj.setattribute2(cDetails.item(3).getTextContent());
see that the item are increased by 2 each time
now that i read a URL the increment is 1 every time
Now i am sure that there is a reason for this which i don't understand probably cause of my limited yet knowledge but the above work and the index of the item increases 1 for the next item in the nodelist.
Related
I'm calling a soap webservice from my java application.
I get response and I want to parse it and get data.
The problem is that field <tranData>, contains structure with >< instead of <>. How can I parse this document to get data from field <tranData>?
This is response structure:
<response>
<Portfolio>
<ID>1</ID>
<holder>2</holder>
</Portfolio>
<tranData> <responseOne><header><code>1</code></header></responseOne></tranData>
Please remember that, this is only a example of response, and the amount of data will be much bigger, so the solution should be fast.
What you show us is the actual document as it is received over the wire, right? So <tranData> contains an XML string that has been escaped to not interfere with the markup of the rest of the containing document.
When you read the content of the <tranData> element, the XML processor will 'unescape' the string and give you the 'original' value:
<responseOne><header><code>1</code></header></responseOne>
What you do with that value is a different story. You can parse it as yet another XML document and retrieve the value of the <code> element, or just pass the string along to some other processing step.
such as:
Document doc = Jsoup.parse(file,"UTF-8");
Elements eles = doc.getElementsByTag("style");
How can I get the lineNumber of eles[0] in the file?
There is no way for you to do it with Jsoup API. I have checked on their source code: org.jsoup.parser.Parser maintains no position information of the element in the original input.
Please, refer to sources on Grep Code
Provided that Jsoup is build for extracting and manipulating data I don't believe that they will have such feature in future as it is ambigous what element position is after manipulation and costly to maintain actual references.
There is no direct way. But there is an indirect way.
Once you find the point of interest like an attribute, simply add a token as html before the element, and write the file to another temporary file. The next step is do a search for the token, using text editing tools.
code is as follows.
Step-1:
// get an element
for (Element element : doc.getAllElements()) {
... some code to get attributes of element ...
String myAttr = attribute.getKey();
if (myAttr.equals("some-attribute-name-of-interest") {
System.out.println(attribute.getKey() + "::" + attribute.getValue());
element.before("<!-- My Special Token : ABCDEFG -->");
}
Step-2:
// write the doc back to a temporary file
// see: How to save a jsoup document as text file
Step-3:
The last step is search for "My Special Token : ABCDEFG" in the output file using a text editing tool.
jsoup is a nice library. I thought this would help others.
I am getting this output when trying to use Jsoup to extract text from Wikipedia:
I dont have enough rep to post pictures as I am new to this site but its basically like this:
[]{k[]q[]f[]d[]d etc..
Here is part of my code:
public static void scrapeTopic(String url)
{
String html = getUrl("http://www.wikipedia.org/" + url);
Document doc = Jsoup.parse(html);
String contentText = doc.select("*").first().text();
System.out.println(contentText);
}
It appears to get all the information but in the wrong format!
I appreciate any help given
Thanks in advance
Here are some suggestion for you. While fetching general webpage, which doesn't require HTTP header's field to be set like cookie, user-agent just call:
Document doc = Jsoup.connect("givenURL").get();
This function read the webpage using a GET request. When you are selecting element using *, it returns any element, that is all the element of the document. Hence, calling doc.select("*").first() is returning the #root element. Try printing it to see:
System.out.println(doc.select("*").first().tagName()); // #root
System.out.println(doc.select("*").first()); // will print the whole document,
System.out.println(doc); //print the whole document, the above action is pointless
System.out.println(doc.select("*").first()==doc);
// check whither they are equal, and it will print TRUE
I am assuming that you are just playing around to learn about this API, although selector is much powerful, but a good start should be trying general document manipulation function e.g., doc.getElementsByTag().
However, in my local machine, i was successful to fetch the Document and parsing it using your getURL() function !!
So I've gotten help from here already so I figured why not try it out again!? Any suggestions would be greatly appreciated.
I'm using HTTP client and making a POST request; the response is an XML body that looks like the following:
<?xml version="1.0" encoding="UTF-8"?>
<CartLink
xmlns="http://api.gsicommerce.com/schema/ews/1.0">
<Name>vSisFfYlAPwAAAE_CPBZ3qYh</Name>
<Uri>carts/vSisFfYlAPwAAAE_CPBZ3qYh</Uri>
</CartLink>
Now...
I have an HttpEntity which is
[HttpResponse].getEntity().
Then I get a String representation of the response (which is XML in this case) by saying
String content = EntityUtils.toString(HttpEntity)
I tried following some of the suggestions on this post: How to create a XML object from String in Java? but it did not seem to work for me. When I built up the document it still appeared to be null.
MY END GOAL here is just to get the NAME field.. i.e. the "vSisFfYlAPwAAAE_CPBZ3qYh" part. So do I want to build up a document and then extract it...? Or is there a simpler way? I've been trying different things and I can't seem to get it to work.
Thanks for all of the help guys, it is most appreciated!!
Instead of trying to extract the value with string manipulation, try to use Java's inbuilt ability to parse XML. That's a much better approach. Http Components returns responses in an XML format - there's a reason for that. :)
Here's probably one way to solve your problem:
// Parse the response using DocumentBuilder so you can get at elements easily
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse(response);
Element root = doc.getDocumentElement();
// Now let's say you have not one, but 'n' nodes that contain the value
// you're looking for. Use NodeList to get a list of all those nodes and just
// pull out the tag/attribute's value you want.
NodeList nameNodesList = doc.getElementsByTagName("Name");
ArrayList<String> nameValues = null;
// Now iterate through the Nodelist to get the values you want.
for (int i=0; i<nameNodesList.getLength(); i++){
nameValues.add(nameNodesList.item(i).getTextContent());
}
The ArrayList "nameValues" will now hold every single value contained within "Name" tags. You could also create a HashMap to store a key value pair of Nodes and their respective text contents.
Hope this helps.
I want to validate an xml file against it schema. Once the validation is completed I want to remove any invalid data and save this invalid data into a new file. I can perfom the validation, just stuck on the removing and saving invalid data into new file.
I take back everything I just wrote. ... :) You can get the node you need using the Current Element Node property at Exception time, it seems.
Element curElement = (Element)validator.getProperty("http://apache.org/xml/properties/dom/current-element-node");
Because the Schema is defined via Xerces, I think this will work. See http://xerces.apache.org/xerces2-j/properties.html#dom.current-element-node .
There is more explanation in the answer at How can I get more information on an invalid DOM element through the Validator? .