Grabbing data from an XML stream using jsoup - java

Im trying to get the 'Done' from between these tags in an XML stream
<scan_run_status>Done</scan_run_status>
This is the code I'm using but it returns nothing.
I am picking up id and status in the same way no problem.
input1 contains the xml string
status = Jsoup.parse(input1).getAllElements().attr("scan_run_status");
System.out.println(status);
Thanks

Try this:
Document doc = Jsoup.parse(input1);
/* Grab all elements named "scan_run_sttus" */
Elements els = doc.select("scan_run_status");
/* If you need the first only ...here it is..*/
String status = els.first().text();
/* Otherwise you can loop
for (Element el: els) {
//.Do something
}
*/
and don't forget to do all the checks if els is empty etc. etc.
Here you can found all the tips about JSoup selector.

Related

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}
You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}
I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

Android - how to parse html by jsoup and fill into the arraylist?

I want to read the date from this HTML link:
http://jadvalbaz.blog.ir/post/%D8%B1%D8%A7%D9%87%D9%86%D9%85%D8%A7%DB%8C-%D8%AD%D9%84-%D8%AC%D8%AF%D9%88%D9%84-%D8%AD%D8%B1%D9%81-%D8%B0
if you look at the view-source
ذات اریه (پنومونی- سینه پهلو)ذر (مورچه ریز)ذرع (مقیاس طول)ذره ای بنیادی از رده هیبرونها که بار الکتریکی ندارد (لاندا)ذره منفی اتم (الکترون)ذریه (نسل)ذل (خواری)ذم (نکوهش)ذهاب (رفتن)ذی (صاحب)
my words are separated by <.br>, I want to read each word to ArrayList, I means how to omit the <.br> and read the words.
here is my code:
Document document = Jsoup.connect(url).get();
for (Element span : document.select("?").select("?")) {
title = span.toString();
name.add(title);
}
How to read them, what to put instead of question mark.
any suggestion?
edit the css of your template and define a class for your words then Use the Element.select(String selector) and Elements.select(String selector) method.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element masthead = doc.select("p.words").first(); // p with class=words
follow below link for more information about extracting data with this methods:
Use selector-syntax to find elements

adding text before and after a link jSoup

I've just stared learning Jsoup and the cookbook on their website but I'm just a bit stuck with addling text to an element I've parsed.
try{
Document doc = Jsoup.connect(url).get();
Element add = doc.prependText("a href") ;
Elements links = add.select("a[href]");
for (Element link : links) {
PrintStream sb = System.out.format("%n %s",link.attr("abs:href"));
System.out.print("<br>");
}
}
catch(Exception e){
System.out.print("error --> " + e);
}
Example run with google.com I get
http://www.google.ie/imghp?hl=en&tab=wi<br>
http://maps.google.ie/maps?hl=en&tab=wl<br>
https://play.google.com/?hl=en&tab=w8<br>
But I really want
<a href> http://www.google.ie/imghp?hl=en&tab=wi<br></a>
<a href> http://maps.google.ie/maps?hl=en&tab=wl<br></a>
<a href> https://play.google.com/?hl=en&tab=w8<br></a>
With this code I've gotten all the links off the page but I want to also get the and tags so I can them create my on webpage. I've tried adding a string and prepend text but just can't seem to get it right.
Thanks
with link.attr(...) you get the attribute value.
But you need the whole tag:
Document doc = Jsoup.connect(...).get();
for( Element e : doc.select("a[href]") ) // Select all 'a'-Tags with 'href' attribute
{
String wholeTag = e.toString(); // Get a string as the element is
/* No you you can use the html - in this example for a simple output */
System.out.println(wholeTag);
}

Jsoup Selector "not"

This code here:
Document doc = Jsoup.connect("http://wikitravel.org/en/San_Francisco").get();
System.out.println(doc.select("h2:contains(Get around) ~ *:not(h2:contains(See) ~ *)"));
outputs http://pastebin.com/gkcCfr1F. Is there a selector to make the "not" selector inclusive? Right now it is removing everything after the "see", when I'd like to remove the last h2 tag with the id="see" along with the everything else because I'm attempting to parse individual sections of a wiki.
The final output that I would like to obtain is: http://pastebin.com/ntpVrgui
I would do something like this :
Get the content div :
StringBuilder sb = new StringBuilder();
boolean start = false;
Document doc = Jsoup.connect("http://wikitravel.org/en/San_Francisco").get();
Elements content = doc.select("#content");
for (Element element : content) {
/*Pseudo code
if element is h3 and it contains span with id Navigating and if start is
false append it to stringbuilder, set start to true, else append everything in between until you reach h2 with span id See
*/
}

Using Jsoup to extract data

I am using jsoup to extract data from a table in a website.http://www.moneycontrol.com/stocks/marketstats/gainerloser.php?optex=BSE&opttopic=topgainers&index=-1 using Jsoup. I have referred to Using JSoup To Extract HTML Table Contents and other similar questions but it does not print the data. Could someone please provide me with the code required to achieve this?
public class TestClass
{
public static void main(String args[]) throws IOException
{
Document doc = Jsoup.connect("http://www.moneycontrol.com/stocks/marketstats/gainerloser.php?optex=BSE&opttopic=topgainers&index=-1").get();
for (Element table : doc.select("table.tablehead")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 6) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}
If you want to get the content of table(not head), you need change the selector of table:
for (Element table : doc.select("table.tbldata14"))
instead of
for (Element table : doc.select("table.tablehead"))
One important thing is to check what are you getting in Doc when you parse the HTML because there might be few problems with it like:
1. The Site might be using iframes to display content
2. Display content via Javascript
3. few sites have scripts which does not allow jsoup parsing, hence the doc element will contain random data

Categories

Resources