Elements returns empty string - java

I am trying to scrape prices of a website with jSoup, but I only get an empty string.
I've tested my code with jSoup Online and I expect <meta itemprop="price" content="6,99"> to be printed when I use the following code:
Document doc = Jsoup.connect(URL).get();
Elements meta = doc.select("meta[itemprop=price]");
System.out.println("meta: " + meta.text());
price = meta.attr("content");
However, I just get an empty string and no error. What am I doing wrong here?
For the ones interested I am trying to scrape the price of this page

Try this:
Document doc = Jsoup.connect(URL).get();
Element meta = doc.select("meta[itemprop=price]").first();
System.out.println("meta: " + meta.text());
String price = meta.attr("content");

The webserver you are trying to access needs another user agent string to respond with the info you want. Try this:
Document doc = Jsoup.connect(URL).userAgent("Mozilla/5.0").get();

Related

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}
You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}
I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

Can't parse XML (from web) using JSoup

I am trying to work with small XML files sent from web and parse few attributes from them. How would I approach this in JSoup? I know it's not XML Parser but HTML one but it supports XML too and I don't have to build any Handlers, BuildFactories and such as I would have to in DOM, SAX etc.
Here is example xml: LINK I can't paste it here because it exits the code tag after every line - if someone can fix that I would be grateful.
And here is my piece of code::
String xml = "http://www.omdbapi.com/?t=Private%20Ryan&y=&plot=short&r=xml";
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
// want to select first occurrence of genre tag though there is only one it
// doesn't work without .first() - but it doesn't parse it
Element genreFromXml = doc.select("genre").first();
String genre = genreFromXml.text();
System.out.println(genre);
It results in NPE at:
String genre = genreFromXml.text();
There are 2 issues in your code:
You provide a String representation of an URL while an XML content is expected, you should rather use the method parse(InputStream in, String charsetName, String baseUri, Parser parser) instead to parse your XML as an input stream.
There is no element genre in your XML, genre is an attribute of the element movie.
Here is how your code should look like:
String url = "http://www.omdbapi.com/?t=Private%20Ryan&y=&plot=short&r=xml";
// Parse the doc using an XML parser
Document doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", "", Parser.xmlParser());
// Select the first element "movie"
Element movieFromXml = doc.select("movie").first();
// Get its attribute "genre"
String genre = movieFromXml.attr("genre");
// Print the result
System.out.println(genre);
Output:
Drama, War

Grabbing information from an html file

OK, I am trying to grab the data-title and href and assigning them to variables in java.
<tr class="pl-video yt-uix-tile " data-video-id="MBBWVgE0ewk" data-set-video-id="" data-title="Windows Command Line Tutorial - 1 - Introduction to the Command Prompt"><td class="pl-video-handle "></td><td class="pl-video-index"></td><td class="pl-video-thumbnail"><span class="pl-video-thumb ux-thumb-wrap contains-addto"><a href="/watch?v=MBBWVgE0ewk&index=1&list=PL6gx4Cwl9DGDV6SnbINlVUd0o2xT4JbMu"
If you don't mind including a dependency, there is a good library for this kind of things called jsoup.
String html = ...
Document doc = Jsoup.parse(html);
Element tr = doc.select("tr").first();
Element link = tr.select("a").first();
String dataTitle = tr.attr("data-title");
String href = link.attr("href");

adding text before and after a link jSoup

I've just stared learning Jsoup and the cookbook on their website but I'm just a bit stuck with addling text to an element I've parsed.
try{
Document doc = Jsoup.connect(url).get();
Element add = doc.prependText("a href") ;
Elements links = add.select("a[href]");
for (Element link : links) {
PrintStream sb = System.out.format("%n %s",link.attr("abs:href"));
System.out.print("<br>");
}
}
catch(Exception e){
System.out.print("error --> " + e);
}
Example run with google.com I get
http://www.google.ie/imghp?hl=en&tab=wi<br>
http://maps.google.ie/maps?hl=en&tab=wl<br>
https://play.google.com/?hl=en&tab=w8<br>
But I really want
<a href> http://www.google.ie/imghp?hl=en&tab=wi<br></a>
<a href> http://maps.google.ie/maps?hl=en&tab=wl<br></a>
<a href> https://play.google.com/?hl=en&tab=w8<br></a>
With this code I've gotten all the links off the page but I want to also get the and tags so I can them create my on webpage. I've tried adding a string and prepend text but just can't seem to get it right.
Thanks
with link.attr(...) you get the attribute value.
But you need the whole tag:
Document doc = Jsoup.connect(...).get();
for( Element e : doc.select("a[href]") ) // Select all 'a'-Tags with 'href' attribute
{
String wholeTag = e.toString(); // Get a string as the element is
/* No you you can use the html - in this example for a simple output */
System.out.println(wholeTag);
}

How to extract Dynamic text from a webpage

I want to get some text from webpage those are frequently changed.What are the technologies I cab use for this?,AS an example Currency rate that change everyday I want to extract from web page and want to save in DB,pls let me know any one knows about this,
thanxx
You can use JSoup to parse the HTML.
Example :
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""
String linkOuterH = link.outerHtml();
// "<b>example</b>"
String linkInnerH = link.html(); // "<b>example</b>"
You can look for particular DIV , tag this way, Check example

Categories

Resources