I have a scenario where I need to pull the title from a img tag like below.
<img alt="Bear" border="0" src="/images/teddy/5433.gif" title="Bear"/>
I was able to get the image url. But how do i get the title from the img tag.
From above title = "bear". I want to extract this.
Use Element#attr() to extract arbitrary element attributes.
Element img = selectItSomehow();
String title = img.attr("title");
// ...
See also:
Jsoup Cookbook - Extract attributes, text, and HTML from elements
String html = "<img alt='Bear' border='0' src='/images/teddy/5433.gif' title='Bear'/>";
Document doc = Jsoup.parse(html);
Element e = doc.select("img[title]").first();
String title = e.attr("title");
System.out.println(title);
Related
I am trying to scrape prices of a website with jSoup, but I only get an empty string.
I've tested my code with jSoup Online and I expect <meta itemprop="price" content="6,99"> to be printed when I use the following code:
Document doc = Jsoup.connect(URL).get();
Elements meta = doc.select("meta[itemprop=price]");
System.out.println("meta: " + meta.text());
price = meta.attr("content");
However, I just get an empty string and no error. What am I doing wrong here?
For the ones interested I am trying to scrape the price of this page
Try this:
Document doc = Jsoup.connect(URL).get();
Element meta = doc.select("meta[itemprop=price]").first();
System.out.println("meta: " + meta.text());
String price = meta.attr("content");
The webserver you are trying to access needs another user agent string to respond with the info you want. Try this:
Document doc = Jsoup.connect(URL).userAgent("Mozilla/5.0").get();
OK, I am trying to grab the data-title and href and assigning them to variables in java.
<tr class="pl-video yt-uix-tile " data-video-id="MBBWVgE0ewk" data-set-video-id="" data-title="Windows Command Line Tutorial - 1 - Introduction to the Command Prompt"><td class="pl-video-handle "></td><td class="pl-video-index"></td><td class="pl-video-thumbnail"><span class="pl-video-thumb ux-thumb-wrap contains-addto"><a href="/watch?v=MBBWVgE0ewk&index=1&list=PL6gx4Cwl9DGDV6SnbINlVUd0o2xT4JbMu"
If you don't mind including a dependency, there is a good library for this kind of things called jsoup.
String html = ...
Document doc = Jsoup.parse(html);
Element tr = doc.select("tr").first();
Element link = tr.select("a").first();
String dataTitle = tr.attr("data-title");
String href = link.attr("href");
I want to get the URL of the first image in an HTML String and then replace it with an empty String.
The images can be in this two forms in my String:
<img src="http://www.mywebsite.de/wp-content/uploads/2014/11/picture.jpg" alt="MyImage" width="635" height="311" class="aligncenter size-full wp-image-32729" />
<img class="aligncenter size-full wp-image-38590" src="http://www.mywebsite.de/wp-content/uploads/2014/11/picture2.jpg" alt="MyImage2" width="635" height="303" />
I want to extract the URL as String http://www.mywebsite.de/wp-content/uploads/2014/11/picture.jpg and replace it with an empty string.
At the moment I use this code to get the URL:
/**
* Method to get the URL of the first image
*/
public String getFirstImageURL(String description){
Document doc = Jsoup.parse(description);
Element imageElement = doc.select("img").first();
String absoluteUrl = imageElement.absUrl("src"); //absolute URL on src
//String srcValue = imageElement.attr("src"); // exact content value of the attribute.
return absoluteUrl;
}
This way I can retrieve the correct URL but I cannot replace the complete HTML tag with an emptry String. If I use
// Get description string
String imageURL = getFirstImageURL(HTMLString);
HTMLString = HTMLString.replaceAll(imageURL, "");
I still have <img src="" alt="MyImage" width="635" height="317" class="aligncenter size-full wp-image-13794" /> in the HTMLString.
Anyone an idea how I can completely replace the HTML tag?
SOLUTION
/**
* Method to get the URL of the first image
*/
public String getFirstImageURL(String description){
Document doc = Jsoup.parse(description);
Element imageElement = doc.select("img").first();
imageURL = imageElement.absUrl("src"); //absolute URL on src
imageElement.remove();
description = doc.toString();
return description;
}
From what i understand, you want to remove the entire
Element imageElement = doc.select("img").first();
imageElement.remove();
I am trying to parse the html but getting nullpointor. I want to extract image uri from the below html.
String html = "<div class=\"thumb-box thumb-160\"><a class=\"mimg\" data-id=\"1394085169856_6744\" href=\"#\"><img class=\"thumb\" src=\"http://i.ytimg.com/vi/u7deClndzQw/hqdefault.jpg\" style=\"top: -15px;\"><span class=\"btn\"></span></a></div>";
Document document = Jsoup.parse(html);
Element element = document.select("div.thumb-box thumb-160").first();
System.out.println(element.select("img").attr("src"));
Element element = document.select("div.thumb-box thumb-160").first();
you have to use . (dot) for every class
Element element = document.select("div.thumb-box.thumb-160").first();
Besides it is rather straight forward do select like this
Element element = document.select("div.thumb-box.thumb-160:eq(0) a").first();
This yould yet you anchor element out of the box
I want to get some text from webpage those are frequently changed.What are the technologies I cab use for this?,AS an example Currency rate that change everyday I want to extract from web page and want to save in DB,pls let me know any one knows about this,
thanxx
You can use JSoup to parse the HTML.
Example :
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""
String linkOuterH = link.outerHtml();
// "<b>example</b>"
String linkInnerH = link.html(); // "<b>example</b>"
You can look for particular DIV , tag this way, Check example