Extract Data from HTML using JSoup

Extract Data from HTML using JSoup - java

I am writing a script to extract data from a HTML Document. Here is a part of the document.
<div class="info">
<div id="info_box" class="inf_clear">
<div id="restaurant_info_box_left">
<table id="rest_logo">
<tr>
<td>
<a itemprop="url" title="XYZ" href="XYZ.com">
<img src="/files/logo/26721.jpg" alt="XYZ" title="XYZ" width="100" />
</a>
</td>
</tr>
</table>
<h1 id="Name"><a class="fn org url" rel="Order Online" href="XYZ.com" title="XYZ" itemprop="name">XYZ</a></h1>
<div class="rest_data" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="telephone">(305) 535-1379</span> | <b>
<span itemprop="streetAddress">1755 Alton Rd</span>,
<span itemprop="addressLocality">Miami Beach</span>,
<span itemprop="addressRegion">FL</span>
<span itemprop="postalCode">33139</span></b>
</div>
<div class="geo">
<span class="latitude" title="25.792588"></span>
<span class="longitude" title="-80.141214"></span>
</div>
<div class="rest_data">Estimated delivery time: <b>45-60 min</b></div>
</div>
</div>
I am using Jsoup and not quite sure how to achieve this.
There are many div tags in the document and I try to match with their unique attribute.
Say for div tag with class attribute value as "info"
Elements divs = doc.select("div");
for (Element div : divs) {
String divClass = div.attr("class").toString();
if (divClass.equalsIgnoreCase("rest_info")) {
}
If matched, I have to get the table with id "rest_logo" inside that divtag.
When doc.select("table") is used, it looks like the parser searches the entire document.
What I need to achieve is, if the div tag attribute is matched, I need to fetch the elements and attributes inside the matched div tag.
Expected Output:
Name : XYZ
telephone:(305) 535-1379
streetAddress:1755 Alton Rd
addressLocality:Miami Beach
addressRegion:FL
postalCode:33139
latitude:25.792588
longitude:-80.141214
Estimated delivery time:45-60 min
Any Ideas?

for (Element e : doc.select("div.info")) {
System.out.println("Name: " + e.select("a.fn").text());
System.out.println("telephone: " + e.select("span[itemprop=telephone]").text());
System.out.println("streetAddress: " + e.select("span[itemprop=streetAddress]").text());
// .....
}

Here's how I would do it:
Document doc = Jsoup. parse(myHtml);
Elements elements = doc.select("div.info")
.select(”a[itemprop=url], span[itemprop=telephone], span[itemprop=streetAddress], span[itemprop=addressLocality], span[itemprop=addressRegion], span[itemprop=postalCode], span.longitude, span.latitude”);
elements.add(doc.select("div.info > div.rest_data").last());
for (Element e:elements) {
if (e.hasAttr("itemprop”)) {
System.out.println(e.attr("itemprop") + e.text());
}
if (e.hasAttr("itemprop”) && e.attr("itemprop").equals ("url")) {
System.out.println("name: " + e.attr("title"));
}
if (e.attr("class").equals("longitude") || e.attr("class").equals("latitude")) {
System.out. println(e.attr("class") + e.attr("title"));
}
if (e.attr("class").equals("rest_data")) {
System.out.println(e.text());
}
}
(Note: I wrote this on my phone, so untested, but it should work, may also contain typos)
A bit of explanation: First get all the desired elements via doc.select(...), and then extract the desired data from each one.
Let me know if it works.

Probably the main thing to realise is that an element with an id can be selected directly - no need to loop through a collection of elements searching for it.
I've not used JSoup and my Java is very rusty but here goes ...
// 1. Select elements from document
Element container = doc.select("#restaurant_info_box_left"); // find element in document with id="restaurant_info_box_left"
Element h1 = container.select("h1"); // find h1 element in container
Elements restData = container.select(".rest_data"); //find all divs in container with class="rest_data"
Element restData_0 = restData.get(0); // find first rest_data div
Element restData_1 = restData.get(1); // find second rest_data div
Elements restData_0_spans = restData_0.select("span"); // find first rest_data div's spans
Elements geos = container.select(".geo"); // find all divs in container with class="geo"
Element geo = geos.get(0); // find first .geo div
Elements geo_spans = geo.select("span"); // find first .geo div's spans
// 2. Compose output
// h1 text
String text = "Name: " + h1.text();
// output text >>>
// restData_0_spans text
for (Element span : restData_0_spans) {
String text = span.attr("itemprop").toString() + ": " + span.text();
// output text >>>
}
// geo data
for (Element span : geo_spans) {
String text = span.attr("class").toString() + ": " + span.attr("title").toString();
// output text >>>
}
// restData_1 text
String text = restData_1.text();
// output text >>>
For someone used to JavaScript/jQuery, this all seems very laboured. With luck it may simplify somewhat.

Related

JSoup, count elements after a specific tag (h3)

I need to know the size of all the p elements after the h3[id=hm_2] tag.
Is there a way to accomplish that?
As this is not working, result should be 3.
Many thx in advance.
Here is my piece of code:
for (Element tag : doc.select("div.archive-style-wrapper")) {
Elements headCat1 = tag.getElementsByTag("h3");
for (Element headline : headCat1) {
Elements importantLinks = headline.getElementsByTag("p");
Log.w("result", String.valueOf(importantLinks.size()));
}
}
HTML code piece involved:
<h3 class="a-header--3" id="hm_2">Some text</h3>
<p class="a-paragraph"><img src="data:image/gif;base64,R0lGODlhAQ"</img></p>
<p class="a-paragraph">This new event quest, brought to us by</p>
<p class="a-paragraph">This new event quest, brought to us by</p>
<div class="imagelink"... </div>

You can use the selector E ~ F to get an F element preceded by sibling E for example h1 ~ p. Find more about the selector syntax here: Selector.html
String html = "<h3 class=\"a-header--3\" id=\"hm_2\">Some text</h3>\n"
+ " <p class=\"a-paragraph\"><img src=\"data:image/gif;base64,R0lGODlhAQ\"</img></p>\n"
+ " <p class=\"a-paragraph\">This new event quest, brought to us by</p>\n"
+ " <p class=\"a-paragraph\">This new event quest, brought to us by</p>\n"
+ " <div class=\"imagelink\"... </div>"
+ " <p class=\"a-paragraph\">This shall not be included</p>\n";
Document doc = Jsoup.parseBodyFragment(html);
Elements paragraphs = doc.select("h3[id=hm_2] ~ p");
System.out.println(paragraphs.size());
paragraphs.forEach(System.out::println);

Exclude Certain Child Nodes from XPath Result with Single Call to element.getText()?

I want to extract content matching XPath: .//*[contains (#class, 'post-content')] However I wish to exclude child nodes:
1) Containing text: P3 or AP
2) Div containing id = bottom
3) Form containing label with text: Get email updates
I have the following HTML:
<div class="td-post-content">
<p>P1</p>
<p>P2</p>
<p>P3</p>
<p>P4</p>
<p>P5</p>
<p>AP</p>
<div id="td-a-rec bottom"> </div>
<form action="https://example.com/subscribe method=" post " id="subscribe-form " name="subscribe-form " class="validate " target="_blank " novalidate=" ">
<div id="signup_scroll ">
<label for="mce-EMAIL ">Get email updates from..</label>
<input type="email " value=" " name="EMAIL " class="email " id="EMAIL " placeholder="email address " required=" ">
<div style="position: absolute; left: -5000px; " aria-hidden="true "><input type="text " name="b_11 " tabindex="-1 " value=" "></div>
<div class="clear "><input type="submit " value="Subscribe " name="subscribe " id="-subscribe " class="button "></div>
</div>
</form>
</div>
I am able to achieve this by using the XPath syntax: [not(contains(#id,'bottom'))] + [not(contains(text(),'P3'))] + [not(contains(text(),'AP'))] etc However, the main issue is that instead of matching all desired child elements as a single element - it now matches each element as a WebElement List.
Right now the only way to extract desired text is by iterating through the web element list and concatenating the results into a single Senter code heretring.
Is is possible to directly scrape all desired content in one shot (with a single call to element.getText() ) without the need to iterate through element list?
Thanks

From your description, it looks like all you want is the text from the P tags with a couple of exclusions. The CSS selector div.td-post-content > p will get you all the P tags including the ones you want to exclude. You can gather those into a list and then remove the text you want to exclude to give you the final list.
List<WebElement> ps = driver.findElements(By.cssSelector("div.td-post-content > p"));
List<String> text = ps.stream().map(e -> e.getText()).collect(Collectors.toList());
text.remove("AP");
text.remove("P3");
System.out.println(text);
Running this prints
[P1, P2, P3, P4, P5]

How to fetch data from ul li in android studio with jsoup

I am trying to get Product 1 and Product 2 but I cant get it help please
I am using jsoup and volley
<ul id="searched-products">
<li>
<div class="gd-col navUnitContainer1 gu4">
<div class="product_name">
<a>Prodict 1</a>
</div>
</div>
</li>
<li>
<div class="gd-col navUnitContainer1 gu4">
<div class="product_name">
<a>Prodict 2</a>
</div>
</div>
</li>
</ul>
I have tried this
Elements itemElements = doc.select("ul#searched-products li");
but its not selecting "li".I have also tried this
Elements itemElements = doc.select("ul#searched-products"); //this line works
Element e1 = itemElements.get(i);
e1.select("li"); or item.getElementsByTag("li");
still no good...
There are hundreds of li so I cant do this
doc.select("li");
Kindly suggest something

Like this:
public class JsoupList {
public static void main(String[] brawwwr){
String html = "<ul id=\"searched-products\">" +
"<li>" +
"<div class=\"gd-col navUnitContainer1 gu4\">" +
"<div class=\"product_name\">" +
"<a>Prodict 1</a>" +
"</div>" +
"</div>" +
"</li>" +
"<li>" +
"<div class=\"gd-col navUnitContainer1 gu4\">" +
"<div class=\"product_name\">"+
"<a>Prodict 2</a>" +
"</div>" +
"</div>" +
"</li>" +
"</ul>";
Document doc = Jsoup.parse(html);
Elements itemElements = doc.select("ul#searched-products li");
for(Element elem : itemElements){
System.out.println(elem.select("div div a").text());
}
}
}
Will return
Prodict 1
Prodict 2
You can imagine repetitive code inside tags like a little page of its own.
regards

Try this code.
Elements itemElements = doc.select("ul#searched-products");
itemElements = itemElements.select("li");
for(Element ele : itemElements){
String text = ele.text();
System.out.println(text); //this will return Prodict 1 and Prodict 2
}
// or u can try by getting all the a
for(Element ele : itemElements){
String text = ele.select("a").first().text();
System.out.println(text); //this will also return Prodict 1 and Prodict 2
}

To exclude <li> or <a> tags outside the list, you need to restrict the selector to match only inside the list. The best would be to use the ID (#searched-products). Then do not select <li> or <a> tags from the doc, but from the selected <ul>element.
You can get your text with any of the following selectors (not a complete list):
#searched-products li a
#searched-products a
#searched-products .product_name a
#searched-products .product_name
Even the last one is okay, since you need only the text, and div.product_name contains only the <a> tag.
for(Element e: doc.select("#searched-products .product_name")) {
String t = e.text(); // Prodict N
}
By the way, your original approach with selecting <li> tags inside ul#searched-products should have worked. If that doesn't return anything, the case might be that the list is generated dynamically on that page. You can test it easily by printing out the HTML that Jsoup has (doc.html() or doc.select('#searched-products').html()).
If really that's the case, Jsoup is not the right tool for you. I suggest you to use Selenium with possibly a headless browser (HtmlUnit or PhantomJS). They can return and even interact with dynamically created elements, so maybe other parts of your crawl process can be simplified.

Select specific div class in Jsoup

I get problems when I parse a specific div's class.
<div class="box_3 box_3a">
<div class="title_new_2"></div>
<div class="list_indeks_2"></div>
</div>
I have tried to select <div class="list_indeks_2"></div> with jsoup as follows:
links = doc.select(".list_indeks_2")
However, this code didn't work because the div's class contains underscores (_). How does one handle an underscore (_) in the jsoup select method?

Try to access the element based on the attribute.
The snippet was tested with JSoup version 1.8.1.
Document doc = Jsoup.parse(
"<div class=\"box_3 box_3a\">\n"
+ " <div class=\"title_new_2\">some title</div>\n"
+ " <div class=\"list_indeks_2\">some index</div>\n"
+ "</div>");
Elements rows = doc.getElementsByAttributeValue("class", "list_indeks_2");
System.out.println("rows.size() = " + rows.size());
String index = "";
for (Element span : rows) {
index = span.text();
}
System.out.println("index = " + index);
this produces as output
rows.size() = 1
index = some index

Jsoup parse HTML including span tags

I have a HTML with the following format
<article class="cik" id="100">
<a class="ci" href="/abc/1001/STUFF">
<img alt="Micky Mouse" src="/images/1001.jpg" />
<span class="mick vtEnabled"></span>
</a>
<div>
Micky Mouse
<span class="FP">$88.00</span> <span class="SP">$49.90</span>
</div>
</article>
In the above code the tag inside article has a span class="mick vtEnabled" with no lable. I want to check if this span tag with the class name specified is present within the article tag. How do i do that? I tried select("> a[href] > span.mick vtEnabled") and checked the size..it remains 0 for all the article tags irrespective if its set or not. any inputs?

Starting from individual article tags would be good:
final String test = "<article class=\"cik\" id=\"100\"><a class=\"ci\" href=\"/abc/1001/STUFF\"><img alt=\"Micky Mouse\" src=\"/images/1001.jpg\" /></a><div>Micky Mouse<span class=\"FP\">$88.00</span> <span class=\"SP\">$49.90</span></div></article>";
final Elements articles = Jsoup.parse(test).select("article");
for (final Element article : articles) {
final Elements articleImages = article.select("> a[href] > img[src]");
for (final Element image : articleImages) {
System.out.println(image.attr("src"));
}
final Elements articleLinks = article.select("> div > a[href]");
for (final Element link : articleLinks) {
System.out.println(link.attr("href"));
System.out.println(link.text());
}
final Elements articleFPSpans = article.select("> div > span.FP");
for (final Element span : articleFPSpans) {
System.out.println(span.text());
}
}
final Elements articleSPSpans = article.select("> div > span.SP");
for (final Element span : articleSPSpans) {
System.out.println(span.text());
}
}
This prints:
/images/1001.jpg
/abc/1001/STUFF
Micky Mouse
$88.00
$49.90

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract Data from HTML using JSoup - java

for (Element e : doc.select("div.info")) { System.out.println("Name: " + e.select("a.fn").text()); System.out.println("telephone: " + e.select("span[itemprop=telephone]").text()); System.out.println("streetAddress: " + e.select("span[itemprop=streetAddress]").text()); // ..... }

Related

JSoup, count elements after a specific tag (h3)

Exclude Certain Child Nodes from XPath Result with Single Call to element.getText()?

How to fetch data from ul li in android studio with jsoup

Select specific div class in Jsoup

Jsoup parse HTML including span tags

Categories

Resources