Parsing data using Jsoup - java

I am trying to parse out job information from HTML page using Jsoup parser. I am trying to extract all the job posting details, however I just couldn't get the query right. I tried into Tryjsoup.com to get idea of query structure but I can't figure out how can I get these tuples and also please inform on how to get a grip on their inner structure
Html Code:
<div itemscope itemtype="http://schema.org/JobPosting" type="tuple" id="131015000050" class="row ">
<a count=1 href="some link">
<span itemprop=title><font class=hlite>Developer</font></span>
<span itemprop=hiringOrganization>Vm World</span>
</a>
</div>
<div class= "other details"><span itemprop=baseSalary><em></em>3000</span></div>
Expected Output:
String Post = Developer
String Company = Vm World
String Salary = 3000

I think you just need to use Element.select("span") for the block of HTML code.
Document doc = Jsoup.parse("<HTML code>");
Elements spans = doc.select("span");
for(Element span: spans) {
System.out.println(span.text());
}
The result of the above code:
Developer
Vm World
3000
Code for segregatiton:
Element title = doc.select("span[itemprop=title]").first();
Element post = doc.select("span[itemprop=hiringOrganization]").first();
Element salary = doc.select("span[itemprop=baseSalary]").first();
System.out.println(title.text());
System.out.println(post.text());
System.out.println(salary.text());

Related

How to fetch span class value using Selenium with Java

I am using selenium with java and want to fetch the value "3". I guess I have to use xpath but I am not sure what would be the syntax for that? the html code is given below:
<div class="p-panel halign-center">
<div>
<span class="p-text p-f-sz-xl p-t-secondary50 p-f-w-b p-t-wr-fw" data-tag="Transcript-Summary-No-Due-Date-Count">3</span>
</div>
</div>
You can fetch value 3 by following code
String value = driver.findElement(by.xpath("//span[#data-tag='Transcript-Summary-No-Due-Date-Count']").getText();
OR
String value = driver.findElement(by.xpath("//span[#class='p-text p-f-sz-xl p-t-secondary50 p-f-w-b p-t-wr-fw']").getText();
Also we can improve the xpath if more HTML code is available.
Happy coding~
This is for getting by class:
.//span[#class='p-text p-f-sz-xl p-t-secondary50 p-f-w-b p-t-wr-fw']
This is for getting by text:
.//span[text()='3']
and this is both
.//span[#class='p-text p-f-sz-xl p-t-secondary50 p-f-w-b p-t-wr-fw' and text()='3']

Get price from webpage using Jsoup

I'm trying to get the price from a product on a webpage.
Specifically from within the following html. I don't know how to use CSS but these are my attempts so far.
<div class="pd-price grid-100">
<!-- Selling Price -->
<div class="met-product-price v-spacing-small" data-met-type="regular">
<span class="primary-font jumbo strong art-pd-price">
<sup class="dollar-symbol" itemprop="PriceCurrency" content="USD">$</sup>
399.00</span>
<span itemprop="price" content="399.00"></span>
</div>
</div>
> $399.00
This obviously resides further within a webpage but here is the java code i've attempted to run this.
String url ="https://www.lowes.com/pd/GE-700-sq-ft-Window-Air-Conditioner-115-Volt-14000-BTU-ENERGY-STAR/1000380463";
Document document = Jsoup.connect(url).timeout(0).get();
String price = document.select("div.pd-price").text();
String title = document.title(); //Get title
System.out.println(" Title: " + title); //Print title.
System.out.println(price);
First you should familiarize yourself with CSS Selector
W3School
has some resource to get you started.
In this case, the thing you need resides inside div with pd-price class
so div.pd-price is already correct.
You need to get the element first.
Element outerDiv = document.selectFirst("div.pd-price");
And then get the child div with another selector
Element innerDiv = outerDiv.selectFirst("div.met-product-price");
And then get the span element inside it
Element spanElement = innerDiv.selectFirst("span.art-pd-price");
At this point you could get the <sup> element but in this case, you can just call text() method to get the text
System.out.println(spanElement.text());
This will print
$ 399.0
Edit:
After seeing comments in other answer
You can get cookie from your browser and send it from Jsoup to bypass the zipcode requirement
Document document = Jsoup.connect("https://www.lowes.com/pd/GE-700-sq-ft-Window-Air-Conditioner-115-Volt-14000-BTU-ENERGY-STAR/1000380463")
.header("Cookie", "<Your Cookie here>")
.get();
Element priceDiv = document.select("div.pd-price").first();
String price = priceDiv.select("span").last().attr("content");
If you need currency too:
String priceWithCurrency = priceDiv.select("sup").text();
I'm not run these, but should work.
For more detail see JSoup API reference

I cannot get text With jsoup : element.text()

I cannot get text With Jsoup : element.text()
It doesn't show me anything, someone help me please.
org.jsoup.nodes.Document d = Jsoup.connect("https://translate.google.com/#en/ar/scraping").get();
org.jsoup.nodes.Element element = d.getElementById("result_box");
out.print(element.text());
When you view the static page source here: https://translate.google.com/#en/ar/scraping you'll see that it contains this:
<span id="result_box" class="short_text"></span>
But on loading the page in your browser you'll see that element is changed to:
<span id="result_box" class="short_text" lang="ar">
<span class="">...</span>
</span>
So, the content of the result_box span is populated dynamically.
This means that it cannot be scraped by JSoup.
To read dynamic content you'll need to use a webdriver such as Selenium.

Xpath changing after each run

Could someone help, please.
I need to grasp a number that's generated after each run. As this number changes after each run, I need to grasp it and write it to excel sheet. I'm using Xpath for this field but not sure how to make this field more generic so that it only catches the number that is generated last.
The Xpath I tried is shown below and the syntax that is changing is div[2] to div[3] :-
driver.findElement(By.xpath("/html/body/div/div[3]/div/div/div[3]/div/section/section/article/div[2]/div/div/div/div[2]/div[2]/div[1]/div/div[2]/p")).getText();
driver.findElement(By.xpath("/html/body/div/div[3]/div/div/div[3]/div/section/section/article/div[2]/div/div/div/div[2]/div[3]/div[1]/div/div[2]/p")).getText();
My HTML SOURCE code is :
<div class="card-details row"> <div class="pane base4"> <div class="card"> <div class="card-name"> <div class="card-number"> <h4>Card number</h4> <p>633597015500042861</p> </div> <a class="submit-btn uniform-button button-smaller button-orange" href="/my-account/replacement-card?cardId=1ce25b86-27e6-4ce8-8ef3-6576f9a0ae84"> </div>
Thanks and Regards,
Az.
Please try to use below xpath to retrieve the card number.
String cardNum = driver.findElement(By.xpath(".//div[#class='card-number']/p")).getText();
You are accessing the randomly generated card number from the <div class="card-number"> tag.
Hope this helps.
//h4[contains(text(),'Card number')]/following-sibling::p
or
div[h4[contains(text(),'Card number')]]/p
if occurrence of same xpath more than once in your code and you want to select the last one following is the way:
xpath[count(sameXpath)]
this will give you the last occurence of the xpath
Please do it like below:
// as per your given xpath
String FirstXpath = "/html/body/div/div[3]/div/div/div[3]/div/section/section/article/div[2]/div/div/div/div[2]/div[";
String SecondXpath = "]/div[1]/div/div[2]/p";
// make sure here i value is as per your application
for(int i=2;i<4;i++){
driver.findElement(By.xpath(FirstXpath + i + SecondXpath)).getText();
}
Just make sure value of i as per your div value change.
Update
As per html provided driver.findElement(By.xpath("//*[#class='card-number']/h4/p")).getText();.

How I can replace "text" in the each tag using Jsoup

I have the following html:
<html>
<head>
</head>
<body>
<div id="content" >
<p>text <strong>text</strong> text <em>text</em> text </p>
</div>
</body>
</html>
How I can replace "text" to "word" in the each tag using Jsoup library.
I want to see:
<html>
<head>
</head>
<body>
<div id="content" >
<p>word <strong>word</strong> word <em>word</em> word </p>
</div>
</body>
</html>
Thank you for any suggestions!
UPD:
Thanks for answers, but I found the versatile way:
Element entry = doc.select("div").first();
Elements tags = entry.getAllElements();
for (Element tag : tags) {
for (Node child : tag.childNodes()) {
if (child instanceof TextNode && !((TextNode) child).isBlank()) {
System.out.println(child); //text
((TextNode) child).text("word"); //replace to word
}
}
}
Document doc = Jsoup.connect(url).get();
String str = doc.toString();
str = str.replace("text", "word");
try it..
A quick search turned up this code:
Elements strongs = doc.select("strong");
Element f = strongs.first();
Element l = strongs.last();1,siblings.lastIndexOf(l));
etc
First what you want to do is understand how the library works and what features it contains, and then you figure out how to use the library to do what you need. The code above seems to allow you to select a strong element, at which point you could update it's inner text, but I'm sure there are a number of ways you could accomplish the same.
In general, most libraries which parse xml are able to select any given element in the document object model, or any list of elements, and either manipulate the elements themselves, or their inner text, attributes and the like.
Once you gain more experience working with different libraries, your starting point is to look for the documentation of the library to see what that library does. If you see a method that says it does something, that's what it does, and you can expect to use it to accomplish that goal. Then, instead of writing a question on Stack Overflow, you just need to parse the functionality of the library you're using, and figure out how to use it to do what you want.
String html = "<html> ...";
Document doc = Jsoup.parse(html);
Elements p = doc.select("div#content > p");
p.html(p.html().replaceAll("text", "word"));
System.out.println(doc.toString());
div#content > p means that the elements <p> in the element <div> which id is content.
If you want to replace the text only in <strong>text</strong>:
Elements p = doc.select("div#content > p > strong");
p.html(p.html().replaceAll("text", "word"));

Categories

Resources