I have a piece of HTML code of a web page (library thing) like:
<div class="qelcontent" id="4ed0e0ba4f1b16.47984984" style="display:block;">
<div class="description"><h4 class="first"><b>Amazon.com Product Description</b>
(ISBN 0860783227, Hardcover)</h4>
I want to get the absolute URL from an href attribute. I tried:
selector = document.select(".first .a[href]");
But it returned null. How can I get the value?
This solves this specific problem.. not sure if it will work with your entire dataset.
String html = "<div class=\"qelcontent\" id=\"4ed0e0ba4f1b16.47984984\" style=\"display:block;\">" +
"<div class=\"description\"><h4 class=\"first\"><b>Amazon.com Product Description</b>" +
"(ISBN 0860783227, Hardcover)</h4>";
Document doc = Jsoup.parse(html);
System.out.println(doc.select(".first").select("a").attr("href"));
Related
I'm trying to get the price from a product on a webpage.
Specifically from within the following html. I don't know how to use CSS but these are my attempts so far.
<div class="pd-price grid-100">
<!-- Selling Price -->
<div class="met-product-price v-spacing-small" data-met-type="regular">
<span class="primary-font jumbo strong art-pd-price">
<sup class="dollar-symbol" itemprop="PriceCurrency" content="USD">$</sup>
399.00</span>
<span itemprop="price" content="399.00"></span>
</div>
</div>
> $399.00
This obviously resides further within a webpage but here is the java code i've attempted to run this.
String url ="https://www.lowes.com/pd/GE-700-sq-ft-Window-Air-Conditioner-115-Volt-14000-BTU-ENERGY-STAR/1000380463";
Document document = Jsoup.connect(url).timeout(0).get();
String price = document.select("div.pd-price").text();
String title = document.title(); //Get title
System.out.println(" Title: " + title); //Print title.
System.out.println(price);
First you should familiarize yourself with CSS Selector
W3School
has some resource to get you started.
In this case, the thing you need resides inside div with pd-price class
so div.pd-price is already correct.
You need to get the element first.
Element outerDiv = document.selectFirst("div.pd-price");
And then get the child div with another selector
Element innerDiv = outerDiv.selectFirst("div.met-product-price");
And then get the span element inside it
Element spanElement = innerDiv.selectFirst("span.art-pd-price");
At this point you could get the <sup> element but in this case, you can just call text() method to get the text
System.out.println(spanElement.text());
This will print
$ 399.0
Edit:
After seeing comments in other answer
You can get cookie from your browser and send it from Jsoup to bypass the zipcode requirement
Document document = Jsoup.connect("https://www.lowes.com/pd/GE-700-sq-ft-Window-Air-Conditioner-115-Volt-14000-BTU-ENERGY-STAR/1000380463")
.header("Cookie", "<Your Cookie here>")
.get();
Element priceDiv = document.select("div.pd-price").first();
String price = priceDiv.select("span").last().attr("content");
If you need currency too:
String priceWithCurrency = priceDiv.select("sup").text();
I'm not run these, but should work.
For more detail see JSoup API reference
I cannot get text With Jsoup : element.text()
It doesn't show me anything, someone help me please.
org.jsoup.nodes.Document d = Jsoup.connect("https://translate.google.com/#en/ar/scraping").get();
org.jsoup.nodes.Element element = d.getElementById("result_box");
out.print(element.text());
When you view the static page source here: https://translate.google.com/#en/ar/scraping you'll see that it contains this:
<span id="result_box" class="short_text"></span>
But on loading the page in your browser you'll see that element is changed to:
<span id="result_box" class="short_text" lang="ar">
<span class="">...</span>
</span>
So, the content of the result_box span is populated dynamically.
This means that it cannot be scraped by JSoup.
To read dynamic content you'll need to use a webdriver such as Selenium.
It is on Android and need to fix up the html before loaded into the WebView.
normally it could be done by
(<a[^>]+>)(.+?)(<\/a>)
to get group $1 then replace the text.
What if there are other unknown children inside the <a> tag?
the example below has <a><p>... text</p></a>, but the <p> could something else not known.
Really what it wants is to replace only the content of text element of any child inside the element.
<a href="http://news.newsletter.com/" target="_blank">
<p><img alt=“Socialbook" border="0" height="50"
src="http://news.newsletter.com/images/socialbook.gif" width="62">
THIS IS THE TEXT NEEDED TO REPLACE<p>
</a>
Can this be done inside the JAVA or has to be done inside the WebView's javascript?
You can use any Java html parser. E.g. JSoup:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
for (Element link : links)
link.text("~" + link.text() + "~");
See Element api docs.
I'm trying to find all elements inside this kind of html:
<body>
My text without tag
<br>Some title</br>
<img class="image" src="url">
My second text without tag
<p>Some Text</p>
<p class="MsoNormal">Some text</p>
<ul>
<li>1</li>
<li>2</li>
</ul>
</body>
I need get all elements include parts without any tag. How a can get it?
P.S.: I need to get array of "Element" for each element.
Not quite sure if you are asking to retrieve all the text within the html. to do that, you can simply do the following:
String html; // your html code
Document doc = Jsoup.parse(html); //parse the string
System.out.println(doc.text()); // get all the text from tags.
OUTPUT:
My text without tag Some title My second text without tag Some Text
Some text 1 2
Just in case if you using a html file, you can use the below code and retrieve each tag that you need. The API is Jsoup. You can find more examples in the below link http://jsoup.org/
File input = new File(htmlFilePath);
InputStream is = new FileInputStream(input);
String html = IOUtils.toString(is);
Document htmlDoc = Jsoup.parse(html);
Elements pElements = htmlDoc.select("P");
Element pElement1 = pElements.get(0);
This question already has answers here:
Text Extraction from HTML Java
(8 answers)
Closed 9 years ago.
I'm trying to extract a specific text from a webpage?
This is the part of the webpage which contains the specific text:
<div class="module">
<div class="body">
<dl class="per_info">
<dt>F.Name:</dt>
<dd><a class="nm" href="http://">a Variable Name1</a></dd>
<dt>L.Name:</dt>
<dd><a class="nm" href="http://">a Variable Name2</a></dd>
</dl>
</div>
</div>
How to extract the content of Variable Name1 and Variable Name2?
Is there any html parser could do this extraction?
well, you can try Selenium, it loads the html page to your java code in a DOM-aware fashion, such that afterwards you can pick content of HTML elements based on id, xpath, etc.
http://seleniumhq.org/
TagSoup is a SAX-compliant parser that is able to parse HTML found in the "wild". So there's no need for well formed XML.
jsoup is a Java library that can parse HTML and extract element data. To use jsoup, first you create a jsoup Document by parsing it from a file, URL, whole document string, or HTML fragment string. A HTML fragment example is something like:
String html = "<div class='module'>" +
"<div class='body'>" +
"<dl class='per_info'>" +
"<dt>F.Name:</dt>" +
"<dd><a class='nm' href='http://'>a Variable Name1</a></dd>" +
"<dt>L.Name:</dt>" +
"<dd><a class='nm' href='http://'>a Variable Name2</a></dd>" +
"</dl>" +
"</div>" +
"</div>";
Document doc = Jsoup.parseBodyFragment(html);
With the document, you can use jsoup's selectors to locate specific elements:
// select all <a/> elements from the document
Elements anchors = doc.select("a")
With the element collection, you can iterator over the elements and extract their element contents:
for (Element anchor : anchors) {
String contents = anchor.text();
System.out.println(contents);
}