HTML Parsing using Jsoup library

HTML Parsing using Jsoup library - java

<div class="serieSelector serieSelected" data-serie="36" data-title="Steps">
<div class="value fontGreyBold">2620</div>
<div id="stepsPulse" class="fontGreyLight">Steps</div>
</div>
I am currently working on an Android project which needs to parse some data from the website and display the data in TextView's. As seen above, I need to display the value Highlighted which is "2620". I'm using Jsoup and that is my Element data obtained from the website. I dunno what tag to use exactly.
try {
Document document = Jsoup.connect(url).get();
Elements stepstaken = document
.select("div[class=measureValue fontGreyBold]span[class]");
stta = stepstaken.attr("class");
} catch (IOException e) {
e.printStackTrace();
}
The above code doesn't work so any possible replies are appreciated. Thanks!

I always just use PHP Simple DOM Parser whenever I need to parse anything. Then you'd just create a simple REST API that returns the parsed results. Works like a charm. :)

Try this seclector
document.select("div.value.fontGreyBold");
Example
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class JsoupParser {
public static void main(String[] args) {
String html = "<div class=\"serieSelector serieSelected\" data-serie=\"36\" data-title=\"Steps\">"
+ "<div class=\"value fontGreyBold\">2620</div>"
+ "<div id=\"stepsPulse\" class=\"fontGreyLight\">Steps</div>"
+ "</div>";
Document document = Jsoup.parse(html);
Elements stepstaken = document.select("div.value.fontGreyBold");
System.out.println(stepstaken.text());
}
}

Related

JSoup - exclude links

I'm trying to exclude a list of links which I do not wish to crawl.
I could not find anything useful in the documentation that skips user requested urls.
I was, however, able to do it like this:
if(!(link.attr("href").startsWith("https://blog.olark.com") ||
link.attr("href").startsWith("http://www.olark.com")||
link.attr("href").startsWith("https://www.olark.com")||
link.attr("href").startsWith("https://olark.com") ||
link.attr("href").startsWith("http://olark.com"))) {
this.links.add(link.absUrl("href")); //get the absolute url and add it to links list. }
Of course this isn't a proper way to do it, so I wrapped the links in a List and tried to loop through it - however, it did not exclude a single link (code below):
List<String> exclude = Arrays.asList("https://blog.olark.com", "http://www.olark.com", "https://www.olark.com", "https://olark.com", "http://olark.com");
for (String string : exclude) {
if(!link.attr("href").startsWith(string)) {
this.links.add(link.absUrl("href")); //get the absolute url and add it to links list.
}
}
So my question is: How do I make it avoid a list of urls? I'm thinking something similar to the second code block I've written, but I'm open for ideas or fixes.

You can start with selecting and removing all the unwanted links. Then you can process your document without any checks.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupQuestion51072084 {
public static void main(final String[] args) throws IOException {
Document doc = Jsoup.parse("" +
"" +
"" +
"" +
"" +
"");
System.out.println("Document before modifications:\n" + doc);
// select links having "olark.com" in href.
Elements links = doc.select("a[href*=olark.com]");
System.out.println();
System.out.println("Links to remove: " + links);
System.out.println();
// remove them from the document
for (Element link : links) {
link.remove();
}
System.out.println("Document without unwanted links:\n" + doc);
}
}
and the output is:
Document before modifications:
<html>
<head></head>
<body>
</body>
</html>
Links to remove:
Document without unwanted links:
<html>
<head></head>
<body>
</body>
</html>

Select Html Tags with Jsoup

I want to select all Html tags code with Jsoup
<html>
<head></head>
<body>
.....
</body>
</html>
I tried that:
Document dc = Jsoup.parse(fichier, "utf-8");
String tags = dc.outerHtml();

Your question it's not clear, but it seems that you simply want to get all the tag node names, to do so you can parse the html and getAllElements() and then iterate over the list element getting the nodeName() of each one, using java 8 to take advantage of forEach your code could be something like:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class JSoup {
public static void main(String[] args) {
String fichier = "<html>" +
"<head></head>" +
"<body></body>" +
"</html>";
Document dc = Jsoup.parse(fichier, "utf-8");
Elements elements = dc.getAllElements();
elements.forEach( element -> System.out.println(element.nodeName()));
}
}
This code prints all the tag node names:
#document
html
head
body

Trying to fill out a website form using java, but form tag is embedded in iframe tag

My goal is to access this url http://eaacorp.com/find-a-dealer and fill out a form using java. To do this, I attempted to find all form tags:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class HttpUrlConnectionExample{
public static void main(String[] args) throws IOException{
Document document = Jsoup.connect("http://eaacorp.com/find-a-dealer").get();
String page = document.toString();//this is the whole page's html
Elements formEl = document.getElementsByTag("form");
}
}
However formEl returns empty because the form tag is embedded in the http://www.eaacorp.com/dealer/searchdealer.php html in iframe tag (snippet of page's source):
<iframe id="blockrandom" name="iframe" src="http://www.eaacorp.com/dealer/searchdealer.php" width="100%" height="500" scrolling="auto" frameborder="1" class="wrapper"></iframe>
Hence, is there a way to access the form tag within the iframe tag? Something like:
if(formEl.isEmpty()){
//find iframe
Elements iframeEl = document.getElementsByTag("iframe");
System.out.println(iframeEl);
String embedURL = iframeEl.getSrc();//DOES NOT COMPILE, getSrc() is not a method
Document embedDoc = Jsoup.connect(embedURL).get();
}

There is no need for your own getSrcString method, especially since the substring approach will break for minimal changes in the tag.
Use .attr("abs:src") on an element with the src attribute instead (compare: https://jsoup.org/cookbook/extracting-data/working-with-urls)
Example Code
Document document = Jsoup.connect("http://eaacorp.com/find-a-dealer").get();
Element iframeEl = document.select("iframe").first();
String embedURL = iframeEl.attr("abs:src");
Document embedDoc = Jsoup.connect(embedURL).get();
System.out.println(embedDoc.select("form").first());
Truncated Output
<form action="findit.php" method="post" name="dlrsrchfrm" target="_blank">
<div style="padding: 15px;">
[...]
</form>

I found that you could actually make your own method that can get the src url using substrings and then just use that String to get a document connection:
public static String getSrcString(String html){
String construct = "";
for (int i = 0; i < html.length() - 5;i++){
if (html.substring(i, i + 5).equals("src=\"")){
i += 5;
while(!html.substring(i, i + 1).equals("\"")){
construct += html.substring(i, i + 1);
i++;
}
}
}
return construct;
}
and then in the main:
String embedURL = getSrcString(iframeEl.toString());
Document embedDoc = Jsoup.connect(embedURL).get();

Can't scrape the data that i'm looking for?

I am trying to scrape the prices and the dates in the table in the attached picture from the URL: ****
http://www.airfrance.fr/vols/paris+tunis
I succeeded to scrape informations but not the ways i'm looking for ( date + price). I used these lines of code
import java.io.IOException;
import javax.lang.model.element.Element;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("http://www.airfrance.fr/vols/paris+tunis").get();
Elements links = doc.select("div");
for (org.jsoup.nodes.Element e:links) {
System.out.println(e.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Running this code gives me just some prices and anly a few dates but not all the table as it shown in the picture below.
Can you help me please to resolve this problem for my study project and thanks.

The problem is the calendar you are parsing is not in the original source code (right click > view source) as delivered from the server. That table is generated using JavaScript when the page is rendered by the browser (right click > inspect).
Jsoup can only parse source code. So you need to load the page first with something like HtmlUnit, then pass this rendered paged to Jsoup.
// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage("http://www.airfrance.fr/vols/paris+tunis");
// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());
// find all of the date/price cells
for(Element cell : doc.select("td.available.daySelection")) {
String cellDate = cell.select(".cellDate").text();
String cellPrice = cell.select(".cellPrice > .day_price").text();
System.out.println(
String.format(
"cellDate=%s cellPrice=%s",
cellDate,
cellPrice));
}
// clean up resources
webClient.close();
Console
cellDate=1 septembre cellPrice=302 €
cellDate=2 septembre cellPrice=270 €
cellDate=3 septembre cellPrice=270 €
cellDate=4 septembre cellPrice=270 €
cellDate=5 septembre cellPrice=270 €
....
Source: Parsing JavaScript Generated Pages

how to substring with word boundary for a long html content for preview (preserving format) in web using java?

I am trying to display a quick summary of a long html message sent by user. I would like to do this in java rather than javascript. How can I achieve this? I have looked at jsoup and htmlunit but can not find the method that does it!

With jsoup you can parse the document, select the inner element where the text content is too long and replace its text content with an excerpt.
Parse a document
Find an element
Extract the text content
Compute a replacement string
Set the new text content
It is all in their doc.
All in one it results in:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class A {
public static void main(String[] args) {
String html = "<html><head><title>First parse</title></head><body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Element pTag = doc.select("body > p").first(); // the p tag
String pContent = pTag.text();
pContent = pContent.substring(0, 7) + "... (too long)";
pTag.text(pContent);
System.out.println(doc);
}
}
Prints:
<html>
<head>
<title>First parse</title>
</head>
<body>
<p>Parsed ... (too long)</p>
</body>
</html>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HTML Parsing using Jsoup library - java

I always just use PHP Simple DOM Parser whenever I need to parse anything. Then you'd just create a simple REST API that returns the parsed results. Works like a charm. :)

Related

JSoup - exclude links

Select Html Tags with Jsoup

Trying to fill out a website form using java, but form tag is embedded in iframe tag

Can't scrape the data that i'm looking for?

how to substring with word boundary for a long html content for preview (preserving format) in web using java?

Categories

Resources