Can't scrape the data that i'm looking for? - java

I am trying to scrape the prices and the dates in the table in the attached picture from the URL: ****
http://www.airfrance.fr/vols/paris+tunis
I succeeded to scrape informations but not the ways i'm looking for ( date + price). I used these lines of code
import java.io.IOException;
import javax.lang.model.element.Element;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("http://www.airfrance.fr/vols/paris+tunis").get();
Elements links = doc.select("div");
for (org.jsoup.nodes.Element e:links) {
System.out.println(e.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Running this code gives me just some prices and anly a few dates but not all the table as it shown in the picture below.
Can you help me please to resolve this problem for my study project and thanks.

The problem is the calendar you are parsing is not in the original source code (right click > view source) as delivered from the server. That table is generated using JavaScript when the page is rendered by the browser (right click > inspect).
Jsoup can only parse source code. So you need to load the page first with something like HtmlUnit, then pass this rendered paged to Jsoup.
// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage("http://www.airfrance.fr/vols/paris+tunis");
// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());
// find all of the date/price cells
for(Element cell : doc.select("td.available.daySelection")) {
String cellDate = cell.select(".cellDate").text();
String cellPrice = cell.select(".cellPrice > .day_price").text();
System.out.println(
String.format(
"cellDate=%s cellPrice=%s",
cellDate,
cellPrice));
}
// clean up resources
webClient.close();
Console
cellDate=1 septembre cellPrice=302 €
cellDate=2 septembre cellPrice=270 €
cellDate=3 septembre cellPrice=270 €
cellDate=4 septembre cellPrice=270 €
cellDate=5 septembre cellPrice=270 €
....
Source: Parsing JavaScript Generated Pages

Related

How to extract data if the div class comes after an id?

I try to get some data from div which is embedded after an ID and type=hidden. I cannot reach the class to get the links listed in that class.
I am using Jsoup with Elements and .select() or .getElementsbyId() and tried to combine them to reach the class. Without success. The site is https://www.ariva.de/aktien/suche. If you hit the search "Suche starten" button the result table pops up. In this table the links are what I want to reach.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class DatenImportUnternehmen {
public static void main (String[] args) {
String url = "https://www.ariva.de/aktien/suche";
try {
Document document = Jsoup.connect(url).get();
for (Element row : document.select("div.aktiensuche_result_table")) {
if(row.select("input[type=hidden]").text().equals("")) {
continue;
}
else {
String raw = row.select("[type=hidden]").text();
System.out.println(raw);
}
}
}
catch (Exception ex) {
ex.printStackTrace();
}
}
I don't get any result. Eclipse just states terminated.
If I understand correctly you want to get to the links in the table generated when you hit the search button on https://www.ariva.de/aktien/suche.
The first problem you are having is that the search results aren't available directly from this URL. Instead when you click the search button a POST request is made to https://www.ariva.de/aktiensuche/_result_table.m
The result of this request actually contains the table with the links that I believe you are interested in. Specifically the response contains HTML which is then dynamically added to the page as the results table.
The second problem looks to be in the jsoup query. I can't see any hidden input fields in the result table, but it is easy enough to grab the links using document.select("a[href]").
So for me this code:
String searchUrl = "https://www.ariva.de/aktiensuche/_result_table.m";
String searchBody = "page=0&page_size=25&sort=ariva_name&sort_d=asc&ariva_performance_1_year=_&ariva_performance_3_years=&ariva_performance_5_years=&index=0&founding_year=&land=0&industrial_sector=0&sector=0&currency=0&type_of_share=0&year=_all_years&sales=_&profit_loss=&sum_assets=&sum_liabilities=&number_of_shares=&earnings_per_share=&dividend_per_share=&turnover_per_share=&book_value_per_share=&cashflow_per_share=&balance_sheet_total_per_share=&number_of_employees=&turnover_per_employee=_&profit_per_employee=&kgv=_&kuv=_&kbv=_&dividend_yield=_&return_on_sales=_";
// post request to search URL
Document document = Jsoup.connect(searchUrl).requestBody(searchBody).post();
// find links in returned HTML
for(Element link:document.select("a[href]")) {
System.out.println(link);
}
produces the output:
1&1 Drillisch
11 88 0 Solutions
1st Red
21ST. CENT. FOX B NEW
21st Century Fox
2G Energy
3I Group
3I INFRASTRUCTURE
3M Company
3U Holding
3W Power
4imprint Group
4 SC
6,625% Statkraft AS 09/19 auf Festzins
7C Solarparken
888 Holdings
A.A.A. aktiengesellschaft allgemeine anlageverwaltung
A.G. BARR LS-,04167
A.H.T. Syngas Technology
A.S. Creation Tapeten
A+J Mucklow Group
A+JMUCKLOW GRP PREF. LS 1
A2A
AAC Technologies Holding
Aalberts
Which I hope is more or less what you are after. To set search parameters you will need to examine the search form and modify the form data in the searchBody string (or use the .data method instead of .requestBody to build the query).

Parsing Information from URL Using Jsoup

I need help with my Java project using Jsoup (if you think there is a more efficient way to achieve the purpose, please let me know). The purpose of my program is to parse certain useful information from different URLs and put it in a text file. I am not an expert in HTML or JavaScript, therefore, it has been difficult for me to code in Java exactly what I want to parse.
In the website that you see in the code below as one of the examples, the information that interests me to parse with Jsoup is everything you can see in the table under “Routing”(Route, Location, Vessel/Voyage, Container Arrival Date, Container Departure Date; = Origin, Seattle SSA Terminal T18, 26 Jun 15 A, 26 Jun 15 A… and so on).
So far, with Jsoup we are only able to parse the title of the website, yet we have been unsuccessful in getting any of the body.
Here is the code that I used, which I got from an online source:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Jsouptest71115 {
public static void main(String[] args) throws Exception {
String url = "http://google.com/gentrack/trackingMain.do "
+ "?trackInput01=999061985";
Document document = Jsoup.connect(url).get();
String title = document.title();
System.out.println("title : " + title);
String body = document.select("body").text();
System.out.println("Body: " + body);
}
}
Working code:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.ArrayList;
public class Sample {
public static void main(String[] args) {
String url = "http://homeport8.apl.com/gentrack/blRoutingPopup.do";
try {
Connection.Response response = Jsoup.connect(url)
.data("blNbr", "999061985") // tracking number
.method(Connection.Method.POST)
.execute();
Element tableElement = response.parse().getElementsByTag("table")
.get(2).getElementsByTag("table")
.get(2);
Elements trElements = tableElement.getElementsByTag("tr");
ArrayList<ArrayList<String>> tableArrayList = new ArrayList<>();
for (Element trElement : trElements) {
ArrayList<String> columnList = new ArrayList<>();
for (int i = 0; i < 5; i++) {
columnList.add(i, trElement.children().get(i).text());
}
tableArrayList.add(columnList);
}
System.out.println("Origin/Location: "
+tableArrayList.get(1).get(1));// row and column number
System.out.println("Discharge Port/Container Arrival Date: "
+tableArrayList.get(5).get(3));
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output:
Origin/Location: SEATTLE SSA TERMINAL (T18), WA  
Discharge Port/Container Arrival Date: 23 Jul 15  E
You need to utilize document.select("body") select method input to which is CSS selector. To know more about CSS selectors just google it, or Read this. Using CSS selectors you can identify parts of web page body easily.
In your particular case you will have a different problem though, for instance the table you are after is inside an IFrame and if you look at the html of web page you are visiting its(iframe's) url is "http://homeport8.apl.com/gentrack/blRoutingFrame.do", so if you visit this URL directly so that you can access its content you will get an exception which is perhaps some restriction from Server. To get content properly you need to visit two URLs via JSoup, 1. http://homeport8.apl.com/gentrack/trackingMain.do?trackInput01=999061985 and 2. http://homeport8.apl.com/gentrack/blRoutingFrame.do?trackInput01=999061985
For first URL you'll get nothing useful, but for second URL you'll get tables of your interest. The try using document.select("table") which will give you List of tables iterator over this list and find table of your interest. Once you have the table use Element.select("tr") to get a table row and then for each "tr" use Element.select("td") to get table cell data.
The webpage you are visiting didn't use CSS class and id selectors which would have made reading it with jsoup a lot easier so I am afraid iterating over document.select("table") is your best and easy option.
Good Luck.

Java Jsoup can't select table

I have recently started to work on a mini project so I can learn the basis of Jsoup, however I have some difficulty to select a table on a particular website. I'm trying to fetch the table with Jsoup but with no sucess (see picture) http://imgur.com/RC21UBk
I know that the table that i'm trying to get have the class="meddelande" and is also inside a form element which have the same class="meddelande".
HTML code of the website: http://pastebin.com/ufRDhLSy
I'm trying to fetch the red marked area, any idea on how to do it?
Thanks in advance! :)
My code:
public void startMessage(String cookie1) {
try {
doc1 = Jsoup.connect("https://nya.boplats.se/minsida/meddelande")
.timeout(0).cookie("Boplats-Session", cookie1)
.get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements tables = doc1.select("form.meddelande");
Elements table = tables.select("table.meddelande");
System.out.println(table);
}
In your code
Elements tables = doc1.select("form.meddelande");
Elements table = tables.select("table.meddelande");
you are trying to access form with class attribute meddelande but from your linked HTML source meddelande is id, not class, so instead of
form.meddelande
you should use
form#meddelande
^--# means id, dot represents class
So try with
Elements tables = doc.select("form#meddelande");
Elements table = doc.select("table.meddelande");
or maybe simpler
Elements table = doc.select("form#meddelande table.meddelande");
If this will not work then HTML code responsible for table is probably generated by JavaScript. In that case you will not be able to get it with Jsoup, but you will need something like Selenium web driver, or HtmlUtil
In your situation better select class unread and read.
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupSO {
public static void main(String args0[]) throws IOException {
Document doc;
Elements elements;
doc = Jsoup.parse(new File("path_to_file or use connect for URL"), "UTF-8");
elements = doc.getElementsByClass("unread");
for (Element element : elements) {
System.out.println(element);
}
elements = doc.getElementsByClass("read");
for (Element element : elements) {
System.out.println(element);
}
}
}
Output: http://pastebin.com/CwG1cL5T
And yes read their cookbook http://jsoup.org/
Here is an attempt
Document doc = Jsoup.connect("http://pastebin.com/raw.php?i=ufRDhLSy").get();
System.out.println(doc.select("table[class=meddelande]"));
or use the shorter syntax when selecting the node with a particular class only
System.out.println(doc.select("table.meddelande"));
JSoup supports the selector syntax. So you could use that to select DOM nodes with particular attributes - in this case the class attribute.
For more complex options in the selector syntax, look here http://jsoup.org/cookbook/extracting-data/selector-syntax

HTML Parsing using Jsoup library

<div class="serieSelector serieSelected" data-serie="36" data-title="Steps">
<div class="value fontGreyBold">2620</div>
<div id="stepsPulse" class="fontGreyLight">Steps</div>
</div>
I am currently working on an Android project which needs to parse some data from the website and display the data in TextView's. As seen above, I need to display the value Highlighted which is "2620". I'm using Jsoup and that is my Element data obtained from the website. I dunno what tag to use exactly.
try {
Document document = Jsoup.connect(url).get();
Elements stepstaken = document
.select("div[class=measureValue fontGreyBold]span[class]");
stta = stepstaken.attr("class");
} catch (IOException e) {
e.printStackTrace();
}
The above code doesn't work so any possible replies are appreciated. Thanks!
I always just use PHP Simple DOM Parser whenever I need to parse anything. Then you'd just create a simple REST API that returns the parsed results. Works like a charm. :)
Try this seclector
document.select("div.value.fontGreyBold");
Example
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class JsoupParser {
public static void main(String[] args) {
String html = "<div class=\"serieSelector serieSelected\" data-serie=\"36\" data-title=\"Steps\">"
+ "<div class=\"value fontGreyBold\">2620</div>"
+ "<div id=\"stepsPulse\" class=\"fontGreyLight\">Steps</div>"
+ "</div>";
Document document = Jsoup.parse(html);
Elements stepstaken = document.select("div.value.fontGreyBold");
System.out.println(stepstaken.text());
}
}

how to substring with word boundary for a long html content for preview (preserving format) in web using java?

I am trying to display a quick summary of a long html message sent by user. I would like to do this in java rather than javascript. How can I achieve this? I have looked at jsoup and htmlunit but can not find the method that does it!
With jsoup you can parse the document, select the inner element where the text content is too long and replace its text content with an excerpt.
Parse a document
Find an element
Extract the text content
Compute a replacement string
Set the new text content
It is all in their doc.
All in one it results in:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class A {
public static void main(String[] args) {
String html = "<html><head><title>First parse</title></head><body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Element pTag = doc.select("body > p").first(); // the p tag
String pContent = pTag.text();
pContent = pContent.substring(0, 7) + "... (too long)";
pTag.text(pContent);
System.out.println(doc);
}
}
Prints:
<html>
<head>
<title>First parse</title>
</head>
<body>
<p>Parsed ... (too long)</p>
</body>
</html>

Categories

Resources