Java Jsoup can't select table - java

I have recently started to work on a mini project so I can learn the basis of Jsoup, however I have some difficulty to select a table on a particular website. I'm trying to fetch the table with Jsoup but with no sucess (see picture) http://imgur.com/RC21UBk
I know that the table that i'm trying to get have the class="meddelande" and is also inside a form element which have the same class="meddelande".
HTML code of the website: http://pastebin.com/ufRDhLSy
I'm trying to fetch the red marked area, any idea on how to do it?
Thanks in advance! :)
My code:
public void startMessage(String cookie1) {
try {
doc1 = Jsoup.connect("https://nya.boplats.se/minsida/meddelande")
.timeout(0).cookie("Boplats-Session", cookie1)
.get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements tables = doc1.select("form.meddelande");
Elements table = tables.select("table.meddelande");
System.out.println(table);
}

In your code
Elements tables = doc1.select("form.meddelande");
Elements table = tables.select("table.meddelande");
you are trying to access form with class attribute meddelande but from your linked HTML source meddelande is id, not class, so instead of
form.meddelande
you should use
form#meddelande
^--# means id, dot represents class
So try with
Elements tables = doc.select("form#meddelande");
Elements table = doc.select("table.meddelande");
or maybe simpler
Elements table = doc.select("form#meddelande table.meddelande");
If this will not work then HTML code responsible for table is probably generated by JavaScript. In that case you will not be able to get it with Jsoup, but you will need something like Selenium web driver, or HtmlUtil

In your situation better select class unread and read.
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupSO {
public static void main(String args0[]) throws IOException {
Document doc;
Elements elements;
doc = Jsoup.parse(new File("path_to_file or use connect for URL"), "UTF-8");
elements = doc.getElementsByClass("unread");
for (Element element : elements) {
System.out.println(element);
}
elements = doc.getElementsByClass("read");
for (Element element : elements) {
System.out.println(element);
}
}
}
Output: http://pastebin.com/CwG1cL5T
And yes read their cookbook http://jsoup.org/

Here is an attempt
Document doc = Jsoup.connect("http://pastebin.com/raw.php?i=ufRDhLSy").get();
System.out.println(doc.select("table[class=meddelande]"));
or use the shorter syntax when selecting the node with a particular class only
System.out.println(doc.select("table.meddelande"));
JSoup supports the selector syntax. So you could use that to select DOM nodes with particular attributes - in this case the class attribute.
For more complex options in the selector syntax, look here http://jsoup.org/cookbook/extracting-data/selector-syntax

Related

How to retrieve data from data table from Sports Reference using JSoup?

I'm attempting to use JSoup to retrieve the amount of wins for a team from a Sports Reference table.
Specifically, I am trying to receive the following data point highlighted below, with the html code provided
Below is what I have tried already, but I get a null pointer exception when trying to access the text of this element, telling me that my code is likely not parsing the HTML code correctly.
Element wins = document.selectFirst("td[data-stat=\"wins\"]");
What I want is for the text of this element to be 34 (or some number depending on the number of wins for the team).
Check what your Document was able to read from page and print it. If it contains HTML content which can be dynamically added by JavaScript by browser, you need to use as tool Selenium not Jsoup.
For reading HTML source, you can write similar to:
import java.io.IOException;
import org.jsoup.Jsoup;
public class JSoupHTMLSourceEx {
public static void main(String[] args) throws IOException {
String webPage = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
String html = Jsoup.connect(webPage).get().html();
System.out.println(html);
}
}
Since Jsoup supports cssSelector, you can try to get an element like:
public static void main(String[] args) {
String webPage = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
String html = Jsoup.connect(webPage).get().html();
Document document = Jsoup.parse(html);
Elements tds = document.select("#team_misc > tbody > tr:nth-child(1) > td:nth-child(2)");
for (Element e : tds) {
System.out.println(e.text());
}
}
But better solution is to use Selenium - a portable framework for testing web applications (more details about Selenium tool):
public static void main(String[] args) {
String baseUrl = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
WebDriver driver = new FirefoxDriver();
driver.get(baseUrl);
String innerText = driver.findElement(
By.xpath("//*[#id="team_misc"]/tbody/tr[1]/td[1]")).getText();
System.out.println(innerText);
driver.quit();
}
}
Also you can try instead of:
driver.findElement(By.xpath("//*[#id="team_misc"]/tbody/tr[1]/td[1]")).getText();
in this form:
driver.findElement(By.xpath("//[#id="team_misc"]/tbody/tr[1]/td[1]")).getAttribute("innerHTML");
P.S. In the future it would be useful to add source links from where you want to get information or at least snippet of the DOM structure instead of image.

How to extract data if the div class comes after an id?

I try to get some data from div which is embedded after an ID and type=hidden. I cannot reach the class to get the links listed in that class.
I am using Jsoup with Elements and .select() or .getElementsbyId() and tried to combine them to reach the class. Without success. The site is https://www.ariva.de/aktien/suche. If you hit the search "Suche starten" button the result table pops up. In this table the links are what I want to reach.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class DatenImportUnternehmen {
public static void main (String[] args) {
String url = "https://www.ariva.de/aktien/suche";
try {
Document document = Jsoup.connect(url).get();
for (Element row : document.select("div.aktiensuche_result_table")) {
if(row.select("input[type=hidden]").text().equals("")) {
continue;
}
else {
String raw = row.select("[type=hidden]").text();
System.out.println(raw);
}
}
}
catch (Exception ex) {
ex.printStackTrace();
}
}
I don't get any result. Eclipse just states terminated.
If I understand correctly you want to get to the links in the table generated when you hit the search button on https://www.ariva.de/aktien/suche.
The first problem you are having is that the search results aren't available directly from this URL. Instead when you click the search button a POST request is made to https://www.ariva.de/aktiensuche/_result_table.m
The result of this request actually contains the table with the links that I believe you are interested in. Specifically the response contains HTML which is then dynamically added to the page as the results table.
The second problem looks to be in the jsoup query. I can't see any hidden input fields in the result table, but it is easy enough to grab the links using document.select("a[href]").
So for me this code:
String searchUrl = "https://www.ariva.de/aktiensuche/_result_table.m";
String searchBody = "page=0&page_size=25&sort=ariva_name&sort_d=asc&ariva_performance_1_year=_&ariva_performance_3_years=&ariva_performance_5_years=&index=0&founding_year=&land=0&industrial_sector=0&sector=0&currency=0&type_of_share=0&year=_all_years&sales=_&profit_loss=&sum_assets=&sum_liabilities=&number_of_shares=&earnings_per_share=&dividend_per_share=&turnover_per_share=&book_value_per_share=&cashflow_per_share=&balance_sheet_total_per_share=&number_of_employees=&turnover_per_employee=_&profit_per_employee=&kgv=_&kuv=_&kbv=_&dividend_yield=_&return_on_sales=_";
// post request to search URL
Document document = Jsoup.connect(searchUrl).requestBody(searchBody).post();
// find links in returned HTML
for(Element link:document.select("a[href]")) {
System.out.println(link);
}
produces the output:
1&1 Drillisch
11 88 0 Solutions
1st Red
21ST. CENT. FOX B NEW
21st Century Fox
2G Energy
3I Group
3I INFRASTRUCTURE
3M Company
3U Holding
3W Power
4imprint Group
4 SC
6,625% Statkraft AS 09/19 auf Festzins
7C Solarparken
888 Holdings
A.A.A. aktiengesellschaft allgemeine anlageverwaltung
A.G. BARR LS-,04167
A.H.T. Syngas Technology
A.S. Creation Tapeten
A+J Mucklow Group
A+JMUCKLOW GRP PREF. LS 1
A2A
AAC Technologies Holding
Aalberts
Which I hope is more or less what you are after. To set search parameters you will need to examine the search form and modify the form data in the searchBody string (or use the .data method instead of .requestBody to build the query).

Parsing Information from URL Using Jsoup

I need help with my Java project using Jsoup (if you think there is a more efficient way to achieve the purpose, please let me know). The purpose of my program is to parse certain useful information from different URLs and put it in a text file. I am not an expert in HTML or JavaScript, therefore, it has been difficult for me to code in Java exactly what I want to parse.
In the website that you see in the code below as one of the examples, the information that interests me to parse with Jsoup is everything you can see in the table under “Routing”(Route, Location, Vessel/Voyage, Container Arrival Date, Container Departure Date; = Origin, Seattle SSA Terminal T18, 26 Jun 15 A, 26 Jun 15 A… and so on).
So far, with Jsoup we are only able to parse the title of the website, yet we have been unsuccessful in getting any of the body.
Here is the code that I used, which I got from an online source:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Jsouptest71115 {
public static void main(String[] args) throws Exception {
String url = "http://google.com/gentrack/trackingMain.do "
+ "?trackInput01=999061985";
Document document = Jsoup.connect(url).get();
String title = document.title();
System.out.println("title : " + title);
String body = document.select("body").text();
System.out.println("Body: " + body);
}
}
Working code:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.ArrayList;
public class Sample {
public static void main(String[] args) {
String url = "http://homeport8.apl.com/gentrack/blRoutingPopup.do";
try {
Connection.Response response = Jsoup.connect(url)
.data("blNbr", "999061985") // tracking number
.method(Connection.Method.POST)
.execute();
Element tableElement = response.parse().getElementsByTag("table")
.get(2).getElementsByTag("table")
.get(2);
Elements trElements = tableElement.getElementsByTag("tr");
ArrayList<ArrayList<String>> tableArrayList = new ArrayList<>();
for (Element trElement : trElements) {
ArrayList<String> columnList = new ArrayList<>();
for (int i = 0; i < 5; i++) {
columnList.add(i, trElement.children().get(i).text());
}
tableArrayList.add(columnList);
}
System.out.println("Origin/Location: "
+tableArrayList.get(1).get(1));// row and column number
System.out.println("Discharge Port/Container Arrival Date: "
+tableArrayList.get(5).get(3));
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output:
Origin/Location: SEATTLE SSA TERMINAL (T18), WA  
Discharge Port/Container Arrival Date: 23 Jul 15  E
You need to utilize document.select("body") select method input to which is CSS selector. To know more about CSS selectors just google it, or Read this. Using CSS selectors you can identify parts of web page body easily.
In your particular case you will have a different problem though, for instance the table you are after is inside an IFrame and if you look at the html of web page you are visiting its(iframe's) url is "http://homeport8.apl.com/gentrack/blRoutingFrame.do", so if you visit this URL directly so that you can access its content you will get an exception which is perhaps some restriction from Server. To get content properly you need to visit two URLs via JSoup, 1. http://homeport8.apl.com/gentrack/trackingMain.do?trackInput01=999061985 and 2. http://homeport8.apl.com/gentrack/blRoutingFrame.do?trackInput01=999061985
For first URL you'll get nothing useful, but for second URL you'll get tables of your interest. The try using document.select("table") which will give you List of tables iterator over this list and find table of your interest. Once you have the table use Element.select("tr") to get a table row and then for each "tr" use Element.select("td") to get table cell data.
The webpage you are visiting didn't use CSS class and id selectors which would have made reading it with jsoup a lot easier so I am afraid iterating over document.select("table") is your best and easy option.
Good Luck.

Extracting Values From an XML File Either using XPath, SAX or DOM for this Specific Scenario

I am currently working on an academic project, developing in Java and XML. Actual task is to parse XML, passing required values preferably in HashMap for further processing. Here is the short snippet of actual XML.
<root>
<BugReport ID = "1">
<Title>"(495584) Firefox - search suggestions passes wrong previous result to form history"</Title>
<Turn>
<Date>'2009-06-14 18:55:25'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "3.1"> Created an attachment (id=383211) [details] Patch v.2</Sentence>
<Sentence ID = "3.2"> Ah. So, there's a ._formHistoryResult in the....</Sentence>
<Sentence ID = "3.3"> The simple fix it to just discard the service's form history result.</Sentence>
<Sentence ID = "3.4"> Otherwise it's trying to use a old form history result that no longer applies for the search string.</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 12:07:34'</Date>
<From>'Gavin Sharp'</From>
<Text>
<Sentence ID = "4.1"> (From update of attachment 383211 [details])</Sentence>
<Sentence ID = "4.2"> Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 13:17:56'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "5.1"> (In reply to comment #3)</Sentence>
<Sentence ID = "5.2"> &gt; (From update of attachment 383211 [details] [details])</Sentence>
<Sentence ID = "5.3"> &gt; Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
<Sentence ID = "5.4"> Good point.</Sentence>
<Sentence ID = "5.5"> I renamed the one in the wrapper to _formHistResult. </Sentence>
<Sentence ID = "5.6"> fhResult seemed maybe a bit too short.</Sentence>
</Text>
</Turn>
.....
and so on
</BugReport>
There are many commenter like 'Justin Dolske' who have commented on this report and what I actually looking for is the list of commenter and all sentences they have written in a whole XML file. Something like if(from == justin dolske) getHisAllSentences(). Similarly for other commenters (for all). I have tried many different ways to get the sentences only for 'Justin dolske' or other commenters, even in a generic form for all using XPath, SAX and DOM but failed. I am quite new to these technologies including JAVA and any don't know how to achieve it.
Can anyone guide me specifically how could I get it with any of above technologies or is there any other better strategy to do it?
(Note: Later I want to put it in a hashmap such as like this HashMap (key, value) where key = name of commenter (justin dolske) and value is (all sentences))
Urgent help will be highly appreciated.
There're several ways using which you can achieve your requirement.
One way would be use JAXB. There're several tutorials available on this on the web, so feel free to refer to them.
You can also think of creating a DOM and then extracting data from it and then put it into your HashMap.
One reference implementation would be something like this:
import java.io.File;
import java.util.ArrayList;
import java.util.HashMap;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
public class XMLReader {
private HashMap<String,ArrayList<String>> namesSentencesMap;
public XMLReader() {
namesSentencesMap = new HashMap<String, ArrayList<String>>();
}
private Document getDocument(String fileName){
Document document = null;
try{
document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File(fileName));
}catch(Exception exe){
//handle exception
}
return document;
}
private void buildNamesSentencesMap(Document document){
if(document == null){
return;
}
//Get each Turn block
NodeList turnList = document.getElementsByTagName("Turn");
String fromName = null;
NodeList sentenceNodeList = null;
for(int turnIndex = 0; turnIndex < turnList.getLength(); turnIndex++){
Element turnElement = (Element)turnList.item(turnIndex);
//Assumption: <From> element
Element fromElement = (Element) turnElement.getElementsByTagName("From").item(0);
fromName = fromElement.getTextContent();
//Extracting sentences - First check whether the map contains
//an ArrayList corresponding to the name. If yes, then use that,
//else create a new one
ArrayList<String> sentenceList = namesSentencesMap.get(fromName);
if(sentenceList == null){
sentenceList = new ArrayList<String>();
}
//Extract sentences from the Turn node
try{
sentenceNodeList = turnElement.getElementsByTagName("Sentence");
for(int sentenceIndex = 0; sentenceIndex < sentenceNodeList.getLength(); sentenceIndex++){
sentenceList.add(((Element)sentenceNodeList.item(sentenceIndex)).getTextContent());
}
}finally{
sentenceNodeList = null;
}
//Put the list back in the map
namesSentencesMap.put(fromName, sentenceList);
}
}
public static void main(String[] args) {
XMLReader reader = new XMLReader();
reader.buildNamesSentencesMap(reader.getDocument("<your_xml_file>"));
for(String names: reader.namesSentencesMap.keySet()){
System.out.println("Name: "+names+"\tTotal Sentences: "+reader.namesSentencesMap.get(names).size());
}
}
}
Note: This is just a demonstration and you would need to modify it to suit your need. I've created it based on your XML to show one way of doing it.
I suggest to use JAXB to creates a Data Model reflecting your XML structure.
One done, you can load the XML into Java instances.
Put each 'Turn' into a Map< String, List< Turn >>, using Turn.From as key.
Once done, you'll can write:
List< Turn > justinsTurn = allTurns.get( "'Justin Dolske'" );

Using Jsoup to extract data

I am using jsoup to extract data from a table in a website.http://www.moneycontrol.com/stocks/marketstats/gainerloser.php?optex=BSE&opttopic=topgainers&index=-1 using Jsoup. I have referred to Using JSoup To Extract HTML Table Contents and other similar questions but it does not print the data. Could someone please provide me with the code required to achieve this?
public class TestClass
{
public static void main(String args[]) throws IOException
{
Document doc = Jsoup.connect("http://www.moneycontrol.com/stocks/marketstats/gainerloser.php?optex=BSE&opttopic=topgainers&index=-1").get();
for (Element table : doc.select("table.tablehead")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 6) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}
If you want to get the content of table(not head), you need change the selector of table:
for (Element table : doc.select("table.tbldata14"))
instead of
for (Element table : doc.select("table.tablehead"))
One important thing is to check what are you getting in Doc when you parse the HTML because there might be few problems with it like:
1. The Site might be using iframes to display content
2. Display content via Javascript
3. few sites have scripts which does not allow jsoup parsing, hence the doc element will contain random data

Categories

Resources