Using Jsoup to extract data

Using Jsoup to extract data - java

I am using jsoup to extract data from a table in a website.http://www.moneycontrol.com/stocks/marketstats/gainerloser.php?optex=BSE&opttopic=topgainers&index=-1 using Jsoup. I have referred to Using JSoup To Extract HTML Table Contents and other similar questions but it does not print the data. Could someone please provide me with the code required to achieve this?
public class TestClass
{
public static void main(String args[]) throws IOException
{
Document doc = Jsoup.connect("http://www.moneycontrol.com/stocks/marketstats/gainerloser.php?optex=BSE&opttopic=topgainers&index=-1").get();
for (Element table : doc.select("table.tablehead")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 6) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}

If you want to get the content of table(not head), you need change the selector of table:
for (Element table : doc.select("table.tbldata14"))
instead of
for (Element table : doc.select("table.tablehead"))

One important thing is to check what are you getting in Doc when you parse the HTML because there might be few problems with it like:
1. The Site might be using iframes to display content
2. Display content via Javascript
3. few sites have scripts which does not allow jsoup parsing, hence the doc element will contain random data

Related

How to retrieve data from data table from Sports Reference using JSoup?

I'm attempting to use JSoup to retrieve the amount of wins for a team from a Sports Reference table.
Specifically, I am trying to receive the following data point highlighted below, with the html code provided
Below is what I have tried already, but I get a null pointer exception when trying to access the text of this element, telling me that my code is likely not parsing the HTML code correctly.
Element wins = document.selectFirst("td[data-stat=\"wins\"]");
What I want is for the text of this element to be 34 (or some number depending on the number of wins for the team).

Check what your Document was able to read from page and print it. If it contains HTML content which can be dynamically added by JavaScript by browser, you need to use as tool Selenium not Jsoup.
For reading HTML source, you can write similar to:
import java.io.IOException;
import org.jsoup.Jsoup;
public class JSoupHTMLSourceEx {
public static void main(String[] args) throws IOException {
String webPage = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
String html = Jsoup.connect(webPage).get().html();
System.out.println(html);
}
}
Since Jsoup supports cssSelector, you can try to get an element like:
public static void main(String[] args) {
String webPage = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
String html = Jsoup.connect(webPage).get().html();
Document document = Jsoup.parse(html);
Elements tds = document.select("#team_misc > tbody > tr:nth-child(1) > td:nth-child(2)");
for (Element e : tds) {
System.out.println(e.text());
}
}
But better solution is to use Selenium - a portable framework for testing web applications (more details about Selenium tool):
public static void main(String[] args) {
String baseUrl = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
WebDriver driver = new FirefoxDriver();
driver.get(baseUrl);
String innerText = driver.findElement(
By.xpath("//*[#id="team_misc"]/tbody/tr[1]/td[1]")).getText();
System.out.println(innerText);
driver.quit();
}
}
Also you can try instead of:
driver.findElement(By.xpath("//*[#id="team_misc"]/tbody/tr[1]/td[1]")).getText();
in this form:
driver.findElement(By.xpath("//[#id="team_misc"]/tbody/tr[1]/td[1]")).getAttribute("innerHTML");
P.S. In the future it would be useful to add source links from where you want to get information or at least snippet of the DOM structure instead of image.

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}

I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

Real time web crawling using Jsoup

I have this web page https://rrtp.comed.com/pricing-table-today/ and from that I need to get the information about Time (Hour Ending) and Day-Ahead Hourly Price column alone. I tried with the following code,
Document doc = Jsoup.connect("https://rrtp.comed.com/pricing-table-today/").get();
for (Element table : doc.select("table.prices three-col")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 2) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}
but unfortunately I am unable to get the data I need.
Is there something wrong in the code..? or This page can't be crawled...?
Need some help

As I said in comment:
You should hit https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717 because it's source from which data is loaded on the page you have pointed to.
Data under this link is not a valid html document (and this is why it's not working for you), but you can easily make it "quite" right.
All you have to do is first get the response and add <table>..</table> tags around it, then it's enough to parse it as html document.
Connection.Response response = Jsoup.connect("https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717").execute();
Document doc = Jsoup.parse("<table>" + response.body() + "</table>");
for (Element element : doc.select("tr")) {
System.out.println(element.html());
}

Java Jsoup can't select table

I have recently started to work on a mini project so I can learn the basis of Jsoup, however I have some difficulty to select a table on a particular website. I'm trying to fetch the table with Jsoup but with no sucess (see picture) http://imgur.com/RC21UBk
I know that the table that i'm trying to get have the class="meddelande" and is also inside a form element which have the same class="meddelande".
HTML code of the website: http://pastebin.com/ufRDhLSy
I'm trying to fetch the red marked area, any idea on how to do it?
Thanks in advance! :)
My code:
public void startMessage(String cookie1) {
try {
doc1 = Jsoup.connect("https://nya.boplats.se/minsida/meddelande")
.timeout(0).cookie("Boplats-Session", cookie1)
.get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements tables = doc1.select("form.meddelande");
Elements table = tables.select("table.meddelande");
System.out.println(table);
}

In your code
Elements tables = doc1.select("form.meddelande");
Elements table = tables.select("table.meddelande");
you are trying to access form with class attribute meddelande but from your linked HTML source meddelande is id, not class, so instead of
form.meddelande
you should use
form#meddelande
^--# means id, dot represents class
So try with
Elements tables = doc.select("form#meddelande");
Elements table = doc.select("table.meddelande");
or maybe simpler
Elements table = doc.select("form#meddelande table.meddelande");
If this will not work then HTML code responsible for table is probably generated by JavaScript. In that case you will not be able to get it with Jsoup, but you will need something like Selenium web driver, or HtmlUtil

In your situation better select class unread and read.
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupSO {
public static void main(String args0[]) throws IOException {
Document doc;
Elements elements;
doc = Jsoup.parse(new File("path_to_file or use connect for URL"), "UTF-8");
elements = doc.getElementsByClass("unread");
for (Element element : elements) {
System.out.println(element);
}
elements = doc.getElementsByClass("read");
for (Element element : elements) {
System.out.println(element);
}
}
}
Output: http://pastebin.com/CwG1cL5T
And yes read their cookbook http://jsoup.org/

Here is an attempt
Document doc = Jsoup.connect("http://pastebin.com/raw.php?i=ufRDhLSy").get();
System.out.println(doc.select("table[class=meddelande]"));
or use the shorter syntax when selecting the node with a particular class only
System.out.println(doc.select("table.meddelande"));
JSoup supports the selector syntax. So you could use that to select DOM nodes with particular attributes - in this case the class attribute.
For more complex options in the selector syntax, look here http://jsoup.org/cookbook/extracting-data/selector-syntax

Grabbing data from an XML stream using jsoup

Im trying to get the 'Done' from between these tags in an XML stream
<scan_run_status>Done</scan_run_status>
This is the code I'm using but it returns nothing.
I am picking up id and status in the same way no problem.
input1 contains the xml string
status = Jsoup.parse(input1).getAllElements().attr("scan_run_status");
System.out.println(status);
Thanks

Try this:
Document doc = Jsoup.parse(input1);
/* Grab all elements named "scan_run_sttus" */
Elements els = doc.select("scan_run_status");
/* If you need the first only ...here it is..*/
String status = els.first().text();
/* Otherwise you can loop
for (Element el: els) {
//.Do something
}
*/
and don't forget to do all the checks if els is empty etc. etc.
Here you can found all the tips about JSoup selector.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using Jsoup to extract data - java

If you want to get the content of table(not head), you need change the selector of table: for (Element table : doc.select("table.tbldata14")) instead of for (Element table : doc.select("table.tablehead"))

Related

How to retrieve data from data table from Sports Reference using JSoup?

Use JSoup to get all textual links

Real time web crawling using Jsoup

Java Jsoup can't select table

Grabbing data from an XML stream using jsoup

Categories

Resources