HTML Can't Find Table ID - java

http://games.espn.go.com/ffl/freeagency?leagueId=1566286&teamId=4&seasonId=2015#&seasonId=2015&view=projections&context=freeagency&avail=-1
I am trying to use JSoup to rip the table from this link. However I am very new to HTML and I can not find the right "table id" to use. My code is below, and I have gotten it to work for tables from other pages, so the code is not the issue. I just don't know how to find the right table id. Thank you!
This is the html code I see: http://pastebin.com/d5h5QBb6
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class readURL {
public static void main(String[] args) {
//extractTableUsingJsoup("http://mobilereviews.net/details-for-Motorola%20L7.htm","phone_details");
extractTableUsingJsoup("http://games.espn.go.com/ffl/freeagency?leagueId=1566286&teamId=4&seasonId=2015#&seasonId=2015&view=projections&context=freeagency&avail=-1","INSERT TABLE ID HERE");
}
public static void extractTableUsingJsoup(String url, String tableId){
Document doc;
try {
// need http protocol
doc = Jsoup.connect(url).get();
//Set id of any table from any website and the below code will print the contents of the table.
//Set the extracted data in appropriate data structures and use them for further processing
Element table = doc.getElementById(tableId);
Elements tds = table.getElementsByTag("td");
//You can check for nesting of tds if such structure exists
for (Element td : tds) {
System.out.println("\n"+td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
As output, which is not what I am looking for. I want the players and their projections.
this is the table i want to get

Related

How can I figure out what elements to use as parameters for cssQuery

I would really want to understand how to actually extract the data I want from a website. I have done it with an IMDb top chart that I got from a tutorial on YouTube but it just confuses me how to know what syntax to insert for the row.select parameters.
I have tried doing it with other websites such as Best Buy, getting the price and name of specific laptops and I failed because I am pretty sure I put the wrong parameters(cssQuery).
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import javax.swing.*;
import java.io.IOException;
public class Scraper {
static String title;
static final String url = "https://www.imdb.com/chart/top";
public static void main(String args[])throws IOException {
final Document document = Jsoup.connect(url).get();
for(Element row: document.select("table.chart.full-width tr")){
final String title = row.select(".titleColumn a").text();
final String rating = row.select(".imdbRating").text();
System.out.println(title);
System.out.println(rating);
}
}
}
for what i have undersnd from our question is that you dont know which css class t put in your code. for that you could inspect website by right-clicking on the website and click inspect element and from there you can check the div class by pressing ctrl+shift+c and hover over any element on the website like i have shown in below image

Why html code in chrome devtools and html code parsed by jsoup are different?

I'm trying to extract information about created date of issues from HADOOP Jira issue site(https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues)
As you can see in this Screenshot, created date is the text between the time tag whose class is live stamp(e.g. <time class=livestamp ...> 'this text' </time>)
So, I tried parse it with code as below.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CreatedDateExtractor {
public static void main(String[] args) {
String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("time.livestamp"); //This line finds elements that matches time tags with livestamp class
System.out.println("# of elements : "+ elements.size());
for(Element e: elements) {
System.out.println(e.text());
}
}
}
I expect that created date is extracted, but the actual output is
# of elements : 0.
I found this is something wrong. So, I tried to parse whole html code from that side with below code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CreatedDateExtractor {
public static void main(String[] args) {
String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("*"); //This line finds whole elements in html document.
System.out.println("# of elements : "+ elements.size());
for(Element e: elements) {
System.out.println(e);
}
}
}
I compared both the html code in chrome devtools and the html code that I parsed one by one. Then I found those are different.
Can you explain why this happens and give me some advices how to extract created date?
I advice you to get elements with "time" tag, and use select to get time tags which have "livestamp" class. Here is the example:
Elements timeTags = doc.select("time");
Element timeLivestamp = null;
for(Element tag:timeTags){
Element livestamp = tag.selectFirst(".livestamp");
if(livestamp != null){
 timeLivestamp = livestamp;
break;
}
}
I don't know why but when I want to use .select() method of Jsoup with more than 1 selector (as you used like time.livestamp), I get interesting outputs like this.

How to access data from Google Finance table with JSoup?

I am trying to scrape the data from the table where it states 'range', '52 week', 'open', etc... on this site: https://finance.google.com/finance?q=aapl&ei=czANWqmhNoPYswHV9YnwBg
The code below scrapes the write data but not in the format I want. It outputs the contents of the whole table as one string, whereas I would like each part of the table to be outputted separately on its own line.
Any help would be greatly appreciated.
Thank you!
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JSoup {
public static void main(String[] args) throws Exception {
String ticker = "AAPL";
String url = "https://finance.google.com/finance?q="+ticker+"&ei=czANWqmhNoPYswHV9YnwBg";
Document document = Jsoup.connect(url).get();
for (Element row : document.select("table.snap-data")) {
final String key = row.select(".range").text();
final String val = row.select(".val").text();
System.out.println(key);
System.out.println(val);
}
}
}

Extracting data from HTML table with Jsoup

I am trying to extract the data from the table on the following website. I.e Club, venue, start time. http://www.national-autograss.co.uk/february.htm
I have got many examples on here working that use a css class table but this website doesn't. I have made an attempt with the code below but it doesn't seem to provide any output. Any help would be very much appreciated.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) {
Document doc = null;
try {
doc = Jsoup.connect("http://www.national-autograss.co.uk/february.htm").get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("table#table1");
String name;
for( Element element : elements )
{
name = element.text();
System.out.println(name);
}
}
}
An id should be unique, so you should use directly doc.select("#table1") and so on

HTML to PDF using iText : How can produce a checkbox

I have a simple HTML page, iText is able to produce a PDF from it. It's fine but the checkbox is ignored. What can I do about it ?
import java.io.FileOutputStream;
import java.io.StringReader;
import com.itextpdf.text.Document;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.html.simpleparser.HTMLWorker;
import com.itextpdf.text.pdf.PdfWriter;
public class HtmlToPDF {
public static void main(String ... args ) {
try {
Document document = new Document(PageSize.LETTER);
PdfWriter pdfWriter = PdfWriter.getInstance(document, new FileOutputStream("c://temp//testpdf.pdf"));
document.open();
String str = "<HTML><HEAD></HEAD><BODY><H1>Testing</H1><FORM>" +
"check : <INPUT TYPE='checkbox' CHECKED/><br/>" +
"</FORM></BODY></HTML>";
htmlWorker.parse(new StringReader(str));
document.close();
System.out.println("Done.");
}
catch (Exception e) {
e.printStackTrace();
}
}
}
I got it working with YAHP ( http://www.allcolor.org/YaHPConverter/ ).
import java.io.File;
import java.io.FileOutputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
// http://www.allcolor.org/YaHPConverter/
import org.allcolor.yahp.converter.CYaHPConverter;
import org.allcolor.yahp.converter.IHtmlToPdfTransformer;
public class HtmlToPdf_yahp {
public static void main(String ... args ) throws Exception {
htmlToPdfFile();
}
public static void htmlToPdfFile() throws Exception {
CYaHPConverter converter = new CYaHPConverter();
File fout = new File("c:/temp/x.pdf");
FileOutputStream out = new FileOutputStream(fout);
Map properties = new HashMap();
List headerFooterList = new ArrayList();
String str = "<HTML><HEAD></HEAD><BODY><H1>Testing</H1><FORM>" +
"check : <INPUT TYPE='checkbox' checked=checked/><br/>" +
"</FORM></BODY></HTML>";
properties.put(IHtmlToPdfTransformer.PDF_RENDERER_CLASS,
IHtmlToPdfTransformer.FLYINGSAUCER_PDF_RENDERER);
//properties.put(IHtmlToPdfTransformer.FOP_TTF_FONT_PATH, fontPath);
converter.convertToPdf(str,
IHtmlToPdfTransformer.A4P, headerFooterList, "file://c:/temp/", out,
properties);
out.flush();
out.close();
}
}
Are you generating the HTML?
If so, then instead of using an HTML checkbox you could using the Unicode 'ballot box' character, which is ☐ or ☐. It's just a box, you can't electronically tick it or untick it; but if the PDF is intended for printing then of course people can tick it using a pen or pencil.
For example:
String str = "<HTML><HEAD></HEAD><BODY><H1>Testing</H1><FORM>" +
"check : ☐<br/>" +
"</FORM></BODY></HTML>";
Note that this will only work if you're using a Unicode font in your PDF; I think that iText won't use a Unicode font unless you tell it to.
You may be out of luck here.
The "htmlWorker" which is used to parse the html tags, doesn't seem to support the "input" tag.
public static final String tagsSupportedString = "ol ul li a pre font span br p div body table td th tr i b u sub sup em strong s strike h1 h2 h3 h4 h5 h6 img";
You can access the source code for "HtmlWorker" from here.
http://www.java2s.com/Open-Source/Java-Document/PDF/pdf-itext/com/lowagie/text/html/simpleparser/HTMLWorker.java.htm
It is from this source that I figured that out.
public void startElement(String tag, HashMap h) {
if (!tagsSupported.containsKey(tag))
return; //return if tag not supported
// ...
}
creating pdfs with iText from html is a bit troubled.
i advise to use the flying saucer library for this.
it is also using iText in the background.
The only alternative I'm aware of at that point is to hack iText. The new XMLWorker should be considerably more extensible than The Old Way (HTMLWorker), but it'll still be Non Trivial.
There might be some magic style tag you can pass in that will show up in a "generic tag" for a PdfPageEventHandler... lets see here...
Reading the code, it looks like a style or attribute "generictag" will be propagated to the ...text.Chunk object via setGenericTag().
So what you need to do is XSLT your unsupported tags into div/p/whatever with a "generictag" attribute that is a string which encodes the information you need to recreate the original element.
In your PdfPageEventHandler's OnGenericTag function, you have to parse that tag and recreate whatever it is you're trying to recreate.
That's just crazy enough to work!

Categories

Resources