Extracting data from HTML table with Jsoup

Extracting data from HTML table with Jsoup - java

I am trying to extract the data from the table on the following website. I.e Club, venue, start time. http://www.national-autograss.co.uk/february.htm
I have got many examples on here working that use a css class table but this website doesn't. I have made an attempt with the code below but it doesn't seem to provide any output. Any help would be very much appreciated.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) {
Document doc = null;
try {
doc = Jsoup.connect("http://www.national-autograss.co.uk/february.htm").get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("table#table1");
String name;
for( Element element : elements )
{
name = element.text();
System.out.println(name);
}
}
}

An id should be unique, so you should use directly doc.select("#table1") and so on

Related

Calculating physico-chemical properties of amino acids in Biojava

I need to calculate the number and percentages of polar/non-polar, aliphatic/aromatic/heterocyclic amino acids in this protein sequence that I got from UNIPROT, using BioJava.
I have found in the BioJava tutorial how to read the Fasta files and implemented this code. But I have no ideas how to solve this problem.
If you have some ideas please help me.
Maybe there are some sources where I can check it.
This is the code.
package biojava.biojava_project;
import java.net.URL;
import org.biojava.nbio.core.sequence.ProteinSequence;
import org.biojava.nbio.core.sequence.io.FastaReaderHelper;
public class BioSeq {
// Inserting the sequence from UNIPROT
public static ProteinSequence getSequenceForId(String uniProtId) throws Exception {
URL uniprotFasta = new URL(String.format("https://rest.uniprot.org/uniprotkb/P31574.fasta", uniProtId));
ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId);
System.out.printf("id : P31574", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader());
System.out.println();
return seq;
}
public static void main(String[] args) {
try {
System.out.println(getSequenceForId("P31574"));
} catch (Exception e) {
e.printStackTrace();
}
}
}

I don't know if BioJava stores these properties anywhere. But it's easy to just list all the amino acids with their properties manually. Then iterate over the sequence and count those that satisfy the property. Here's an example for the polarity:
import java.io.InputStream;
import java.net.URL;
import java.util.Set;
import org.biojava.nbio.core.sequence.ProteinSequence;
import org.biojava.nbio.core.sequence.compound.AminoAcidCompound;
import org.biojava.nbio.core.sequence.io.FastaReaderHelper;
public class BioSeq {
public static void main(String[] args) throws Exception {
ProteinSequence seq = loadFromUniprot("P31574");
int polarCount = numberOfOccurrences(seq, /*Polar AAs:*/ Set.of("Y", "S", "T", "N", "Q", "C"));
System.out.println("% of polar AAs: " + ((double)polarCount)/seq.getLength());
}
public static ProteinSequence loadFromUniprot(String uniProtId) throws Exception {
URL uniprotFasta = new URL(String.format("https://rest.uniprot.org/uniprotkb/%s.fasta", uniProtId));
try (InputStream is = uniprotFasta.openStream()) {
return FastaReaderHelper.readFastaProteinSequence(is).get(uniProtId);
}
}
private static int numberOfOccurrences(ProteinSequence seq, Set<String> bases) {
int count = 0;
for (AminoAcidCompound aminoAcid : seq)
if(bases.contains(aminoAcid.getBase()))
count++;
return count;
}
}
PS: don't forget to close IO streams after you used them. In the example above I used try-with-resources syntax which automatically closes the InputStream.

Why html code in chrome devtools and html code parsed by jsoup are different?

I'm trying to extract information about created date of issues from HADOOP Jira issue site(https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues)
As you can see in this Screenshot, created date is the text between the time tag whose class is live stamp(e.g. <time class=livestamp ...> 'this text' </time>)
So, I tried parse it with code as below.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CreatedDateExtractor {
public static void main(String[] args) {
String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("time.livestamp"); //This line finds elements that matches time tags with livestamp class
System.out.println("# of elements : "+ elements.size());
for(Element e: elements) {
System.out.println(e.text());
}
}
}
I expect that created date is extracted, but the actual output is
# of elements : 0.
I found this is something wrong. So, I tried to parse whole html code from that side with below code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CreatedDateExtractor {
public static void main(String[] args) {
String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("*"); //This line finds whole elements in html document.
System.out.println("# of elements : "+ elements.size());
for(Element e: elements) {
System.out.println(e);
}
}
}
I compared both the html code in chrome devtools and the html code that I parsed one by one. Then I found those are different.
Can you explain why this happens and give me some advices how to extract created date?

I advice you to get elements with "time" tag, and use select to get time tags which have "livestamp" class. Here is the example:
Elements timeTags = doc.select("time");
Element timeLivestamp = null;
for(Element tag:timeTags){
Element livestamp = tag.selectFirst(".livestamp");
if(livestamp != null){
 timeLivestamp = livestamp;
break;
}
}
I don't know why but when I want to use .select() method of Jsoup with more than 1 selector (as you used like time.livestamp), I get interesting outputs like this.

Getting sub links of a URL using jsoup

Consider a URl www.example.com it may have plenty numbers of links ,some may be internal and other may be external.I want to get a list of all the sub links ,not even the sub-sub links but only sub link.
E.G if there are four links as follows
1)www.example.com/images/main
2)www.example.com/data
3)www.example.com/users
4)www.example.com/admin/data
Then out of the four only 2 and 3 are of use as they are sub links not the sub-sub and so on links .Is there a way to achieve it through j-soup..If this could not be achieved through j-soup then one can introduce me with some other java API.
Also note that it should be a link of the parent Url which is initially sent(i.e. www.example.com)

If i can understand a sub-link can contain one slash you can attempt with this with counting the number of slashes for example :
List<String> list = new ArrayList<>();
list.add("www.example.com/images/main");
list.add("www.example.com/data");
list.add("www.example.com/users");
list.add("www.example.com/admin/data");
for(String link : list){
if((link.length() - link.replaceAll("[/]", "").length()) == 1){
System.out.println(link);
}
}
link.length(): count the number of characters
link.replaceAll("[/]", "").length() : count the number of slashes
If the difference equal to one then right link else no.
EDIT
How will i scan the whole website for sub links?
The answer for this with the robots.txt file or Robots exclusion standard, so in this it define all the sub-links of the web site for example https://stackoverflow.com/robots.txt, so the idea is, to read this file and you can extract the sub-links from this web-site here is a piece of code that can help you :
public static void main(String[] args) throws Exception {
//Your web site
String website = "http://stackoverflow.com";
//We will read the URL https://stackoverflow.com/robots.txt
URL url = new URL(website + "/robots.txt");
//List of your sub-links
List<String> list;
//Read the file with BufferedReader
try (BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()))) {
String subLink;
list = new ArrayList<>();
//Loop throw your file
while ((subLink = in.readLine()) != null) {
//Check if the sub-link is match with this regex, if yes then add it to your list
if (subLink.matches("Disallow: \\/\\w+\\/")) {
list.add(website + "/" + subLink.replace("Disallow: /", ""));
}else{
System.out.println("not match");
}
}
}
//Print your result
System.out.println(list);
}
This will show you :
[https://stackoverflow.com/posts/, https://stackoverflow.com/posts?,
https://stackoverflow.com/search/, https://stackoverflow.com/search?,
https://stackoverflow.com/feeds/, https://stackoverflow.com/feeds?,
https://stackoverflow.com/unanswered/,
https://stackoverflow.com/unanswered?, https://stackoverflow.com/u/,
https://stackoverflow.com/messages/, https://stackoverflow.com/ajax/,
https://stackoverflow.com/plugins/]
Here is a Demo about the regex that i use.
Hope this can help you.

To scan the links on the web page you can use JSoup library.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
class read_data {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("**your_url**").get();
Elements links = doc.select("a");
List<String> list = new ArrayList<>();
for (Element link : links) {
list.add(link.attr("abs:href"));
}
} catch (IOException ex) {
}
}
}
list can be used as suggested in the previous answer.
The code for reading all the links on a website is given below. I have used http://stackoverflow.com/ for illustration. I would recommend you to go through company's terms of use before scraping it's website.
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class readAllLinks {
public static Set<String> uniqueURL = new HashSet<String>();
public static String my_site;
public static void main(String[] args) {
readAllLinks obj = new readAllLinks();
my_site = "stackoverflow.com";
obj.get_links("http://stackoverflow.com/");
}
private void get_links(String url) {
try {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a");
links.stream().map((link) -> link.attr("abs:href")).forEachOrdered((this_url) -> {
boolean add = uniqueURL.add(this_url);
if (add && this_url.contains(my_site)) {
System.out.println(this_url);
get_links(this_url);
}
});
} catch (IOException ex) {
}
}
}
You will get list of all the links in uniqueURL field.

HTML Can't Find Table ID

http://games.espn.go.com/ffl/freeagency?leagueId=1566286&teamId=4&seasonId=2015#&seasonId=2015&view=projections&context=freeagency&avail=-1
I am trying to use JSoup to rip the table from this link. However I am very new to HTML and I can not find the right "table id" to use. My code is below, and I have gotten it to work for tables from other pages, so the code is not the issue. I just don't know how to find the right table id. Thank you!
This is the html code I see: http://pastebin.com/d5h5QBb6
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class readURL {
public static void main(String[] args) {
//extractTableUsingJsoup("http://mobilereviews.net/details-for-Motorola%20L7.htm","phone_details");
extractTableUsingJsoup("http://games.espn.go.com/ffl/freeagency?leagueId=1566286&teamId=4&seasonId=2015#&seasonId=2015&view=projections&context=freeagency&avail=-1","INSERT TABLE ID HERE");
}
public static void extractTableUsingJsoup(String url, String tableId){
Document doc;
try {
// need http protocol
doc = Jsoup.connect(url).get();
//Set id of any table from any website and the below code will print the contents of the table.
//Set the extracted data in appropriate data structures and use them for further processing
Element table = doc.getElementById(tableId);
Elements tds = table.getElementsByTag("td");
//You can check for nesting of tds if such structure exists
for (Element td : tds) {
System.out.println("\n"+td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
As output, which is not what I am looking for. I want the players and their projections.
this is the table i want to get

Can we use Google data plugin in scala?

I am new to scala. I am trying to import contacts from gmail in to my application.I can create sample application in java using Eclipse by following link https://developers.google.com/google-apps/contacts/v2/developers_guide_java?csw=1#retrieving_without_query
I can Import the contacts in My java application.And It works fine. My java code is
import com.google.gdata.client.contacts.ContactsService;
import com.google.gdata.data.contacts.ContactEntry;
import com.google.gdata.data.contacts.ContactFeed;
import com.google.gdata.model.gd.Email;
import com.google.gdata.util.AuthenticationException;
import com.google.gdata.util.ServiceException;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.List;
/**
* This is a test template
*/
public class Contacts {
public static void main(String[] args) {
try {
// Create a new Contacts service
System.out.println("hiiii"+args[0]);
ContactsService myService = new ContactsService("My Application");
myService.setUserCredentials(args[0],args[1]);
// Get a list of all entries
URL metafeedUrl = new URL("http://www.google.com/m8/feeds/contacts/"+args[0]+"#gmail.com/base");
System.out.println("Getting Contacts entries...\n");
ContactFeed resultFeed = myService.getFeed(metafeedUrl, ContactFeed.class);
List<ContactEntry> entries = resultFeed.getEntries();
for(int i=0; i<entries.size(); i++) {
ContactEntry entry = entries.get(i);
System.out.println("\t" + entry.getTitle().getPlainText());
System.out.println("\t" + entry.getEmailAddresses());
for(com.google.gdata.data.extensions.Email emi:entry.getEmailAddresses())
System.out.println(emi.getAddress());
}
System.out.println("\nTotal Entries: "+entries.size());
}
catch(AuthenticationException e) {
e.printStackTrace();
System.out.println("Authentication failed");
}
catch(MalformedURLException e) {
e.printStackTrace();
System.out.println("url");
}
catch(ServiceException e) {
e.printStackTrace();
System.out.println("Service exc");
}
catch(IOException e) {
e.printStackTrace();
System.out.println("IO exception");
}
}
}
I tried to use same library functions for My Scala but it doesn't work. My Scala code is
import com.google.gdata.client.contacts.ContactsService
import com.google.gdata.data.contacts.ContactEntry
import com.google.gdata.data.contacts.ContactFeed
import com.google.gdata.util.ServiceException
import com.google.gdata.util.AuthenticationException
import java.io.IOException
import java.net.URL
import java.net.MalformedURLException
object Contacts {
class Test
{
def main(args:Array[String])
{
println("hiii")
try {
// Create a new Contacts service
//ContactsService myService = new ContactsService("My Application");
//myService.setUserCredentials(args[0],args[1]);
val myService= new ContactsService("My App")
myService.setUserCredentials("MyemailId","password")
val metafeedUrl = new URL("http://www.google.com/m8/feeds/contacts/"+"MyemailId"+"#gmail.com/base")
val resultFeed = myService.getFeed(metafeedUrl, classOf[ContactFeed])
//List<ContactEntry> entries = resultFeed.getEntries();
val entries = resultFeed.getEntries();
for(i <-0 to entries.size())
{
var entry=entries.get(i)
println(entry.getTitle().getPlainText())
}
}
catch{
case e:AuthenticationException=>{
e.printStackTrace();
}
case e:MalformedURLException=>{
e.printStackTrace();
}
case e:ServiceException=>{
e.printStackTrace();
}
case e:IOException=>
{
e.printStackTrace();
}
}
}
}
}
But it does not works. Can I use java library in Scala?

The problem that's causing your error, is that the object Contacts does not have a main method. Instead, it contains an inner class called Test which has a main method. I don't believe that is what you want (in Scala, object methods are the equivalent of Java static methods), so the main method should be moved out into Contacts, and the inner class deleted.
Also, for(i <-0 to entries.size()) is probably a mistake. This is roughly equivalent to for(int i=0; i<=entries.size(); i++) (notice the <=). You probably want for(i <-0 until entries.size()).
While you're there, you can kill the try..catch blocks if you like, as Scala doesn't use checked exceptions. If you import scala.collection.JavaConversions._, then you can use for (entry <- entries), which may be less error prone.
If it still doesn't work (or when posting future questions), provide as much info as you can (error messages, warnings, etc.), as it makes it far more likely that someone will be able to help.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracting data from HTML table with Jsoup - java

An id should be unique, so you should use directly doc.select("#table1") and so on

Related

Calculating physico-chemical properties of amino acids in Biojava

Why html code in chrome devtools and html code parsed by jsoup are different?

Getting sub links of a URL using jsoup

HTML Can't Find Table ID

Can we use Google data plugin in scala?

Categories

Resources