I am trying to scrape the data from the table where it states 'range', '52 week', 'open', etc... on this site: https://finance.google.com/finance?q=aapl&ei=czANWqmhNoPYswHV9YnwBg
The code below scrapes the write data but not in the format I want. It outputs the contents of the whole table as one string, whereas I would like each part of the table to be outputted separately on its own line.
Any help would be greatly appreciated.
Thank you!
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JSoup {
public static void main(String[] args) throws Exception {
String ticker = "AAPL";
String url = "https://finance.google.com/finance?q="+ticker+"&ei=czANWqmhNoPYswHV9YnwBg";
Document document = Jsoup.connect(url).get();
for (Element row : document.select("table.snap-data")) {
final String key = row.select(".range").text();
final String val = row.select(".val").text();
System.out.println(key);
System.out.println(val);
}
}
}
Related
My code returns all the links on a webpage, but I would like to get the first link when I google search something for example "android". How do I do that?
Document doc = Jsoup.connect(sharedURL).get();
String title = doc.title();
Elements links = doc.select("a[href]");
stringBuilder.append(title).append("\n");
for (Element link : links) {
stringBuilder.append("\n").append(" ").append(link.text()).append(" ").append(link.attr("href")).append("\n");
}
Here ids my code
Elements#first and Node#absUrl
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/Wikipedia").get();
Elements links = doc.select("a[href]");
Node node = links.first();
System.out.println(node.absUrl("href"));
}
}
Output:
https://en.wikipedia.org/wiki/Wikipedia:Protection_policy#semi
I would really want to understand how to actually extract the data I want from a website. I have done it with an IMDb top chart that I got from a tutorial on YouTube but it just confuses me how to know what syntax to insert for the row.select parameters.
I have tried doing it with other websites such as Best Buy, getting the price and name of specific laptops and I failed because I am pretty sure I put the wrong parameters(cssQuery).
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import javax.swing.*;
import java.io.IOException;
public class Scraper {
static String title;
static final String url = "https://www.imdb.com/chart/top";
public static void main(String args[])throws IOException {
final Document document = Jsoup.connect(url).get();
for(Element row: document.select("table.chart.full-width tr")){
final String title = row.select(".titleColumn a").text();
final String rating = row.select(".imdbRating").text();
System.out.println(title);
System.out.println(rating);
}
}
}
for what i have undersnd from our question is that you dont know which css class t put in your code. for that you could inspect website by right-clicking on the website and click inspect element and from there you can check the div class by pressing ctrl+shift+c and hover over any element on the website like i have shown in below image
Consider a URl www.example.com it may have plenty numbers of links ,some may be internal and other may be external.I want to get a list of all the sub links ,not even the sub-sub links but only sub link.
E.G if there are four links as follows
1)www.example.com/images/main
2)www.example.com/data
3)www.example.com/users
4)www.example.com/admin/data
Then out of the four only 2 and 3 are of use as they are sub links not the sub-sub and so on links .Is there a way to achieve it through j-soup..If this could not be achieved through j-soup then one can introduce me with some other java API.
Also note that it should be a link of the parent Url which is initially sent(i.e. www.example.com)
If i can understand a sub-link can contain one slash you can attempt with this with counting the number of slashes for example :
List<String> list = new ArrayList<>();
list.add("www.example.com/images/main");
list.add("www.example.com/data");
list.add("www.example.com/users");
list.add("www.example.com/admin/data");
for(String link : list){
if((link.length() - link.replaceAll("[/]", "").length()) == 1){
System.out.println(link);
}
}
link.length(): count the number of characters
link.replaceAll("[/]", "").length() : count the number of slashes
If the difference equal to one then right link else no.
EDIT
How will i scan the whole website for sub links?
The answer for this with the robots.txt file or Robots exclusion standard, so in this it define all the sub-links of the web site for example https://stackoverflow.com/robots.txt, so the idea is, to read this file and you can extract the sub-links from this web-site here is a piece of code that can help you :
public static void main(String[] args) throws Exception {
//Your web site
String website = "http://stackoverflow.com";
//We will read the URL https://stackoverflow.com/robots.txt
URL url = new URL(website + "/robots.txt");
//List of your sub-links
List<String> list;
//Read the file with BufferedReader
try (BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()))) {
String subLink;
list = new ArrayList<>();
//Loop throw your file
while ((subLink = in.readLine()) != null) {
//Check if the sub-link is match with this regex, if yes then add it to your list
if (subLink.matches("Disallow: \\/\\w+\\/")) {
list.add(website + "/" + subLink.replace("Disallow: /", ""));
}else{
System.out.println("not match");
}
}
}
//Print your result
System.out.println(list);
}
This will show you :
[https://stackoverflow.com/posts/, https://stackoverflow.com/posts?,
https://stackoverflow.com/search/, https://stackoverflow.com/search?,
https://stackoverflow.com/feeds/, https://stackoverflow.com/feeds?,
https://stackoverflow.com/unanswered/,
https://stackoverflow.com/unanswered?, https://stackoverflow.com/u/,
https://stackoverflow.com/messages/, https://stackoverflow.com/ajax/,
https://stackoverflow.com/plugins/]
Here is a Demo about the regex that i use.
Hope this can help you.
To scan the links on the web page you can use JSoup library.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
class read_data {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("**your_url**").get();
Elements links = doc.select("a");
List<String> list = new ArrayList<>();
for (Element link : links) {
list.add(link.attr("abs:href"));
}
} catch (IOException ex) {
}
}
}
list can be used as suggested in the previous answer.
The code for reading all the links on a website is given below. I have used http://stackoverflow.com/ for illustration. I would recommend you to go through company's terms of use before scraping it's website.
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class readAllLinks {
public static Set<String> uniqueURL = new HashSet<String>();
public static String my_site;
public static void main(String[] args) {
readAllLinks obj = new readAllLinks();
my_site = "stackoverflow.com";
obj.get_links("http://stackoverflow.com/");
}
private void get_links(String url) {
try {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a");
links.stream().map((link) -> link.attr("abs:href")).forEachOrdered((this_url) -> {
boolean add = uniqueURL.add(this_url);
if (add && this_url.contains(my_site)) {
System.out.println(this_url);
get_links(this_url);
}
});
} catch (IOException ex) {
}
}
}
You will get list of all the links in uniqueURL field.
http://games.espn.go.com/ffl/freeagency?leagueId=1566286&teamId=4&seasonId=2015#&seasonId=2015&view=projections&context=freeagency&avail=-1
I am trying to use JSoup to rip the table from this link. However I am very new to HTML and I can not find the right "table id" to use. My code is below, and I have gotten it to work for tables from other pages, so the code is not the issue. I just don't know how to find the right table id. Thank you!
This is the html code I see: http://pastebin.com/d5h5QBb6
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class readURL {
public static void main(String[] args) {
//extractTableUsingJsoup("http://mobilereviews.net/details-for-Motorola%20L7.htm","phone_details");
extractTableUsingJsoup("http://games.espn.go.com/ffl/freeagency?leagueId=1566286&teamId=4&seasonId=2015#&seasonId=2015&view=projections&context=freeagency&avail=-1","INSERT TABLE ID HERE");
}
public static void extractTableUsingJsoup(String url, String tableId){
Document doc;
try {
// need http protocol
doc = Jsoup.connect(url).get();
//Set id of any table from any website and the below code will print the contents of the table.
//Set the extracted data in appropriate data structures and use them for further processing
Element table = doc.getElementById(tableId);
Elements tds = table.getElementsByTag("td");
//You can check for nesting of tds if such structure exists
for (Element td : tds) {
System.out.println("\n"+td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
As output, which is not what I am looking for. I want the players and their projections.
this is the table i want to get
I am trying to extract the data from the table on the following website. I.e Club, venue, start time. http://www.national-autograss.co.uk/february.htm
I have got many examples on here working that use a css class table but this website doesn't. I have made an attempt with the code below but it doesn't seem to provide any output. Any help would be very much appreciated.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) {
Document doc = null;
try {
doc = Jsoup.connect("http://www.national-autograss.co.uk/february.htm").get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("table#table1");
String name;
for( Element element : elements )
{
name = element.text();
System.out.println(name);
}
}
}
An id should be unique, so you should use directly doc.select("#table1") and so on