Get Google Search Result with Java using Jsoup - java

first of all i search this problem in stackoverflow database and google. Unfortunately i couldn't find a solution.
I am trying to get Google Search Result for a keyword. Heres my code :
public static void main(String[] args) throws Exception {
Document doc;
try{
doc = Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
Elements links = (Elements) doc.select("li[class=g]");
for (Element link : links) {
Elements titles = link.select("h3[class=r]");
String title = titles.text();
Elements bodies = link.select("span[class=st]");
String body = bodies.text();
System.out.println("Title: "+title);
System.out.println("Body: "+body+"\n");
}
}
catch (IOException e) {
e.printStackTrace();
}
}
And heres the errors : https://prnt.sc/ro4ooi
It says : can only iterate over an array or an instance of java.lang.iterable ( at links )..
When i delete the (Elements) : https://prnt.sc/ro4pa9
Thank you.

Related

Java parse data from html table with jsoup

I want to get the data from the table from the link.
link:
https://www.nasdaq.com/symbol/aapl/financials?query=balance-sheet
I´ve tried my code but it doens´t work
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://www.nasdaq.com/symbol/aapl/financials?query=balance-sheet").get();
Elements trs = doc.select("td_genTable");
for (Element tr : trs) {
Elements tds = tr.getElementsByTag("td");
Element td = tds.first();
System.out.println(td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
Can anybody help me? To get it to work
I´m not getting an output of the table. Nothing happens.
After test your code I've got and Read time out problem. Looking on Google I found this post where suggest to add an user agent to fix it and it worked for me. So, you can try this
public static void main(String[] args) {
try {
// add user agent
Document doc = Jsoup.connect("https://www.nasdaq.com/symbol/aapl/financials?query=balance-sheet")
.userAgent("Mozilla/5.0").get();
Elements trs = doc.select("tr");
for (Element tr : trs) {
Elements tds = tr.select(".td_genTable");
// avoid tr headers that produces NullPointerException
if(tds.size() == 0) continue;
// look for siblings (see the html structure of the web)
Element td = tds.first().siblingElements().first();
System.out.println(td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
I have added User agent option and fix some query errors. This will be useful to start your work ;)

parse html from a web page which uses infinite scroll

I would like to parse html from web page which use infinite scroll, such as: pinterest.com so as to get all items.
public List<String> popularTagsPinterest(String tag) throws Exception {
List<String> results = new ArrayList<>();
try {
Document doc = Jsoup.connect(
urlPinterest + tag + "&eq=%23" + tag + "&etslf=6622&term_meta[]=%23" + tag + "%7Cautocomplete%7C0")
.timeout(90000).get();
Elements img1 = doc.select("a.pinImageWrapper img.pinImg");
for (Element e : img1) {
results.add(e.attr("src"));
System.out.println(e.attr("src"));
}
} catch (Exception e) {
e.printStackTrace();
}
return results;
}
Get base url and the ajax call for loading another part can do.
Check this page, is a good example.
https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016

Unable to correctly return links for google search result

Hi my code is unable to return links that are listed in the search result for Google, I can't seem to find the solution. It seems that it only views the homepage of the Google but I already added the link on a search result. See my code below:
public class jsoupScraper {
public static void main (String args[]) {
Document doc;
try{
doc = Jsoup.connect("https://www.google.com.ph/?gfe_rd=cr ei=yaXVV4CAGION2AT9q5OICQ#q=Silence+of+the+lambs")
.userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
Elements links = doc.select("a");
int count = 0;
for (Element link : links) {
if(count < 10) {
System.out.println(link.attr("href"));
count++;
} else
break;
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Actually I just copied this on this link: Using JSoup to scrape Google Results and I just did a minor modification but it still doesn't output correctly. When I tried to run the original code it will just terminate with no result.
Here is a sample output
https://www.google.com/webhp?tab=ww
https://www.google.com/imghp?hl=en&tab=wi
https://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=PH&tab=w1
https://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.com/intl/en/options/
https://www.google.com/calendar?tab=wc
Thank you for the help in advance, and it seems to be the same case with HTMLUnit

How to get link from ArrayList filling by Jsoup

I trying to parse website. After all links collect to ArrayList, I wanna parse them again, but I have trouble with initialization of them.
This is my ArrayList:
public ArrayList<String> linkList = new ArrayList<String>();
How I collect links in "doInBackground":
try {
Document doc = Jsoup.connect("http://forurl.com/archive/").get();
Element links = doc.select("a[href]");
for (Element link : links)
{
linkList.add(link.attr("abs:href"));
}
}
In "onPostExecute" showing what I get:
lk.setText("Collected: " +linkList.size()); // showing how much is collected
lj.setText("First link: " +linkList.get(0)); // showing first link
Try to parse child links:
public class imgTread extends AsyncTask<Void, Void, Void> {
Bitmap bitmap;
String[] url = {"http://forurl.com/link1/",
"http://forurl.com/link2/"}; // this way work well
protected Void doInBackground(Void... params) {
try {
for (int i = 0; i < url.length; i++){
Document doc1 = Jsoup.connect(url[0]).get(); // connect to 1 link for example
Elements img = doc1.select("#strip");
String imgSrc = img.attr("src");
InputStream input = new java.net.URL(imgSrc).openStream();
bitmap = BitmapFactory.decodeStream(input);
}
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
I traing to make String[] from ArrayList, but it doesn't work.
String[] url = linkList.toArray(new String[linkList.size()]);
Output for this way will be Ljava.lang.String;#45ds364
The idea is 1) collected all links from url; 2) Connect to them 1 by 1 and get that information what I need.
First point work, the second too, but how tie it.
Thanks for any advise.
Working code:
Document doc = Jsoup.connect(url).get(); // connect to site
Elements links = doc.select("a[href]"); // get all links
String link_addr = links.get(3).attr("abs:href"); // choose 3 link
Document link_doc = Jsoup.connect(link_addr).get(); // connetect to it
Elements img = link_doc.select("#strip"); // get all elements by tag #strip
String imgSrc = img.attr("src"); // get url
InputStream input = new java.net.URL(imgSrc).openStream();
bitmap = BitmapFactory.decodeStream(input);
I hope this helps someone.
You are doing many unnecessary steps. You have a perfectly fine collection of Element objects in your Elements links object. Why do you have to add them to an ArrayList?
If I have understood your question correctly, your thought process should be something like this.
Get all the links from a URL
Establish a new connection to each link
Download all the images on that page where the element id = "strip".
Get all the links:
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
doInBackground(links);
Call the doInBackground method with the links as a parameter:
public static void doInBackground(Elements links) throws IOException {
try {
for (Element element : links) {
//Connect to the first link
Document doc = Jsoup.connect(element.attr("abs:href")).get();
//Select all the elements with ID 'strip'
Elements img = doc.select("#strip");
//Get the source of the image
String imgSrc = img.attr("abs:src");
//Open an InputStream
InputStream input = new java.net.URL(imgSrc).openStream();
//Get the image
bitmap = BitmapFactory.decodeStream(input);
...
//Perhaps save the image somewhere?
//Close the InputStream
input.close();
}
} catch (IllegalArgumentException e) {
System.out.println(e.getMessage());
} catch (MalformedURLException ex) {
System.out.println(ex.getMessage());
}
}
Of course, you will have to properly use AsyncTask and call the methods from preferred places, but this is the overall idea of how you can use Jsoup to do the job you want it to.
If you want create a array instead of list you can do:
try {
Document doc = Jsoup.connect("http://forurl.com/archive/").get();
Element links = doc.select("a[href]");
String array = new String[links.size();
for (int i = 0; i < links.size(); i++)
{
array[i] = link.attr("abs:href");
}
}
First of all what the hell is that?
Why link variable is a collection of Elements and links is a single member of that collection?? Isn't that confusing?
Second, keep to java naming convention and name variables with noncapitalized letters so change LinkList to linkList. Even syntax highlighter got crazy thanks to you.
Third
traing to make String[] from ArrayList, but it doesn't work.
Where are you trying to do that and it does not work? I don't see it anywhere in the code.
Forth
To create Array out of List you have to do something like that
String links[]=linksList.toArray(new String[linksList.size()]);
Fifth
Change the topic to more apropriate, as present one is very missleading (you have no trouble with Jsoup here)

JSoup core web text extraction

I am new to JSoup, Sorry if my question is too trivial.
I am trying to extract article text from http://www.nytimes.com/ but on printing the parse document
I am not able to see any articles in the parsed output
public class App
{
public static void main( String[] args )
{
String url = "http://www.nytimes.com/";
Document document;
try {
document = Jsoup.connect(url).get();
System.out.println(document.html()); // Articles not getting printed
//System.out.println(document.toString()); // Same here
String title = document.title();
System.out.println("title : " + title); // Title is fine
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
ok I have tried to parse "http://en.wikipedia.org/wiki/Big_data" to retrieve the wiki data, same issue here as well I am not getting the wiki data in the out put.
Any help or hint will be much appreciated.
Thanks.
Here's how to get all <p class="summary> text:
final String url = "http://www.nytimes.com/";
Document doc = Jsoup.connect(url).get();
for( Element element : doc.select("p.summary") )
{
if( element.hasText() ) // Skip those tags without text
{
System.out.println(element.text());
}
}
If you need all <p> tags, without any filtering, you can use doc.select("p") instead. But in most cases it's better to select only those you need (see here for Jsoup Selector documentation).

Categories

Resources