parse html from a web page which uses infinite scroll

parse html from a web page which uses infinite scroll - java

I would like to parse html from web page which use infinite scroll, such as: pinterest.com so as to get all items.
public List<String> popularTagsPinterest(String tag) throws Exception {
List<String> results = new ArrayList<>();
try {
Document doc = Jsoup.connect(
urlPinterest + tag + "&eq=%23" + tag + "&etslf=6622&term_meta[]=%23" + tag + "%7Cautocomplete%7C0")
.timeout(90000).get();
Elements img1 = doc.select("a.pinImageWrapper img.pinImg");
for (Element e : img1) {
results.add(e.attr("src"));
System.out.println(e.attr("src"));
}
} catch (Exception e) {
e.printStackTrace();
}
return results;
}

Get base url and the ajax call for loading another part can do.
Check this page, is a good example.
https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016

Related

Get Google Search Result with Java using Jsoup

first of all i search this problem in stackoverflow database and google. Unfortunately i couldn't find a solution.
I am trying to get Google Search Result for a keyword. Heres my code :
public static void main(String[] args) throws Exception {
Document doc;
try{
doc = Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
Elements links = (Elements) doc.select("li[class=g]");
for (Element link : links) {
Elements titles = link.select("h3[class=r]");
String title = titles.text();
Elements bodies = link.select("span[class=st]");
String body = bodies.text();
System.out.println("Title: "+title);
System.out.println("Body: "+body+"\n");
}
}
catch (IOException e) {
e.printStackTrace();
}
}
And heres the errors : https://prnt.sc/ro4ooi
It says : can only iterate over an array or an instance of java.lang.iterable ( at links )..
When i delete the (Elements) : https://prnt.sc/ro4pa9
Thank you.

Java parse data from html table with jsoup

I want to get the data from the table from the link.
link:
https://www.nasdaq.com/symbol/aapl/financials?query=balance-sheet
I´ve tried my code but it doens´t work
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://www.nasdaq.com/symbol/aapl/financials?query=balance-sheet").get();
Elements trs = doc.select("td_genTable");
for (Element tr : trs) {
Elements tds = tr.getElementsByTag("td");
Element td = tds.first();
System.out.println(td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
Can anybody help me? To get it to work
I´m not getting an output of the table. Nothing happens.

After test your code I've got and Read time out problem. Looking on Google I found this post where suggest to add an user agent to fix it and it worked for me. So, you can try this
public static void main(String[] args) {
try {
// add user agent
Document doc = Jsoup.connect("https://www.nasdaq.com/symbol/aapl/financials?query=balance-sheet")
.userAgent("Mozilla/5.0").get();
Elements trs = doc.select("tr");
for (Element tr : trs) {
Elements tds = tr.select(".td_genTable");
// avoid tr headers that produces NullPointerException
if(tds.size() == 0) continue;
// look for siblings (see the html structure of the web)
Element td = tds.first().siblingElements().first();
System.out.println(td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
I have added User agent option and fix some query errors. This will be useful to start your work ;)

Jsoup Google Search Results

I am attempting to parse the HTML of google's search results to grab the title of each result. This is done through android in a private nested class shown below:
private class WebScraper extends AsyncTask<String, Void, String> {
public WebScraper() {}
#Override
protected String doInBackground(String... urls) {
Document doc;
try {
doc = Jsoup.connect(urls[0]).get();
} catch (IOException e) {
System.out.println("Failed to open document");
return "";
}
Elements results = doc.getElementsByClass("rc");
int count = 0;
for (Element lmnt : results) {
System.out.println(count++);
System.out.println(lmnt.text());
}
System.out.println("Count is : " + count);
String key = "test";
//noinspection Since15
SearchActivity.this.songs.put(key, SearchActivity.this.songs.getOrDefault(key, 0) + 1);
// return requested
return "";
}
}
an example url I am trying to parse: http://www.google.com/#q=i+might+site:genius.com
For some reason, when i run the above code, my count is printed as 0, thus no elements are being stored in results. Any help is much appreciated! P.S. docs is definitely initialized and the HTML page is loading properly

This code will search a word like "Apple" in google and fetch all links from results and display their title and url. It can search upto 500 words in a day after that google detect it and stop giving results.
search="Apple"; //your word to be search on google
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)";
Elements links=null;
try {
links = Jsoup.connect(google +
URLEncoder.encode(search,charset)).
userAgent(userAgent).get().select(".g>.r>a");
} catch (UnsupportedEncodingException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in
format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
try {
url = URLDecoder.decode(url.substring(url.indexOf('=') +
1, url.indexOf('&')), "UTF-8");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}

If you check source code of the Google's page, you will notice that it does not contain any text data which is shown normally in the browser - there is only a bunch of javascript code. That means that Google outputs all the search results dynamically.
Jsoup will fetch that javascript code and it will not find any html code with "rc" classes, that's why you get zero count in your code sample.
Consider using Google's public search API instead of direct parsing of its html pages: https://developers.google.com/custom-search/.

I completely agree with Matvey Sidorenko but for using the google public search API, you need to have the Google Api key. But the problem is that google limits 100 searches per api key, exceeding which, it stops working and it gets reset in 24 hours.
Recently i was working on a project where we needed to get the google search result links for different queries provided by the user, so as to overcome this issue of API limit, i made my own API that searches directly on google/ncr and gives you the result link.
Free Google Search API-
http://freegoogleapi.azurewebsites.net/ OR http://google.bittque.com
I used HTML-UNIT library for making this API.
You can use my API or you can use the HTML UNIT Library for achieving what you need.

Read JSP page and Write HTML file UTF-8 issuses

i want read JSP page and write it to HTML page. I have 3 method in parse class. first readHTMLBody(), second WriteNewHTML(), third ZipToEpub().
When I called this method in parse class, all method work. But called in JSP or webservice UTF-8 character looks like "?" in readHTMLBody(). How can I fix it?
public String readHTMLBody() {
try {
String url = "http://localhost:8080/Library/part.jsp";
Document doc = Jsoup.parse((new URL(url)).openStream(), "utf-8", url);
String body = doc.html();
Elements title = doc.select("xxx");
linkURI = title.toString();
linkURI = linkURI.replaceAll("<xxx>", "");
linkURI = linkURI.replaceAll("</xxx>", "");
linkURI = linkURI.replaceAll("\\s", "");
resultBody = body;
resultBody = resultBody.replaceAll("part/" + linkURI + "/assets/", "assets/");
} catch (IOException e) {
}
return resultBody;
}

JSoup core web text extraction

I am new to JSoup, Sorry if my question is too trivial.
I am trying to extract article text from http://www.nytimes.com/ but on printing the parse document
I am not able to see any articles in the parsed output
public class App
{
public static void main( String[] args )
{
String url = "http://www.nytimes.com/";
Document document;
try {
document = Jsoup.connect(url).get();
System.out.println(document.html()); // Articles not getting printed
//System.out.println(document.toString()); // Same here
String title = document.title();
System.out.println("title : " + title); // Title is fine
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
ok I have tried to parse "http://en.wikipedia.org/wiki/Big_data" to retrieve the wiki data, same issue here as well I am not getting the wiki data in the out put.
Any help or hint will be much appreciated.
Thanks.

Here's how to get all <p class="summary> text:
final String url = "http://www.nytimes.com/";
Document doc = Jsoup.connect(url).get();
for( Element element : doc.select("p.summary") )
{
if( element.hasText() ) // Skip those tags without text
{
System.out.println(element.text());
}
}
If you need all <p> tags, without any filtering, you can use doc.select("p") instead. But in most cases it's better to select only those you need (see here for Jsoup Selector documentation).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

parse html from a web page which uses infinite scroll - java

Get base url and the ajax call for loading another part can do. Check this page, is a good example. https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016

Related

Get Google Search Result with Java using Jsoup

Java parse data from html table with jsoup

Jsoup Google Search Results

Read JSP page and Write HTML file UTF-8 issuses

JSoup core web text extraction

Categories

Resources