Catching Google's search "Next Page" - Jsoup Webcrawler

Catching Google's search "Next Page" - Jsoup Webcrawler - java

I'm building one java webcrawler and I need to catch the "Next Page" link from the Google search I request. For that I was trying to realise one pattern or way to do, but until now I couldn't find any clues about this.
Check out this picture:
You can test yourself that the "Next Page" is the same link for every number you pass the mouse on. The only think that will change on the link is the part "Start=(number)" almost in the end of the link. For every page of search it plus 10 on start, since this is the number of links result per page.
But, the weird part is that this "default" link doesn't come inside the source code of the page when you request the browser to show its code. Maybe this has something with the google index process, but I'm not sure since I'm not an expert programmer yet, specially in Web programmation.
So, anyone has any Idea of how should I solve this?

I would suggest you to use the jsoup.org library

Related

Get TEXT only tweets using Twitter4j ? tweet that do not contain any media

There are a lot of forums asking to get medias from twitter but how to do the opposite? I want to get only the tweets which do not contain any image/video/url. If found those tweets just skip and search for the next because I want to display a full text without "http://t..." thing at the end. I put this in ...
cb.setIncludeEntitiesEnabled(false);
but, was not sure I did it right. Also, I write this code in Processing library but in Eclipse so I guess if you can show me the way in Java I will be fine, but a complete example please. I am very new to Java.
However, I have seen some people say about "filter=image" in tweet method but, I could not figure it out where to put this in. I have tried and fail.
Any suggestion? --Thanks in advance.

Stack Overflow isn't really designed for general "how do I do this" type questions. It's for specific "I tried X, expected Y, but got Z instead" type questions. Please try to be more specific. Saying "I could not figure it out where to put this in" doesn't really tell us anything. Where exactly did you try to put it? What exactly happened when you tried that? Can you please post a MCVE?
That being said, the best thing you can do is Google something like "twitter4j filter out images" or "twitter4j exclude images". More info:
Get Tweets with Pictures using twitter search api: Notice the second reply mentions the approach of getting all tweets and then manually checking whether it has an image. You could do the same thing to check that a tweet does not contain an image.
How to exclude retweets from my search query results Options: This mentions an exclude filter and a -filter that might be helpful.
Please try something and post a MCVE in a new question post if you get stuck. Good luck.

how do i submit a pastebin or pastee from an android app and get the url back

ok so heres what i want to achieve:
i want to submit to pastebin or pastee a string(big string) with a custom title and recieve back the url to it as a string (this needs to be done as a guest)
before i continue, i have searched and there is lots of different api's some java i have tried them all and none have worked for me, theres a fair few posts but none with a definative answer.
ok so lets go with what i know (or think i know)
i need atleast : (according to http://pastebin.com/api)
api_dev_key (i have my api key)
api_option=paste
api_paste_code (the code we want to paste)
the title is optional but is :
api_paste_name=
the url for submission is :
*pastebin web page /api/api_post.php
so in theory
*pastebin web page /api/api_post.php/api_dev_key=myprivateapikey&api_option=paste&api_paste_name=testpaste&api_paste_code=hello%20world
should create a pastebin titled testpaste with the content hello world
instead it creates "This paste has been removed!"
so thats the first hurdle (yes i have double checked my api key)
then comes im not really sure how to get the address back after the key is submitted.
all in all im totaly confused (it doesnt help that i have been reading about a thousand and one api java's and guides and none seem to work.
the code i had cobbled together at one point is :
http://pastebin.com/4PVFH8tR
the alternative is pastee but its api is very very undocumented
*it counted them as links so had to make them non url so i could ask the question

Want to create a form filler - is java, jsp, html enough?

Summary - Want to make a simple website form filler. The website is NOT mine and I cannot edit its source code. Don't know what tools/languages are needed. Would java, jsp, html be enough ?
Request - Please reconsider your decision to close or downvote. I only need to know if java is enough or not.
There is a form on a website, say for reserving a visit to only one dentist. You fill your details and the date and time you want to visit. Then, it tells you if an appointment can be made or not, somewhere in the webpage.
This web page is NOT protected by CAPTCHA. I don't want to enter my details all the time to look for a reservation. I want to make code to do it for me.
I want to make code which will -
1 - Fill the details into the form and "press" submit.
2 - Then, read the resulting page and find out if a reservation is
available or not. If yes, do something like maybe - pop up a GUI
message, send e-mail or whatever.
3 - Repeat the above steps every 5 hours or so.
What are the languages and tools I would need to do this job ? Would I need more than java, jsp and html (thats all i know now) to make such code ?
Thanks.

I will suggest you try CURL. That will make you solution more simple in my opinion.
You can execute HTTP GET/POST with CURL, which is enough to solve your problem. Give it a try, and if you get block you can ask a more specific question about CURL or HTTP.
Hope it helps.

IMO, If you really just want to fill up some forms to check a reservation, no need to code anything, why not just install a plugin, Selenium, record your actions there and just run it at specified times: http://docs.seleniumhq.org/

Sure.
You need a web server and a database on the back end.
Since you feel comfortable with Java, JSP/HTML would probably be an ideal solution.
IMHO...

SEO, google crawl

I have implemented the pagination like below
http://myhost.com/product-2/213-1
Which means there are total 213 products and this is the first page.
When I check the what pages Google crawled in my website I see the result like
http://myhost.com/product-2/213-1-2/144-0/144-1/144-14/125-1/125-12/125-1/151-15/108-10/131-1/134-13/140-14/140-1/118-11/126-1/126-12/110-1/270-27/270-1/270-27
This means google is somehow appending all the page numbers at the end of the URL and crawling that URL. Could someone give me a solution to stop this? For this particular case I want Google to crawl only one page having all the product information.

Use canonical URLs to tell Google which page is the one page you want to show in the search results.

That is weird. Looks like you are using relative links in your pagination and your URL-Router is accepting this somehow without throwing a 404... instead it is showing content because of interpreting only a part of the URL and not the whole. So search-engines can crawl this URLs.
Example:
You are linking to
next-side/
instead of
/path/to/next-side/
If you post a link the community can try out!
By the way I wouldn't recommend to change the URL by number of items. Using fixed URLs is much better and the number of items is of none interest. Better use something like /shop/category/subcategory/product .

I will add to the great answers that they gave you that you can use the rel next\prev pagination elements.
To let google know that the next link is the next page in your list.
More information you can find on google webmaster blogs there is an post and video tutorial.
they both will explain you how to implement and use the pagination tags.

Java Bing Image Search

I have a small application in java which searches images using bing image search. The problem I am facing is that, its getting only first 20 images. May be because when we search on bing.com it populates first 20 images first and then its an infinite scrolling feature.
Is there any way to search more than 20 images using bing?
Cheers :)

I'm guessing this is because this site uses ajax to populate the "infinite" scrolling list as you call it.
You probably send an http request and get the initial page (btw on my browser I got 6 images accross x 4 down, i.e. 24 not 20; thinking about it maybe my client also got 20 only at first and got the last 4 w/ ajax...), and you'd need to do the paging trough by way of ajax requests.
At a glance, the xhtml and associated javascript of the page is very dense and somewhat obfuscated, It would take a while to get oriented... An alternative to analyzing this page is to instead use a packet sniffer (such as wireshark) and to capture the requests which take place when you scroll down.
Essentially this will likely expose some form of ajax request, which you can then easily emulate with java. Typically the ajax response is easy to parse whatever its nature (xml, jason, gzip...).
A possible snags to this well laid out plan is if the returned data in the ajax response is encrypted, for example where the extra images are bundled in some sort of envelope for which you'll then need to discover the format.
Depending on the actual task at hand, you may try alternatives such as automations within GreaseMonkey (on Firefox) or similar tools.
What of Bing API ?
Note that all the above approaches are akin to screen-scraping and hence quite sensitive to even minute changes in the Bing application, and, depending on effective usage and context, this could put the project in a legal grey area... A better approach may be to register and obtain a proper application ID with MS/Bing and to use the Bing API.

You are simulating a browser? Doesn't the Bing engine have an entry point for programs instead - a web service or so - which would make your task much easier.
EDIT: SDK appears to be here: http://msdn.microsoft.com/en-us/library/cc980922.aspx

Just wanted to post a direct answer to the question:
Bing uses Ajax (of course) for the infinite scroll. Each "tick" is based on a simple ajax get request, which accuires new images.
For instance, this url returns 30 results (121-151) in a "htmlraw" format based on the query "max payne".
http://www.bing.com/images/async?q=max+payne&format=htmlraw&first=121
Edit:
It works with the original url too, just add &first=NUMBER to the querystring. Example:
www.bing.com/images/search?q=payne&go=&form=QBLH&scope=images&filt=all&first=10
I am building my own bulk image collector (for a "learning project" for myself) and I found out that it is paginated like this.
FYI, Google and Bing are easy, Yahoo and Altavista (redundant, since their results are from Yahoo) are far more problematic - they don't post the directlink to the original image.
Have fun! :)

This can be done by using count parameter. For example, I tried GET "https://api.cognitive.microsoft.com/bing/v7.0/images/search?q=shoes&mkt=en-us&count=30" call and it returns 30 images.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.