How to mimic a web browser from Java using jSoup

How to mimic a web browser from Java using jSoup - java

I am using a query like this in jSoup:
Document doc = Jsoup.connect(urlString).timeout(1000).post();
It works for some sites, however:
it doesn't work for Google search queries (e.g. urlString = "http://www.google.com/search?q=text") - I don't know why, how it is special
result documents contain messages like "JavaScript should be turned on in your browser" which I would rather avoid
there are probably more quirks, but I haven't tested it fully yet...
My question: could these problems be avoided if we could mimic a web browser more closely? What is the best way to do it?
What are the other differences that can be encountered between getting pages via web browser and via Java (URLConnection or jSoup)?

I will like to answer your question.
In Google , when you search , parameters are passed in URL , So it's a get request.
In this case , you should use .get() method.
While there are many websites , parameters are passed using post request.
Taking an example of a simple login page of all websites , username and password are passed using POST REQUEST , In addition to that there are many hidden fields inside that page which is also needed to be passed.
If we missed that parmaters , it will result in a error.

I realized that the problem with some sites not responding was actually that I was using post() instead of get(). With get() it works fine now!
It also probably helps to add userAgent to the query, for example:
.userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
In the meantime, I've also tested HtmlUnit for the same task, and it worked, but it seems like an overkill for the purpose to simply get an HTML file (for some kind of processing). It basically runs a whole invisible web browser in the background to do this task.

Related

JSOUP / HTTP error fetching URL. Status=503

I am using JSOUB to scrape all the web page as the following:
public static final String GOOGLE_SEARCH_URL = "https://www.google.com/search";
String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num +
"&start=" + start;
Document doc = Jsoup.connect(searchURL)
.userAgent("Mozilla/5.0 Chrome/26.0.1410.64 Safari/537.31")
// .ignoreHttpErrors(true)
.maxBodySize(1024*1024*3)
.followRedirects(true)
.timeout(100000)
.ignoreContentType(true)
.get();
Elements results = doc.select("h3.r > a");
for (Element result : results) {
String linkHref = result.attr("href");
}
But my problem is that at the start of the code working good.
after a while, it will stop and always gives me " HTTP error fetching URL. Status=503 error".
when I add the .ignoreHttpErrors(true) it will work without any error but it will not scrape the web.
*search term is any keyword I want to search about and num is the number of pages that I need to retrieve.
could anyone help, please?
Is this mean that Google blocked my IP from scraping? if yes is there any solution or how I scape the google search result, please?
I need help.
Thank you,

503 error usually means the website you trying to scrap blocks you because they don't want non-human user navigating their sites. Especially Google.
There are something you can do though. Such as
Using proxy rotator
Use chromedriver
Add some delays to your application after each page
Basically you need to be as human as possible to prevent sites blocking you.
EDIT:
I need to warn you that scraping Google search result is against their ToS and might be illegal depends on where you are.
What you can do
You can use proxy rotating service to mask your request so google will see it as request from multiple region. Google proxy rotator service if you interested. It might be expensive depends on what you do with the data.
Then code some module that change the User-Agent every request to make Google less suspicious with your request.
Add random delay after scraping each page. I suggest around 1-5 seconds. Randomized delay makes your request more human-like for Google
At last if everything fails, you might want to look into Google search API and use their API instead of scraping their site.

How to write a Java code to read fields from a website that requires login and uses POST request?

Need some help with fetching some data from a website.
Previously , we had following code in our application and it used to fetch the required data. We just used to read the required fields by forming a URL by passing username , password and search parameter (DEA number). The same URL (with parameters ) could also be hit from browser directly to see the results. It was a simple GET request:
{URL url = new URL(
"http://www.deanumber.com/Websvc/deaWebsvc.asmx/GetQuery?UserName="+getUsername()+"&Password="+getPassword()+"&DEA="
+ deaNumber
+ "&BAC=&BASC=&ExpirationDate=&Company=&Zip=&State=&PI=&MaxRows=");
Document document = parser.parse(url.toExternalForm());
// Ask the document for a list of all <sect1> tags it contains
NodeList sections = document.getElementsByTagName("DEA");
//Followed by a loop code to get each element by using sections.item(index).getFirstChild() etc.
}
Now, the website URL has got changed to following:
https://www.deanumber.com/RelId/33637/ISvars/default/Home.htm
I am able to login to the URL with credentials , go to the search page , enter the DEA number and search. The login page comes as a pop-up once I click 'Login' link on home page. Also, the final result comes as a pop-up. This is a POST request so I am unable to form the complete URL which I could use in my code.
I am not an expert in Web Services , but I think I need a web service URL like the one mentioned in the code above. Not sure how to get that !! Even if I get the URL , I am not sure how to perform the login through Java code and search the DEA number.
Also, it would be great if I could validate the URL manually before using in Java. Let me know if there is any way.
Or, in case there is any alternate approach in Java; kindly suggest.
Thanks in advance.

First of all, the previous approach provided by the website was completely wrong and insecure, because it passes the username and password as querystring parameters in plain text. I think, they would have realized this thing and changed their way of authentication.
Also, it looks like that they have restricted the direct URL based requests from the client applications like yours. For such requests from clients, they have published the web services. Check this link. They also have mentioned the rates for web service request counts.
So, you may need to open a formal communication channel to get authentication and other details to access their web services for this purpose. Depends on what they use for web service client authentication, you may code your client to access the web services.
I hope this helps.

Web crawler breaks when website is changed

I have created a web crawler according to this example.
This is working OK, but if I replace
processPage("http://www.mit.edu");
Document doc = Jsoup.connect("http://www.mit.edu/").get();
with
processPage("http://www.stackoverflow.com");
Document doc = Jsoup.connect("http://www.stackoverflow.com/").get();
or the same text, but for other sites, then this returns only the text "conn built".
Why is this not working for other sites?

I did not try the code but my guess is on accessing "http://www.stackoverflow.com" it returns HTTP Response code-301 or 302. That means it redirects to different page. My guess is the library you are using does not handle the 301/302 response codes well.
So try this url instead https://stackoverflow.com/questions. It should work if my assumption is correct.

web page source downloaded through Jsoup is not equal to the actual web page source

i have a severe concern here. i have searched all through stack overflow and many other sites. every where they give the same solution and i have tried all those but mi am not able to resolve this issue.
i have the following code,
Document doc = Jsoup.connect(url).timeout(30000).get();
Here m using Jsoup library and the result that i am getting is not equal to the actual page source that we can see but right click on the page -> page source. Many parts are missing in the result that i am getting with the above line of code.
After searching some sites on Google, i saw this methid,
URL url = new URL(webPage);
URLConnection urlConnection = url.openConnection();
urlConnection.setConnectTimeout(10000);
urlConnection.setReadTimeout(10000);
InputStream is = urlConnection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
int numCharsRead;
char[] charArray = new char[1024];
StringBuffer sb = new StringBuffer();
while ((numCharsRead = isr.read(charArray)) > 0) {
sb.append(charArray, 0, numCharsRead);
}
String result = sb.toString();
System.out.println(result);
But no Luck.
While i was searching over the internet for this problem i saw many sites where it said i had to set the proper charSet and encoding types of the webpage while downloading the page source of a web page. but how will i get to know these things from my code dynamically?? is there any classes in java for that. i went through crawler4j also a bit but it did not to much for me. Please help guys. m stuck with this problem for over a month now. i have tried all my ways i can. so final hope is on the gods of stack overflow who have always helped!!

I had this recently. I'd run into some sort of robot protection. Change your original line to:
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(30000)
.get();

The problem might be that your web page is rendered by Javascript which is run in a browser, JSoup alone can't help you with this, so you may try using HtmlUnit which uses Selenium to emulate the browser: using Jsoup to sign in and crawl data.
UPDATE
There are several reasons why HTML is different. The most probable is that this web page contains <javascript> elements which contains dynamic page logic. This could be an application inside your web page which sends requests to the server and add or removes content depending on the responses.
JSoup would never render such pages because it's a job for a browser like Chrome, Firefox or IE. JSoup is a lightweight parser for plaintext html you get from the server.
So what you could do is you could use a web driver which emulates a web browser and renders a page in memory, so it would have the same content as shown to the user. You may even do mouse clicks with this driver.
And the proposed implementation for the web driver in the linked answer is HtmlUnit. It's the most lightweight solution, however, it's might give you unexpected results: Selenium vs HtmlUnit?.
If you want the most real page rendering, you might want to consider Selenium WebDriver.

Why do you want to parse a web page this way? If there is a consumable service available from the website, the website might have an REST API.
To answer your question, A webpage viewed using the web-browser may not be same, as the same webpage is downloaded using a URLConnection.
The following could be few of the reasons that cause these differences:
Request Headers: when the client (java application/browser) makes a request for a URL, it sets various headers as part of the request and the webserver may change the content of the response accordingly.
Java Script: once the response is recieved, if there are java script elements present in the response it's executed by the browsers javascript engine, which may change the contents of DOM.
Browser Plugins, such as IE Browser Helper Objects, Firefox Extensions or Chrome Extensions may change the contents of the DOM.
in simple terms, when you request a URL using a URLConnection you are recieving raw data, however when you request the same URL using a browser's addressbar you get processed (by javascript/browser plugins) webpage.
URLConnection/JSoup will allow you to set request headers as required, but you may still get the different response due to points 2 & 3. Selenium allows you to remote control a browser and has a api to access the rendered page. Selenium is used for automated testing of web applications.

HtmlUnit: Request website from server in a specific language

I am looking for a clean/simple way in HtmlUnit to request a webpage from a server in a specific language.
To do this i have been trying to request "bankofamerica.com" for their homepage in spanish instead of english.
This is what i have done so far:
I tried to set "Accept-Language" header to "es" in the Http request. I did this using:
myWebClient.addRequestHeader("Accept-Language" , "es");
It did not work. I then created a web request with the following code:
URL myUrl = new URL("https://www.bankofamerica.com/");
WebRequest myRequest = new WebRequest(myUrl);
myRequest.setAdditionalHeader("Accept-Language", "es");
HtmlPage aPage = myWebClient.getPage(myRequest);
Since this failed too i printed out the request object for this url , to check if these headers are being set.
[<url="https://www.bankofamerica.com/", GET, EncodingType[name=application/x-www-form-urlencoded], [], {Accept-Language=es, Accept-Encoding=gzip, deflate, Accept=*/*}, null>]
So the server is being requested for a spanish page but in response its sending the homepage in english (the response header has the value of Content-Language set to en-US)
I did find a hack to retrieve the BOA page in spanish. I visited this page and used the chrome developer tool to get the cookie value from the request
header. I used this value to do the following:
myRequest.setAdditionalHeader("Cookie", "TLTSID= ........._LOCALE_COOKIE=es-US; CONTEXT=es_US; INTL_LANG=es_US; LANG_COOKIE=es_US; hp_pf_anon=anon=((ct=+||st=+||fn=+||zc=+||lang=es_US));..........1870903; throttle_value=43");
I am guessing the answer lies somewhere here.
Here lies my next question. If i am writing a script to retrieve 100 different websites in Spanish (ie Assuming they all have their pages in the spanish) . Is there a clean way in HtmlUnit to accomplish this.
(If cookies is indeed a solution then to create them in htmlunit you need to specify the domain name. One would have to then create cookies for each of the 100 sites. As far as i know there is no way in HtmlUnit to do something like:
Cookie langCookie = new Cookie("All Domains","LANG_COOKIE","es_US");
myWebClient.getCookieManager().addCookie(langCookie);)
NOTE: I am using HtmlUnit 2.12 and setting BrowserVersion.CHROME in the webclient
Thanks.

Regarding your first concern the clear/simple(/only?) way of requesting a webpage in a particular language is, as you said, to set the HTTP Accept-Language request header to the locale(s) you want. That is it.
Now the fact that you request a page in a particular language doesn't mean that you will actually get a page in that language. The server has to be set up to process that HTTP header and respond accordingly. Even if a site has a whole section in spanish it doesn't mean that the site is responding to the HTTP header.
A clear example of this is the page you provided. I performed a quick test on it and found that it is clearly not responding accordingly to the Accept-Language I've set (which was es). Hitting the home page using es resulted in getting results in english. However, the page has a link that states En Español which means In Spanish the page does switch to spanish and you get redirected to https://www.bankofamerica.com?request_locale=es_US.
So you might be tempted to think that the page handles the locale by a request parameter. However, that is not (only) the case. Because if you then open the home page again (without the locale parameter) you will see the Spanish version again. That is clearly a proof that they are being stored somewhere else, most likely in the session, which will most likely be handled by cookies.
That can easily be confirmed by opening a private session or clearing the cookies and confirming this behaviour (I've just done that).
I think that explains the mystery of the webpage existing in Spanish but being fetched in English. (Note how most bank webpages do not conform to basic standards such as responding to simple HTTP requests... and they are handling our money!)
Regarding your second question, it would be like asking What is the recipe to not get ill ever?. It just doesn't depend on you. Also note that your first concerned used the word request while your second concern used the word retrieve. I think it should be clear by now that you can only be 100% sure of what you request but not of what you retrieve.
Regarding setting a value in a cookie manually, that is technically possible. However, that is just like adding another parameter in a get request: http://domain.com?login=yes. The parameter will only be processed by the server if it is expecting it. Otherwise, it will be ignored. That is what will happen to the value in your cookie.
Summary: There are standards to follow. You can try to use them but if the one in the other side doesn't then you won't get the results you expect. Your best choice: do your best and follow the standards.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.