Web crawler breaks when website is changed

Web crawler breaks when website is changed - java

I have created a web crawler according to this example.
This is working OK, but if I replace
processPage("http://www.mit.edu");
Document doc = Jsoup.connect("http://www.mit.edu/").get();
with
processPage("http://www.stackoverflow.com");
Document doc = Jsoup.connect("http://www.stackoverflow.com/").get();
or the same text, but for other sites, then this returns only the text "conn built".
Why is this not working for other sites?

I did not try the code but my guess is on accessing "http://www.stackoverflow.com" it returns HTTP Response code-301 or 302. That means it redirects to different page. My guess is the library you are using does not handle the 301/302 response codes well.
So try this url instead https://stackoverflow.com/questions. It should work if my assumption is correct.

Related

How to write a Java code to read fields from a website that requires login and uses POST request?

Need some help with fetching some data from a website.
Previously , we had following code in our application and it used to fetch the required data. We just used to read the required fields by forming a URL by passing username , password and search parameter (DEA number). The same URL (with parameters ) could also be hit from browser directly to see the results. It was a simple GET request:
{URL url = new URL(
"http://www.deanumber.com/Websvc/deaWebsvc.asmx/GetQuery?UserName="+getUsername()+"&Password="+getPassword()+"&DEA="
+ deaNumber
+ "&BAC=&BASC=&ExpirationDate=&Company=&Zip=&State=&PI=&MaxRows=");
Document document = parser.parse(url.toExternalForm());
// Ask the document for a list of all <sect1> tags it contains
NodeList sections = document.getElementsByTagName("DEA");
//Followed by a loop code to get each element by using sections.item(index).getFirstChild() etc.
}
Now, the website URL has got changed to following:
https://www.deanumber.com/RelId/33637/ISvars/default/Home.htm
I am able to login to the URL with credentials , go to the search page , enter the DEA number and search. The login page comes as a pop-up once I click 'Login' link on home page. Also, the final result comes as a pop-up. This is a POST request so I am unable to form the complete URL which I could use in my code.
I am not an expert in Web Services , but I think I need a web service URL like the one mentioned in the code above. Not sure how to get that !! Even if I get the URL , I am not sure how to perform the login through Java code and search the DEA number.
Also, it would be great if I could validate the URL manually before using in Java. Let me know if there is any way.
Or, in case there is any alternate approach in Java; kindly suggest.
Thanks in advance.

First of all, the previous approach provided by the website was completely wrong and insecure, because it passes the username and password as querystring parameters in plain text. I think, they would have realized this thing and changed their way of authentication.
Also, it looks like that they have restricted the direct URL based requests from the client applications like yours. For such requests from clients, they have published the web services. Check this link. They also have mentioned the rates for web service request counts.
So, you may need to open a formal communication channel to get authentication and other details to access their web services for this purpose. Depends on what they use for web service client authentication, you may code your client to access the web services.
I hope this helps.

Cookies with Jsoup

I'm having issues with sending POST data to this site:
https://www.amazon.com/ap/signin?openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.assoc_handle=amzn_mturk_worker&openid.ns.pape=http%3A%2F%2Fspecs.openid.net%2Fextensions%2Fpape%2F1.0&_encoding=UTF8&openid.mode=checkid_setup&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.pape.max_auth_age=43200&marketplaceId=A384XSLT9ODACQ&clientContext=703ea210dfe6fd07defd5ab30ac8d9&openid.return_to=https%3A%2F%2Fwww.mturk.com%2Fmturk%2Fendsignin`
I'm using Jsoup. I'm trying to use the same cookies "session-id" for the get data but i'm still not logged in. This is my code:
Connection.Response res = Jsoup.connect("https://www.amazon.com/ap/signin?openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.assoc_handle=amzn_mturk_worker&openid.ns.pape=http%3A%2F%2Fspecs.openid.net%2Fextensions%2Fpape%2F1.0&_encoding=UTF8&openid.mode=checkid_setup&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.pape.max_auth_age=43200&marketplaceId=A384XSLT9ODACQ&clientContext=703ea210dfe6fd07defd5ab30ac8d9&openid.return_to=https%3A%2F%2Fwww.mturk.com%2Fmturk%2Fendsignin").data("email", "blah#gmail.com", "password", "blah").method(Connection.Method.POST).execute();
Document doc2 = res.parse();
sessionId = res.cookie("session-id");
Document doc = Jsoup.connect("https://www.mturk.com/mturk/searchbar?selectedSearchType=hitgroups&minReward=0.00&sortType=LastUpdatedTime%3A1&pageSize=50").cookie("SESSIONID", sessionId).get();
Where e-mail and password are real information instead of "blah". I don't know if my issue is how I parse the cookie or send the POST data originally.
Edit: So the site uses OpenID. Not sure if I should make a whole new question, but how would I go about it now? I basically need to login and pull information off the site after login. Here is my post info:
appActionToken:pj2FxGfwLZT6nheliE7BMxwZrTUKEj3D
appAction:SIGNIN
clientContext:ape:NzAzZWEyMTBkZmU2ZmQwN2RlZmQ1YWIzMGFjOGQ5
openid.pape.max_auth_age:ape:NDMyMDA=
openid.return_to:ape:aHR0cHM6Ly93d3cubXR1cmsuY29tL210dXJrL2VuZHNpZ25pbg==
prevRID:ape:S1kyUFNDUkhLVFZSSjRGMjBYUUo=
openid.identity:ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjAvaWRlbnRpZmllcl9zZWxlY3Q=
openid.assoc_handle:ape:YW16bl9tdHVya193b3JrZXI=
openid.mode:ape:Y2hlY2tpZF9zZXR1cA==
openid.ns.pape:ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvZXh0ZW5zaW9ucy9wYXBlLzEuMA==
openid.claimed_id:ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjAvaWRlbnRpZmllcl9zZWxlY3Q=
pageId:ape:YW16bl9tdHVya193b3JrZXI=
openid.ns:ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjA=
email: -Deleted-
create:0
password: -Deleted-
metadata1:+gLgZV5Fc5cBh44WnOrKTq5ofl6IhGvSbZGHfX7T5PuwmIl0Ep4bclt77iRlLPO1thRNg/9TylDw5H/9UPZnuOcF1OAHaECaWmK9H8pkW0elpz5QgEukM4aP6dPwSliw9Ggy+1/vQCk0MLm2TvkyS8uLslyh2aEw4H7hDmcF6lTgctZVE8B2KENH97L7hp4rcR2NHKMm4tEFdwpmVqv+pmLX5rUBo4p2QNUe3g0dNAifuK3RPXCVSQyQHpUzlBuFZTFK9xspwA2dgcdSZcgQzgzQKik/WEDrn0eP4sAVnO1ZGFUWKFAY55Lzgf6yd6WxCZ15yGTWENf0Km9wnXce+Ev5SMarXPJNQtfqY6tdp5snwFxpB8m/x72AvRgWJACoi5qcyqwO6dxroebIyB9uruApIkUk07AD8bJvhcf92+flN9TY4iXCkIoeSUN5aKp8rJbyhspySgsmQ9guu4964qeQRK0J092/sx1De6VmfGQ3nMrr0+McnC4/wZo2jUhGOr62ow==`

The site you're trying to log in make use of Javascript. Since Jsoup doesn't support Javascript (Jsoup 1.8.3 as of this writing), I advice you to use some better approaches: ui4j or Selenium. The choice is yours.

web page source downloaded through Jsoup is not equal to the actual web page source

i have a severe concern here. i have searched all through stack overflow and many other sites. every where they give the same solution and i have tried all those but mi am not able to resolve this issue.
i have the following code,
Document doc = Jsoup.connect(url).timeout(30000).get();
Here m using Jsoup library and the result that i am getting is not equal to the actual page source that we can see but right click on the page -> page source. Many parts are missing in the result that i am getting with the above line of code.
After searching some sites on Google, i saw this methid,
URL url = new URL(webPage);
URLConnection urlConnection = url.openConnection();
urlConnection.setConnectTimeout(10000);
urlConnection.setReadTimeout(10000);
InputStream is = urlConnection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
int numCharsRead;
char[] charArray = new char[1024];
StringBuffer sb = new StringBuffer();
while ((numCharsRead = isr.read(charArray)) > 0) {
sb.append(charArray, 0, numCharsRead);
}
String result = sb.toString();
System.out.println(result);
But no Luck.
While i was searching over the internet for this problem i saw many sites where it said i had to set the proper charSet and encoding types of the webpage while downloading the page source of a web page. but how will i get to know these things from my code dynamically?? is there any classes in java for that. i went through crawler4j also a bit but it did not to much for me. Please help guys. m stuck with this problem for over a month now. i have tried all my ways i can. so final hope is on the gods of stack overflow who have always helped!!

I had this recently. I'd run into some sort of robot protection. Change your original line to:
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(30000)
.get();

The problem might be that your web page is rendered by Javascript which is run in a browser, JSoup alone can't help you with this, so you may try using HtmlUnit which uses Selenium to emulate the browser: using Jsoup to sign in and crawl data.
UPDATE
There are several reasons why HTML is different. The most probable is that this web page contains <javascript> elements which contains dynamic page logic. This could be an application inside your web page which sends requests to the server and add or removes content depending on the responses.
JSoup would never render such pages because it's a job for a browser like Chrome, Firefox or IE. JSoup is a lightweight parser for plaintext html you get from the server.
So what you could do is you could use a web driver which emulates a web browser and renders a page in memory, so it would have the same content as shown to the user. You may even do mouse clicks with this driver.
And the proposed implementation for the web driver in the linked answer is HtmlUnit. It's the most lightweight solution, however, it's might give you unexpected results: Selenium vs HtmlUnit?.
If you want the most real page rendering, you might want to consider Selenium WebDriver.

Why do you want to parse a web page this way? If there is a consumable service available from the website, the website might have an REST API.
To answer your question, A webpage viewed using the web-browser may not be same, as the same webpage is downloaded using a URLConnection.
The following could be few of the reasons that cause these differences:
Request Headers: when the client (java application/browser) makes a request for a URL, it sets various headers as part of the request and the webserver may change the content of the response accordingly.
Java Script: once the response is recieved, if there are java script elements present in the response it's executed by the browsers javascript engine, which may change the contents of DOM.
Browser Plugins, such as IE Browser Helper Objects, Firefox Extensions or Chrome Extensions may change the contents of the DOM.
in simple terms, when you request a URL using a URLConnection you are recieving raw data, however when you request the same URL using a browser's addressbar you get processed (by javascript/browser plugins) webpage.
URLConnection/JSoup will allow you to set request headers as required, but you may still get the different response due to points 2 & 3. Selenium allows you to remote control a browser and has a api to access the rendered page. Selenium is used for automated testing of web applications.

In Java, how I download a page that was redirected?

I making a web crawler and there are some pages that redirect to other. How I get the page that the original page redirected?
In some sites like xtema.com.br, I can get the url of redirection using the HttpURLConnection class with the getHeaderField("Location") method, but in others like visa.com.br, the redirection is made using javascript or another way and this method returns null.
There is some way to always get the page and the url resulting of redirection? The original page without the redirection is not important.
Thanks, and sorry for bad english.
EDIT: Using httpConn.setInstanceFollowRedirects(true) to follow the redirections and returning the URL with httpConn.getURL worked, but I have two issues.
1: The httpConn.getURL only will return the actual url of the redirected page if I call httpConn.getDate before. If I dont this, it will return the original URL before the redirections.
2: Some sites like visa.com.br get the answer 200, but if I open then in the web browser, I see another page.
Eg.: my program - visa.com.br - answer 200 (no redirections)
web broser - visa.com.br/go/principal.aspx - html code different of the version that i get in my program

Use HttpURLConnection, it follows redirects by default.
In case you want to see the redirected URL, you'll have to do:
httpConn.setInstanceFollowRedirects( false );
httpConn.connect();
int responseCode = httpConn.getResponseCode();
while ((responseCode / 100) == 3) { /* codes 3XX are redirections */
String newLocationHeader = httpConn.getHeaderField( "Location" );
/* open a new connection and get the content for the URL newLocationHeader */
/* ... */
responseCode = httpConn.getResponseCode();
/* do it until you get some code that is not a redirection */
}

You can't easily get javascript redirection. And HTTP redirection is handled by default by the HttpURLConnection. What you can do is, search the page contents for several keywords:
the meta refresh tag
document.location=, window.location= and both with .href=
But this does not guarantee anything. People might be calling javascript functions from external js files and you will pretty much need to fetch resources and parse javascript, which you aren't willing to do, I guess.

I ended up using Apache's HTTP client. Just another option.

retrieving 'nulls' from website using Java URL input stream

I'm trying to read the text from a website using the Java URL input stream:
URL u = new URL(str);
br3 = new BufferedReader(new InputStreamReader(u.openStream()));
while(true)
System.out.println(br3.readLine());
This seems to work fine for most websites, but for some URL shortening services like LinkBee, the object draws a blank. e.g. linkbee.com/FUAKF. I can view the source code using an explorer, however I repeatedly get nulls when I use the above code.

It's because those sites are just redirection services. How are you handling redirects? (a redirect has a Location: header, but no body)

use a http library like commons:httpclient, the method getResponseBodyAsStream follows redirects automatically

Barry is correct.
I just wanted to add that for certain websites there also could be javascript that could redirect you to a different page. Something like this:
<script type="text/javascript">
<!--
window.location = "http://www.google.com/"
//-->
</script>
But in your situation it would be the headers redirecting you based on the fact you are getting nulls back. Just thought you might want to watch out for the javascript thing too.

It's true that it is a redirection service, however I do not require actually following the redirection, I merely need to extract the URL that it redirects to - which can found within the source code of the redirection website itself (which in the given case, is at line 81:
input type='hidden' id='urlholder' value='http://www.megaupload.com/?d=02EBRUTT'
Regardless, I don't think the stream should be giving me a complete blank unless it doesn't read head, only body?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Web crawler breaks when website is changed - java

Related

How to write a Java code to read fields from a website that requires login and uses POST request?

Cookies with Jsoup

web page source downloaded through Jsoup is not equal to the actual web page source

In Java, how I download a page that was redirected?

retrieving 'nulls' from website using Java URL input stream

Categories

Resources