retrieving 'nulls' from website using Java URL input stream - java

I'm trying to read the text from a website using the Java URL input stream:
URL u = new URL(str);
br3 = new BufferedReader(new InputStreamReader(u.openStream()));
while(true)
System.out.println(br3.readLine());
This seems to work fine for most websites, but for some URL shortening services like LinkBee, the object draws a blank. e.g. linkbee.com/FUAKF. I can view the source code using an explorer, however I repeatedly get nulls when I use the above code.

It's because those sites are just redirection services. How are you handling redirects? (a redirect has a Location: header, but no body)

use a http library like commons:httpclient, the method getResponseBodyAsStream follows redirects automatically

Barry is correct.
I just wanted to add that for certain websites there also could be javascript that could redirect you to a different page. Something like this:
<script type="text/javascript">
<!--
window.location = "http://www.google.com/"
//-->
</script>
But in your situation it would be the headers redirecting you based on the fact you are getting nulls back. Just thought you might want to watch out for the javascript thing too.

It's true that it is a redirection service, however I do not require actually following the redirection, I merely need to extract the URL that it redirects to - which can found within the source code of the redirection website itself (which in the given case, is at line 81:
input type='hidden' id='urlholder' value='http://www.megaupload.com/?d=02EBRUTT'
Regardless, I don't think the stream should be giving me a complete blank unless it doesn't read head, only body?

Related

web page source downloaded through Jsoup is not equal to the actual web page source

i have a severe concern here. i have searched all through stack overflow and many other sites. every where they give the same solution and i have tried all those but mi am not able to resolve this issue.
i have the following code,
Document doc = Jsoup.connect(url).timeout(30000).get();
Here m using Jsoup library and the result that i am getting is not equal to the actual page source that we can see but right click on the page -> page source. Many parts are missing in the result that i am getting with the above line of code.
After searching some sites on Google, i saw this methid,
URL url = new URL(webPage);
URLConnection urlConnection = url.openConnection();
urlConnection.setConnectTimeout(10000);
urlConnection.setReadTimeout(10000);
InputStream is = urlConnection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
int numCharsRead;
char[] charArray = new char[1024];
StringBuffer sb = new StringBuffer();
while ((numCharsRead = isr.read(charArray)) > 0) {
sb.append(charArray, 0, numCharsRead);
}
String result = sb.toString();
System.out.println(result);
But no Luck.
While i was searching over the internet for this problem i saw many sites where it said i had to set the proper charSet and encoding types of the webpage while downloading the page source of a web page. but how will i get to know these things from my code dynamically?? is there any classes in java for that. i went through crawler4j also a bit but it did not to much for me. Please help guys. m stuck with this problem for over a month now. i have tried all my ways i can. so final hope is on the gods of stack overflow who have always helped!!
I had this recently. I'd run into some sort of robot protection. Change your original line to:
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(30000)
.get();
The problem might be that your web page is rendered by Javascript which is run in a browser, JSoup alone can't help you with this, so you may try using HtmlUnit which uses Selenium to emulate the browser: using Jsoup to sign in and crawl data.
UPDATE
There are several reasons why HTML is different. The most probable is that this web page contains <javascript> elements which contains dynamic page logic. This could be an application inside your web page which sends requests to the server and add or removes content depending on the responses.
JSoup would never render such pages because it's a job for a browser like Chrome, Firefox or IE. JSoup is a lightweight parser for plaintext html you get from the server.
So what you could do is you could use a web driver which emulates a web browser and renders a page in memory, so it would have the same content as shown to the user. You may even do mouse clicks with this driver.
And the proposed implementation for the web driver in the linked answer is HtmlUnit. It's the most lightweight solution, however, it's might give you unexpected results: Selenium vs HtmlUnit?.
If you want the most real page rendering, you might want to consider Selenium WebDriver.
Why do you want to parse a web page this way? If there is a consumable service available from the website, the website might have an REST API.
To answer your question, A webpage viewed using the web-browser may not be same, as the same webpage is downloaded using a URLConnection.
The following could be few of the reasons that cause these differences:
Request Headers: when the client (java application/browser) makes a request for a URL, it sets various headers as part of the request and the webserver may change the content of the response accordingly.
Java Script: once the response is recieved, if there are java script elements present in the response it's executed by the browsers javascript engine, which may change the contents of DOM.
Browser Plugins, such as IE Browser Helper Objects, Firefox Extensions or Chrome Extensions may change the contents of the DOM.
in simple terms, when you request a URL using a URLConnection you are recieving raw data, however when you request the same URL using a browser's addressbar you get processed (by javascript/browser plugins) webpage.
URLConnection/JSoup will allow you to set request headers as required, but you may still get the different response due to points 2 & 3. Selenium allows you to remote control a browser and has a api to access the rendered page. Selenium is used for automated testing of web applications.

Authorize.net DPM -- perform server side processing in servlet rather than jsp

I'm currently working with a test account on Authorize.net and am utilizing their Direct Post Method form to submit transactions directly to their gateway without additional server-side processing on my end. My application is a basic jsp webapp sitting on top of Apache Tomcat 7.
Per the instructions provided on their Java Quick Start Guide I have set up 3 files to: 1) take in user input, 2) relay the response, and 3) process and display output.
Truth be told, I don't really need to display an output to the user. Instead, I would like to thoroughly process the response that Authorize.net sends me. The sample code they provide explicitly accounts for this in the relay_response.jsp file:
String receiptPageUrl = "http://MERCHANT_HOST/order_receipt.jsp";
...
net.authorize.sim.Result result = net.authorize.sim.Result.createResult(apiLoginId,
MD5HashKey, request.getParameterMap());
// perform Java server side processing...
// ...
// build receipt url buffer
StringBuffer receiptUrlBuffer = new StringBuffer(receiptPageUrl);
...
...
document.location = "<%=receiptUrlBuffer.toString()%>";
However, it looks like they want me to perform the processing in the jsp, while I would rather perform this work on the back end using a Java servlet. I've tried to accomplish this using 2 methods, neither of which work quite as I want.
Attempt 1) I replaced the 'order_receipt.jsp' tag with a url to another jsp, which subsequently submits a form to a servlet, passing all request parameters.
String receiptPageUrl = "http://<my_server's_ip_address>/another.jsp";
The problem with this approach is that in the initial forward from relay_response.jsp all of the parameters are passed via GET and appear in the URL, which I can't allow.
Attempt 2) Rather than forwarding the results to another jsp, I created a form right inside relay_response.jsp and tried to submit the form with the results passed as a request parameter.
<form id='myform' method='post' action="servlet_action" accept-charset='UTF-8'>
<input id='params' type='hidden' name='params' value='<%= paramsMap %>'/>
</form>
<script type="text/javascript">
document.getElementById("myform").submit();
</script>
The problem here is that although the browser displays my relay_response.jsp file, the value of document.location.hostname is test.authorize.net, so it doesn't recognize my action since that resides on my server rather than on authorize.net's server.
Alternatively, I have tried setting the action on the form to be the full url of my server and servlet action:
<form id='myform' method='post' action="http://<my_server's_ip_address>/webapp/servlet_action" accept-charset='UTF-8'>
But I get a warning (at least in Firefox) saying that the data is not being transmitted over a secure connection: "Although this page is encrypted, the information you have entered is to be sent over an unencrypted connection and could easily be read by a third party."
How can I pass the results of the transaction from relay_response.jsp to my Java servlet without exposing the parameters being passed to the user? Should I be using https? And why is document.location.host pointing to authorize.net rather than my relay_response.jsp?
Thanks!
A friend suggested 2 solutions for the initial question I posted, one of which I have verified.
Solution 1:
Simply redirect the initial form to servlet rather than to relay_response.jsp. Then the servlet can redirect to another jsp as apporpriate. I have verified that this works with Authorize.net DPM.
Solution 2:
Inside the scriptlet in relay_response.jsp, make a call to a Java class that actually handles the logic. You don't have to expose or write any Java code inside the scriptlet, but rather just invoke the class and call a few methods. You can pass the response parameter map as the argument to the method. I suppose the class you invoke could even be a proper servlet, though mixing these up might not be good form.

HtmlUnit: Request website from server in a specific language

I am looking for a clean/simple way in HtmlUnit to request a webpage from a server in a specific language.
To do this i have been trying to request "bankofamerica.com" for their homepage in spanish instead of english.
This is what i have done so far:
I tried to set "Accept-Language" header to "es" in the Http request. I did this using:
myWebClient.addRequestHeader("Accept-Language" , "es");
It did not work. I then created a web request with the following code:
URL myUrl = new URL("https://www.bankofamerica.com/");
WebRequest myRequest = new WebRequest(myUrl);
myRequest.setAdditionalHeader("Accept-Language", "es");
HtmlPage aPage = myWebClient.getPage(myRequest);
Since this failed too i printed out the request object for this url , to check if these headers are being set.
[<url="https://www.bankofamerica.com/", GET, EncodingType[name=application/x-www-form-urlencoded], [], {Accept-Language=es, Accept-Encoding=gzip, deflate, Accept=*/*}, null>]
So the server is being requested for a spanish page but in response its sending the homepage in english (the response header has the value of Content-Language set to en-US)
I did find a hack to retrieve the BOA page in spanish. I visited this page and used the chrome developer tool to get the cookie value from the request
header. I used this value to do the following:
myRequest.setAdditionalHeader("Cookie", "TLTSID= ........._LOCALE_COOKIE=es-US; CONTEXT=es_US; INTL_LANG=es_US; LANG_COOKIE=es_US; hp_pf_anon=anon=((ct=+||st=+||fn=+||zc=+||lang=es_US));..........1870903; throttle_value=43");
I am guessing the answer lies somewhere here.
Here lies my next question. If i am writing a script to retrieve 100 different websites in Spanish (ie Assuming they all have their pages in the spanish) . Is there a clean way in HtmlUnit to accomplish this.
(If cookies is indeed a solution then to create them in htmlunit you need to specify the domain name. One would have to then create cookies for each of the 100 sites. As far as i know there is no way in HtmlUnit to do something like:
Cookie langCookie = new Cookie("All Domains","LANG_COOKIE","es_US");
myWebClient.getCookieManager().addCookie(langCookie);)
NOTE: I am using HtmlUnit 2.12 and setting BrowserVersion.CHROME in the webclient
Thanks.
Regarding your first concern the clear/simple(/only?) way of requesting a webpage in a particular language is, as you said, to set the HTTP Accept-Language request header to the locale(s) you want. That is it.
Now the fact that you request a page in a particular language doesn't mean that you will actually get a page in that language. The server has to be set up to process that HTTP header and respond accordingly. Even if a site has a whole section in spanish it doesn't mean that the site is responding to the HTTP header.
A clear example of this is the page you provided. I performed a quick test on it and found that it is clearly not responding accordingly to the Accept-Language I've set (which was es). Hitting the home page using es resulted in getting results in english. However, the page has a link that states En EspaƱol which means In Spanish the page does switch to spanish and you get redirected to https://www.bankofamerica.com?request_locale=es_US.
So you might be tempted to think that the page handles the locale by a request parameter. However, that is not (only) the case. Because if you then open the home page again (without the locale parameter) you will see the Spanish version again. That is clearly a proof that they are being stored somewhere else, most likely in the session, which will most likely be handled by cookies.
That can easily be confirmed by opening a private session or clearing the cookies and confirming this behaviour (I've just done that).
I think that explains the mystery of the webpage existing in Spanish but being fetched in English. (Note how most bank webpages do not conform to basic standards such as responding to simple HTTP requests... and they are handling our money!)
Regarding your second question, it would be like asking What is the recipe to not get ill ever?. It just doesn't depend on you. Also note that your first concerned used the word request while your second concern used the word retrieve. I think it should be clear by now that you can only be 100% sure of what you request but not of what you retrieve.
Regarding setting a value in a cookie manually, that is technically possible. However, that is just like adding another parameter in a get request: http://domain.com?login=yes. The parameter will only be processed by the server if it is expecting it. Otherwise, it will be ignored. That is what will happen to the value in your cookie.
Summary: There are standards to follow. You can try to use them but if the one in the other side doesn't then you won't get the results you expect. Your best choice: do your best and follow the standards.

In Java, how I download a page that was redirected?

I making a web crawler and there are some pages that redirect to other. How I get the page that the original page redirected?
In some sites like xtema.com.br, I can get the url of redirection using the HttpURLConnection class with the getHeaderField("Location") method, but in others like visa.com.br, the redirection is made using javascript or another way and this method returns null.
There is some way to always get the page and the url resulting of redirection? The original page without the redirection is not important.
Thanks, and sorry for bad english.
EDIT: Using httpConn.setInstanceFollowRedirects(true) to follow the redirections and returning the URL with httpConn.getURL worked, but I have two issues.
1: The httpConn.getURL only will return the actual url of the redirected page if I call httpConn.getDate before. If I dont this, it will return the original URL before the redirections.
2: Some sites like visa.com.br get the answer 200, but if I open then in the web browser, I see another page.
Eg.: my program - visa.com.br - answer 200 (no redirections)
web broser - visa.com.br/go/principal.aspx - html code different of the version that i get in my program
Use HttpURLConnection, it follows redirects by default.
In case you want to see the redirected URL, you'll have to do:
httpConn.setInstanceFollowRedirects( false );
httpConn.connect();
int responseCode = httpConn.getResponseCode();
while ((responseCode / 100) == 3) { /* codes 3XX are redirections */
String newLocationHeader = httpConn.getHeaderField( "Location" );
/* open a new connection and get the content for the URL newLocationHeader */
/* ... */
responseCode = httpConn.getResponseCode();
/* do it until you get some code that is not a redirection */
}
You can't easily get javascript redirection. And HTTP redirection is handled by default by the HttpURLConnection. What you can do is, search the page contents for several keywords:
the meta refresh tag
document.location=, window.location= and both with .href=
But this does not guarantee anything. People might be calling javascript functions from external js files and you will pretty much need to fetch resources and parse javascript, which you aren't willing to do, I guess.
I ended up using Apache's HTTP client. Just another option.

Redirect to servlet fails

I have a servlet named EditPhotos which, believe it or not, is used for editing the photos associated with a certain item on a web design I am developing. The URL path to edit a photo is [[SITEROOT]]/EditPhotos/[[ITEMNAME]].
When you go to this path (GET), the page loads fine. You can then click on a 'delete' link that POSTs to the same page, telling it to delete the photo. The servlet receives this delete command properly and successfully deletes the photo. It then sends a redirect back to the first page (GET).
For some reason, this redirect fails. I don't know how or why, but using the HTTPFox plugin for firefox, I see that the POST request receives 0 bytes in response and has the code NS_BINDING_ABORTED.
The code I am using to send the redirect, is the same code I have used throughout the website to send redirects:
response.sendRedirect(Constants.SITE_ROOT + "EditPhotos/" + itemURL);
I have checked the final URL that the redirect sends, and it is definitely correct, but the browser never receives the redirect. Why?
Read the server logs. Do you see IllegalStateException: response already committed with the sendRedirect() call in the trace?
If so, then that means that the redirect failed because the response headers are already been sent. Ensure that you aren't touching the HttpServletResponse at all before calling the sendRedirect(). A redirect namely exist of basically a Location response header with the new URL as value.
If not, then you're probably handling the request using JavaScript which in turn failed to handle the new location.
If neither is the case or you still cannot figure it, then we'd be interested in the smallest possible copy'n'pasteable code snippet which reproduces exactly this problem. Update then your question to include it.
Update as per the comments, the culprit is indeed in JavaScript. A redirect on a XMLHttpRequest POST isn't going to work. Are you using homegrown XMLHttpRequest functions or a library around it like as jQuery? If jQuery, please read this question carefully. It boils down to that you need to return a specific response and then let JS/jQuery do the new window.location itself.
Turns out that it was the JavaScript I was using to send the POST that was the problem.
I originally had this:
Delete
And everything got fixed when I changed it to this:
Delete
The deletePhoto function is:
function deletePhoto(photoID) {
doPost(document.URL, {'action':'delete', 'id':photoID});
}
function doPost(path, params) {
var form = document.createElement("form");
form.setAttribute("method", "POST");
form.setAttribute("action", path);
for(var key in params) {
var hiddenField = document.createElement("input");
hiddenField.setAttribute("type", "hidden");
hiddenField.setAttribute("name", key);
hiddenField.setAttribute("value", params[key]);
form.appendChild(hiddenField);
}
document.body.appendChild(form);
form.submit();
}

Categories

Resources