I'm using java and trying to get the content of a website so that I can analyze the text on the page, however every time that I "GET" a response from the server, it is from a login page rather than the website page that I am looking at.
I am logged into the website on all my browsers, but my application is not able to see the page as if it were me.
I also tried to use an API called "Yandex" --> http://api.yandex.com/rca/
as a work-around. But when I call the page from Yandex (which would get its content) I only see information based on the login page returned.
Can anyone give me a direction to investigate? I would like to be able to get one item on the page of a website that I work for, but it doesn't seem possible.
m_strseedpath = "http://myUrl.com/mypage.html"; //not https
URLConnection connection = new URL("http://rca.yandex.com/?key={MyActualKeyNotThisText}&url=" + m_strSeedUrlPath + "").openConnection();
connection.setRequestProperty("Accept-Charset", "UTF-8");
InputStream response = connection.getInputStream();
StringWriter writer = new StringWriter();
IOUtils.copy(response, writer, "UTF-8");
String strString = writer.toString();
System.out.println(strString);
The URLConnection object will connect to the page but in a different session. You would have to programmaticaly log in from your Java code.
Create a URLConnection object to the login page, POST the user name and password, receive the content getting the InputStream from the URLConnection object, and finally create a new connection to the page you wish to analyze. You'd have to also work with cookies in order to view the second page.
Hope this helps!
The URL that you are trying to access has access restricted via login. Even if you are logged in via your browser you wont be able to access the page from your Java application because the browser has an Authenticated Session with the target website. The same session is not visible to your Java Application.
You would have to research into the ways to login to the website and then get the page content.
Related
I am working on a crawling project. When I do a simple URLConnection connection to the website as shown in below:
URLConnection conn = new URL(url).openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
The method returns the HTML body correctly. However, the website makes inner requests for some fields. For example, the website fetches the total number of users from a different web service. In the web browser, the total number of users appear after some time, but with the URLConnection method does not wait for the total number of users and the returned HTML does not contain such field.
In Java, is there any way to wait for a while to fetch all the data from a website using URLConnection.
From your "inner requests" comment it sounds like the website is using JavaScript (via a framework or just using native browser APIs) to fetch data and render these results into the DOM. This is very common nowadays with SPAs etc.
If that's the case, no amount of waiting will change the outcome from using a simple HTTP library like URLConnection - but you can check this by saving the HTML locally and viewing it in your browser - what happens? When you examine it, is there JavaScript on that page?
To do this properly in code, you'll need something capable of behaving more like a browser, and executing that JS referenced by the HTML in a DOM-like environment. Try Selenium with PhantomJS or headless Chrome / Firefox, or maybe GhostDriver.
Normally if you are getting the html body of the page, all calls made onn the server side of this web site must have been completed.
If the website does not contain Javascript, then use the Jsoup (https://jsoup.org) library for Java. It loads all inner HTML requests needed to render the final HTML page.
I am tasked with checking whether some URLs are working correctly, I'm using Java to make HTTP get request to get the response code.
So what I did was this.
URL u = new URL("some URL");
HttpURLConnection huc = (HttpURLConnection) u.openConnection();
huc.setRequestMethod("GET");
huc.connect();
int code = huc.getResponseCode();
System.out.println(code + " " + huc.getURL());
The Problem: Some sites require you to login to access the page, but the page doesn't return a 401 code, but 200. Note that the web page doesn't show up until a username and password are provided. It asks for authentication in a pop up window.
So how do I catch these kind of links?
Also, how can I identify if a webpage shows a login page like http://www.example.com/login/? Is it sufficient to just check the URL for the word “login”?
There's no universal way to deal with this. You have to know how the site you're using does authentication - 401? separate login page? multi-factor auth (ie: using RSA token)? Checking for the substring "login" in the URL is a possible way of handling some, but not enough for a general way.
For example, a 401 will only happen when using basic authentication (or when trying to access protected resources directly). There's a lot of other ways to do auth
John sums up the issue quite well in his comment:
If you have to deal with pages that roll their own custom authentication, then it follows that you probably have to write your own custom code to accommodate them. Depending on how the relevant sites work, you might be able to bypass authentication by sending an appropriate cookie in your request, as if you had already authenticated, or by some similar means
i have a severe concern here. i have searched all through stack overflow and many other sites. every where they give the same solution and i have tried all those but mi am not able to resolve this issue.
i have the following code,
Document doc = Jsoup.connect(url).timeout(30000).get();
Here m using Jsoup library and the result that i am getting is not equal to the actual page source that we can see but right click on the page -> page source. Many parts are missing in the result that i am getting with the above line of code.
After searching some sites on Google, i saw this methid,
URL url = new URL(webPage);
URLConnection urlConnection = url.openConnection();
urlConnection.setConnectTimeout(10000);
urlConnection.setReadTimeout(10000);
InputStream is = urlConnection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
int numCharsRead;
char[] charArray = new char[1024];
StringBuffer sb = new StringBuffer();
while ((numCharsRead = isr.read(charArray)) > 0) {
sb.append(charArray, 0, numCharsRead);
}
String result = sb.toString();
System.out.println(result);
But no Luck.
While i was searching over the internet for this problem i saw many sites where it said i had to set the proper charSet and encoding types of the webpage while downloading the page source of a web page. but how will i get to know these things from my code dynamically?? is there any classes in java for that. i went through crawler4j also a bit but it did not to much for me. Please help guys. m stuck with this problem for over a month now. i have tried all my ways i can. so final hope is on the gods of stack overflow who have always helped!!
I had this recently. I'd run into some sort of robot protection. Change your original line to:
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(30000)
.get();
The problem might be that your web page is rendered by Javascript which is run in a browser, JSoup alone can't help you with this, so you may try using HtmlUnit which uses Selenium to emulate the browser: using Jsoup to sign in and crawl data.
UPDATE
There are several reasons why HTML is different. The most probable is that this web page contains <javascript> elements which contains dynamic page logic. This could be an application inside your web page which sends requests to the server and add or removes content depending on the responses.
JSoup would never render such pages because it's a job for a browser like Chrome, Firefox or IE. JSoup is a lightweight parser for plaintext html you get from the server.
So what you could do is you could use a web driver which emulates a web browser and renders a page in memory, so it would have the same content as shown to the user. You may even do mouse clicks with this driver.
And the proposed implementation for the web driver in the linked answer is HtmlUnit. It's the most lightweight solution, however, it's might give you unexpected results: Selenium vs HtmlUnit?.
If you want the most real page rendering, you might want to consider Selenium WebDriver.
Why do you want to parse a web page this way? If there is a consumable service available from the website, the website might have an REST API.
To answer your question, A webpage viewed using the web-browser may not be same, as the same webpage is downloaded using a URLConnection.
The following could be few of the reasons that cause these differences:
Request Headers: when the client (java application/browser) makes a request for a URL, it sets various headers as part of the request and the webserver may change the content of the response accordingly.
Java Script: once the response is recieved, if there are java script elements present in the response it's executed by the browsers javascript engine, which may change the contents of DOM.
Browser Plugins, such as IE Browser Helper Objects, Firefox Extensions or Chrome Extensions may change the contents of the DOM.
in simple terms, when you request a URL using a URLConnection you are recieving raw data, however when you request the same URL using a browser's addressbar you get processed (by javascript/browser plugins) webpage.
URLConnection/JSoup will allow you to set request headers as required, but you may still get the different response due to points 2 & 3. Selenium allows you to remote control a browser and has a api to access the rendered page. Selenium is used for automated testing of web applications.
I created an application that parses content of secured areas of one webpage after account information input. It used GET method for loging in, so it was quite simple, I just used URL to log in.
Now it was changed to POST method and I wonder how to log in to that site? The login form uses 2 input tags of type text with names 'login' and 'pass'.
Are you using something like apache commons http. You can use post method.
Have you tried setting the request method?
HttpURLConnection conn = new HttpURLConnection(url);
conn.setRequestMethod("POST");
OutputStream os = conn.getOutputStream();
os.write(...);
Object content = conn.getContent();
I have a simple web page with an
embedded Java applet.
The applet
makes HTTP calls to different Axis
Cameras who all share the same
authentication (e.g. username,
password).
I am passing the user name and password to the Java code upon launch of the applet - no problem.
When I run from within NetBeans with the applet viewer, I get full access to the cameras and see streaming video - exactly as advertised.
The problem begins when I open the HTML page in a web browser (Firefox).
Even though my code handles authentication:
URL u = new URL(useMJPGStream ? mjpgURL : jpgURL);
huc = (HttpURLConnection) u.openConnection();
String base64authorization =
securityMan.getAlias(this.securityAlias).getBase64authorization();
// if authorization is required set up the connection with the encoded
// authorization-information
if(base64authorization != null)
{
huc.setDoInput(true);
huc.setRequestProperty("Authorization",base64authorization);
huc.connect();
}
InputStream is = huc.getInputStream();
connected = true;
BufferedInputStream bis = new BufferedInputStream(is);
dis= new DataInputStream(bis);
The browser still brings up an authentication pop-up and requests the username and password for each camera separately!
To make things worse, the images displayed from the camera are frozen and old (from last night).
How can I bypass the browser's authentication?
Fixed
I added the following lines:
huc.setDoOuput(true);
huc.setUseCaches(false);
after the
huc.setDoInput(true);
line.
When running in the browser base64authorization not null, correct?
I'm not really sure what getBase64authorization is supposed to return, but I'm fairly certain when you call huc.setRequestProperty("Authorization", **autorization value**) it's looking for a HTTP Basic authentication value. Meaning **authorization value** needs to be in the format Basic **base 64 encoding of username:password** as described here.
Maybe you just need to add the Basic (note the trailing space) string to your property.