I am working on a crawling project. When I do a simple URLConnection connection to the website as shown in below:
URLConnection conn = new URL(url).openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
The method returns the HTML body correctly. However, the website makes inner requests for some fields. For example, the website fetches the total number of users from a different web service. In the web browser, the total number of users appear after some time, but with the URLConnection method does not wait for the total number of users and the returned HTML does not contain such field.
In Java, is there any way to wait for a while to fetch all the data from a website using URLConnection.
From your "inner requests" comment it sounds like the website is using JavaScript (via a framework or just using native browser APIs) to fetch data and render these results into the DOM. This is very common nowadays with SPAs etc.
If that's the case, no amount of waiting will change the outcome from using a simple HTTP library like URLConnection - but you can check this by saving the HTML locally and viewing it in your browser - what happens? When you examine it, is there JavaScript on that page?
To do this properly in code, you'll need something capable of behaving more like a browser, and executing that JS referenced by the HTML in a DOM-like environment. Try Selenium with PhantomJS or headless Chrome / Firefox, or maybe GhostDriver.
Normally if you are getting the html body of the page, all calls made onn the server side of this web site must have been completed.
If the website does not contain Javascript, then use the Jsoup (https://jsoup.org) library for Java. It loads all inner HTML requests needed to render the final HTML page.
Related
I've been researching networking in android and have become familiar with the basic concepts. However, I've recently been working on an app intended for use by my school as a gradebook/assignment display of sorts. My school uses the following online gradebook: https://parents.mtsd.k12.nj.us/, and currently I'm trying to allow students to input their login credentials into my app which will then try to retrieve their respective information and display it in a gridview. However, I have never attempted to log in to a website from an android app before parsing data (I have gone through several similar questions but I am still a little confused)
Any help is appreciated!
[edit] After logging in I will use import.io to display gradebook data, website does not have api
If I understand You clearly u need to :
Connect to site using HttpURLConnection with user creditentials (do Post request to login page)
Obtain cookies from request (my favorite way is to parse - copy appropriate header field)
Use obtained cookies with HttpURLConnection.setRequestProperty("Cookie",obtainedCookies) and do another request(to user data page).
Get input stream
Parse stream
(to obtain Document "html" u can use jsoup example:
Document doc = JSoup.parse(stream, null, "");
Show data
GUIDELINES:
cookie how to
Most useful example
Caution:
-any http requests needs to be done outside main thread (ui thread) - u can use async task for this or intent service
EDIT on questions:
you ask whether to use Jsoup to handle connection?
my answer is NO!
Jsoup is a PARSER by using HttpURLConnection u gain full control over HTTP request so for example u can handle some specyfic exceptions or request propertiers which jsoup is not capable! from my experience after a while I start to disassemble the libraries and learn from them the basics and use parts of them in my code!
You could make a WebView with a custom HTML file which has a login and password field, and then have the HTML file run the same script as the website to log them in.
I'm using java and trying to get the content of a website so that I can analyze the text on the page, however every time that I "GET" a response from the server, it is from a login page rather than the website page that I am looking at.
I am logged into the website on all my browsers, but my application is not able to see the page as if it were me.
I also tried to use an API called "Yandex" --> http://api.yandex.com/rca/
as a work-around. But when I call the page from Yandex (which would get its content) I only see information based on the login page returned.
Can anyone give me a direction to investigate? I would like to be able to get one item on the page of a website that I work for, but it doesn't seem possible.
m_strseedpath = "http://myUrl.com/mypage.html"; //not https
URLConnection connection = new URL("http://rca.yandex.com/?key={MyActualKeyNotThisText}&url=" + m_strSeedUrlPath + "").openConnection();
connection.setRequestProperty("Accept-Charset", "UTF-8");
InputStream response = connection.getInputStream();
StringWriter writer = new StringWriter();
IOUtils.copy(response, writer, "UTF-8");
String strString = writer.toString();
System.out.println(strString);
The URLConnection object will connect to the page but in a different session. You would have to programmaticaly log in from your Java code.
Create a URLConnection object to the login page, POST the user name and password, receive the content getting the InputStream from the URLConnection object, and finally create a new connection to the page you wish to analyze. You'd have to also work with cookies in order to view the second page.
Hope this helps!
The URL that you are trying to access has access restricted via login. Even if you are logged in via your browser you wont be able to access the page from your Java application because the browser has an Authenticated Session with the target website. The same session is not visible to your Java Application.
You would have to research into the ways to login to the website and then get the page content.
i have a severe concern here. i have searched all through stack overflow and many other sites. every where they give the same solution and i have tried all those but mi am not able to resolve this issue.
i have the following code,
Document doc = Jsoup.connect(url).timeout(30000).get();
Here m using Jsoup library and the result that i am getting is not equal to the actual page source that we can see but right click on the page -> page source. Many parts are missing in the result that i am getting with the above line of code.
After searching some sites on Google, i saw this methid,
URL url = new URL(webPage);
URLConnection urlConnection = url.openConnection();
urlConnection.setConnectTimeout(10000);
urlConnection.setReadTimeout(10000);
InputStream is = urlConnection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
int numCharsRead;
char[] charArray = new char[1024];
StringBuffer sb = new StringBuffer();
while ((numCharsRead = isr.read(charArray)) > 0) {
sb.append(charArray, 0, numCharsRead);
}
String result = sb.toString();
System.out.println(result);
But no Luck.
While i was searching over the internet for this problem i saw many sites where it said i had to set the proper charSet and encoding types of the webpage while downloading the page source of a web page. but how will i get to know these things from my code dynamically?? is there any classes in java for that. i went through crawler4j also a bit but it did not to much for me. Please help guys. m stuck with this problem for over a month now. i have tried all my ways i can. so final hope is on the gods of stack overflow who have always helped!!
I had this recently. I'd run into some sort of robot protection. Change your original line to:
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(30000)
.get();
The problem might be that your web page is rendered by Javascript which is run in a browser, JSoup alone can't help you with this, so you may try using HtmlUnit which uses Selenium to emulate the browser: using Jsoup to sign in and crawl data.
UPDATE
There are several reasons why HTML is different. The most probable is that this web page contains <javascript> elements which contains dynamic page logic. This could be an application inside your web page which sends requests to the server and add or removes content depending on the responses.
JSoup would never render such pages because it's a job for a browser like Chrome, Firefox or IE. JSoup is a lightweight parser for plaintext html you get from the server.
So what you could do is you could use a web driver which emulates a web browser and renders a page in memory, so it would have the same content as shown to the user. You may even do mouse clicks with this driver.
And the proposed implementation for the web driver in the linked answer is HtmlUnit. It's the most lightweight solution, however, it's might give you unexpected results: Selenium vs HtmlUnit?.
If you want the most real page rendering, you might want to consider Selenium WebDriver.
Why do you want to parse a web page this way? If there is a consumable service available from the website, the website might have an REST API.
To answer your question, A webpage viewed using the web-browser may not be same, as the same webpage is downloaded using a URLConnection.
The following could be few of the reasons that cause these differences:
Request Headers: when the client (java application/browser) makes a request for a URL, it sets various headers as part of the request and the webserver may change the content of the response accordingly.
Java Script: once the response is recieved, if there are java script elements present in the response it's executed by the browsers javascript engine, which may change the contents of DOM.
Browser Plugins, such as IE Browser Helper Objects, Firefox Extensions or Chrome Extensions may change the contents of the DOM.
in simple terms, when you request a URL using a URLConnection you are recieving raw data, however when you request the same URL using a browser's addressbar you get processed (by javascript/browser plugins) webpage.
URLConnection/JSoup will allow you to set request headers as required, but you may still get the different response due to points 2 & 3. Selenium allows you to remote control a browser and has a api to access the rendered page. Selenium is used for automated testing of web applications.
I am trying to read a website using the java.net package classes. The site has content, and i see it manually in html source utilities in the browser. When I get its response code and try to view the site using java, it connects successfully but interprets the site as one without content(204 code). What is going on and is it possible to get around this to view the html automatically.
thanks for your responses:
Do you need the URL?
here is the code:
URL hef=new URL(the website);
BufferedReader kj=null;
int kjkj=((HttpURLConnection)hef.openConnection()).getResponseCode();
System.out.println(kjkj);
String j=((HttpURLConnection)hef.openConnection()).getResponseMessage();
System.out.println(j);
URLConnection g=hef.openConnection();
g.connect();
try{
kj=new BufferedReader(new InputStreamReader(g.getInputStream()));
while(kj.readLine()!=null)
{
String y=kj.readLine();
System.out.println(y);
}
}
finally
{
if(kj!=null)
{
kj.close();
}
}
}
Suggestions:
Assert than when manually accessing the site (with a web browser client) you are effectively getting a 200 return code
Make sure that the HTTP request issued from the automated (java-based) logic is similar/identical to that of what is sent by an interactive web browser client. In particular, make sure the User-Agent is identical (some sites purposely alter their responses depending on the agent).
You can use a packet sniffer, maybe something like Fiddler2 to see exactly what is being sent and received to/from the server
I'm not sure that the java.net package is robot-aware, but that could be a factor as well (can you check if the underlying site has robot.txt files).
Edit:
assuming you are using the java.net package's HttpURLConnection class, the "robot" hypothesis doesn't apply.
On the other hand you'll probably want to use the connection's setRequestProperty() method to prepare the desired HTTP header for the request (so they match these from the web browser client)
Maybe you can post the relevant portions of your code.
I created an application that parses content of secured areas of one webpage after account information input. It used GET method for loging in, so it was quite simple, I just used URL to log in.
Now it was changed to POST method and I wonder how to log in to that site? The login form uses 2 input tags of type text with names 'login' and 'pass'.
Are you using something like apache commons http. You can use post method.
Have you tried setting the request method?
HttpURLConnection conn = new HttpURLConnection(url);
conn.setRequestMethod("POST");
OutputStream os = conn.getOutputStream();
os.write(...);
Object content = conn.getContent();