Accessing a dynamic website with HtmlUnit

Accessing a dynamic website with HtmlUnit - java

I want to access instagram pages without using the API. I need to find the number of followers, so it is not simply a source download, since the page is being built dynamically.
I found HtmlUnit as a library to simulate the browser, so that the JS gets rendered, and I get back the content I want.
HtmlPage myPage = ((HtmlPage) webClient.getPage("http://www.instagram.com/instagram"));
This call however results in the following exception:
Exception in thread "main" com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 403 Forbidden for http://d36xtkk24g8jdx.cloudfront.net/bluebar/3a30db9/scripts/webfont.js
So it can't access that script, but if I'm interpreting this correctly, it's just for font loading, which I don't need. I googled how to tell it to ignore parts of the page, and found this thread.
webClient.setWebConnection(new WebConnectionWrapper(webClient) {
#Override
public WebResponse getResponse(final WebRequest request) throws IOException {
if (request.getUrl().toString().contains("webfont")) {
System.out.println(request.getUrl().toString());
return super.getResponse(request);
} else {
System.out.println("returning response...");
return new StringWebResponse("", request.getUrl());
}
}
});
With that code, the exception goes away, but the source (or page title, or anything else I've tried) seems to be empty. "returning response..." is printed once.
I'm open to different approaches as well. Ultimately, entire page source in a single string would be good enough for me, but I need the JS to execute.

HtmlUnit with JS is not a good solution because Javascript engine Mozilla Rhino for many JS page not work and have a lot of problem.
You can use PhantomJs like a webdriver:
PhantomJs

Related

Java - Detecting changes to webpage ajax component loaded with htmlUnit

I am using HtmlUnit to load a webpage containing a dynamically updated ajax component using the following:-
WebClient webClient = new WebClient(BrowserVersion.CHROME);
URL url = new URL("https://live.xxx.com/en/ajax/getDetailedQuote/" + instrument);
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);
HtmlPage redirectPage = webClient.getPage(requestSettings);
This works and I get the contents of the page at the time of request.
I want however to be able to monitor and respond to changes on the page.
I tried the following:-
webClient.addWebWindowListener(new WebWindowListener() {
public void webWindowContentChanged(WebWindowEvent event) {
System.out.println("Content changed ");
}
});
But I only get "Content changed" when the page is first loaded, and not when it updates.

I fear there is not really a solution with HtmlUnit (at least not out of the box). The webWindowContentChanged hook is only called if the (whole) page inside the window is replaced.
You can try to implement an DomChangeListener and attach that to the page (or maybe the body of the page).
If you like to track the ajax requests more on the http level you have the option to intercept the requests (see https://htmlunit.sourceforge.io/faq.html#HowToModifyRequestOrResponse for more details).
If you need more, please open an issue and we can discuss.

How to get final redirect of a URL with HtmlUnit

I have the URL https://www.facebook.com/ads/library/?id=286238429359299 which gets redirected to https://www.facebook.com/ads/library/?active_status=all&ad_type=political_and_issue_ads&country=US&impression_search_field=has_impressions_lifetime&id=286238429359299&view_all_page_id=575939395898200 in the browser.
I'm using the following code:
#Test
public void createWebClient() throws IOException {
getLogger("com.gargoylesoftware").setLevel(OFF);
WebClient webClient = new WebClient(CHROME);
WebClientOptions options = webClient.getOptions();
options.setJavaScriptEnabled(true);
options.setRedirectEnabled(true);
webClient.waitForBackgroundJavaScriptStartingBefore(10000);
// IMPORTANT: Without the country/language selection cookie the redirection does not work!
URL s = webClient.getPage("https://www.facebook.com/ads/library/?id=286238429359299").getUrl();
}
The above code doesn't take into account of the redirection, is there something I am missing? I need to get the final URL the original URL resolves to.

actually the url https://www.facebook.com/ads/library/?id=286238429359299 return a page with javascript.The javascript will detect environment of the web browser.For example,the js will detect if the current browser is the Headless browser and if the web driver is legal.So I think the solution is to analysis the javascript and you will get the final url.

I think it never actually resolves to final URL due being headless.
Please load the same page in a browser, load the source code and search for "page_uri" and you will see exactly URI you are looking for.
If you would check HtmlUnit output or print the page
System.out.println(page.asXml());
You will see that "page_uri" contains originally entered URL.
I suggest to use Selenium WebDriver (not headless)

web page source downloaded through Jsoup is not equal to the actual web page source

i have a severe concern here. i have searched all through stack overflow and many other sites. every where they give the same solution and i have tried all those but mi am not able to resolve this issue.
i have the following code,
Document doc = Jsoup.connect(url).timeout(30000).get();
Here m using Jsoup library and the result that i am getting is not equal to the actual page source that we can see but right click on the page -> page source. Many parts are missing in the result that i am getting with the above line of code.
After searching some sites on Google, i saw this methid,
URL url = new URL(webPage);
URLConnection urlConnection = url.openConnection();
urlConnection.setConnectTimeout(10000);
urlConnection.setReadTimeout(10000);
InputStream is = urlConnection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
int numCharsRead;
char[] charArray = new char[1024];
StringBuffer sb = new StringBuffer();
while ((numCharsRead = isr.read(charArray)) > 0) {
sb.append(charArray, 0, numCharsRead);
}
String result = sb.toString();
System.out.println(result);
But no Luck.
While i was searching over the internet for this problem i saw many sites where it said i had to set the proper charSet and encoding types of the webpage while downloading the page source of a web page. but how will i get to know these things from my code dynamically?? is there any classes in java for that. i went through crawler4j also a bit but it did not to much for me. Please help guys. m stuck with this problem for over a month now. i have tried all my ways i can. so final hope is on the gods of stack overflow who have always helped!!

I had this recently. I'd run into some sort of robot protection. Change your original line to:
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(30000)
.get();

The problem might be that your web page is rendered by Javascript which is run in a browser, JSoup alone can't help you with this, so you may try using HtmlUnit which uses Selenium to emulate the browser: using Jsoup to sign in and crawl data.
UPDATE
There are several reasons why HTML is different. The most probable is that this web page contains <javascript> elements which contains dynamic page logic. This could be an application inside your web page which sends requests to the server and add or removes content depending on the responses.
JSoup would never render such pages because it's a job for a browser like Chrome, Firefox or IE. JSoup is a lightweight parser for plaintext html you get from the server.
So what you could do is you could use a web driver which emulates a web browser and renders a page in memory, so it would have the same content as shown to the user. You may even do mouse clicks with this driver.
And the proposed implementation for the web driver in the linked answer is HtmlUnit. It's the most lightweight solution, however, it's might give you unexpected results: Selenium vs HtmlUnit?.
If you want the most real page rendering, you might want to consider Selenium WebDriver.

Why do you want to parse a web page this way? If there is a consumable service available from the website, the website might have an REST API.
To answer your question, A webpage viewed using the web-browser may not be same, as the same webpage is downloaded using a URLConnection.
The following could be few of the reasons that cause these differences:
Request Headers: when the client (java application/browser) makes a request for a URL, it sets various headers as part of the request and the webserver may change the content of the response accordingly.
Java Script: once the response is recieved, if there are java script elements present in the response it's executed by the browsers javascript engine, which may change the contents of DOM.
Browser Plugins, such as IE Browser Helper Objects, Firefox Extensions or Chrome Extensions may change the contents of the DOM.
in simple terms, when you request a URL using a URLConnection you are recieving raw data, however when you request the same URL using a browser's addressbar you get processed (by javascript/browser plugins) webpage.
URLConnection/JSoup will allow you to set request headers as required, but you may still get the different response due to points 2 & 3. Selenium allows you to remote control a browser and has a api to access the rendered page. Selenium is used for automated testing of web applications.

Django and Java Applet authorization failed

I'm building a Django App with allauth.
I have a page, with authentication required, where I put a Java applet. This applet do GET requests to other pages (of the same django project) which return Json objects.
The applet gets the CSRF token from the parent web page, using JSObject.
The problem is that I want to set ALL the pages with authentication control, but I cannot get the sessionid cookie from the parent web page of the applet, so it cannot do GET (and neither POST) to obtain (or save) data.
Maybe it is a simple way to obtain this, but I'm a newby, and I haven't found anything.
Ask freely if you need something.
Thank you.
EDIT:
Has I wrote downstairs, I found out that the sessionid cookie is marked as HTTPOnly, so the problem now is which is the most safe way to allow the applet to do POST and GET request.
For example it is possible to create a JS method in the page, which GET the data and pass it down to the applet?
Maybe in the same way I can do the POST?
EDIT:
I successfully get the data, using a jquery call from the page. The problem now is that the code throws an InvocationTargetException. I found out the position of the problem, but I don't know how to solve it.
Here is the Jquery code:
function getFloor() {
$.get(
"{% url ... %}",
function(data) {
var output = JSON.stringify(data);
document.mapGenerator.setFloor(output)
}
);}
And here there are the two functions of the applet.
The ** part is the origin of the problem.
public void setFloor(String input) {
Floor[] f = Floor.parse(input);
}
public static Floor[] parse(String input) {
**Gson gson = new Gson();**
Floor[] floors = gson.fromJson(input, Floor[].class);
return floors;
}
And HERE is the log that come out on my server, where you can see that the applet try to load the Gson's library from the server (instead from the applet)
"GET /buildings/generate/com/google/gson/Gson.class HTTP/1.1" 404 4126
Somebady can help me?

You can do something like this in your applet:
String cookies = JSObject.getWindow(this).eval("document.cookie").toString();
This will give you all the cookies for that page delimited by semicolons.

Redirect to servlet fails

I have a servlet named EditPhotos which, believe it or not, is used for editing the photos associated with a certain item on a web design I am developing. The URL path to edit a photo is [[SITEROOT]]/EditPhotos/[[ITEMNAME]].
When you go to this path (GET), the page loads fine. You can then click on a 'delete' link that POSTs to the same page, telling it to delete the photo. The servlet receives this delete command properly and successfully deletes the photo. It then sends a redirect back to the first page (GET).
For some reason, this redirect fails. I don't know how or why, but using the HTTPFox plugin for firefox, I see that the POST request receives 0 bytes in response and has the code NS_BINDING_ABORTED.
The code I am using to send the redirect, is the same code I have used throughout the website to send redirects:
response.sendRedirect(Constants.SITE_ROOT + "EditPhotos/" + itemURL);
I have checked the final URL that the redirect sends, and it is definitely correct, but the browser never receives the redirect. Why?

Read the server logs. Do you see IllegalStateException: response already committed with the sendRedirect() call in the trace?
If so, then that means that the redirect failed because the response headers are already been sent. Ensure that you aren't touching the HttpServletResponse at all before calling the sendRedirect(). A redirect namely exist of basically a Location response header with the new URL as value.
If not, then you're probably handling the request using JavaScript which in turn failed to handle the new location.
If neither is the case or you still cannot figure it, then we'd be interested in the smallest possible copy'n'pasteable code snippet which reproduces exactly this problem. Update then your question to include it.
Update as per the comments, the culprit is indeed in JavaScript. A redirect on a XMLHttpRequest POST isn't going to work. Are you using homegrown XMLHttpRequest functions or a library around it like as jQuery? If jQuery, please read this question carefully. It boils down to that you need to return a specific response and then let JS/jQuery do the new window.location itself.

Turns out that it was the JavaScript I was using to send the POST that was the problem.
I originally had this:
Delete
And everything got fixed when I changed it to this:
Delete
The deletePhoto function is:
function deletePhoto(photoID) {
doPost(document.URL, {'action':'delete', 'id':photoID});
}
function doPost(path, params) {
var form = document.createElement("form");
form.setAttribute("method", "POST");
form.setAttribute("action", path);
for(var key in params) {
var hiddenField = document.createElement("input");
hiddenField.setAttribute("type", "hidden");
hiddenField.setAttribute("name", key);
hiddenField.setAttribute("value", params[key]);
form.appendChild(hiddenField);
}
document.body.appendChild(form);
form.submit();
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Accessing a dynamic website with HtmlUnit - java

HtmlUnit with JS is not a good solution because Javascript engine Mozilla Rhino for many JS page not work and have a lot of problem. You can use PhantomJs like a webdriver: PhantomJs

Related

Java - Detecting changes to webpage ajax component loaded with htmlUnit

How to get final redirect of a URL with HtmlUnit

web page source downloaded through Jsoup is not equal to the actual web page source

Django and Java Applet authorization failed

Redirect to servlet fails

Categories

Resources