What is a java.net.URLConnection doing? - java

When I connect() to a URL. Is Java fetching the webpage as a browser would, just without displaying it?
I'm trying to understand for example, if I were to connect to a Youtube Video URL. Even though i could not see the page, is the URL Connection loading the page and playing the video as if from a typical browser (without the UI or visual representation of the page)?

It is fetching the raw HTML of the web page. Similar if you where to open a page in your browser and right click->View Source.
If you connect to a Youtube page, you will get the raw HTML and within that code, will be a reference (href), most likely a tag that points to the source of the video.
--Edit
A browser then obviously interprets that HTML into what you see on your screen.
A modern browser automatically connects to all references on that HTML page, as if loading multiple pages simultaneously, and putting them together.

No, the URLConnection represents only that, a URL connection. Using UrlConnection.openConnection() is connecting to the page, but it still needs to be told what to do. It will provide you with a "printing" of the elements on the page only if told to do so. However, it has accessed that file. Connecting to and reading information from a page is a multistep process.
Please see the Oracle Java Documentation about URLConnections. It provides a lot more information and clarity into how this class works as well as how to use it.
http://docs.oracle.com/javase/7/docs/api/java/net/URLConnection.html

Related

Read full content of a web page in Java

I want to crawl the whole content of the following link with a Java program. The first page is no problem, but when I want to crawl the data of the next pages, there is the same source code as for page one. Therefore a simple HTTP Get does not help at all.
This is the link for the page I need to crawl.
The web site has active contents that need to be interpreted and executed by a HMTL/CSS/JavaScript rendering engine. Therefore I have a simple solution with PhantomJS, but it is sophisticated to run PhantomJS code in Java.
Is there any easier way to read the whole content of the page with Java code? I already searched for a solution, but could not find anything suitable.
Appreciate your help,
kind regards.
Using the Chrome network log (or a similar tool in any other browser) you can identify the XHR request that loads the actual data displayed on the page. I have removed some of the query parameters, but essentially the request looks like this:
GET https://www.blablacar.de/search_xhr?fn=frankfurt&fcc=DE&tn=muenchen&tcc=DE&sort=trip_date&order=asc&limit=10&page=1&user_bridge=0&_=1461181945520
Helpfully, the query parameters look quite easy to understand. The order=asc&limit=10&page=1 part looks like it would be easy to adjust to return your desired results. You could adjust the page parameter to crawl successive pages of data.
The response is JSON, for which there are a ton of libraries available.

Retrieving contents of URL after they have been changed by javascript

I am facing a problem retrieving the contents of an HTML page using java. I have described the problem below.
I am loading a URL in java which returns an HTML page.
This page uses javascript. So when I load the URL in the browser, a javascript function call occurs AFTER the page has been loaded (onBodyLoad of HTML page) and it modifies some content (one of the div id's innerHtml) on the webpage. This change is obviously visible to me in the browser.
Now, when I try to do the same thing using java, I only get the HTML content of the page , BEFORE the javascript call has occurred.
What I want to do is, fetch the contents of the html page after the javascript function call has occurred and all this has to be done using java.
How can I do this? What should my approach be?
You need to use a server side browser library that will also execute the JavaScript, so you can get the JavaScript updated DOM contents. The default browser mechanism doesn't do this, which is why you don't get the expected result.
You should try Cobra: Java HTML Parser, which will execute your JavaScript. See here for the download and for the documentation on how to use it.
Cobra:
It is Javascript-aware. DOM modifications that occur during parsing will be reflected in the resulting DOM. However, Javascript can be disabled.
For anyone reading this answer, Scott's answer above was a starting point for me. The Cobra project is long dead and cannot handle pages which use complex JavaScript.
However there is something called HTML Unit which does just exactly what I want.
Here is a small description:
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.
It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use.
It is typically used for testing purposes or to retrieve information from web sites.

scrape website multiple pages using Web Client java

I am trying to scrape a website, using Web Client, i am able to get the data on the first page and parse it, but I do not know how to read the data on the second page, the website is calling a java script to navigate to the second page. Can anyone suggest me how do I get the data from the next pages?
Thanks in advance
The problem you're going to have is while you (a person) can read the JavaScript in the first page and see it is navigating to another page, having the computer do this is going to be hard.
If you could identify the block of code performing the navigation, you would then need to execute it in such a way that allowed your program to extract the URL. This again is going to be very specific to the structure of the JavaScript and would require a person to identify this.
In short, I think you're dead in the water with this one, though it serves as a good example of why the Unobtrusive JavaScript concept is so important.
This framework integrates HtmlUnit with its headless javascript enabled browser to fully support scriping multiple pages in the same WebClient session: https://github.com/subes/invesdwin-webproxy

Interact with Web Page from java

So I'm making a program for android that tries to download something from www.wupload.com. What I want isn't a browser but to interact with the webpage without actually showing it. Like how HtmlUnit is supposed to work.
I'm using apache for the html requests and what I've done so far is send a post that simulates clicking on slow download on the web page. Then I read the response so I can get some variables needed to make the next post and execute the next post. In theory, the web page should be showing the captcha cause the response I get is please enter the captcha, but no image url.
The next step would be to enter the captcha and finally download the file, the problem I'm having is I don't know how to show the captcha image to the user. Do I have to capture it somehow? I know how to make the post to send what the user would type, but the image url of the captcha isn't in the source code.
I thought of inspecting the web page so I could get the url from the DOM tree, like what inspect element on google chrome does, but I have no idea if it's even possible. Any ideas would be great.
thx
The captcha is probably generated using JavaScript. Therefore when you get the source of the website, the captcha hasn't yet been generated and you won't see the image in the source HTML. You would need to run the Javascript somehow. You could try using a WebView because it has built-in support for Javascript, or get a Javascript library for Java and use it somehow. I think it would be a lot of work.
Edit:
Actually, if they are using a thirdy-party captcha library, I'm sure it uses some sort of HTTP request system, so you might be able to inspect it with this plugin for Firefox.

How do i get Contents of an ASPX file through java?

In java, is there any way to get the content of a webpage, wich is an .ASPX file.
I know how to read/write anything from a normal HTML page, but asp pages seem to have one URL for multiple pages, so it's not really possible to reach the desired page by URL.
I understand you can't/won't give me complete instructions right here, but could you maybe send me in the right direction ?
Thanks in advance.
There is nothing special about ASPX pages compared to any other type of page; "plain" html pages could have been dynamically generated as well.
Just don't forget that the query string is also part of the URL. Many ASPX, PHP, etc pages might not even be 'correct' to request without some query string value at all. And other sites don't have file extensions at all... like this site itself. You just have to be sure to get the entire URL for each unique 'page'.
I'm not an expert on .asp, so I might be wrong. However, my impression is that a .asp page should ultimately return HTML (similarly to what a .jsp page does), so you can fetch the content in the same way as you would do for an HTML page.
However, you write that
asp pages seem to have one URL for multiple pages
this makes me think that perhaps your .asp page is using AJAX and so the page content may change while the URL doesn't. Is this your case?
I understand that you are trying to read the aspx from a client PC, not from the server.
If that's right, accessing an HTTP resource is independent from the technology used by the server, all you need to do is to open an http request and retrieve the results.
If you see multiple pages from one URL, then one of the following is happening:
1) POST data is sent to the aspx, and it renders different HTML due to these parameters
2) You are not looking really at the inner page but to a page that provides the frames for the HTML being rendered
3) The page uses heavily Ajax in order to be rendered. The "contents" of the page are not download through the initial request but later by javascript.
Generally, it is probably the first reason.

Categories

Resources