Is it possible to process JavaScript in a Java application? Possibly utilizing WebKit libraries, or whatever browser libraries use to process JavaScript? A use case would be - how, in Java, can I determine the possible links this web page would go to?
<script>
function goToLink(){
if(1==1){
window.location='www.somesite.com'
} else {
window.location='www.nevergethere.com'
}
}
</script>
<html>
<a href onClick='javascript:goToLink()'>CLICK HERE!!</a>
</html>
Typically, you would just search all of the code for a link regular expression, but you will never actually have the chance of going to: 'www.nevergetthere.com'
I've had some luck in tracing down JavaScript-based page links with HTML Unit. It basically acts like a browser that you have access to inside a Java program, so you can simulate a click on a link, and then figure out where it goes.
You might be looking for Rhino.
If the objective is to look at a website, without knowing beforehand what the javascript or HTML will look like, and try to figure out where you would end up by clicking on various anchor tags, you could use something like WebDriver to actually load up the page in a browser (either real or virtual), click on various DOM elements, and see where you end up.
Web scraping is tricky business, though. There are a hundred little things that could make your code not read the page correctly. A hundred little expectations you may have that the website in question might not abide by.
Related
Hello everyone here is my problem.
I want to extract 2 words from a website, the words are "won" or "loss". If i can find those 2 words on the website i will be able to write the program i am working on.The problems i have are...
When i write a java program to get the html code from the site it only gives me the html code that is not changing ie: it doesnt giving the dynamic php code parts.
When i "inspect elements" on the website it gives me exactly what i want. It says i either won or loss in the html tags . However if i simply view source it doesn't show me that dynamic php code that u would see when inspecting elements.
Is there a way for me to write code that looks at "inspect elements" for the website and keep track of the part of the html code that is changing between "win" or "loss"?
I've had trouble with something like this before and since you lack details I will give you the best answer I can...
More information that will be helpful to know maybe if you edit include,
Code... Show me what you got
The html code
APIs or frameworks used in you application
So the issue seems like when you request the site the information is not there. Normally this doesn't happen since most webpage display information at load time.
These days we do a lot of stuff with Javascript so therefore that is probably the part you are having problems with. Javascript can load information onto the page dynamically at anytime. It need not me at load time and even if it looks like it by eye that its there when the page loads it may not be since it's too fast to notice.
Look into the javascript code and see if you can find a get, post, or put action and see if you can follow that to where it loads the page. Then mimic the request in your program.
The program I am writing is in Java.
I am writing a little program that will download the html of webpages and save them. It works easily for basic pages that don't use JavaScript. But how can I download the page if I want it after a script has updated it? The page I am dealing with is actually updated by Ajax which might be one step harder.
I understand that this is probably a difficult problem that involves setting up a JavaScript run time environment of some kind. I am prepared for a solution of any level of difficulty, I just don't know exactly how to approach it or where to get started.
You can't do that alone with Java only. As the page that you want to download is rendered with javascript, then you must be able to execute the javascript to get the whole rendered page.
Because of this situation, you need to use a headless browser which is a web browser that can access to web pages but can’t show the output within a GUI, aims to provide the content of web pages as fully rendered to serve to the programs or scripts.
You can start with the most famous ones which are Selenium, HtmlUnit and PhantomJS
I am facing a problem retrieving the contents of an HTML page using java. I have described the problem below.
I am loading a URL in java which returns an HTML page.
This page uses javascript. So when I load the URL in the browser, a javascript function call occurs AFTER the page has been loaded (onBodyLoad of HTML page) and it modifies some content (one of the div id's innerHtml) on the webpage. This change is obviously visible to me in the browser.
Now, when I try to do the same thing using java, I only get the HTML content of the page , BEFORE the javascript call has occurred.
What I want to do is, fetch the contents of the html page after the javascript function call has occurred and all this has to be done using java.
How can I do this? What should my approach be?
You need to use a server side browser library that will also execute the JavaScript, so you can get the JavaScript updated DOM contents. The default browser mechanism doesn't do this, which is why you don't get the expected result.
You should try Cobra: Java HTML Parser, which will execute your JavaScript. See here for the download and for the documentation on how to use it.
Cobra:
It is Javascript-aware. DOM modifications that occur during parsing will be reflected in the resulting DOM. However, Javascript can be disabled.
For anyone reading this answer, Scott's answer above was a starting point for me. The Cobra project is long dead and cannot handle pages which use complex JavaScript.
However there is something called HTML Unit which does just exactly what I want.
Here is a small description:
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.
It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use.
It is typically used for testing purposes or to retrieve information from web sites.
I am trying to scrape a website, using Web Client, i am able to get the data on the first page and parse it, but I do not know how to read the data on the second page, the website is calling a java script to navigate to the second page. Can anyone suggest me how do I get the data from the next pages?
Thanks in advance
The problem you're going to have is while you (a person) can read the JavaScript in the first page and see it is navigating to another page, having the computer do this is going to be hard.
If you could identify the block of code performing the navigation, you would then need to execute it in such a way that allowed your program to extract the URL. This again is going to be very specific to the structure of the JavaScript and would require a person to identify this.
In short, I think you're dead in the water with this one, though it serves as a good example of why the Unobtrusive JavaScript concept is so important.
This framework integrates HtmlUnit with its headless javascript enabled browser to fully support scriping multiple pages in the same WebClient session: https://github.com/subes/invesdwin-webproxy
there
I am working on a project which would translate the html code of a web into a specific JS library using JAVA, so that the div blocks can have different dynamic behaviors.
To translate the html div into a JS object, I have to know the coordinates of it as well as the width and length.
I turned into several JAVA html parser library: http://java-source.net/open-source/html-parsers
But none of them have this functionality except Cobra http://lobobrowser.org/cobra/java-html-parser.jsp . It has a rendering engine which could provide the coordinates and dimension of a div. But this library turns out to be really buggy. I cannot even run through its test which comes with the library.
Does anyone know how to handle this problem? I would really appreciate it if you could help!
Thanks in advance!
Phil
You could try some component of HtmlUnit, which emulates a browser. Honestly though, I think you need to think about your question more carefully. JQuery can do the 'different dynamic behaviours' thing you talk about via modification of the HTML DOM (Document Object Model) with Javascript, and if you need anything in the HTML document, inspection of the DOM via Javascript should be your first port of call. Java should not be required anywhere (unless you're using it server-side for page and input processing with JSP or some similar tech). Any responses to client input can be triggered server-side and sent to Javascript on the client-side, which triggers JQuery actions that modify the DOM.