Parse JavaScript from html string in C#/Java - java

I am trying to do some WebScraping of a site and the data is in dynamically loaded containers. It appears that the loading is done via JavaScript. Therefore I would need to execute it, in order to get it. I have written 2 versions of the program, one in Java and one in C#, so helping me with one would be nice enough.
I am currently using WebResponse/WebRequest in C# and HttpURLConnection in Java. I have to login into the site first, which already works like a charm. Now I need to parse the content, so the data gets filled in and the containers loaded. Is there an easy way to run the html through a browser control or an already included library?

In C# you could just use a hidden WebControl. With that you could execute everything you need.

Related

Run and Save output of JavaScript from Java

I'm trying to figure out how to run some JavaScript from my Java class. I know that javax.script can be used to do this, but here's the kicker:
The JavaScript I want to run returns and displays a PDF in the browser. I'd like to store the PDF that's generated in a byte array or something like that so my class can do something with it later such as save it to disk.
So I guess the best approach would be to take a String that I have that contains the HTML and JavaScript. Then emulate a browser page, run it, then store the source? I'm not sure what library I would use to do something like this. How might I achieve this?
Focusing on your problem with javascript and java you can you this lib called Direct Web Remoting:
DWR is a RPC library which makes it easy to call Java functions from
JavaScript and to call JavaScript functions from Java (a.k.a Reverse
Ajax).
http://directwebremoting.org/

When JEditorPane will not do. JRex, JRex wherefore are thou JRex?

IDEA: Implement a recent web browser into a java application (for saved offline, non server content).
The question is this: can I have a java application implement a webbrowser with jquery / html / css support within a java program?
So I am asking anyone who has played with JRex for advice: I want to know how complicated will it be to integrate an open source webbrowser into java. I am not all that keen on the idea of compiling Mozilla from source build. Is there a ready made compiled version?
Is there a simplified method to have latest compiled version (most current in terms of support for HTML css & javascript), and integrate that into an application?
Also: I appreciate the amount of work required to support for HTML4 nevermind 5, and CSS2 compliance. How close is JRex to that?
Application: My intention with the webbrowser is to render a webpage from offline content. It will not need to be online content, and will simply be for file based displays = e.g. file:///C:...
Does the webbrowser have to be wrapped into a server to function, e.g. to pass files to the browser to render is how complicated? I am not keen to have to implement Jetty or another server type application just for this.
If JRex is not the solution... what then? Is it possible to start a browser implementation within Java and can Java interact with the information and traverse the Dom?
Or alternatively is there .hta equivalent in recent browsers like firefox?
If you need to have the embedded browser interact with your application code, you could try the SWT Browser control, it's actually maintained as opposed to JRex. Browser uses either WebKit or Gecko or embedded IE as appropriate, or lets you choose which one you want, so it should run jQuery and familiar Javascript. And since SWT is a JNI library to begin with they probably already have guidance on how to deploy an app that uses JNI.
You can feed HTML into the control from a string (example) or a java Url - which can point to local files or resource files in your JAR, which I assume will let you split your app into different files.
To call Java code, you need to expose it as Javascript functions. example
To manipulate the HTML from Java code, you need to call Javascript functions from Java. example
To make the previous two tasks easier, you might want to look into a JSON library to simplify passing around complex data.
Does it have to be implemented within a Java program? Could you let the user use the default browser on their machine (ie does it matter what browser)?
If not would use the Java Desktop API.
if (desktop.isSupported(Desktop.Action.BROWSE)) {
txtBrowserURI.setEnabled(true);
btnLaunchBrowser.setEnabled(true);
}
If you are using Java 1.5 try http://javadesktop.org/articles/jdic/

How to fill out form data on a website

I am looking to develop an app that will take login details from the user, go to a website, login, return values on the web page and then display them to the user on the phone.
Does java have this functionallity? Will I need to use javascript instead maybe? do these answers depend on the website that I am trying to access?
In my head I figure that I could just read in the paramaters as strings or chars, parse the webpage for the appropriate form and "paste" the appropriate value into the form "box". However, I have never attempted anything like this with coding so I am completely new to the idea and dont really know where to start. I tried googling around but any information that I found was either irrelevant or conflicting.
I'm not looking for the code to do it because I will not really learn anythig from that but a finger in the right direction would be great. I really do want to try get better at programming so that's why I've started to give myself these little side projects
Any help that can be offered would be great
Ian,
You can try using http-client (http://hc.apache.org/httpclient-3.x/) lib from apache. It lets to pro grammatically access a website (from a Java code). You will need to do the following things
Use the http-client lib to POST the data to the web site.
Receive the html response.
Use some html parser or xpath to retrieve the values from the response html.
You would need a script which accesses the webpage and enters the data, but in my opinion this is illegal. Because you are accessing a secured area and are able to look into sensitive data. Also accessing the page via a script is "botting" - most pages have safety precautions to prevent the execution of scripts, because most of them are harmful.
In my opinion there is no legal and easy solution to this.

Sending HTML Form Data to Java

I have a Java program that I'm trying to interact with over the web.
I need to gather form data from the user on a Drupal site that I don't have any control over, send it to the Java program, and send the output back to the user. The Java program needs to load a lot of libraries every time it's run, so it needs to be up waiting for data from the user.
It'd be best for me to just have an HTML form for the input. What's the simplest way to deal with HTML form data using Java?
Also, I'm trying to call the Java program from a shell script. I want the program running in the background though so the libraries are loaded in advance. So ideally, I could use the server I set up for both applications.
Thanks for any help.
It sounds like you really just want to write a servlet (or use a higher level web framework, but a servlet would work fine). That makes it very easy to get web form data - you just ask for values by name, basically.
You could then "script" the application using curl, wget or something similar to make requests to the servlet.
Apologies if this doesn't answer your question - I'm finding it slightly tricky to understand exactly what you're trying to do, particularly as there are multiple layers of web UI involved, as far as I can see.
The easiest way to make POST requests with java is to use the Apache HttpClient or the more recent HttpComponents libraries.

autogenerate HTTP screen scraping Java code

I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.
As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.
Thanks,
Don
I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.
It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.
If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.
The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.
You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.
I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.
A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).

Categories

Resources