Replacement for Jaxer for parsing/crawling websites - java

I have an old tool an (ex-)colleague wrote a few years back with Jaxer, that I'd like to replace/rewrite.
Jaxer is an (abandoned) server-side framework based on a headless Mozilla/Gecko-Browser allowing you to use JavaScript and the DOM server-side.
Since Jaxer is abandoned and because I have big problems installing and running Aptana Studio 1.5 with Jaxer on a new computer, I'm looking for a library/framework/something on which I can base a new version.
This tool is only run locally inside Aptana Studio (the IDE for Jaxer) and was never intended to be an actual web app. It crawls our customers websites by loading them page by page into the server-side Mozilla. In order to do that it uses jQuery and predefined CSS selectors to find the links in the menus and parse other information out of the pages. The final result is basically a glorified sitemap.
I'd like to keep this modus operandi if possible and continue using jQuery/JavaScript/the DOM to load and parse/access the pages, but it can be wrapped in a framework based on another language such as Java. I considered writing something based on Gecko myself, but that seems a bit over the top, so I'm open for an other suggestions.

As far as HTML crawling/parsing goes:
http://ccil.org/~cowan/XML/tagsoup/
or
http://jsoup.org/

Related

Alternatives to Java Applet running with NPAPI in Chrome

We have an Applet we used to zip files on the client machine and stream the content back to our servers. Our clients that have updated to the newer versions of Chrome are no longer able to use our Applet because Chrome does not support NPAPI plugins any longer.
I think I have a couple of options:
To somehow make the existing Applet work with Chrome (perhaps using JNLP? ) or some other method
To find an alternative technology altogether
The solution has to be able to receive a list of folders, sub-folders and file names. It then has to be able to compress these files, if possible, then upload them to the server. I am open to any suggestions.
You can
Read the file(s) with the File API, potentially letting the user add them to your interface via drag and drop (for a more convenient selection mechanism than boring <input type="file"> :-) ).
Zip them up in JavaScript using a library like JSZip (though if your server has gzip enabled, I'm not sure you gain a lot doing that; I haven't looked into it deeply, though)
Send them to the server either via HTTP POST (possibly multiple posts), or by using XMLHttpRequest2, or via web sockets.
Of course, your other alternative is to continue to use Java and have the users use Firefox instead of Chrome. Just beware that Mozilla is also looking to make a move away from NPAPI and away from supporting Java. About 20 months ago they weren't:
there are no plans of dropping support for java or other npapi plugins in firefox other than setting them to "ask to activate": https://blog.mozilla.org/security/2014/02/28/update-on-plugin-activation/
....but now:
Mozilla intends to remove support for most NPAPI plugins in Firefox by the end of 2016. Firefox began this process several years ago...
(which puts the lie to "no plans" in the first quote)
...Websites and publishers which currently use plugins such as Silverlight or Java should accelerate their transition to Web technologies.

Rendering HTML (w/ javascript) and Converting to an Image

I have an HTML page that has Javascript code. It needs to be rendered first before it can be converted into an image.
I am aware of projects like wkhtmltoimage, PhantomJS, khtmltopng, webkit2png, PrinceXML and html2image. I have implemented a few of those but I am trying to find a pure Java solution that does not have to use Process to execute a command. Any help would be great, thanks!
edit: I looked into Cobra however it seems that the JS support is still in dev and it does not parse my html file properly.
Or if there are any other ways of doing this, please let me know. I am just trying to find the best solution possible.
There is no pure Java solution - no one has written a browser in Java that supports HTML 5.
I'd try either of these approaches:
Use env.js + rhino to simulate a browser in which you can run the JavaScript. That should give you a DOM which you can render using FlyingSaucer, for example.
Add SWT to your classpath (plus the binary for your platform). It contains a Browser component that uses your system's browser to render URLs or an HTML string.
You probably need SWTBot to run the browser in headless mode.
If that doesn't work and you're on Linux, then you can start an in-memory X server Xvfb to open your browser. Or you can use vncserver to start a desktop on your server.
[EDIT] The project phantomjs might do what you want:
PhantomJS (www.phantomjs.org) is a headless WebKit scriptable with JavaScript or CoffeeScript.
[...]
Use cases: Headless web testing, Site scraping, Page rendering
Multiplatform, available on major operating systems: Windows, Mac OS X, Linux, other Unices
Fast and native implementation of web standards: DOM, CSS, JavaScript, Canvas, SVG. No emulation!
Pure headless (X11) on Linux, ideal for continuous integration systems. Also runs on Amazon EC2.
The quickstart page explains how to load a web page and render it to an image.
I have found a solution using WebRenderer. WebRenderer is a paid solution and has a swing, server, and desktop edition. The swing edition is the only one that supports HTML5 as of 7/9/2012. However, the swing edition can be used on a server to convert the image by instantiating the browser and not creating a JFrame. See this question.

When writing a internet browser in java

I came across the problem of not having an editor kit that could not handle some parts of a webpage. Examples include javascript and css. Does anyone know where I can find an editor kit that is suitable for that?
Also I'm curious as to what programming language browsers like Google Chrome and IE use.
Try Aptana Studio
There are already miscellaneous browser projects using java. The java Scripting API supporting undermore JavaScript (JS sometime named ECMAScript) will be worth looking into. You should on reading the HTML, construct a DOM tree and interprete <script> blocks, which may operate on the DOM or write HTML you have to read from.
The EditorKit in swing builds a too simple non.tree StyledDocument, which you have to bridge.
The way to proceed would be to first not use swing output but immediately generate a DOM/generated HTML.

When JEditorPane will not do. JRex, JRex wherefore are thou JRex?

IDEA: Implement a recent web browser into a java application (for saved offline, non server content).
The question is this: can I have a java application implement a webbrowser with jquery / html / css support within a java program?
So I am asking anyone who has played with JRex for advice: I want to know how complicated will it be to integrate an open source webbrowser into java. I am not all that keen on the idea of compiling Mozilla from source build. Is there a ready made compiled version?
Is there a simplified method to have latest compiled version (most current in terms of support for HTML css & javascript), and integrate that into an application?
Also: I appreciate the amount of work required to support for HTML4 nevermind 5, and CSS2 compliance. How close is JRex to that?
Application: My intention with the webbrowser is to render a webpage from offline content. It will not need to be online content, and will simply be for file based displays = e.g. file:///C:...
Does the webbrowser have to be wrapped into a server to function, e.g. to pass files to the browser to render is how complicated? I am not keen to have to implement Jetty or another server type application just for this.
If JRex is not the solution... what then? Is it possible to start a browser implementation within Java and can Java interact with the information and traverse the Dom?
Or alternatively is there .hta equivalent in recent browsers like firefox?
If you need to have the embedded browser interact with your application code, you could try the SWT Browser control, it's actually maintained as opposed to JRex. Browser uses either WebKit or Gecko or embedded IE as appropriate, or lets you choose which one you want, so it should run jQuery and familiar Javascript. And since SWT is a JNI library to begin with they probably already have guidance on how to deploy an app that uses JNI.
You can feed HTML into the control from a string (example) or a java Url - which can point to local files or resource files in your JAR, which I assume will let you split your app into different files.
To call Java code, you need to expose it as Javascript functions. example
To manipulate the HTML from Java code, you need to call Javascript functions from Java. example
To make the previous two tasks easier, you might want to look into a JSON library to simplify passing around complex data.
Does it have to be implemented within a Java program? Could you let the user use the default browser on their machine (ie does it matter what browser)?
If not would use the Java Desktop API.
if (desktop.isSupported(Desktop.Action.BROWSE)) {
txtBrowserURI.setEnabled(true);
btnLaunchBrowser.setEnabled(true);
}
If you are using Java 1.5 try http://javadesktop.org/articles/jdic/

autogenerate HTTP screen scraping Java code

I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.
As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.
Thanks,
Don
I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.
It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.
If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.
The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.
You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.
I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.
A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).

Categories

Resources