What are the best Java libraries to "fully download any webpage and render the built-in JavaScript(s) and then access the rendered webpage (that is the DOM-Tree !) programmatically and get the DOM Tree as an "HTML-Source"?
(Something similarly what firebug does in the end, it renders the page and I get access to the fully rendered DOM Tree, as the page looks like in the browser! In contrast, if I click "show source" I only get the JavaScript source code. This is not what I want. I need to have access to the rendered page...)
(With rendering I mean only rendering the DOM Tree not a visual rendering...)
This does not have to be one single library, it's ok to have several libraries that can accomplish this together (one will download, one render...), but due to the dynamic nature of JavaScript most likely the JavaScript library will also have to have some kind of downloader to fully render any asynchronous JS...
Background:
In the "good old days" HttpClient (Apache Library) was everything required to build your own very simple crawler. (A lot of cralwers like Nutch or Heretrix are still built around this core princible, mainly focussing on Standard HTML parsing, so I can't learn from them)
My problem is that I need to crawl some websites that rely heavily on JavaScript and that I can't parse with HttpClient as I defenitely need to execute the JavaScripts before...
You can use JavaFX 2 WebEngine. Download JavaFX SDK (you may already have it if you installed JDK7u2 or later) and try code below.
It will print html with processed javascript.
You can uncomment lines in the middle to see rendering as well.
public class WebLauncher extends Application {
#Override
public void start(Stage stage) {
final WebView webView = new WebView();
final WebEngine webEngine = webView.getEngine();
webEngine.load("http://stackoverflow.com");
//stage.setScene(new Scene(webView));
//stage.show();
webEngine.getLoadWorker().workDoneProperty().addListener(new ChangeListener<Number>() {
#Override
public void changed(ObservableValue<? extends Number> observable, Number oldValue, Number newValue) {
if (newValue.intValue() == 100 /*percents*/) {
try {
org.w3c.dom.Document doc = webEngine.getDocument();
new XMLSerializer(System.out, new OutputFormat(doc, "UTF-8", true)).serialize(doc);
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
});
}
public static void main(String[] args) {
launch();
}
}
This is a bit outside of the box, but if you are planning on running your code in a server where you have complete control over your environment, it might work...
Install Firefox (or XulRunner, if you want to keep things lightweight) on your machine.
Using the Firefox plugins system, write a small plugin which takes loads a given URL, waits a few seconds, then copies the page's DOM into a String.
From this plugin, use the Java LiveConnect API (see http://jdk6.java.net/plugin2/liveconnect/ and https://developer.mozilla.org/en/LiveConnect ) to push that string across to a public static function in some embedded Java code, which can either do the required processing itself or farm it out to some more complicated code.
Benefits: You are using a browser that most application developers target, so the observed behavior should be comparable. You can also upgrade the browser along the normal upgrade path, so your library won't become out-of-date as HTML standards change.
Disadvantages: You will need to have permission to start a non-headless application on your server. You'll also have the complexity of inter-process communication to worry about.
I have used the plugin API to call Java before, and it's quite achievable. If you'd like some sample code, you should take a look at the XQuery plugin - it loads XQuery code from the DOM, passes it across to the Java Saxon library for processing, then pushes the result back into the browser. There are some details about it here:
https://developer.mozilla.org/en/XQuery
The Selenium library is normally used for testing, but does give you remote control of most standard browsers (IE, Firefox, etc) as well as a headless, browser free mode (using HtmlUnit). Because it is intended for UI verification by page scraping, it may well serve your purposes.
In my experience it can sometimes struggle with very slow JavaScript, but with careful use of "wait" commands you can get quite reliable results.
It also has the benefit that you can actually drive the page, not just scrape it. That means that if you perform some actions on the page before you get to the data you want (click the search button, click next, now scrape) then you can code that into the process.
I don't know if you'll be able to get the full DOM in a navigable form from Selenium, but it does provide XPath retrieval for the various parts of the page, which is what you'd normally need for a scraping application.
You can use Java, Groovy with or without Grails. Then use Webdriver, Selenium, Spock and Geb these are for testing purposes, but the libraries are useful for your case.
You can implement a Crawler that won't open a new window but just a runtime of these either browser.
Selenium : http://code.google.com/p/selenium/
Webdriver : http://seleniumhq.org/projects/webdriver/
Spock : http://code.google.com/p/spock/
Geb : http://www.gebish.org/manual/current/testing.html
MozSwing could help http://confluence.concord.org/display/MZSW/Home.
You can try JExplorer.
For more information see http://www.teamdev.com/downloads/jexplorer/docs/JExplorer-PGuide.html
You can also try Cobra, see http://lobobrowser.org/cobra.jsp
I haven't tried this project, but I have seen several implementations for node.js that include javascript dom manipulation.
https://github.com/tmpvar/jsdom
Related
I am using JSOUP to fetch the documents from a website.
Below is my code
webPageUrl = https://mwcc.ms.gov/#/electronicDataInterchange
Document doc = Jsoup.connect(webPageUrl).get();
Elements links = doc.getElementsByAttribute("a[href]");
Below line of code is not working. It is supposed to return an element but doesn't:
doc.getElementsByAttribute("a[href]")
Can someone please point out the mistake in my code?
That page seems to be an Angular application, which means it loads some (probably all or most) of its content via JavaScript scripts.
The fact that the URL contains the fragment separator # is already a strong indicator of that fact, because if you do a HTTP request, then everything after that indicator is cut off (i.e. not sent to the server), so the actual request will just be of https://mwcc.ms.gov/.
As far as I know JSoup does not support running JavaScript, so you might need to look into a more involved scraping tool (possibly running a full browser engine).
We can successfully convert an SVG into an image with Batik, however, I need to convert a whole HTML div, with SVG implemented within, along with its CSS presentation code, into an image.
Is there any modules / support within Batik or some other Java API for achieving this?
Selenium library for Java may help you. It can run a browser (ie, chrome, firefox, etc.) in background mode, and you can load an HTML and take a snapshot of the content.
Although it's designed for testing and automation, it's the only way I can offer to you.
Hope it helps.
http://www.seleniumhq.org/
We had the same problem, and we solved it by spawning an PhantomJS process.
Phantom takes an JavaScript file that will instruct its headless browser what to do.
You can wait until the page is fully loaded and then you can print the output into the console as a data URI.
Below is a very simple example from my PhantomJS scripts:
var page = require( "webpage" ).create();
var options = JSON.parse( phantom.args[ 0 ] );
page.open( options.url, "POST", decodeURIComponent( options.payload || "" ), function( status ) {
if ( status === "fail" ) {
phantom.exit( 1 );
}
var contents = page.renderBase64( "png" );
require( "system" ).stderr.write( contents );
});
This is not an easy task as what you are asking is the process called "html rendering" and is basically what browsers try to implement correctly for over 2 decades.
If the CSS you need rendered is fairly simple (no CSS3, no fancy stuff, etc.), then there is a high chance that one of the open-source renderers would be able to handle that (PhantomJS as an example). See #gustavohenke answer for more details.
If the CSS is moderately complex and if you are able to modify it if needed - then there are some fast but non-free renderers, like PrinceXML and DocRaptor.
If the CSS could be very complex and you are not able to make it simpler - then the only option would be to render it in a real browser. You can use Selenium for that as it has a way of running the browser, rendering your HTML in it and "screenshotting" the result all in automated fashion. See #Jorgeblom answer for more details.
I am using vaadin to create a web app. I want to import legacy-styles.css into my styles.css
my styles.css is as follow:
#import "../reindeer/legacy-styles.css";
.v-app {
background: yellow;
}
Then use morderniz to targer IE8
Element head = response.getDocument().head();
Element meta = head.appendElement("meta");
meta.attr("name", "viewport");
meta.attr("content", "width=device-width, initial-scale=1, maximum-scale=1");
// Meta tag to force IE8 to standard mode
String ie8Meta = "<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\">";
head.prepend(ie8Meta);
// some other stuffs ...
// Adding modernizr library to target ie8
// http://modernizr.com/docs/
String modernizr = "<script type=\"text/javascript\" src=\"//cdnjs.cloudflare.com/ajax/libs/modernizr/2.6.2/modernizr.min.js\"></script>";
head.prepend(modernizr);
Strangely, the legacy-styles loaded properly in IE10. Then legacy-styles.css doesn't load in IE8
The error reported is
com.vaadin.client.VConsole
SEVERE: CSS files may have not loaded properly.
I have tried to rearrange moderniz.js (using append instead of prepend) but didn't work
Before we dig into the problem at hand, using #import is a super bad idea. While it is a convenient thing to do for you, it means that every single user that comes to your site will have to download a file (the document itself), in order to download another file (the one containing the #import string) just to download yet another file. In addition to DNS resolution and tcp handshakes, you are looking at possibly up to a dozen round trips between your server and their computer just to get this one single document.
It would be much better to use a css precompiler, such as sass or less, and concatenating the older file with the new one into a single, compressed, file.
Next, modernizr shouldn't be used to target IE8, or any browser for that matter. That is actually completely against the purpose of Modernizr - which specializes in feature detection, rather than browser detection. That means that rather than saying you want to "target IE8", you would think more along the lines of wanting to "target browsers lacking feature X" (svg, geolocation, rounded corners, etc).
That being said, there are a number of reasons to specifically target versions of internet explorer, but modernizr is not the way to do it. The more correct way to do this would be using IE's conditional classes
<!--[if IE 8]>
//IE specific stlyes here
<![endif]-->
Nothing you have shown has anything to do with IE specific code, and since modernizr doesn't add any features itself, there it isn't really clear why IE 10 would do anything different at all.
I create a basic GWT (Google Web Toolkit) Ajax application, and now I'm trying to create snapshots to the crawlers read the page.
I create a Servlet to response the crawlers, using HtmlUnit.
My application runs perfectly when I'm on a browser. But when in HtmlUnit, it throws a lot of errors about the special chars I have in the HTML. But these chars are content, and I wouldn't like to replace it with the special codes, once it's currently working, just because of the HtmlUnit. (at least I should check before if I'm using HtmlUnit correctly )
I think HtmlUnit should read the charset information of the page and render it as a browser, once it's the objective of the project I think.
I haven't found good information about this problem. Is this an HtmlUnit limitation? Do I need to change all the content of my website to use this java library to take snapshots?
Here's my code:
if ((queryString != null) && (queryString.contains("_escaped_fragment_"))) {
// ok its the crawler
// rewrite the URL back to the original #! version
// remember to unescape any %XX characters
url = URLDecoder.decode(url, "UTF-8");
String ajaxURL = url.replace("?_escaped_fragment_=", "#!");
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);
HtmlPage page = webClient.getPage(ajaxURL);
// important! Give the headless browser enough time to execute JavaScript
// The exact time to wait may depend on your application.
webClient.waitForBackgroundJavaScript(3000);
// return the snapshot
response.getWriter().write(page.asXml());
The problem was XML confliting with the HTML. #ColinAlworth comments helped me.
I followed Google example, and there was not working.
To it work, you need to remove XML tags and let just the HTML be responded, changing the line:
// return the snapshot
response.getWriter().write(page.asXml());
to
response.getWriter().write(page.asXml().replaceFirst("<\\?.*>",""));
Now it's rendering.
But although it is being rendered, the CSS is ot working, and the DOM is not updated (GWT updates page title when page opens). HTMLUnit throwed a lot of errors about CSS, and I'm using twitter bootstrap without any changes. Apparently, HtmlUnit project have a lot of bugs, good for small tests, but not to parse complex (or even simple) HTMLs.
What's the best way to externalize large quantities of HTML in a GWT app? We have a rather complicated GWT app of about 30 "pages"; each page has a sort of guide at the bottom that is several paragraphs of HTML markup. I'd like to externalize the HTML so that it can remain as "unescaped" as possible.
I know and understand how to use property files in GWT; that's certainly better than embedding the content in Java classes, but still kind of ugly for HTML (you need to backslashify everything, as well as escape quotes, etc.)
Normally this is the kind of thing you would put in a JSP, but I don't see any equivalent to that in GWT. I'm considering just writing a widget that will simply fetch the content from html files on the server and then add the text to an HTML widget. But it seems there ought to be a simpler way.
I've used ClientBundle in a similar setting. I've created a package my.resources and put my HTML document and the following class there:
package my.resources;
import com.google.gwt.core.client.GWT;
import com.google.gwt.resources.client.ClientBundle;
import com.google.gwt.resources.client.TextResource;
public interface MyHtmlResources extends ClientBundle {
public static final MyHtmlResources INSTANCE = GWT.create(MyHtmlResources.class);
#Source("intro.html")
public TextResource getIntroHtml();
}
Then I get the content of that file by calling the following from my GWT client code:
HTML htmlPanel = new HTML();
String html = MyHtmlResources.INSTANCE.getIntroHtml().getText();
htmlPanel.setHTML(html);
See http://code.google.com/webtoolkit/doc/latest/DevGuideClientBundle.html for further information.
You can use some templating mechanism. Try FreeMarker or Velocity templates. You'll be having your HTML in files that will be retrieved by templating libraries. These files can be named with proper extensions, e.g. .html, .css, .js obsearvable on their own.
I'd say you load the external html through a Frame.
Frame frame = new Frame();
frame.setUrl(GWT.getModuleBase() + getCurrentPageHelp());
add(frame);
You can arrange some convention or lookup for the getCurrentPageHelp() to return the appropriate path (eg: /manuals/myPage/help.html)
Here's an example of frame in action.
In GWT 2.0, you can do this using the UiBinder.
<ui:UiBinder xmlns:ui='urn:ui:com.google.gwt.uibinder'>
<div>
Hello, <span ui:field='nameSpan’/>, this is just good ‘ol HTML.
</div>
</ui:UiBinder>
These files are kept separate from your Java code and can be edited as HTML. They are also provide integration with GWT widgets, so that you can easily access elements within the HTML from your GWT code.
GWT 2.0, when released, should have a ClientBundle, which probably tackles this need.
You could try implementing a Generator to load external HTML from a file at compile time and build a class that emits it. There doesn't seem to be too much help online for creating generators but here's a post to the GWT group that might get you started: GWT group on groups.google.com.
I was doing similar research and, so far, I see that the best way to approach this problem is via the DeclarativeUI or UriBind. Unfortunately it still in incubator, so we need to work around the problem.
I solve it in couple of different ways:
Active overlay, i.e.: you create your standard HTML/CSS and inject the GET code via <script> tag. Everywhere you need to access an element from GWT code you write something like this:
RootPanel.get("element-name").setVisible(false);
You write your code 100% GWT and then, if a big HTML chunk is needed, you bring it to the client either via IFRAME or via AJAX and then inject it via HTML panel like this:
String html = "<div id='one' "
+ "style='border:3px dotted blue;'>"
+ "</div><div id='two' "
+ "style='border:3px dotted green;'"
+ "></div>";
HTMLPanel panel = new HTMLPanel(html);
panel.setSize("200px", "120px");
panel.addStyleName("demo-panel");
panel.add(new Button("Do Nothing"), "one");
panel.add(new TextBox(), "two");
RootPanel.get("demo").add(panel);
Why not to use good-old IFRAME? Just create an iFrame where you wish to put a hint and change its location when GWT 'page' changes.
Advantages:
Hits are stored in separate maintainable HTML files of any structure
AJAX-style loading with no coding at all on server side
If needed, application could still interact with loaded info
Disadvantages:
Each hint file should have link to shared CSS for common look-and-feel
Hard to internationalize
To make this approach a bit better, you might handle loading errors and redirect to default language/topic on 404 errors. So, search priority will be like that:
Current topic for current language
Current topic for default language
Default topic for current language
Default error page
I think it's quite easy to create such GWT component to incorporate iFrame interactions
The GWT Portlets framework (http://code.google.com/p/gwtportlets/) includes a WebAppContentPortlet. This serves up any content from your web app (static HTML, JSPs etc.). You can put it on a page with additional functionality in other Portlets and everything is fetched with a single async call when the page loads.
Have a look at the source for WebAppContentPortlet and WebAppContentDataProvider to see how it is done or try using the framework itself. Here are the relevant bits of source:
WebAppContentPortlet (client side)
((HasHTML)getWidget()).setHTML(html == null ? "<i>Web App Content</i>" : html);
WebAppContentDataProvider (server side):
HttpServletRequest servletRequest = req.getServletRequest();
String path = f.path.startsWith("/") ? f.path : "/" + f.path;
RequestDispatcher rd = servletRequest.getRequestDispatcher(path);
BufferedResponse res = new BufferedResponse(req.getServletResponse());
try {
rd.include(servletRequest, res);
res.getWriter().flush();
f.html = new String(res.toByteArray(), res.getCharacterEncoding());
} catch (Exception e) {
log.error("Error including '" + path + "': " + e, e);
f.html = "Error including '" + path +
"'<br>(see server log for details)";
}
You can use servlets with jsps for the html parts of the page and still include the javascript needed to run the gwt app on the page.
I'm not sure I understand your question, but I'm going to assume you've factored out this common summary into it's own widget. If so, the problem is that you don't like the ugly way of embedding HTML into the Java code.
GWT 2.0 has UiBinder, which allows you to define the GUI in raw HTMLish template, and you can inject values into the template from the Java world. Read through the dev guide and it gives a pretty good outline.
Take a look at
http://code.google.com/intl/es-ES/webtoolkit/doc/latest/DevGuideClientBundle.html
You can try GWT App with html templates generated and binded on run-time, no compiling-time.
Not knowing GWT, but can't you define and anchor div tag in your app html then perform a get against the HTML files that you need, and append to the div? How different would this be from a micro-template?
UPDATE:
I just found this nice jQuery plugin in an answer to another StackOverflow question.