I am currently working on getting the source code of a specific web page in a file using Java.
The web page is: http://www.studenti.ict.uniba.it/esse3/ListaAppelliOfferta.do
I wrote some code to do that:
try{
URL url= new URL("http://www.studenti.ict.uniba.it/esse3/ListaAppelliOfferta.do");
URLConnection urlConn = url.openConnection();
BufferedReader dis= new BufferedReader(new InputStreamReader((url.openStream())));
String s="";
while (( s=dis.readLine())!= null) {
System.out.println(s);
}
dis.close();
}catch (MalformedURLException mue) {}
catch (IOException ioe) {}
}
This works fine.
The problem is I want to "simulate" a user selecting "[1020] Dipartimento di Informatica" in Facoltà and "[1102] Informatica e Tecnologie per la produzione del Software" in Corso di Studio and then the user clicking on "Avvia Ricerca" which starts a search and shows a table with the results.
The goal is obtaining the source code of the web page containing also the information in the table I need.
I noticed that if I manually do those selections and then click "Avvia Ricerca" to start the search, the web page is loaded again showing the data in the table I need, but the URL does not change.
So even if the page is now showing the data I need, when using my code I can only get the source code of the page as it is BEFORE doing the selections and doing the search.
I've done similar things with HTMLUnit (http://htmlunit.sourceforge.net) before, works quite well for simulating anything in regards to websites, and for scraping.
I would suggest to open the page in a web debugger (Ctrl-Shift-I) and see what URLs are fetched when you make your selections, and then program those fetches in your Java app.
The downside of this approach is if the page implementation changes your code will break.
Another alternative is to run the page Javascript in a browser sandbox. That is also error-prone and can even be unsafe.
Normaly, you just could send this information by GET/POST (for example with url?department=xy), but in your case its quite complicated, as the site uses JSF and generates an ID (and the information, which department is chosen, is written there, for example "http://www.studenti.ict.uniba.it/esse3/ListaAppelliOfferta.do;jsessionid=365EB9843B2872E73067693A6095BA35").
Depending on what you want to do you could use Selenium (http://docs.seleniumhq.org/). This simulates the Browser, and you could get your elemeents (for example department by name: fac_id), and set the value (for example with selectByValue after you created a select-element, documented here: http://selenium.googlecode.com/git/docs/api/java/org/openqa/selenium/support/ui/Select.html).
If you need to do it without using Selenium (for example cause you need to do it only on command line and without using the browser itself) you can try do deactivate cookies, then the parameters should be sent in GET- or POST-Parameters, and you could inspect this e.g. with Firebug. But thats the harder soluation, Selenium would be much easier to use.
Related
I am using JSOUP to fetch the documents from a website.
Below is my code
webPageUrl = https://mwcc.ms.gov/#/electronicDataInterchange
Document doc = Jsoup.connect(webPageUrl).get();
Elements links = doc.getElementsByAttribute("a[href]");
Below line of code is not working. It is supposed to return an element but doesn't:
doc.getElementsByAttribute("a[href]")
Can someone please point out the mistake in my code?
That page seems to be an Angular application, which means it loads some (probably all or most) of its content via JavaScript scripts.
The fact that the URL contains the fragment separator # is already a strong indicator of that fact, because if you do a HTTP request, then everything after that indicator is cut off (i.e. not sent to the server), so the actual request will just be of https://mwcc.ms.gov/.
As far as I know JSoup does not support running JavaScript, so you might need to look into a more involved scraping tool (possibly running a full browser engine).
The java API for CICS is here. Does anyone know if there any method to put a couple of radio buttons to a web form using this API?
Here's my code to create radio button
HttpRequest req = HttpRequest.getHttpRequestInstance();
String msg = "ZEUSBANK ANTI-FRAUD CHECK BY SHE0008.<br> "
+ "When investigation is complete. Tick the check box and submit.<br>";
String template = "<form><input type=\"radio\"> YES<br><input type=\"radio\"> NO<br></form>";
HttpResponse resp = new HttpResponse();
Document doc = new Document();
doc.createText(msg);
doc.appendFromTemplate(template);
resp.setMediaType("text/plain");
resp.sendDocument(doc, (short)200, "OK", ASCII);
But when I run it on a browser, it print plain text and doesn't convert html tag.
Fixed it, I just change media type from text/plain to text/html and it works.
As you've already discovered, you needed to send the request with the text/html content type.
If you're planning to do more Java web-based work through CICS Java, you might want to investigate the embedded WebSphere Liberty. It adds support for Java EE features, which includes JSF, JSP and Servlets, which can make web development in Java a lot easier.
Tri,
I haven't used CICS for 15 years, so I doubt I'm an expert anymore. But looking quickly at the API, it seems like all the presentation logic would be in your regular Java code. You would then format appropriate messages and invoke the CICS API to update the server & get a response.
There doesn't seem to be any 'BMS-related' methods at all (which is a good thing).
The only 'field' method I see is com.ibm.cics.server.FormField but that only has get() methods, not set().
Are you just starting with Java CICS, or are you just stuck on this particular issue? If you have some sample code of what you are trying, post it so we can see if anyone has any ideas.
HTH, Jim
In GWT we need to use # in a URL to navigate from one page to another i.e for creating history for eg. www.abc.com/#questions/10245857 but due to which I am facing a problem in sharing the url. Google scrappers are reading the url only before # i.e. www.abc.com.
Now I want to remove # from my url and want to keep it straight as www.abc.com/question/10245857.
I am unable to do so. How can I do this?
When user navigates the app I use the hash urls and History object (as
to not reload the pages). However sometimes it's nice/needed to have a
pretty URL (e.g. for sharing, showing in public, etc..) so I would like to know how to
provide the pretty URL of the same page.
Note:
We have to do this to make our webpages url crawlable and to link the website with outside world.
There are 3 issues here, and each can be solved:
The URL should appear prettier to the user
Going directly to the pretty URL should work.
WebCrawlers should be able to get the content
These may all seem like the same issue, but they are quite distinct in this context.
Display Pretty URLs
Can be done with a small javascript file which uses HTML5 state methods. You can see a simple demo here, with source here. This makes all changes to "#" appear without the "#" (on modern browsers).
Relevent code from fiddle:
var stateObj = {locationHash: hash};
history.replaceState(stateObj, "Page Title", baseURL + hash.substring(1));
Repsond to Pretty URLs
This is relatively simple, as long as you have a listener in GWT to load based on the "#" at page load already. You can just throw up a simple re-direct servlet which reinserts the "#" mark where it belongs when requests come in.
For a servlet, listening for the pretty URL:
if(request.getPathInfo()!=null && request.getPathInfo().length()>1){
response.sendRedirect("#" + request.getPathInfo());
return;
}
Alternatively, you can serve up your GWT app directly from this servlet, and initialize it with parameters from the URL, but there's a bit of relative-path bookkeeping to be aware of.
WebCrawlers
This is the trickiest one. Basically you can't get around having static(ish) pages here. That's not too hard if there are a finite set of simple states that you're indexing. One simple scheme is to have a separate servlet which returns the raw content you normally fetch with GWT, in minimal formatted HTML. This servlet can have a different URL pattern like "/indexing/". These wouldn't be meant for humans, just for the webcrawlers. You can attach a simple javascript in the <head> to redirect users to the pretty url once the page loads.
Here's an example for the doGet method of such a servlet:
response.setContentType("text/html;charset=UTF-8");
response.setStatus(200);
pw = response.getWriter();
pw.println("<html>");
pw.println("<head><script>");
pw.println("window.location.href='http://www.example.com/#"
+ request.getPathInfo() + "';");
pw.println("</script></head>");
pw.println("<body>");
pw.println(getRawPageContent(request.getPathInfo()));
pw.println("</body>");
pw.println("</html>");
pw.flush();
pw.close();
return;
You should then just have some links to these indexing pages hidden somewhere on your main app URL (or behind a link on your main app URL).
I create a basic GWT (Google Web Toolkit) Ajax application, and now I'm trying to create snapshots to the crawlers read the page.
I create a Servlet to response the crawlers, using HtmlUnit.
My application runs perfectly when I'm on a browser. But when in HtmlUnit, it throws a lot of errors about the special chars I have in the HTML. But these chars are content, and I wouldn't like to replace it with the special codes, once it's currently working, just because of the HtmlUnit. (at least I should check before if I'm using HtmlUnit correctly )
I think HtmlUnit should read the charset information of the page and render it as a browser, once it's the objective of the project I think.
I haven't found good information about this problem. Is this an HtmlUnit limitation? Do I need to change all the content of my website to use this java library to take snapshots?
Here's my code:
if ((queryString != null) && (queryString.contains("_escaped_fragment_"))) {
// ok its the crawler
// rewrite the URL back to the original #! version
// remember to unescape any %XX characters
url = URLDecoder.decode(url, "UTF-8");
String ajaxURL = url.replace("?_escaped_fragment_=", "#!");
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);
HtmlPage page = webClient.getPage(ajaxURL);
// important! Give the headless browser enough time to execute JavaScript
// The exact time to wait may depend on your application.
webClient.waitForBackgroundJavaScript(3000);
// return the snapshot
response.getWriter().write(page.asXml());
The problem was XML confliting with the HTML. #ColinAlworth comments helped me.
I followed Google example, and there was not working.
To it work, you need to remove XML tags and let just the HTML be responded, changing the line:
// return the snapshot
response.getWriter().write(page.asXml());
to
response.getWriter().write(page.asXml().replaceFirst("<\\?.*>",""));
Now it's rendering.
But although it is being rendered, the CSS is ot working, and the DOM is not updated (GWT updates page title when page opens). HTMLUnit throwed a lot of errors about CSS, and I'm using twitter bootstrap without any changes. Apparently, HtmlUnit project have a lot of bugs, good for small tests, but not to parse complex (or even simple) HTMLs.
What are the best Java libraries to "fully download any webpage and render the built-in JavaScript(s) and then access the rendered webpage (that is the DOM-Tree !) programmatically and get the DOM Tree as an "HTML-Source"?
(Something similarly what firebug does in the end, it renders the page and I get access to the fully rendered DOM Tree, as the page looks like in the browser! In contrast, if I click "show source" I only get the JavaScript source code. This is not what I want. I need to have access to the rendered page...)
(With rendering I mean only rendering the DOM Tree not a visual rendering...)
This does not have to be one single library, it's ok to have several libraries that can accomplish this together (one will download, one render...), but due to the dynamic nature of JavaScript most likely the JavaScript library will also have to have some kind of downloader to fully render any asynchronous JS...
Background:
In the "good old days" HttpClient (Apache Library) was everything required to build your own very simple crawler. (A lot of cralwers like Nutch or Heretrix are still built around this core princible, mainly focussing on Standard HTML parsing, so I can't learn from them)
My problem is that I need to crawl some websites that rely heavily on JavaScript and that I can't parse with HttpClient as I defenitely need to execute the JavaScripts before...
You can use JavaFX 2 WebEngine. Download JavaFX SDK (you may already have it if you installed JDK7u2 or later) and try code below.
It will print html with processed javascript.
You can uncomment lines in the middle to see rendering as well.
public class WebLauncher extends Application {
#Override
public void start(Stage stage) {
final WebView webView = new WebView();
final WebEngine webEngine = webView.getEngine();
webEngine.load("http://stackoverflow.com");
//stage.setScene(new Scene(webView));
//stage.show();
webEngine.getLoadWorker().workDoneProperty().addListener(new ChangeListener<Number>() {
#Override
public void changed(ObservableValue<? extends Number> observable, Number oldValue, Number newValue) {
if (newValue.intValue() == 100 /*percents*/) {
try {
org.w3c.dom.Document doc = webEngine.getDocument();
new XMLSerializer(System.out, new OutputFormat(doc, "UTF-8", true)).serialize(doc);
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
});
}
public static void main(String[] args) {
launch();
}
}
This is a bit outside of the box, but if you are planning on running your code in a server where you have complete control over your environment, it might work...
Install Firefox (or XulRunner, if you want to keep things lightweight) on your machine.
Using the Firefox plugins system, write a small plugin which takes loads a given URL, waits a few seconds, then copies the page's DOM into a String.
From this plugin, use the Java LiveConnect API (see http://jdk6.java.net/plugin2/liveconnect/ and https://developer.mozilla.org/en/LiveConnect ) to push that string across to a public static function in some embedded Java code, which can either do the required processing itself or farm it out to some more complicated code.
Benefits: You are using a browser that most application developers target, so the observed behavior should be comparable. You can also upgrade the browser along the normal upgrade path, so your library won't become out-of-date as HTML standards change.
Disadvantages: You will need to have permission to start a non-headless application on your server. You'll also have the complexity of inter-process communication to worry about.
I have used the plugin API to call Java before, and it's quite achievable. If you'd like some sample code, you should take a look at the XQuery plugin - it loads XQuery code from the DOM, passes it across to the Java Saxon library for processing, then pushes the result back into the browser. There are some details about it here:
https://developer.mozilla.org/en/XQuery
The Selenium library is normally used for testing, but does give you remote control of most standard browsers (IE, Firefox, etc) as well as a headless, browser free mode (using HtmlUnit). Because it is intended for UI verification by page scraping, it may well serve your purposes.
In my experience it can sometimes struggle with very slow JavaScript, but with careful use of "wait" commands you can get quite reliable results.
It also has the benefit that you can actually drive the page, not just scrape it. That means that if you perform some actions on the page before you get to the data you want (click the search button, click next, now scrape) then you can code that into the process.
I don't know if you'll be able to get the full DOM in a navigable form from Selenium, but it does provide XPath retrieval for the various parts of the page, which is what you'd normally need for a scraping application.
You can use Java, Groovy with or without Grails. Then use Webdriver, Selenium, Spock and Geb these are for testing purposes, but the libraries are useful for your case.
You can implement a Crawler that won't open a new window but just a runtime of these either browser.
Selenium : http://code.google.com/p/selenium/
Webdriver : http://seleniumhq.org/projects/webdriver/
Spock : http://code.google.com/p/spock/
Geb : http://www.gebish.org/manual/current/testing.html
MozSwing could help http://confluence.concord.org/display/MZSW/Home.
You can try JExplorer.
For more information see http://www.teamdev.com/downloads/jexplorer/docs/JExplorer-PGuide.html
You can also try Cobra, see http://lobobrowser.org/cobra.jsp
I haven't tried this project, but I have seen several implementations for node.js that include javascript dom manipulation.
https://github.com/tmpvar/jsdom