Need to build a little java tool that gets the keyword suggestions and traffic estimates from the google keywords tool at https://adwords.google.com/select/KeywordToolExternal .
The page is rendered in javascript so simple scraping isn't possible. I have tried htmlunit, but it doesn't work (tried diff browserversions .. still no luck).
one way could be to embed a webbrowser in java , but hvn't had any success with it ?
any suggestions or some alternative ?
Have a look at adword's api and [Client library for Java,Python ect].
Related
I'm using PHP to scrape some information off webpages, however, I've discovered that the info I'm trying to scrape from the pages is loading through some manner of AJAX/javascript. I thought I remembered that Curl could iterate through the javascript, but I've found that that's not the case.
I seem to remember some sort of backend "web browser" library/function that could trace through javascript and AJAX, to get at a final page result of what a full-functioned browser would arrive at.
Is there a library or function that can do this? Any ideas on how to go about this, other than having to manually trace through the scripts/redirects myself? It doesn't have to be pretty -- I'm just looking to scrape the resulting text.
Maybe not in php but in other languages there's: Watir/WatiN, selenium, watir/selenium-webdriver, capybara-webkit, celerity, node.js runs js directly, as well as phantomjs. There's also iMacros and similar commercial options.
But I usually find that I can get the data I want without any of these by just looking at the requests the page is making and recreate them/parsing the response.
I don't think there is such a library. If you're really desperate and you have lots of time on your hands, then you can, of course, download source code of Firefox, for example, and build yourself something useful. However I don't think this is going to be the best use of yours or anybody else's resources.
Note that even google's indexing bot does not process ajax. Here is what Google has to say about it. It's quite possible that the site you're dealing with does support this, in which case you can try using this google's technique, but on the whole, unfortunately, you're out of luck.
I'm new here and I hope I'm not asking something which has already been answered. I have searched everywhere but am yet to discover an adequate answer.
My objective is fairly simple: I want to create a program which will stream the live gold and silver rates from: this website
How would I be able to pinpoint the values that I want to download? Currently, I have managed to implement this using Microsoft Excel's web query feature wherein I am able to select a table from the webpage. However, I want to make it a standalone application.
By the way, I need to retrieve the rates to perform a calculation which is then displayed to the user.
I would greatly appreciate any ideas on how this can be achieved.
I think you need to scrape or parse the data from your website. For that take a look at htmlcleaner, JSoup html parsers.
You can use XPath with htmlcleaner to pinpoint the data.Here is a nice example Xpath Example.
You can use firefox's firebug extension to get the xpath of an element. But your xpath is going to be very huge, by the look of the website you mentioned.
Than you have to execute the code at specified interval of your choice.
And if there is javascript in play than you have to execute the javascript running behind the page using your java code.
Try using Rhino from Mozilla and using its integration libraries or by using the JDK 1.6 ScriptEngine facility.
For ScriptEngine Example take a look here- http://metoojava.wordpress.com/2010/06/20/execute-javascript-from-java/
In, short take a look at html parsers to parse the content of your page.
Native ancient solution in ColdFusion that used to work with HTML 3.x...
<cfhttp url="#targetUrl#" resolveurl="yes">
<cfdocument format="pdf" name="pdfVar">
#cfhttp.filecontent#
</cfdocument>
<cfpdf action="thumbnail" source="#pdfVar#" pages="1" destination="image">
<cfimage action="writeToBrowser" source="#image#">
Super slow, even with cache, many CSS styles missing or broken.
Any good server-side solution to capture a rendered webpage into a thumbnail? like service provided by http://www.shrinktheweb.com/ ?
Any ColdFusion, Java or Command line utility solution?
This website has a script that does what I think your looking for, I haven’t tried using it for any server-side project though.
http://khtml2png.sourceforge.net/
Doesn’t make thumbnails though, but you could render the image created with cfimage.
If you have ColdFusion version 8 or better, you can simply use CFDOCUMENT to create a thumbnail.
From Ray's post:
<cfdocument src="http://www.coldfusionjedi.com" name="pdfdata" format="pdf" />
<cfpdf source="pdfdata" pages="1" action="thumbnail" destination="." format="jpg" overwrite="true" resolution="high" scale="25">
We ended up using SiteShoter which uses IE as the renderering engine. http://www.nirsoft.net/utils/web_site_screenshot.html
I want to know how to convert html file to image. How do I do this?
You can checkout the source code for the popular BrowserShots service,
http://browsershots.org/
If you're running Windows, and have the GD library installed, you can use imagegrabwindow. I've never used it myself, but as always, the PHP site has lots of documentation and examples.
Use:
WKHTMLTOPDF.
It also has binding to PHP, or you can run it yourself from command line.
Problem is that you need to implement all the functionality of a browser and an HTTP stack (and this still does not deal with the case where the content is modified using javascript).
As John McCollum says, if you've got the website open in a browser on your PC, then you can use imagegrabwindow or snapsIE (MSIE only)
If you want to to be able to get a snapshot using code only, then you might want to look at one of the off the shelf solutions - AFAIK there are several programs (at least 2 of which are called html2pdf) which will generate a PDF of static html - and its relatively easy using standard tools to trim this to window size and convert to an image file.
e.g. https://metacpan.org/pod/distribution/PDF-FromHTML/script/html2pdf.pl
I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.
As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.
Thanks,
Don
I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.
It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.
If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.
The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.
You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.
I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.
A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).