I am developing a Java project in which i have a sub-module where i need to extract contents [text, image, color] from a webpage and compare it with another webpage. I am planning to use WinHTTrack software for downloading the webpage locally, but the problem is it doesn't save it as HTML. How can i download a webpage with HTML extension using softwares such as WinHTTrack [or just saving the webpage through ctrl+s is enogh.?]. Also i am planning to use HTML Parsers to extract the 3 content types[text, image, color],after downloading the webpage locally. So which parser to go with.?
WEll I use Httrack and it fetches html files as well. You are probably taking winhttrack project file as the only output file, but if you check inside the project directory there are html files (together with images, etc). I would suggest using - http://htmlparser.sourceforge.net/. It is a java library and since your project is a Java project it should be fairly easy to use it. You can also save the whole website locally using org.htmlparser.parserapplications.SiteCapturer (and specify whether resources such as images should be captured as well). Hope it helps.
Related
I'm trying to download thousands of Excel files from a website. I'd normally use urllib2 for this, but unfortunately the actual downloading takes place through a java applet and the urls don't change correspondingly. E.g., filling out a query and hitting download doesn't change the url until the file is actually downloading, and when it does change the url is always the same and doesn't change based on the query. So, in sum, I'm trying to download a bunch files which are normally queried through a java applet using python. Thanks in advance!
all.
for the given page, say "http://www.yahoo.com", how can i calculate total size for the downloaded files, for example img files, javascript files, and css files?
I know the htmlparser jar, but this does not support element for css file.
As Graeme mentioned, both the Firebug add-on for Firefox (a great tool for web developers btw) and the developer tools in Chrome will give you the info you want.
However if you dont want to download anything you can use this online service:
http://www.websiteoptimization.com/services/analyze/
And this will tell you how much is downloaded in bytes for a webpage, including images, style sheets, scripts and everything else.
I already tested window.print() command for this purpose but it is not fulfill my requirement.
I also used print content of iframe in which source is pdf file but it is only work in chrome not in other browser.
I want to print pdf files automatically using code instead of open file and print it.
For example there are two files such as 1.pdf and 2.pdf in any directory and source is given then how can print both files using either javascript or php or both.
I already tested window.print() command for this purpose but it is not fulfill my requirement.
My required as image as:
Million thanks in advance.
This is not possible since most browsers, unlike google chrome (where it works) don't have a built in pdf viewer.
The printing of a pdf document is up to the pdf reader, whether or not it is installed as a browser plugin, not the browser.
I fix this issue of merging multiple pdf or image or both by using imageMagick.
Using below command we can merge pdf and image as:
<?
$cmd = "test.pdf test.jpeg final.pdf";
exec("convert $cmd");
?>
After completed merging process, open final.pdf automatically using code then user can print it easily.
You can find more.
i am trying to create an android application that saves webpages to use it in offline-browsing, i was able to save the webpage but the problem was in the contents (images, javascripts,..etc), is there a way to do so programmatically, i use eclipse and test my work on an emulator.
hm, I am afraid you should parse html's yourself (I mean do that with a properly lib) and store all resources (css, js, images, videos etc.) too.
s. how it is done in a java crawler: open source crawlers
You will need to search for all images, javascript files, css files, etc... and download them, saving them to the same relative path to the HMTL files - Assuming the html is coded with relative paths (images/image.png) and not absolute paths (http://www.domain.com/image/image.png).
You can pretty easily search the html string for <img, <script, <link etc.. and parse from there - or you can find a 3rd party html parser
We have Flex on the front end and Java on the back end. When a user will request for a PDF file, request will go to the Java backend, where a PDF file will be generated using Jasper Reports. What we dont know is how to display this PDF file in browser; since we dont want to use JSP/Servlets etc - It has to be flex only. Any suggestions?
Flash Player cannot natively render PDF files. This is possible using Adobe AIR but not in a Flex application. Your best bet is to call navigateToURL() and open a Servlet in a new browser tab/window. The Servlet can simply write contents of the PDF file to the OutputStream and set the appropriate HTTP headers.
i think this question is old, but it may help others, there's a new library developed by Jasper Forge them selves, which deals with JasperReports directly, i mean it's not a PDF viewer, but a JasperReport exporting tool, you can download it from here
i tried it through using JasperServer, when viewing reports you can choose from different options to export it, one of them is flash, and it's working nice
Well for starters, PDFs don't always display in the browser. It depends on the user's settings. You essentially header them the pdf file and either they download it or a program like Acrobat Reader opens in the browser to display it.
Not sure how this is done in flex, I would imagine if you're using Java one simple servlet could do it.