I want to read the entire webpage contents including dynamic contents (also HTML contents loaded by JavaScript inside iFrames, nested iFrames). I could simply rebuild the page with printed page contents. I tried using Java here (How to print external script inside iframe using htmlunit?) but did not succeed. Any suggestions?
Related
I've already tried using HttpConnection and HttpClient to get the content of an index.html from a website. The problem is, that it only gives me the original html and then doesn't continue/wait to read. The site I want to read is constantly changing its html code, since it is updating after a while if you are on it.
How can I open a "session" in Java, so it retrieves the content after it updates, or at least waits for the website to load a little longer, since the image elements of the website are generated through php afterwards.
I need to grab the whole HTML from a web page included dynamically generated HTML. Is it possible in JAVA or in PHP?
When I Inspect element with firebug it shows whole HTML. but when I use java I get only static HTML file.
I am generating PDFs file dynamically in my application using Apache PDFBox library.
I have jsp page which is having Print button.When user click on that print button i want to generate PDF file and at the same time show pdf file on browser and apply window.print() method.
How can i achieve this in my jsp page?
Create a pdf link on your page and the link should be mapped to the actual location the PDF exists on your server.
The browser actually handles what to do with the pdf (based on your browser settings) .... whether to download it or open it via plugin. The bottomline is you cannot control it via server side code.
In either of the case you cannot apply window.print() because that is only applicable to browser window and not pdf plugin functionality or if it gets downloaded then he has to manually open it.
There is an alternate solution to this. That is show the pdf in a div in your html and print that div.
For how to show pdf in a html div you can look Display Adobe pdf inside a div
For printing a div or any other html element there are jquery plugins available. I have used print.js that will print a html div, it will also maintain your css.
So when user clicks the print button first show the pdf in a div and then call the print function to print that div.
I have a HTML file containing some java script tags. When I run this file in some browser such as IE, some contents are cached from its source and displayed on browser(for example weather of some cities). How can I run run this html file and get contents of web page that was displayed on web browser before? I don't want to display contents on my application; I want to parse returned data and extract some special contents(for example extract weather of each city).
can anyone guide me please?
What you're trying to do is called html scraping.
Your best option is to get help in the form of a library, since this is a conmon and complex task.
See this question: Options for HTML scraping?
Selenium is a good bet. It supports HtmlUnit, Firefox, Chrome amongst other browsers.
Link: http://seleniumhq.org/
I am developing a Java project in which i have a sub-module where i need to extract contents [text, image, color] from a webpage and compare it with another webpage. I am planning to use WinHTTrack software for downloading the webpage locally, but the problem is it doesn't save it as HTML. How can i download a webpage with HTML extension using softwares such as WinHTTrack [or just saving the webpage through ctrl+s is enogh.?]. Also i am planning to use HTML Parsers to extract the 3 content types[text, image, color],after downloading the webpage locally. So which parser to go with.?
WEll I use Httrack and it fetches html files as well. You are probably taking winhttrack project file as the only output file, but if you check inside the project directory there are html files (together with images, etc). I would suggest using - http://htmlparser.sourceforge.net/. It is a java library and since your project is a Java project it should be fairly easy to use it. You can also save the whole website locally using org.htmlparser.parserapplications.SiteCapturer (and specify whether resources such as images should be captured as well). Hope it helps.