I need to grab the whole HTML from a web page included dynamically generated HTML. Is it possible in JAVA or in PHP?
When I Inspect element with firebug it shows whole HTML. but when I use java I get only static HTML file.
Related
I am trying to scrape data from a webpage using JSoup library in JAVA. However, problem here is the data that I want to scrape is loaded based on XML, so when I try to parse that from HTML it displays
<div id="report-details-container">
<!-- Container where HTML template will be loaded based on XML -->
</div>
instead of full HTML it just shows this comment.
How can I scrape that data because in inspect element I can see full HTML.
How can I scrape that data because in inspect element I can see full HTML.
You can't scrape the original XML out of the HTML. The XML is not in the HTML.
However:
You might be able to reverse engineer the original XML ... provided that you know the rules for the transformation from XML to HTML (e.g. you have the XSLT file), AND the transformation is not lossy.
If the transformation from XML to HTML is done using client side execution of (say) an XSLT, then you should be able to capture the XML before the transformation is applied.
There might be a way to get the server to send XML instead of HTML. That will depend on the server itself.
However, if all you have is an HTML comment like you have shown us, then you will first need to reverse engineer the process that loads the XML. It is probably done with some client-side scripting.
I'm trying to get a values of Selenium versions available for download from this page: http://selenium-release.storage.googleapis.com/index.html just as test values. But i can't use API's which allow to find elements (like Jsoup) with Xpath or CSS because HttpURLConnection returns me a JS source of page. So my Q is how i can get full HTML of that page so i could at least parse the versions of selenium?? My restriction is only JAVA code can be used!
I would like to generate static HTML5 pages, defining my own tags and rendering more complex HTML in the generated pages.
I like the polymer architecture approach to define new sets of tags.
When I want to generate my pages, I'm not in a browser, so I can use Java or NodeJS engines to compute the final HTML pages (from a console for instance).
To sum up, I want to define my own tag libraries using the polymer approach, code some HTML using those new tags, and "print" the result DOM in a static HTML file, running all that from a console program (using Java or NodeJS).
Does somebody know how to do so?
It seems I must have some DOM interpreters, and I know that in Java I can use jsoup, but it will probably lack some JavaScript interpreter? Can NodeJS do that more simply?
I am displaying few tables using HTML table tag and CSS . I am using Struts 2 and would like to include the "Export to PDF" functionality. Right now its just one page where i have to use this. Later one there will be one or two more page where i have to use this feature. I am Looking for some easy to implement available plugins or jar or anything that can be used to do that.
There is a Java API for generating PDF.
Here it is: http://itextpdf.com/download.php
Call it from you Servlet or Struts Action, and use HttpServletRersponse.getOutputStream to direct the PDF document back to the browser.
I have a HTML file containing some java script tags. When I run this file in some browser such as IE, some contents are cached from its source and displayed on browser(for example weather of some cities). How can I run run this html file and get contents of web page that was displayed on web browser before? I don't want to display contents on my application; I want to parse returned data and extract some special contents(for example extract weather of each city).
can anyone guide me please?
What you're trying to do is called html scraping.
Your best option is to get help in the form of a library, since this is a conmon and complex task.
See this question: Options for HTML scraping?
Selenium is a good bet. It supports HtmlUnit, Firefox, Chrome amongst other browsers.
Link: http://seleniumhq.org/