I'm trying to parse HTML page with DOM parser and jsoup library.
The problem that I'm facing is this:
On Web site there are two buttons which show two different tables.
I need to parse the table which is shown when the second button is clicked.
There are different attribute values set after clicking the second button.
When I do Jsoup.connect("example.com")
I get response like first button is selected and I don't need that data.
Is there a way to perform click on second button, and then start parsing and retrieving data from Web site?
Jsoup is just a parser, i.e. it can't handle events such as clicking on buttons. Have a look at browser automation tools (e.g. Selenium) to perform this kind of job.
JSoup is a HTML parser and not a browser alternative. Take a look at Html Unit
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.
JSoup can't control the web page, only parse the content. For manipulation and interaction, there are some tools. I recommend Geb, which uses a Groovy DSL with a JQuery like syntax, making it very fluent. It's also pretty easy to parse xml/html with it.
Related
Can I fill out forms, execute events and Javascript functions in Jsoup? If yes how can I? Or should I go for another parser.
JSoup is just an HTML parser/"tidyfier" - not a browser emulator. To interact with HTML pages (execute javascript, fill out forms, etc.) you should use a tool like HtmlUnit or Selenium.
Use Selenium - if you use Selenium 2 WebDriver API, the main classes there are WebDriver, FirefoxDriver, and JavascriptExecutor.
I would like to generate static HTML5 pages, defining my own tags and rendering more complex HTML in the generated pages.
I like the polymer architecture approach to define new sets of tags.
When I want to generate my pages, I'm not in a browser, so I can use Java or NodeJS engines to compute the final HTML pages (from a console for instance).
To sum up, I want to define my own tag libraries using the polymer approach, code some HTML using those new tags, and "print" the result DOM in a static HTML file, running all that from a console program (using Java or NodeJS).
Does somebody know how to do so?
It seems I must have some DOM interpreters, and I know that in Java I can use jsoup, but it will probably lack some JavaScript interpreter? Can NodeJS do that more simply?
I am displaying few tables using HTML table tag and CSS . I am using Struts 2 and would like to include the "Export to PDF" functionality. Right now its just one page where i have to use this. Later one there will be one or two more page where i have to use this feature. I am Looking for some easy to implement available plugins or jar or anything that can be used to do that.
There is a Java API for generating PDF.
Here it is: http://itextpdf.com/download.php
Call it from you Servlet or Struts Action, and use HttpServletRersponse.getOutputStream to direct the PDF document back to the browser.
I have a HTML file containing some java script tags. When I run this file in some browser such as IE, some contents are cached from its source and displayed on browser(for example weather of some cities). How can I run run this html file and get contents of web page that was displayed on web browser before? I don't want to display contents on my application; I want to parse returned data and extract some special contents(for example extract weather of each city).
can anyone guide me please?
What you're trying to do is called html scraping.
Your best option is to get help in the form of a library, since this is a conmon and complex task.
See this question: Options for HTML scraping?
Selenium is a good bet. It supports HtmlUnit, Firefox, Chrome amongst other browsers.
Link: http://seleniumhq.org/
in my GWT application, on the client side I have a string containing html. Is there a good way to go about parsing that and finding specific html tags within it and returning the id's of those tags?
Any help would be much appreciated, thanks!
Check out GWT query. It is a jQuery like API for GWT that allows easily traversing and manipulating HTML.
You could attach your HTML string to the DOM - using Element.setInnerHTML(yourString). That way you're using the browser's parser. Attaching it to an invisible element or an invisible iframe should hide whats happening from the user.
For the querying you can use GWT's DOM functions if you want to stick with plain GWT. Using JavaScript directly or any JavaScript library like jQuery are also options. GWT query might also be an option, but I haven't used that yet.
UPDATE:
This approach can be abused by XSS (cross site scripting) attacks - so you must either trust or sanitize the HTML string.