Are there any tools for parsing HTML using GWT - java

in my GWT application, on the client side I have a string containing html. Is there a good way to go about parsing that and finding specific html tags within it and returning the id's of those tags?
Any help would be much appreciated, thanks!

Check out GWT query. It is a jQuery like API for GWT that allows easily traversing and manipulating HTML.

You could attach your HTML string to the DOM - using Element.setInnerHTML(yourString). That way you're using the browser's parser. Attaching it to an invisible element or an invisible iframe should hide whats happening from the user.
For the querying you can use GWT's DOM functions if you want to stick with plain GWT. Using JavaScript directly or any JavaScript library like jQuery are also options. GWT query might also be an option, but I haven't used that yet.
UPDATE:
This approach can be abused by XSS (cross site scripting) attacks - so you must either trust or sanitize the HTML string.

Related

Using polymer in Java or NodeJS

I would like to generate static HTML5 pages, defining my own tags and rendering more complex HTML in the generated pages.
I like the polymer architecture approach to define new sets of tags.
When I want to generate my pages, I'm not in a browser, so I can use Java or NodeJS engines to compute the final HTML pages (from a console for instance).
To sum up, I want to define my own tag libraries using the polymer approach, code some HTML using those new tags, and "print" the result DOM in a static HTML file, running all that from a console program (using Java or NodeJS).
Does somebody know how to do so?
It seems I must have some DOM interpreters, and I know that in Java I can use jsoup, but it will probably lack some JavaScript interpreter? Can NodeJS do that more simply?

Java: extract all resources links from HTML

I am looking for a way to extract all resources links from an HTML page in Java. (URL links, links to files..)
I first thought of extracting all elements inside src, href attributes, but the list will not be exhaustive. There is an example of code here: Jsoup, extract links, images, from website. Exception on runtime.
As a tricky example, I want to be able to detect links hidden inside JavaScript (which can also be hidden anywhere in the HTML DOM):
<IMG onmouseover="window.open('http://www.evil.com/image.jpg')">
EDIT:
1) I am not looking for a regex-based solution because they are not reliable to deal with HTML document
2) I have tried to use Html DOM parser like JSoup. They allows the extractions of tags and their properties quite well. However I have not found a way to detect links inside JavaScript with it.
3) Maybe there is an API available that tries to render the page and detect which resources needs to be loaded?
Do you have any thoughts?
Thanks.
If you want to use PHP with a bit of programming knowledge here is a library.
http://simplehtmldom.sourceforge.net/
I used this library to extract info from tags, even from properties of tags. This is exactly what you need to do what you want without working with complicated code.

Perform click on Web page element before parsing in Java

I'm trying to parse HTML page with DOM parser and jsoup library.
The problem that I'm facing is this:
On Web site there are two buttons which show two different tables.
I need to parse the table which is shown when the second button is clicked.
There are different attribute values set after clicking the second button.
When I do Jsoup.connect("example.com")
I get response like first button is selected and I don't need that data.
Is there a way to perform click on second button, and then start parsing and retrieving data from Web site?
Jsoup is just a parser, i.e. it can't handle events such as clicking on buttons. Have a look at browser automation tools (e.g. Selenium) to perform this kind of job.
JSoup is a HTML parser and not a browser alternative. Take a look at Html Unit
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.
JSoup can't control the web page, only parse the content. For manipulation and interaction, there are some tools. I recommend Geb, which uses a Groovy DSL with a JQuery like syntax, making it very fluent. It's also pretty easy to parse xml/html with it.

Java Parser HTML using plain String methods?

Is it a good idea? Well I have used other 3rd party Libraries like JSoup and it works great, but for this project it's different. Is it worth it to load and parse a whole document when you just want to get one item from it? Some of the html pages are simple too, so I could use String methods too. Reason is cause memory will be an issue, and it also takes some time to load the document too. When parsing XML I always use a SAX Parser because it doesn't load it in memory and it is fast. Could I use the same thing on html documents, or is there already one like this out there? So if there is a non-DOM HTML lightweight parser, that would be great too.
If the HTML is XML compliant (i.e. it's XHTML) then you can use a standard SAX parser. Here you can find a list of HTML parsers in Java to choose from: http://java-source.net/open-source/html-parsers. HotSax probably will handle all your use cases.

How to parse text from web content in java?

i would like to parse web content and get only text from the web content. I am getting web content as HTML/java script. Now i need only text from the content.
Can some one help me in doing this? I am using HTML parser to do this.
For example i need the text content in the below file which is in bold.
The URLConnection class contains many methods that let you communicate with
the URL over the network. URLConnection is an HTTP-centric class; that
is, many of its methods are useful only when you are working with HTTP
URLs. However, most URL protocols allow you to read from and write to
the connection. This section describes both functions.
can some one suggest me or provide some sample code to do this.
Thanks in Advance.
You could use an Html parser. A safe choise will be HtmlParser.
An unorthodox method that i like to use is tools like HtmlUnit, which is basically for unit testing, but they have advanced xpath parsing capabilities, provides automatic login, session handling kind of capabilities too.
I recommend use HtmlUnit for web download and Jsoup as html/xml parser.
I use them to extract infos from websites (Google search too).

Categories

Resources