I need to parse a Website using Codename One. There is a class named HTMLParser (https://www.codenameone.com/javadoc/com/codename1/ui/html/HTMLParser.html) but it does not seem to work. At least I can't get it to run.
As an alternative I tried to use the XML Parser which gladly worked. But while parsing HTML with it I experienced problems concerning non XHTML conform tags like breaks (br). They malform my HTML and thus I can't parse it predictable.
Is there any way to get the HTML Parser to work or some other way to do it?
EDIT:
I've chosen to write a Servlet doing the parsing work for me using JSoup. Seems to be a good practice.
The HTMLParser class was used by the deprecated HTMLComponent. It should have been deprecated too asit is useless without it.
XMLParser includes all the HTML parsing functionality built into Codename One. It should work with non-conforming br tags as well, it might be inconsistent for things like self closing tags vs. open tags but it should still allow you to implement most such use cases.
Related
I am looking for a way to extract all resources links from an HTML page in Java. (URL links, links to files..)
I first thought of extracting all elements inside src, href attributes, but the list will not be exhaustive. There is an example of code here: Jsoup, extract links, images, from website. Exception on runtime.
As a tricky example, I want to be able to detect links hidden inside JavaScript (which can also be hidden anywhere in the HTML DOM):
<IMG onmouseover="window.open('http://www.evil.com/image.jpg')">
EDIT:
1) I am not looking for a regex-based solution because they are not reliable to deal with HTML document
2) I have tried to use Html DOM parser like JSoup. They allows the extractions of tags and their properties quite well. However I have not found a way to detect links inside JavaScript with it.
3) Maybe there is an API available that tries to render the page and detect which resources needs to be loaded?
Do you have any thoughts?
Thanks.
If you want to use PHP with a bit of programming knowledge here is a library.
http://simplehtmldom.sourceforge.net/
I used this library to extract info from tags, even from properties of tags. This is exactly what you need to do what you want without working with complicated code.
Is it a good idea? Well I have used other 3rd party Libraries like JSoup and it works great, but for this project it's different. Is it worth it to load and parse a whole document when you just want to get one item from it? Some of the html pages are simple too, so I could use String methods too. Reason is cause memory will be an issue, and it also takes some time to load the document too. When parsing XML I always use a SAX Parser because it doesn't load it in memory and it is fast. Could I use the same thing on html documents, or is there already one like this out there? So if there is a non-DOM HTML lightweight parser, that would be great too.
If the HTML is XML compliant (i.e. it's XHTML) then you can use a standard SAX parser. Here you can find a list of HTML parsers in Java to choose from: http://java-source.net/open-source/html-parsers. HotSax probably will handle all your use cases.
Is there any way apart of XSTL which dynamically generates HTML form based on metadata specified inside a XML? Take note that I'm developing a JAVA web application here. There won't be a lot of metadata inside the XML, which means that the XML is very simple. For worst case scenario, I would just build my own XML processor and generate HTML code with Java.
Consider JAXB to map your XML to Java objects. Once you have the data in Java, you can plug it into the templating engine of your choice.
One - less recommended - way is to display and style xml by use of css. See here for an example.
I would tend to go for XSLT if you need to go from one XML format to another one (HTML in this situation) in 99% of the cases. Not sure why you have that scratched as an option already ..
Cheers,
Wim
Its answered here : Generate HTML form dynamically using xml and reusable xslt.
And complete example is described here : http://ganeshtiwaridotcomdotnp.blogspot.com/2011/09/xslt-using-reusable-xsl-to-generate.html
You have to extend xsl file (answered there) for complex html forms
in my GWT application, on the client side I have a string containing html. Is there a good way to go about parsing that and finding specific html tags within it and returning the id's of those tags?
Any help would be much appreciated, thanks!
Check out GWT query. It is a jQuery like API for GWT that allows easily traversing and manipulating HTML.
You could attach your HTML string to the DOM - using Element.setInnerHTML(yourString). That way you're using the browser's parser. Attaching it to an invisible element or an invisible iframe should hide whats happening from the user.
For the querying you can use GWT's DOM functions if you want to stick with plain GWT. Using JavaScript directly or any JavaScript library like jQuery are also options. GWT query might also be an option, but I haven't used that yet.
UPDATE:
This approach can be abused by XSS (cross site scripting) attacks - so you must either trust or sanitize the HTML string.
I am writing a screen scraping app that reads out various pages and extracts the data. I'm using the SAXParserFactory go get a SAXParser which in turn gets me an XMLReader. I have configured the Factory like this:
spf = SAXParserFactory.newInstance();
spf.setValidating(false);
spf.setFeature("http://xml.org/sax/features/validation", false);
spf.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
spf.setFeature("http://xml.org/sax/features/use-entity-resolver2", false);
However, whenever I parse a document that contains the   entity I get an
SEVERE: null
org.xml.sax.SAXParseException: The
entity "nbsp" was referenced, butnot declared.
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1231)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
I can understand that it can't find the entity, since I told the factory to not read the DTD, but how do I disable entity checking alltogther?
EDIT: This is for an Android app, which is why I am reluctant to use an API/library that isn't in the standard environment.
SAX doesn't seem capable of this, but the StAX API does. See this previous question/answer for how to set this up.
If you're writing the XML processor by hand, the StAX API is a lot easier to deal with than the SAX API, so you win on both counts.
If it's HTML pages that you're reading, I'd strongly recommend using one of the libraries that deals with the fact that even valid HTML isn't XML and most HTML isn't valid. Try one of these:
NekoHTML
TagSoup
Edit: Just saw that it's an Android app. That is going to make it tougher. NekoHTML comes in at 109kb and TagSoup at 89kb.
It seems to me that you've disabled the parser's capability to understand what to do with . What would you expect the SAX parser to do given that it doesn't understand this entity at all.
Perhaps if you're scraping HTML, you may be better off using JTidy ? It's an HTML parser that presents the HTML in a DOM for further analysis.
I think it is possible to intercept these errors by writing your own DOMErrorHandler instance - more details here:
http://xerces.apache.org/xerces2-j/faq-write.html
I've used this approach to work around a problem whereby I'm parsing a drawing as a XML SVG document generated by Corel Draw 12 which breaks the SVG DTD rules sometimes in the documents it outputs.
Why have you told it not to read the DTD? Is that because you don't want it to access this from the W3C servers by connecting to the internet; you want a standalone, off-network solution with a local DTD? I needed the same: I downloaded the SVG DTD and modules locally and used this Java library to force local DTD access: http://doctypechanger.sourceforge.net/