I am writing a screen scraping app that reads out various pages and extracts the data. I'm using the SAXParserFactory go get a SAXParser which in turn gets me an XMLReader. I have configured the Factory like this:
spf = SAXParserFactory.newInstance();
spf.setValidating(false);
spf.setFeature("http://xml.org/sax/features/validation", false);
spf.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
spf.setFeature("http://xml.org/sax/features/use-entity-resolver2", false);
However, whenever I parse a document that contains the   entity I get an
SEVERE: null
org.xml.sax.SAXParseException: The
entity "nbsp" was referenced, butnot declared.
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1231)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
I can understand that it can't find the entity, since I told the factory to not read the DTD, but how do I disable entity checking alltogther?
EDIT: This is for an Android app, which is why I am reluctant to use an API/library that isn't in the standard environment.
SAX doesn't seem capable of this, but the StAX API does. See this previous question/answer for how to set this up.
If you're writing the XML processor by hand, the StAX API is a lot easier to deal with than the SAX API, so you win on both counts.
If it's HTML pages that you're reading, I'd strongly recommend using one of the libraries that deals with the fact that even valid HTML isn't XML and most HTML isn't valid. Try one of these:
NekoHTML
TagSoup
Edit: Just saw that it's an Android app. That is going to make it tougher. NekoHTML comes in at 109kb and TagSoup at 89kb.
It seems to me that you've disabled the parser's capability to understand what to do with . What would you expect the SAX parser to do given that it doesn't understand this entity at all.
Perhaps if you're scraping HTML, you may be better off using JTidy ? It's an HTML parser that presents the HTML in a DOM for further analysis.
I think it is possible to intercept these errors by writing your own DOMErrorHandler instance - more details here:
http://xerces.apache.org/xerces2-j/faq-write.html
I've used this approach to work around a problem whereby I'm parsing a drawing as a XML SVG document generated by Corel Draw 12 which breaks the SVG DTD rules sometimes in the documents it outputs.
Why have you told it not to read the DTD? Is that because you don't want it to access this from the W3C servers by connecting to the internet; you want a standalone, off-network solution with a local DTD? I needed the same: I downloaded the SVG DTD and modules locally and used this Java library to force local DTD access: http://doctypechanger.sourceforge.net/
Related
I have to generate word documents from my application against a entity which will contain some information about that entity, for this i am using POI. But while using POI i have to decide like where i have to create a paragraph, where i have make text bold\italic etc based on a configuration in entity object which i could easily handle in the code.
But is there any way so that i can just define all these style/alignments etc information in any XML/XSL or in any other type of config so i can get rid of styling in my java code ?
Regarding your title question, see Where can I find the XSDs of DOCX XML files?
Regarding your body question,
But is there any way so that i can just define all these
style/alignments etc information in any XML/XSL or in any other type
of config so i can get rid of styling in my java code ?
Yes, of course, and it would be a wise design decision to do so. Since DOCX is OOXML (within OPC) your XSLT will be able to generate OOXML character level formatting via w:rPr settings such as w:b, w:i, etc.
The challenge you'll be facing, however, is that you'll be forgoing the convenience provided by the POI API. You'll also have to reconstruct the OPC if you want to produce a proper DOCX file rather than just an importable OOXML file. For small projects, the learning curve required to wield OOXML directly is likely to be too steep to merit a direct-to-OOXML approach.
I need to parse a Website using Codename One. There is a class named HTMLParser (https://www.codenameone.com/javadoc/com/codename1/ui/html/HTMLParser.html) but it does not seem to work. At least I can't get it to run.
As an alternative I tried to use the XML Parser which gladly worked. But while parsing HTML with it I experienced problems concerning non XHTML conform tags like breaks (br). They malform my HTML and thus I can't parse it predictable.
Is there any way to get the HTML Parser to work or some other way to do it?
EDIT:
I've chosen to write a Servlet doing the parsing work for me using JSoup. Seems to be a good practice.
The HTMLParser class was used by the deprecated HTMLComponent. It should have been deprecated too asit is useless without it.
XMLParser includes all the HTML parsing functionality built into Codename One. It should work with non-conforming br tags as well, it might be inconsistent for things like self closing tags vs. open tags but it should still allow you to implement most such use cases.
Is it a good idea? Well I have used other 3rd party Libraries like JSoup and it works great, but for this project it's different. Is it worth it to load and parse a whole document when you just want to get one item from it? Some of the html pages are simple too, so I could use String methods too. Reason is cause memory will be an issue, and it also takes some time to load the document too. When parsing XML I always use a SAX Parser because it doesn't load it in memory and it is fast. Could I use the same thing on html documents, or is there already one like this out there? So if there is a non-DOM HTML lightweight parser, that would be great too.
If the HTML is XML compliant (i.e. it's XHTML) then you can use a standard SAX parser. Here you can find a list of HTML parsers in Java to choose from: http://java-source.net/open-source/html-parsers. HotSax probably will handle all your use cases.
JPedal library in java is usually used to convert pdf to XML or HTML. However, I needed to know if we could extract data from HTML5 document and save it to XML using JPedal library API?
Is there any other possible alternative to this?
Also , I am trying to parse HTML5 document using Java and store it in XML. are there any good solutions to find just specific tags and render an XML out of them?
Please do let me know . Thank you.
There are a number of Java HTML parsers out there, but I recommend using the HTML5 parser from validator.nu available for download from here: http://about.validator.nu/htmlparser/.
Written to use the HTML5 parser algorithm by one of the main protagonists of HTML5, Henri Sivonen of Mozilla, you won't find a more reliable HTML parser and it creates a true DOM that can be manipulated using standard XML tools and queried for hyperlinks using XPath. There are examples of how to use XSLT transformations with it and how to get an XML serialization of the created DOM.
Is there any way apart of XSTL which dynamically generates HTML form based on metadata specified inside a XML? Take note that I'm developing a JAVA web application here. There won't be a lot of metadata inside the XML, which means that the XML is very simple. For worst case scenario, I would just build my own XML processor and generate HTML code with Java.
Consider JAXB to map your XML to Java objects. Once you have the data in Java, you can plug it into the templating engine of your choice.
One - less recommended - way is to display and style xml by use of css. See here for an example.
I would tend to go for XSLT if you need to go from one XML format to another one (HTML in this situation) in 99% of the cases. Not sure why you have that scratched as an option already ..
Cheers,
Wim
Its answered here : Generate HTML form dynamically using xml and reusable xslt.
And complete example is described here : http://ganeshtiwaridotcomdotnp.blogspot.com/2011/09/xslt-using-reusable-xsl-to-generate.html
You have to extend xsl file (answered there) for complex html forms