Java Parser HTML using plain String methods? - java

Is it a good idea? Well I have used other 3rd party Libraries like JSoup and it works great, but for this project it's different. Is it worth it to load and parse a whole document when you just want to get one item from it? Some of the html pages are simple too, so I could use String methods too. Reason is cause memory will be an issue, and it also takes some time to load the document too. When parsing XML I always use a SAX Parser because it doesn't load it in memory and it is fast. Could I use the same thing on html documents, or is there already one like this out there? So if there is a non-DOM HTML lightweight parser, that would be great too.

If the HTML is XML compliant (i.e. it's XHTML) then you can use a standard SAX parser. Here you can find a list of HTML parsers in Java to choose from: http://java-source.net/open-source/html-parsers. HotSax probably will handle all your use cases.

Related

HTML parsing in Codename One without having to use the XML Parser

I need to parse a Website using Codename One. There is a class named HTMLParser (https://www.codenameone.com/javadoc/com/codename1/ui/html/HTMLParser.html) but it does not seem to work. At least I can't get it to run.
As an alternative I tried to use the XML Parser which gladly worked. But while parsing HTML with it I experienced problems concerning non XHTML conform tags like breaks (br). They malform my HTML and thus I can't parse it predictable.
Is there any way to get the HTML Parser to work or some other way to do it?
EDIT:
I've chosen to write a Servlet doing the parsing work for me using JSoup. Seems to be a good practice.
The HTMLParser class was used by the deprecated HTMLComponent. It should have been deprecated too asit is useless without it.
XMLParser includes all the HTML parsing functionality built into Codename One. It should work with non-conforming br tags as well, it might be inconsistent for things like self closing tags vs. open tags but it should still allow you to implement most such use cases.

Converting XML to PDF, using styles from XSL

I have following problem: I have a XML file with XSL stylesheet, that is rendering this XML file as neat table in HTML when I load it in web browser. Now I need to make a PDF that is looking EXACTLY like that XSL-styled XML in web browser, without need for making custom FO's for every file. Everything must be done in Java.
I need to make a PDF that is looking EXACTLY like that XSL-styled XML in web browser
Think again about this requirement. Paged media such as PDF and non-paged media such as HTML may only look "close enough", but never "exactly like" each other. This is even more obvious if you consider your HTML being displayed on devices with different screen sizes.
If you relax the above requirement somewhat, you'll probably agree that XSL-FO is the best choice. You definitely do not need to write "custom FO's for every file": write an XSLT just once, and use it on-the-fly to convert your XML to XSL-FO, and then use a rendering engine to process XSL-FO to PDF. Simple.
XSL-FO does sound like exactly what you need. But if that's not an option, first explicitly doing the XSLT transform on the XML in Java and then converting the resulting HTML (which by then is a String/byte array/DOM/whatever you want) to PDF using some additional library would do the trick. There's some libraries that support HTML to PDF, like iText for example. XSLT transformations in Java are really simple. Little code involved there.

using jpedal to extract hyperlinks from html? --java

JPedal library in java is usually used to convert pdf to XML or HTML. However, I needed to know if we could extract data from HTML5 document and save it to XML using JPedal library API?
Is there any other possible alternative to this?
Also , I am trying to parse HTML5 document using Java and store it in XML. are there any good solutions to find just specific tags and render an XML out of them?
Please do let me know . Thank you.
There are a number of Java HTML parsers out there, but I recommend using the HTML5 parser from validator.nu available for download from here: http://about.validator.nu/htmlparser/.
Written to use the HTML5 parser algorithm by one of the main protagonists of HTML5, Henri Sivonen of Mozilla, you won't find a more reliable HTML parser and it creates a true DOM that can be manipulated using standard XML tools and queried for hyperlinks using XPath. There are examples of how to use XSLT transformations with it and how to get an XML serialization of the created DOM.

Generate HTML form based on XML

Is there any way apart of XSTL which dynamically generates HTML form based on metadata specified inside a XML? Take note that I'm developing a JAVA web application here. There won't be a lot of metadata inside the XML, which means that the XML is very simple. For worst case scenario, I would just build my own XML processor and generate HTML code with Java.
Consider JAXB to map your XML to Java objects. Once you have the data in Java, you can plug it into the templating engine of your choice.
One - less recommended - way is to display and style xml by use of css. See here for an example.
I would tend to go for XSLT if you need to go from one XML format to another one (HTML in this situation) in 99% of the cases. Not sure why you have that scratched as an option already ..
Cheers,
Wim
Its answered here : Generate HTML form dynamically using xml and reusable xslt.
And complete example is described here : http://ganeshtiwaridotcomdotnp.blogspot.com/2011/09/xslt-using-reusable-xsl-to-generate.html
You have to extend xsl file (answered there) for complex html forms

How to configure Java's SaxParserFactory to disable entity checking?

I am writing a screen scraping app that reads out various pages and extracts the data. I'm using the SAXParserFactory go get a SAXParser which in turn gets me an XMLReader. I have configured the Factory like this:
spf = SAXParserFactory.newInstance();
spf.setValidating(false);
spf.setFeature("http://xml.org/sax/features/validation", false);
spf.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
spf.setFeature("http://xml.org/sax/features/use-entity-resolver2", false);
However, whenever I parse a document that contains the &nbsp entity I get an
SEVERE: null
org.xml.sax.SAXParseException: The
entity "nbsp" was referenced, butnot declared.
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1231)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
I can understand that it can't find the entity, since I told the factory to not read the DTD, but how do I disable entity checking alltogther?
EDIT: This is for an Android app, which is why I am reluctant to use an API/library that isn't in the standard environment.
SAX doesn't seem capable of this, but the StAX API does. See this previous question/answer for how to set this up.
If you're writing the XML processor by hand, the StAX API is a lot easier to deal with than the SAX API, so you win on both counts.
If it's HTML pages that you're reading, I'd strongly recommend using one of the libraries that deals with the fact that even valid HTML isn't XML and most HTML isn't valid. Try one of these:
NekoHTML
TagSoup
Edit: Just saw that it's an Android app. That is going to make it tougher. NekoHTML comes in at 109kb and TagSoup at 89kb.
It seems to me that you've disabled the parser's capability to understand what to do with . What would you expect the SAX parser to do given that it doesn't understand this entity at all.
Perhaps if you're scraping HTML, you may be better off using JTidy ? It's an HTML parser that presents the HTML in a DOM for further analysis.
I think it is possible to intercept these errors by writing your own DOMErrorHandler instance - more details here:
http://xerces.apache.org/xerces2-j/faq-write.html
I've used this approach to work around a problem whereby I'm parsing a drawing as a XML SVG document generated by Corel Draw 12 which breaks the SVG DTD rules sometimes in the documents it outputs.
Why have you told it not to read the DTD? Is that because you don't want it to access this from the W3C servers by connecting to the internet; you want a standalone, off-network solution with a local DTD? I needed the same: I downloaded the SVG DTD and modules locally and used this Java library to force local DTD access: http://doctypechanger.sourceforge.net/

Categories

Resources