NekoHTML SAX fragment parsing

NekoHTML SAX fragment parsing - java

I'm trying to parse a simple fragment of HTML with NekoHTML :
<h1>This is a basic test</h1>
To do so, I've set a specific Neko feature not to have any HTML, HEAD or BODY tag calling startElement(..) callback.
Unfortunatly, it doesn't work for me.. I certainly missed something but can't figured out what it would be.
Here is a very simple code to reproduce my problem :
public static class MyContentHandler implements ContentHandler {
public void characters(char[] ch, int start, int length) throws SAXException {
String text = String.valueOf(ch, start, length);
System.out.println(text);
}
public void startElement(String nameSpaceURI, String localName, String rawName, Attributes attributes) throws SAXException {
System.out.println(rawName);
}
public void endElement(String nameSpaceURI, String localName, String rawName) throws SAXException {
System.out.println("end " + localName);
}
}
And the main() to launch a test :
public static void main(String[] args) throws SAXException, IOException {
SAXParser saxReader = new SAXParser();
// set the feature like explained in documentation : http://nekohtml.sourceforge.net/faq.html#fragments
saxReader.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);
saxReader.setContentHandler(new MyContentHandler());
saxReader.parse(new InputSource(new StringInputStream("<h1>This is a basic test</h1>")));
}
The corresponding output :
HTML
HEAD
end HEAD
BODY
H1
This is a basic test
end H1
end BODY
end HTML
whereas I was expecting
H1
This is a basic test
end H1
Any idea ?

I finally got it !
Actually, I was parsing my HTML string in a GWT application, where I've added the gwt-dev.jar dependency. This jar packages a lot of external librairies, like the xercesImpl. But the version of embedded xerces classes does not match the one requiered by NeokHTML.
As a (strange) result, it appears that NeokHTML SAX parser didn't use any custom feature when using gwt-dev embedded xerces version.
So, I had to rework some code to remove the gwt-dev dependency, which by the way is not recommanded to be added to any standard GWT project.

Related

JSoup not showing correct text

So I wanted to create a Java App which crawls the Songname of a website called chillstep.info and saves it into a .txt file. However JSoup prints this out:
<div id="titel">
♫
</div>
Here's the code:
public class Crawltitle {
public static void getTitle() throws IOException{
Document doc = Jsoup.connect("http://chillstep.info/").get();
String title = doc.getElementById("titel").outerHtml();
System.out.println(title);
}
public static void main(String[] args) throws IOException{
getTitle();
}
}
Is this problem because of the website (if yes, why and how to solve that problem) or JSoups?

The title is loaded dynamically via
http://chillstep.info/jsonInfo.php
You still can use Jsoup to get this, if you ignore the usually allowed content type:
Connection con = Jsoup
.connect("http://chillstep.info/jsonInfo.php")
.ignoreContentType(true);
Response res = con.execute();
String rawJSON = res.body();
Note that I did not use the JSoup parser. So you might as well have used any other library to get HTTP content, like Apache HtmlClient or such.
At this point you can parse the answser with a json library of your choice. Or do it "by hand" since it is so simple:
String title = rawJSON.replaceAll(".*:\"([^\"]*).*","$1");

Getting text off a website and set it as a string in Java

I'm trying to get some text from a website and set it as a String in Java.
I have little to no experience with web connections in Java and would appreciate some help.
Here's what I've got so far:
static String wgetURL = "http://www.realmofthemadgod.com/version.txt";
static Document Version;
static String displayLink = "http://www.realmofthemadgod.com/AGCLoader" + Version + ".swf";
public static void main(String[] args) throws IOException{
Version = Jsoup.connect(wgetURL).get();
System.out.println(Version);
JOptionPane.showMessageDialog(null, Version, "RotMG SWF Finder", JOptionPane.DEFAULT_OPTION);
}
I'm trying to use Jsoup but I keep getting startup errors (it has issues when starting up).

Your problem is not Jsoup related.
You are trying to create a String with Version while Version is not defined.
Change your code to:
public static void main(String[] args) throws IOException{
String url = "http://www.realmofthemadgod.com/version.txt"
Document doc = Jsoup.connect(url).get();
System.out.println(doc);
// query doc using jsoup ...
}

Strip tags from generated XML file

I have a Java class which returns statuses of particular systems, this then returns a new ResponseEntity and generates an XML file from it. I want to strip the XML tags from the file and display just the content.
Java:
#RequestMapping(method = RequestMethod.GET)
public ResponseEntity<StatusData> getStatus() throws IOException {
StatusData status = new StatusData();
status.setIsDbUp(statusService.isDbUp());
status.setIsAppUp(statusService.isAppUp());
return new ResponseEntity<StatusData>(status, HttpStatus.OK);
}
Generated XML:
<com.ck.app.StatusData>
<isDbUp>DB: UP</isDbUp>
<isAppUp>APP: UP</isAppUp>
</com.ck.app.StatusData>
I wrote an XSL script but am unsure how to apply it.

Check this for examples on how to run an XSLt from within Java: http://docs.oracle.com/javase/tutorial/jaxp/xslt/transformingXML.html

I have used jsoup before to do this with html, I am sure it will work for your xml also
http://jsoup.org
Here is an example
jsoup - strip all formatting and link tags, keep text only

Look this for strip xml tags and show only a content:
http://www.w3schools.com/xsl/tryxslt.asp?xmlfile=cdcatalog&xsltfile=cdcatalog

You can recur through the DOM tree nodes and print the text nodes.
public static void main(String[] args) throws IOException, ParseException, JAXBException, URISyntaxException, NoSuchFieldException, SecurityException, IllegalArgumentException, IllegalAccessException, SQLException, NoSuchMethodException, SAXException, ParserConfigurationException {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(Main.class.getResourceAsStream("/file.xml"));
visit(doc);
}
public static void visit(Node node) {
NodeList nl = node.getChildNodes();
for (int i = 0; i < nl.getLength(); i++) {
Node child = nl.item(i);
if (child.getNodeType() == Node.TEXT_NODE)
System.out.println(child.getTextContent());
visit(child);
}
}

Error transforming XML

I have a problem parsing an xml, actually transforming it.
The error I get is:
ERROR: 'Namespace for prefix 'SOAP-ENV' has not been declared.'
Jul 8, 2011 3:24:54 PM kumar.runs.start$2 run
SEVERE: null
javax.xml.transform.TransformerException: java.lang.RuntimeException: Namespace for prefix 'SOAP-ENV' has not been declared.
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:716)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:313).........
The code I use is:
SAXParserFactory saxFactory = SAXParserFactory.newInstance();
SAXParser parser = saxFactory.newSAXParser();
XMLReader reader = new XMLTrimFilter(parser.getXMLReader());
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "no");
DOMResult result = new DOMResult();
SAXSource ss = new SAXSource(reader, is);
transformer.transform(ss, result);
return (Document)result.getNode();
XMLTrimFilter is custom implementation, extends XMLFilterImpl.
Also I came across this:
A Bug
but it is a rather old issue.
Does anybody have an idea how to fix it?
Thanks!
[Edit:
the xml:
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:enc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:env="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<SOAP-ENV:Header />
<SOAP-ENV:Body>
<swp:addOwnRet xmlns:swbep="urn:SWBEP">
<apples>33</apples>
<bucket>
<orange>5</orange>
<banana>5</banana>
</bucket>
</swp:addOwnRet>
</SOAP-ENV:Body>
]
Edit 2:
XMLTrimFilter:
package kumar.srcs;
import java.io.CharArrayWriter;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLFilterImpl;
public class XMLTrimFilter extends XMLFilterImpl{
private CharArrayWriter contents = new CharArrayWriter();
public XMLTrimFilter(XMLReader parent){
super(parent);
}
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException{
writeContents();
super.startElement(uri, localName, qName, atts);
}
public void characters(char ch[], int start, int length){
contents.write(ch, start, length);
}
public void endElement(String uri, String localName, String qName) throws SAXException{
writeContents();
super.endElement(uri, localName, qName);
}
public void ignorableWhitespace(char ch[], int start, int length){}
private void writeContents() throws SAXException{
char ch[] = contents.toCharArray();
if(!isWhiteSpace(ch))
super.characters(ch, 0, ch.length);
contents.reset();
}
private boolean isWhiteSpace(char ch[]){
for(int i = 0; i<ch.length; i++){
if(!Character.isWhitespace(ch[i]))
return false;
}
return true;
}
}

We don't have enough information, but the first two things I would suspect are:
The input XML doesn't declare the namespace properly; i.e. it is invalid XML.
There is a bug in your custom XMLTrimFilter class.
The Sun bug is against a really old version of JAXP and was fixed a long time ago. And it doesn't much resemble your case ... to me.
The XML that you pasted is missing a namespace declaration, and will give errors if you try to parse it with a validating namespace aware XML parser. This could be the cause of your problems, though the error message doesn't seem right. A more likely cause is your custom filter, IMO.

After checking your XML in a suitable editor, I noticed that there's no namespace defined for the prefix "swp", the one that element addOwnRet falls under. It's possible that for SOAP usage purposes this is alright (I'm not very familiar with the protocol), but for an XSLT processor this is simply an XML document and nothing more.
Now, your exception said "namespace for prefix SOAP-ENV has not been declared". It says nothing about "swp". But it's not impossible that there's some bug in the exception reporting that puts the wrong prefix name in the message.
It would make sense that other processing doesn't fail, since an undeclared namespace prefix makes an XML document invalid, but doesn't necessarily make it non-well-formed. An XSLT processor must make use of namespace scopes to properly determine which templates an input node fits, so it requires the URI that the prefix is bound to.
If you can manually supply an XML document to your transformation, I suggest sending it through without that "swp" prefix, or simply declare some random namespace URI for it. Then see if this still happens. It's also possible that swbep should be used and the swp is a mistake.
The closing tag for the document is also missing, but I assume that simply fell off when pasting it into your post.

How to get a node from xml not knowing its level in Java

We have tree structure like this:
<Tree>
<child1>
<child2>
<child2>
</child1>
</Tree>
Here the child2 can be at any level. Is there any way we can access child2 without knowing the hierarchy?
Thanks for all answer..is there any way in Castor?as we are using Castor for marshalling and unmarshilling
Here is a similar type of question: How to get a node from xml not knowing its level in flex?

Using XPath, you could do it something like this:
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList child2Nodes= (NodeList) xpath.evaluate("//child2", doc,
XPathConstants.NODESET);
Where doc is your org.w3c.dom.Document class.

Use XPath to get the nodes.
//child2 - to get the list of all "child2" elements

If you can use SAX parser than it is easy here your ContentHandler
public class CH extends DefaultHandler
{
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException
{
if (qName.equals("child2"))
{
// here you go do what you need here with the attributes
}
}
}
pass it to parser and you are done
like
import org.xml.sax.*;
public class TestParse {
public static void main(String[] args) {
try {
XMLReader parser =
org.xml.sax.helpers.XMLReaderFactory.createXMLReader();
// Create a new instance and register it with the parser
ContentHandler contentHandler = new CH();
parser.setContentHandler(contentHandler);
parser.parse("foo.xml"); // see javadoc you can give it a string or stream
} catch (Exception e) {
e.printStackTrace();
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

NekoHTML SAX fragment parsing - java

Related

JSoup not showing correct text

Getting text off a website and set it as a string in Java

Strip tags from generated XML file

Error transforming XML

How to get a node from xml not knowing its level in Java

Categories

Resources