I was going to ask an entirely different question, but magically managed to solve that. So, a new problem.
So I'm working on an android app with SAX parser. I have an XML file that contains mostly
<content:encoded>bla bla bla</content:encoded>
And I know I can use localName encoded to get that one.
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (localName.equalsIgnoreCase(descriptionId))
{
if (isItem){descriptionList.add(buff.toString());}
}
... etc etc
but then there's this:
<enclosure url="SOME URL" length="100623688" type="audio/mpeg"/>
And I want to extract SOME URL. Does anyone know how I would do that?
Thanks a lot,
Never developed for android, but if I understand you correctly, you need to read the attributes of that XML element.
In the startElement method of your SaxParser you'll have an argument "Attributes attrs" or something along those lines (at least this is what I remember from the Xerces SAX Parser).
That Attributes object contains the various... attributes of the element =)
I think it's implemented over a Map, but you can debug that quickly.
Hope it helps.
Here SOME URL is the value of the attribute url which belongs to the enclosure tag.
Here is a sample take from
http://www.exampledepot.com/egs/org.xml.sax/GetAttr.html
// Create a handler for SAX events
DefaultHandler handler = new MyHandler();
// Parse an XML file using SAX;
// The Quintessential Program to Parse an XML File Using SAX
parseXmlFile("infilename.xml", handler, true);
// This class listens for startElement SAX events
static class MyHandler extends DefaultHandler {
// This method is called when an element is encountered
public void startElement(String namespaceURI, String localName,
String qName, Attributes atts) {
// Get the number of attribute
int length = atts.getLength();
// Process each attribute
for (int i=0; i<length; i++) {
// Get names and values for each attribute
String name = atts.getQName(i);
String value = atts.getValue(i);
// The following methods are valid only if the parser is namespace-aware
// The uri of the attribute's namespace
String nsUri = atts.getURI(i);
// This is the name without the prefix
String lName = atts.getLocalName(i);
}
}
}
Related
When xml input is given as input stream to SAX parser with some of the xml elements are empty, then
the parser's character method is not called and gets different result.
For example,
XML input:
<root>
<salutation>Hello Sir</salutation>
<userName />
<parent>
<child>a</child>
</parent>
<parent>
<child>b</child>
</parent>
<parent>
<child>c</child>
</parent>
<success>yes</success>
<hoursSpent />
</root>
Parser Implementation:
public class MyContentHandler implements ContentHandler {
private String salutation;
private String userName;
private String success;
private String hoursSpent;
String tmpValue="";
public void endElement(String uri, String localName, String qName) throws SAXException {
if ("salutation".equals(qName)) {
userName=tmpValue;
}
}else
if ("userName".equals(qName)) {
userName=tmpValue;
}
}else
if ("success".equals(qName)) {
success=tmpValue;
}
}else
if ("hoursSpent".equals(qName)) {
hoursSpent=tmpValue;
}
}
public void characters(char[] ch, int begin, int length) throws SAXException {
tmpValue = new String(ch, begin, length).trim();
}
Main Program:
public class MainProgram{
public static void main(String[] args) throws Exception {
SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
SAXParser saxParser = saxParserFactory.newSAXParser();
XMLReader xmlReader = saxParser.getXMLReader();
MyContentHandler contentHandler = new MyContentHandler(xmlReader);
xmlReader.setContentHandler(contentHandler);
String input = "<root><salutation>Hello Sir</salutation><userName /><parent><child>a</child></parent><parent><child>b</child></parent><success>yes</success><hoursSpent /></root>";
InputStream stream = new ByteArrayInputStream(input.getBytes());
xmlReader.parse(new InputSource(stream));
System.out.println(contentHandler.getUserName()); //prints Hello sir instead of null
System.out.println(contentHandler.getHoursSpent); //prints yes instead of null
if empty xml element is specified without open and close elements as below,
<userName />
instead of <userName></userName>, then the character() method in the handler class is not executed and wrong value is set. This issue occurs only when i use input xml as input stream. Please help me to solve this issue
The parser is behaving exactly as specified, it is your code that is wrong.
In general the parser makes zero-to-many calls on the characters() method between a start tag and the corresponding end tag. You need to initialize an empty buffer in startElement(), append to the buffer in characters(), and then use the accumulated value in endElement().
The way you have written it, you will not only get the wrong result for an empty element, you will also get the wrong result if the parser breaks the text up into multiple calls, which often happens if (a) there are entity references in the text, or (b) the text is very long, or (c) the text happens to span two chunks that are read from the input stream in separate read() calls.
This may be one of the insane / stupid / dumb / lengthy questions as I am newbie to web services.
I want to write a web service which will return answer in XML format (I am using my service for YUI autocomplete). I am using Eclipse and Axis2 and following http://www.softwareagility.gr/index.php?q=node/21
I want response in following format
<codes>
<code value="Pegfilgrastim"/>
<code value="Peggs"/>
<code value="Peggy"/>
<code value="Peginterferon alfa-2 b"/>
<code value="Pegram"/>
</codes>
Number of code elements may vary depending on response.
Till now I tried following ways
1) Create XML using String buffer and return the string.(I am providing partial code to avoid confusion)
public String myService ()
{
// Some other stuff
StringBuffer outputXML = new StringBuffer();
outputXML.append("<?xml version='1.0' standalone='yes'?>");
outputXML.append("<codes>");
while(SOME_CONDITION)
{
// Some business logic
outputXML.append("<code value=\""+tempStr+"\">"+"</code>");
}
outputXML.append("</codes>");
return (outputXML.toString());
}
It gives following response with unwanted <ns:myServiceResponse> and <ns:return> element.
<ns:myServiceResponse>
<ns:return>
<?xml version='1.0' standalone='yes'?><codes><code value="Peg-shaped teeth"></code><code value="Pegaspargase"></code><code value="Pegfilgrastim"></code><code value="Peggs"></code><code value="Peggy"></code><code value="Peginterferon alfa-2 b"></code><code value="Pegram"></code></codes>
</ns:return>
</ns:findTermsResponse>
But it didnt work with YUI autocomplete (May be because it required response in format mentioned above)
2) Using DocumentBuilderFactory :
Like
public Element myService ()
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = factory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element codes = doc.createElement("codes");
while(SOME_CONDITION)
{
// Some business logic
Element code = doc.createElement("code");
code.setAttribute("value", tempStr);
codes.appendChild(code);
}
return(codes);
}
Got following error
org.apache.axis2.AxisFault: Mapping qname not fond for the package: com.sun.org.apache.xerces.internal.dom
3) Using servlet : I tried to get same response using simple servlet and it worked. Here is my servlet
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException
{
StringBuffer outputXML = new StringBuffer();
response.setContentType("text/xml");
PrintWriter out = response.getWriter();
outputXML.append("<?xml version='1.0' standalone='yes'?>");
outputXML.append("<codes>");
while(SOME_CONDITION)
{
// Some business logic
outputXML.append("<code value=\"" + tempStr + "\">" + "</code>");
}
outputXML.append("</codes>");
out.println(outputXML.toString());
}
It gave response same as mentioned above and it worked with YUI autocomplete without any extra tag.
Please can you tell how can I get XML response without any unwanted elements ?
Thanks.
Axis2 is for delivering Objects back to the caller. Thats why it adds extra stuff to the response even it is a simple String object.
Using the second approach your service returns a complex Java object (Element instance) that is for describing an XML fragment. This way the caller has to be aware of this object to be able to deserialize it and restore the Java object that contains XML data.
The third approach is the simplest and best in your case regarding the return type: it doesn't return a serialized Java object, only the plain xml text. Of course you could use DocumentBuilder to prepare the XML, but in the end you have to make String of it by calling the appropriate getXml(), asXml() method (or kind of...)
Finally got it work though I am not able to remove unwanted element. (I don't bother till all things are in place). I used AXIOM to generate response.
public OMElement myService ()
{
OMFactory fac = OMAbstractFactory.getOMFactory();
OMNamespace omNs = fac.createOMNamespace("", "");
OMElement codes = fac.createOMElement("codes", omNs);
while(SOME_CONDITION)
{
OMElement code = fac.createOMElement("code", null, codes);
OMAttribute value = fac.createOMAttribute("value", null, tempStr);
code.addAttribute(value);
}
return(codes);
}
Links : 1) http://songcuulong.com/public/html/webservice/create_ws.html
2) http://sv.tomicom.ac.jp/~koba/axis2-1.3/docs/xdocs/1_3/rest-ws.html
I think you cannot return your custom xml with Axis. It will wrap it into its envelope anyways.
We need to get tree like structure from a given text document using Java. Used file type should be common and open (rtf, odt, ...). Currently we use Apache Tika to parse plain text from multiple documents.
What file type and API we should use so that we could most reliably get the correct structure parsed? If this is possible with Tika, I would be happy to see any demonstrations.
For example, we should get this kind of data from the given document:
Main Heading
Heading 1
Heading 1.1
Heading 2
Heading 2.2
Main Heading is the title of the paper. Paper has two main headings, Heading 1 and Heading 2 and they both have one subheadings. We should also get contents under each heading (paragraph text).
Any help is appreciated.
OpenDocument (.odt) is practically a zip package containing multiple xml files. Content.xml contains the actual textual content of the document. We are interested in headings and they can be found inside text:h tags. Read more about ODT.
I found an implementation for extracting headings from .odt files with QueryPath.
Since the original question was about Java, here it is. First we need to get access to content.xml by using ZipFile. Then we use SAX to parse xml content out of content.xml. Sample code simply prints out all the headings:
Test3.odt
content.xml
3764
1 My New Great Paper
2 Abstract
2 Introduction
2 Content
3 More content
3 Even more
2 Conclusions
Sample code:
public void printHeadingsOfOdtFIle(File odtFile) {
try {
ZipFile zFile = new ZipFile(odtFile);
System.out.println(zFile.getName());
ZipEntry contentFile = zFile.getEntry("content.xml");
System.out.println(contentFile.getName());
System.out.println(contentFile.getSize());
XMLReader xr = XMLReaderFactory.createXMLReader();
OdtDocumentContentHandler handler = new OdtDocumentContentHandler();
xr.setContentHandler(handler);
xr.parse(new InputSource(zFile.getInputStream(contentFile)));
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
new OdtDocumentStructureExtractor().printHeadingsOfOdtFIle(new File("Test3.odt"));
}
Relevant parts of used ContentHandler look like this:
#Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
temp = "";
if("text:h".equals(qName)) {
String headingLevel = atts.getValue("text:outline-level");
if(headingLevel != null) {
System.out.print(headingLevel + " ");
}
}
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
char[] subArray = new char[length];
System.arraycopy(ch, start, subArray, 0, length);
temp = new String(subArray);
fullText.append(temp);
}
#Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if("text:h".equals(qName)) {
System.out.println(temp);
}
}
I have an XML to be parsed, which as given below
<feed>
<feed_id>12941450184d2315fa63d6358242</feed_id>
<content> <fieldset><table cellpadding='0' border='0' cellspacing='0' style="clear :both"><tr valign='top' ><td width='35' ><a href='http://mypage.rediff.com/android/32868898' class='space' onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.113&pos=0&feed_id=12941450184d2315fa63d6358242&prc_id=32868898&rowid=674061088')" ><div style='width:25px;height:25px;overflow:hidden;'><img src='http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb' width='25' vspace='0' /></div></a></td> <td><span><a href='http://mypage.rediff.com/android/32868898' class="space" onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.113&pos=0&feed_id=12941450184d2315fa63d6358242&prc_id=32868898&rowid=674061088')" >Android </a> </span><span style='color:#000000 !important;'>testing</span><div class='divtext'></div></td></tr><tr><td height='5' ></td></tr></table></fieldset><br/></content>
<action>status updated</action>
</feed>
Tag contains HTML contents, which contains the data which i need. I am using a SAX Parser. Here's what i am doing
private Timeline timeLine; //Object
private String tempStr;
public void characters(char[] ch, int start, int length)
throws SAXException {
tempStr = new String(ch, start, length);
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (localName.equalsIgnoreCase("content")) {
if (timeLine != null) {
timeLine.setContent(tempStr);
}
}
Will this logic work? If no, how should i extract embedded HTML data from XML using SAX Parser.
You can parse html after all html is also xml.There is a link similar to this in stackoverflow.You can try this How to parse the html content in android using SAX PARSER
On start element,
if the element is content, your temp Str buffer should be initialized.
else if content already started,
capture the current start element and its attributes and update that to the temp Str buffer.
On characters,
if content is started, add the charecters to the current string buffer.
On end element
if content is started, Capture the end node and add to string buffer.
My Assumption:
The xml will have only one content tag.
If the html is actually xhtml, you can parse it using SAX and extract the xhtml contents of the <content> tag, but not nearly this easily.
You would have to make your handler actually respond to the events that will be raised by all the xhtml tags inside the <content> tag, and either build something resembling a DOM structure, which you could then serialize back out to xml form, or on-the-fly directly write into an xml string buffer replicating the contents.
If you modify your xml so that the html inside the content tag is wrapped in a CDATA element as suggested in How to parse the html content in android using SAX PARSER, something not too far from your code should indeed work.
But you can't just put the contents into your String tempStr variable in the characters method as you're doing. You'll need to have a startElement method that initializes a buffer for the string on seeing the <content> tag, collect into that buffer in the characters method, and put the result somewhere in the endElement for the <content> tag.
I find the solution in this way:
Note: In this solution I want to get the html content between <chapter> tags (<chapter> ... html content ... </chapter>)
DefaultHandler handler = new DefaultHandler() {
boolean chap = false;
public char[] temp;
int chapterStart;
int chapterEnd;
public void startElement(String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
System.out.println("Start Element :" + qName);
if (qName.equalsIgnoreCase("chapter")) {
chap = true;
}
}
public void endElement(String uri, String localName,
String qName) throws SAXException {
if (qName.equalsIgnoreCase("chapter")) {
System.out.println(new String(temp, chapterStart, chapterEnd-chapterStart));
}
System.out.println("End Element :" + qName);
}
public void characters(char ch[], int start, int length)
throws SAXException {
if (chap) {
temp = ch;
chapterStart = start;
chap = false;
}
chapterEnd = start + length;
}
};
Update:
My code have a bug. because the length of ch[] in DocumentHandler varies in different situation!
How can I force a SAX parser (specifically, Xerces in Java) to use a DTD when parsing a document without having any doctype in the input document? Is this even possible?
Here are some more details of my scenario:
We have a bunch of XML documents that conform to the same DTD that are generated by multiple different systems (none of which I can change). Some of these systems add a doctype to their output documents, others do not. Some use named character entities, some do not. Some use named character entities without declaring a doctype. I know that's not kosher, but it's what I have to work with.
I'm working on system that needs to parse these files in Java. Currently, it's handling the above cases by first reading in the XML document as a stream, attempting to detect if it has a doctype defined, and adding a doctype declaration if one isn't already present. The problem is that this code is buggy, and I'd like to replace it with something cleaner.
The files are large, so I can't use a DOM-based solution. I'm also trying get character entities resolved, so it doesn't help to use an XML Schema.
If you have a solution, could you please post it directly instead of linking to it? It doesn't do Stack Overflow much good if in a the future there's a correct solution with a dead link.
I think it is no sane way to set DOCTYPE, if document hasn't one. Possible solution is write fake one, as you already do. If you're using SAX, you can use this fake InputStream and fake DefaultHandler implementation. (will work only for latin1 one-byte encoding)
I know this solution also ugly, but it only one works well with big data streams.
Here is some code.
private enum State {readXmlDec, readXmlDecEnd, writeFakeDoctipe, writeEnd};
private class MyInputStream extends InputStream{
private final InputStream is;
private StringBuilder sb = new StringBuilder();
private int pos = 0;
private String doctype = "<!DOCTYPE register SYSTEM \"fake.dtd\">";
private State state = State.readXmlDec;
private MyInputStream(InputStream source) {
is = source;
}
#Override
public int read() throws IOException {
int bit;
switch (state){
case readXmlDec:
bit = is.read();
sb.append(Character.toChars(bit));
if(sb.toString().equals("<?xml")){
state = State.readXmlDecEnd;
}
break;
case readXmlDecEnd:
bit = is.read();
if(Character.toChars(bit)[0] == '>'){
state = State.writeFakeDoctipe;
}
break;
case writeFakeDoctipe:
bit = doctype.charAt(pos++);
if(doctype.length() == pos){
state = State.writeEnd;
}
break;
default:
bit = is.read();
break;
}
return bit;
}
#Override
public void close() throws IOException {
super.close();
is.close();
}
}
private static class MyHandler extends DefaultHandler {
#Override
public InputSource resolveEntity(String publicId, String systemId) throws IOException, SAXException {
System.out.println("resolve "+ systemId);
// get real dtd
InputStream is = ClassLoader.class.getResourceAsStream("/register.dtd");
return new InputSource(is);
}
... // rest of code
}