How does SAX parsing method charecters() work - java

I am trying to read data from an xml file and store them in a list using the SAX parsing method. My problem is when I try to store the values of my data using the characters() method. I am creating an object where for each element I store each value and some extra information for some later use but when I try to store said value it stores spaces instead. I tried printing inside the method and while it seams to go through all my xml file it prints only a couple of the elements and not even in the right order. So can someone explain me what I am missing?
XML FILE:
<?xml version="1.0" encoding="UTF-8"?>
<CarModel>
<Audi model = "TT" year = "2006" starting_price = "33.000$">
<type>sport</type>
<horse_power>222hp</horse_power>
<drivetrain>quattro</drivetrain
<transmission>6_Manual</transmission>
</Audi>
<Mercedes model = "W222_S400" year = "2013" starting_price =
63.000$">
<type>luxury</type>
<horse_power>302hp</horse_power>
<drivetrain>front_wheel_drive</drivetrain>
<transmission>7_Automatic</transmission>
</Mercedes>
</CarModel>
JAVA CODE :
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
lvl_cnt++;
System.out.println(lvl_cnt);
obj = new xml_obj();
obj.setLvl(lvl_cnt);
System.out.println("LVL "+obj.getLvl());
if (lvl_cnt == 0) {
obj.setValue(qName);
obj.setParent("root");
System.out.println(obj.getParent());
}
else {
xml_obj tmp = objListofLists.get(objListofLists.size()-1);
int lvl_before = tmp.getLvl();
System.out.println("AAA" + lvl_before);
if (lvl_cnt > lvl_before) {
obj.setParent(tmp.getValue());
}
else if (lvl_cnt < lvl_before) {
int j = 0;
while (objListofLists.get(j).getLvl() != lvl_cnt) j++;
tmp = objListofLists.get(j);
obj.setParent(tmp.getParent());
}
else {
obj.setParent(tmp.getParent());
}
System.out.println(obj.getParent());
}
obj.attributes = attributes;
objListofLists.add(obj);
}
public void endElement(String uri, String localName, String qName) throws SAXException {
lvl_cnt--;
}
public void characters (char ch[], int start, int length) throws SAXException {
String help = new String(ch, start, length);
System.out.println(help);
objListofLists.get(objListofLists.size()-1).setValue(help);
}
}

The characters method only prints the text inside the elements (no attributes), but that also includes all the empty spaces, tabs, and newlines that might be present in the document and are descendants of the root element.
If you want to read text that is inside a particular element, you should set a flag when you enter that element (in startElement()), then unset it in endElement(), and inside characters() you test if you are currently in the element from which you wish to extract text.
private boolean inTypeTag = false;
public void startElement(String uri, String localName, String qName, ...) ...{
if (qName.equals("type") {
inTypeTag = true;
}
...
}
public void endElement(String uri, String localName, String qName, ...) ...{
if (qName.equals("type") {
inTypeTag = false;
}
...
}
public void characters(char ch[], int start, int length) ... {
if (inTypeTag) {
// do something with the text ("sport") which was found in here
}
...
}

Related

Java use sax to parse xml files. Can't get the correct content when coming up &amp [duplicate]

This question already has answers here:
SAX parsing and special characters
(2 answers)
Closed 5 years ago.
I have some issues with parsing xml files by sax.
The Java contenthandler code looks like this:
boolean rcontent = false;
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase("content")) {
rcontent = true;
}
}
#Override
public void characters(char ch[], int start, int length) throws SAXException {
if (rcontent){
System.out.println("content: " + new String(ch, start, length));
rcontent = false;
}
}
Xml file content is like this:
But the output is:
I want to say
which is not complete.
It's likely that characters(...) is being called multiple times for the single <content> block. Try something like
StringBuilder builder;
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase("content")) {
builder = new StringBuilder();
}
}
#Override
public void characters(char ch[], int start, int length) throws SAXException {
if (builder != null){
builder.append(new String(ch, start, length));
}
}
#Override
public void endElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (builder != null) {
System.out.println("Content = " + builder);
builder = null;
}
}

Parsing xml special chars issue

I'm parsing an XML got from webservice using SAX.
One of the fields is a link, like the following
<link_site>
http://www.ownhosting.com/webservice_332.asp?id_user=21395&id_parent=33943
</link_site>
I have to get this link and save it, but it is saved like so: id_parent=33943.
Parser snippet:
//inside method startElement
else if(localName.equals("link_site")){
this.in_link=true;
}
...
//inside method endElement
else if(localName.equals("link_site"){
this.in_link=false;
}
Then, I get the content
else if(this.in_link){
xmlparsing.setOrderLink(count, Html.fromHtml(new String(ch, start, length)).toString());
}//I get it and put in a HashMap<Integer,String>
I know that this issue is due to the special characters encoding.
What can I do?
& makes parser to split the line and make several calls to characters() method. You need to concatinate the chunks. Something like this
SAXParserFactory.newInstance().newSAXParser()
.parse(new File("1.xml"), new DefaultHandler() {
String url;
String element;
#Override
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
element = qName;
url = "";
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
if (element.equals("link_site")) {
url += new String(ch, start, length);
}
}
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (element.equals("link_site")) {
System.out.println(url.trim());
element = "";
}
}
});
prints
http://www.ownhosting.com/webservice_332.asp?id_user=21395&id_parent=33943

My Parser Skips Elements

I'm trying to make an RSS reader that uses an XML from the web. For some reason it only reads the last element.
This is pretty much the XML file:
<rss version="2.0">
<channel>
<item>
<mainTitle>...</mainTitle>
<headline>
<title>...</title>
<description>...</description>
<subTitle>...</subTitle>
<link>...</link>
</headline>
<headline>
<title>...</title>
<description>...</description>
<subTitle>...</subTitle>
<link>...</link>
</headline>
</item>
<item>
<mainTitle>...</mainTitle>
<headline>
<title>...</title>
<description>...</description>
<subTitle>...</subTitle>
<link>...</link>
</headline>
<headline>
<title>...</title>
<description>...</description>
<subTitle>...</subTitle>
<link>...</link>
</headline>
</item>
</channel>
</rss>
This is the parser:
public class RssHandler extends DefaultHandler {
// Feed and Article objects to use for temporary storage
private Article currentArticle = new Article();
private List<Article> articleList = new ArrayList<Article>();
// Number of articles added so far
private int articlesAdded = 0;
// Number of articles to download
private static final int ARTICLES_LIMIT = 20;
// Current characters being accumulated
StringBuffer chars = new StringBuffer();
// Current characters being accumulated
int cap = new StringBuffer().capacity();
// Basic Booleans
private boolean wantedItem = false;
private boolean wantedHeadline = false;
private boolean wantedTitle = false;
public List<Article> getArticleList() {
return articleList;
}
public Article getParsedData() {
return this.currentArticle;
}
public RssHandler() {
this.currentArticle = new Article();
}
public void startElement(String uri, String localName, String qName,
Attributes atts) {
chars = new StringBuffer();
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (localName.equalsIgnoreCase("title")) {
currentArticle.setTitle(chars.toString());
} else if (localName.equalsIgnoreCase("subtitle")) {
currentArticle.setDescription(chars.toString());
} else if (localName.equalsIgnoreCase("pubdate")) {
currentArticle.setPubDate(chars.toString());
} else if (localName.equalsIgnoreCase("guid")) {
currentArticle.setGuid(chars.toString());
} else if (localName.equalsIgnoreCase("author")) {
currentArticle.setAuthor(chars.toString());
} else if (localName.equalsIgnoreCase("link")) {
currentArticle.setEncodedContent(chars.toString());
} else if (localName.equalsIgnoreCase("item"))
// Check if looking for article, and if article is complete
if (localName.equalsIgnoreCase("item")) {
articleList.add(currentArticle);
currentArticle = new Article();
// Lets check if we've hit our limit on number of articles
articlesAdded++;
if (articlesAdded >= ARTICLES_LIMIT) {
throw new SAXException();
}
}
chars.setLength(0);
}
public void characters(char ch[], int start, int length) {
chars.append(ch, start, length);
}
}
Whenever I debug the application qName is never a direct child of Item.
It reads rss -> channel -> item -> title -> description ...
I'm clueless. Please help!
1) At the end of endElement() method, you are not resetting the chars length i.e
public void endElement(String uri, String localName, String qName)
throws SAXException {
//...
//Reset 'chars' length at the end always.
chars.setLength(0);
}
2) Change your characters(...) method like below:
public void characters(char ch[], int start, int length) {
chars.append(ch, start, length);
}
[EDIT
3) Move initialization of 'chars' from 'startElement' to constructor. i.e:
public RssHandler() {
this.currentArticle = new Article();
//Add here..
chars = new StringBuffer();
}
and,
public void startElement(String uri, String localName, String qName,
Attributes atts) {
//Remove below line..
//chars = new StringBuffer();
}
4) Finally, use qName to check matching tags instead of localName i.e
if (qName.equalsIgnoreCase("title")) {
currentArticle.setTitle(chars.toString().trim());
} else if (qName.equalsIgnoreCase("subtitle")) {
currentArticle.setDescription(chars.toString().trim());
} //...
EDIT]
More info # Using SAXParser in Android

Quotes Issue when Reading from XML File in JAVA

I'm trying to read from XML and store the data in a text file.
My code works very well in reading and storing the data, EXCEPT when the paragraph from the XML file contains double quotes.
For example:
<Agent> "The famous spy" James Bond </Agent>
The output will ignore any data with quotes, and the result would be: James Bond
I'm using SAX, and here is part of my code that might have the issue:
public void characters(char[] ch, int start, int length) throws SAXException
{
tempVal = new String(ch, start, length);
}
I think I should replace the Quotes before storing the string in my tempVal.
Any ideas???
HERE is the complete code just in case:
public class Entailment {
private String Text;
private String Hypothesis;
private String ID;
private String Entailment;
}
//Event Handlers
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
//reset
tempVal = "";
if(qName.equalsIgnoreCase("pair")) {
//create a new instance of Entailment
tempEntailment = new Entailment();
tempEntailment.setID(attributes.getValue("id"));
tempEntailment.setEntailment(attributes.getValue("entailment"));
}
}
public void characters(char[] ch, int start, int length) throws SAXException {
tempVal = new String(ch, start, length);
}
public void endElement(String uri, String localName, String qName) throws SAXException {
if(qName.equalsIgnoreCase("pair")) {
//add it to the list
Entailments.add(tempEntailment);
}else if (qName.equalsIgnoreCase("t")) {
tempEntailment.setText(tempVal);
}else if (qName.equalsIgnoreCase("h")) {
tempEntailment.setHypothesis(tempVal);
}
}
public static void main(String[] args){
XMLtoTXT spe = new XMLtoTXT();
spe.runExample();
}
Your characters() method is being invoked multiple times because the parser is treating the input as several adjacent text nodes. The way your code is written (which you did not show) your are probably keeping only the last text node.
You need to accumulate the contents of adjacent text nodes yourself.
StringBuilder tempVal = null;
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
//reset
tempVal = new StringBuilder();
....
}
public void characters(char[] ch, int start, int length) throws SAXException {
tempVal.append(ch, start, length);
}
public void endElement(String uri, String localName, String qName) throws SAXException {
String textValue = tempVal.toString();
....
}
}
Interestingly enough I simulated your situation and my SAX parser works fine. I'm using jdk 1.6.0_20, and this is how I create my parser:
// Obtain a new instance of a SAXParserFactory.
SAXParserFactory factory = SAXParserFactory.newInstance();
// Specifies that the parser produced by this code will provide support for XML namespaces.
factory.setNamespaceAware(true);
// Specifies that the parser produced by this code will validate documents as they are parsed.
factory.setValidating(true);
// Creates a new instance of a SAXParser using the currently configured factory parameters.
saxParser = factory.newSAXParser();
My XML header is:
<?xml version="1.0" encoding="iso-8859-1"?>
What about you?

Parsing XML with TagSoup : bug with long attributes?

I'm trying to parse ugly HTML with TagSoup to extract value of a given tag.
Here is the tag :
<input type="hidden" name="hash_check" value="ffc39410ed8da309408a9382450ddc85" />
I want to retrieve value of attribute "value" ("ffc39410ed8da309408a9382450ddc85")
And here is my code, in my SAX handler :
public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException
{
if (localName.equals("input"))
{
Log.v(TAG, Integer.toString(atts.getLength()));
if (atts.getValue("name").equals("hash_check")
{
in_input = true;
Log.v(TAG, atts.getValue("name"));
if (atts.getValue("value") != null)
Log.v(TAG,atts.getValue("value");
}
}
}
Logs are here for debugging purposes. Logcat correctly gives me "hash_check" for atts.getValue("name"), but an empty string for atts.getValue("value") although the parser is positionned to the right "input" (the one and only of my html document).
What's wrong ? Bug in TagSoup ?
Thanks
edit #bkail : thank you for your comment. Here are more details and code.
First, the URL that I'm trying to parse : http://forum.hardware.fr/hfr/Programmation/Divers-6/experts-puissant-internet-sujet_37483_1.htm
And the code used to instanciate the parser :
private static final String FORUM_URI = "http://forum.hardware.fr/hfr/Programmation/Divers-6/experts-puissant-internet-sujet_37483_1.htm";
URL hfrUrl = new URL(FORUM_URI);
Parser parser = new Parser();
HfrSAXHandler sh = new HfrSAXHandler();
parser.setContentHandler(sh);
parser.parse(new InputSource(hfrUrl.openStream()));
And finally, the whole code for my SAX parser :
public class HfrSAXHandler extends DefaultHandler
{
private boolean in_input = false;
private static final String TAG = "hfr4droid";
#Override
public void startDocument() throws SAXException
{
Log.v(TAG, "start of parsing");
}
#Override
public void endDocument() throws SAXException
{
}
#Override
public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException
{
if (localName.equals("input"))
{
Log.v(TAG, Integer.toString(atts.getLength()));
if (atts.getValue("name") != null)
{
in_input = true;
Log.v(TAG, atts.getValue("name"));
if (atts.getValue("value") != null)
Log.v(TAG, Integer.toString(atts.getValue("value")));
}
}
}
#Override
public void endElement(String namespaceURI, String localName, String qName) throws SAXException
{
if (localName.equals("input"))
in_input = false;
}
}
Thanks for giving it a try.
Using Integer.toString() is the problem. Change this:
Log.v(TAG, Integer.toString(atts.getValue("value")));
to this:
Log.v(TAG, atts.getValue("value") );

Categories

Resources