Parsing xml special chars issue - java

I'm parsing an XML got from webservice using SAX.
One of the fields is a link, like the following
<link_site>
http://www.ownhosting.com/webservice_332.asp?id_user=21395&id_parent=33943
</link_site>
I have to get this link and save it, but it is saved like so: id_parent=33943.
Parser snippet:
//inside method startElement
else if(localName.equals("link_site")){
this.in_link=true;
}
...
//inside method endElement
else if(localName.equals("link_site"){
this.in_link=false;
}
Then, I get the content
else if(this.in_link){
xmlparsing.setOrderLink(count, Html.fromHtml(new String(ch, start, length)).toString());
}//I get it and put in a HashMap<Integer,String>
I know that this issue is due to the special characters encoding.
What can I do?

& makes parser to split the line and make several calls to characters() method. You need to concatinate the chunks. Something like this
SAXParserFactory.newInstance().newSAXParser()
.parse(new File("1.xml"), new DefaultHandler() {
String url;
String element;
#Override
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
element = qName;
url = "";
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
if (element.equals("link_site")) {
url += new String(ch, start, length);
}
}
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (element.equals("link_site")) {
System.out.println(url.trim());
element = "";
}
}
});
prints
http://www.ownhosting.com/webservice_332.asp?id_user=21395&id_parent=33943

Related

Java use sax to parse xml files. Can't get the correct content when coming up &amp [duplicate]

This question already has answers here:
SAX parsing and special characters
(2 answers)
Closed 5 years ago.
I have some issues with parsing xml files by sax.
The Java contenthandler code looks like this:
boolean rcontent = false;
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase("content")) {
rcontent = true;
}
}
#Override
public void characters(char ch[], int start, int length) throws SAXException {
if (rcontent){
System.out.println("content: " + new String(ch, start, length));
rcontent = false;
}
}
Xml file content is like this:
But the output is:
I want to say
which is not complete.
It's likely that characters(...) is being called multiple times for the single <content> block. Try something like
StringBuilder builder;
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase("content")) {
builder = new StringBuilder();
}
}
#Override
public void characters(char ch[], int start, int length) throws SAXException {
if (builder != null){
builder.append(new String(ch, start, length));
}
}
#Override
public void endElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (builder != null) {
System.out.println("Content = " + builder);
builder = null;
}
}

How does SAX parsing method charecters() work

I am trying to read data from an xml file and store them in a list using the SAX parsing method. My problem is when I try to store the values of my data using the characters() method. I am creating an object where for each element I store each value and some extra information for some later use but when I try to store said value it stores spaces instead. I tried printing inside the method and while it seams to go through all my xml file it prints only a couple of the elements and not even in the right order. So can someone explain me what I am missing?
XML FILE:
<?xml version="1.0" encoding="UTF-8"?>
<CarModel>
<Audi model = "TT" year = "2006" starting_price = "33.000$">
<type>sport</type>
<horse_power>222hp</horse_power>
<drivetrain>quattro</drivetrain
<transmission>6_Manual</transmission>
</Audi>
<Mercedes model = "W222_S400" year = "2013" starting_price =
63.000$">
<type>luxury</type>
<horse_power>302hp</horse_power>
<drivetrain>front_wheel_drive</drivetrain>
<transmission>7_Automatic</transmission>
</Mercedes>
</CarModel>
JAVA CODE :
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
lvl_cnt++;
System.out.println(lvl_cnt);
obj = new xml_obj();
obj.setLvl(lvl_cnt);
System.out.println("LVL "+obj.getLvl());
if (lvl_cnt == 0) {
obj.setValue(qName);
obj.setParent("root");
System.out.println(obj.getParent());
}
else {
xml_obj tmp = objListofLists.get(objListofLists.size()-1);
int lvl_before = tmp.getLvl();
System.out.println("AAA" + lvl_before);
if (lvl_cnt > lvl_before) {
obj.setParent(tmp.getValue());
}
else if (lvl_cnt < lvl_before) {
int j = 0;
while (objListofLists.get(j).getLvl() != lvl_cnt) j++;
tmp = objListofLists.get(j);
obj.setParent(tmp.getParent());
}
else {
obj.setParent(tmp.getParent());
}
System.out.println(obj.getParent());
}
obj.attributes = attributes;
objListofLists.add(obj);
}
public void endElement(String uri, String localName, String qName) throws SAXException {
lvl_cnt--;
}
public void characters (char ch[], int start, int length) throws SAXException {
String help = new String(ch, start, length);
System.out.println(help);
objListofLists.get(objListofLists.size()-1).setValue(help);
}
}
The characters method only prints the text inside the elements (no attributes), but that also includes all the empty spaces, tabs, and newlines that might be present in the document and are descendants of the root element.
If you want to read text that is inside a particular element, you should set a flag when you enter that element (in startElement()), then unset it in endElement(), and inside characters() you test if you are currently in the element from which you wish to extract text.
private boolean inTypeTag = false;
public void startElement(String uri, String localName, String qName, ...) ...{
if (qName.equals("type") {
inTypeTag = true;
}
...
}
public void endElement(String uri, String localName, String qName, ...) ...{
if (qName.equals("type") {
inTypeTag = false;
}
...
}
public void characters(char ch[], int start, int length) ... {
if (inTypeTag) {
// do something with the text ("sport") which was found in here
}
...
}

Quotes Issue when Reading from XML File in JAVA

I'm trying to read from XML and store the data in a text file.
My code works very well in reading and storing the data, EXCEPT when the paragraph from the XML file contains double quotes.
For example:
<Agent> "The famous spy" James Bond </Agent>
The output will ignore any data with quotes, and the result would be: James Bond
I'm using SAX, and here is part of my code that might have the issue:
public void characters(char[] ch, int start, int length) throws SAXException
{
tempVal = new String(ch, start, length);
}
I think I should replace the Quotes before storing the string in my tempVal.
Any ideas???
HERE is the complete code just in case:
public class Entailment {
private String Text;
private String Hypothesis;
private String ID;
private String Entailment;
}
//Event Handlers
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
//reset
tempVal = "";
if(qName.equalsIgnoreCase("pair")) {
//create a new instance of Entailment
tempEntailment = new Entailment();
tempEntailment.setID(attributes.getValue("id"));
tempEntailment.setEntailment(attributes.getValue("entailment"));
}
}
public void characters(char[] ch, int start, int length) throws SAXException {
tempVal = new String(ch, start, length);
}
public void endElement(String uri, String localName, String qName) throws SAXException {
if(qName.equalsIgnoreCase("pair")) {
//add it to the list
Entailments.add(tempEntailment);
}else if (qName.equalsIgnoreCase("t")) {
tempEntailment.setText(tempVal);
}else if (qName.equalsIgnoreCase("h")) {
tempEntailment.setHypothesis(tempVal);
}
}
public static void main(String[] args){
XMLtoTXT spe = new XMLtoTXT();
spe.runExample();
}
Your characters() method is being invoked multiple times because the parser is treating the input as several adjacent text nodes. The way your code is written (which you did not show) your are probably keeping only the last text node.
You need to accumulate the contents of adjacent text nodes yourself.
StringBuilder tempVal = null;
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
//reset
tempVal = new StringBuilder();
....
}
public void characters(char[] ch, int start, int length) throws SAXException {
tempVal.append(ch, start, length);
}
public void endElement(String uri, String localName, String qName) throws SAXException {
String textValue = tempVal.toString();
....
}
}
Interestingly enough I simulated your situation and my SAX parser works fine. I'm using jdk 1.6.0_20, and this is how I create my parser:
// Obtain a new instance of a SAXParserFactory.
SAXParserFactory factory = SAXParserFactory.newInstance();
// Specifies that the parser produced by this code will provide support for XML namespaces.
factory.setNamespaceAware(true);
// Specifies that the parser produced by this code will validate documents as they are parsed.
factory.setValidating(true);
// Creates a new instance of a SAXParser using the currently configured factory parameters.
saxParser = factory.newSAXParser();
My XML header is:
<?xml version="1.0" encoding="iso-8859-1"?>
What about you?

problem with using SAX XML Parser

I am using the SAX Parser for XML Parsing. The problem is for the following XML code:
<description>
Designer:Paul Smith Color:Plain Black Fabric/Composition:100% cotton Weave/Pattern:pinpoint Sleeve:Long-sleeved Fit:Classic Front style:Placket front Back style:Side pleat back Collar:Classic/straight collar Button:Pearlescent front button Pocket:rounded chest pocket Hem:Rounded hem
</description>
I get this:
Designer:Paul Smith
Color:Plain Black
The other parts are missing. The same thing happens for a few other lines. Can anyone kindly tell me whats the problem with my approach ?
My code is given below:
Parser code:
try {
/** Handling XML */
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
XMLReader xr = sp.getXMLReader();
/** Send URL to parse XML Tags */
URL sourceUrl = new URL(
"http://50.19.125.224/Demo/VeryGoodSex_and_the_City_S6E6.xml");
/** Create handler to handle XML Tags ( extends DefaultHandler ) */
MyXMLHandler myXMLHandler = new MyXMLHandler();
xr.setContentHandler((ContentHandler) myXMLHandler);
xr.parse(new InputSource(sourceUrl.openStream()));
} catch (Exception e) {
System.out.println("XML Pasing Excpetion = " + e);
}
Object to hold XML parsed Info:
public class ParserObject {
String name=null;
String description=null;
String bitly=null; //single
String productLink=null;//single
String productPrice=null;//single
Vector<String> price=new Vector<String>();
}
Handler class:
public void endElement(String uri, String localName, String qName)
throws SAXException {
currentElement = false;
if (qName.equalsIgnoreCase("title"))
{
xmlDataObject[index].name=currentValue;
}
else if (qName.equalsIgnoreCase("artist"))
{
xmlDataObject[index].artist=currentValue;
}
}
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
currentElement = true;
if (qName.equalsIgnoreCase("allinfo"))
{
System.out.println("started");
}
else if (qName.equalsIgnoreCase("tags"))
{
insideTag=1;
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
if (currentElement) {
currentValue = new String(ch, start, length);
currentElement = false;
}
}
You have to concatenate characters which the parser gives to you until it calls endElement.
Try removing currentElement = false; from characters handler, and
currentValue = currentValue + new String(ch, start, length);
Initialize currentValue with an empty string or handle null value in the expression above.
I think characters read some, but not all characters at the same time.
Thus, you only get the first "chunk".
Try printing each character chunk on a separate line, as debugging (before the if).

Parsing XML with TagSoup : bug with long attributes?

I'm trying to parse ugly HTML with TagSoup to extract value of a given tag.
Here is the tag :
<input type="hidden" name="hash_check" value="ffc39410ed8da309408a9382450ddc85" />
I want to retrieve value of attribute "value" ("ffc39410ed8da309408a9382450ddc85")
And here is my code, in my SAX handler :
public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException
{
if (localName.equals("input"))
{
Log.v(TAG, Integer.toString(atts.getLength()));
if (atts.getValue("name").equals("hash_check")
{
in_input = true;
Log.v(TAG, atts.getValue("name"));
if (atts.getValue("value") != null)
Log.v(TAG,atts.getValue("value");
}
}
}
Logs are here for debugging purposes. Logcat correctly gives me "hash_check" for atts.getValue("name"), but an empty string for atts.getValue("value") although the parser is positionned to the right "input" (the one and only of my html document).
What's wrong ? Bug in TagSoup ?
Thanks
edit #bkail : thank you for your comment. Here are more details and code.
First, the URL that I'm trying to parse : http://forum.hardware.fr/hfr/Programmation/Divers-6/experts-puissant-internet-sujet_37483_1.htm
And the code used to instanciate the parser :
private static final String FORUM_URI = "http://forum.hardware.fr/hfr/Programmation/Divers-6/experts-puissant-internet-sujet_37483_1.htm";
URL hfrUrl = new URL(FORUM_URI);
Parser parser = new Parser();
HfrSAXHandler sh = new HfrSAXHandler();
parser.setContentHandler(sh);
parser.parse(new InputSource(hfrUrl.openStream()));
And finally, the whole code for my SAX parser :
public class HfrSAXHandler extends DefaultHandler
{
private boolean in_input = false;
private static final String TAG = "hfr4droid";
#Override
public void startDocument() throws SAXException
{
Log.v(TAG, "start of parsing");
}
#Override
public void endDocument() throws SAXException
{
}
#Override
public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException
{
if (localName.equals("input"))
{
Log.v(TAG, Integer.toString(atts.getLength()));
if (atts.getValue("name") != null)
{
in_input = true;
Log.v(TAG, atts.getValue("name"));
if (atts.getValue("value") != null)
Log.v(TAG, Integer.toString(atts.getValue("value")));
}
}
}
#Override
public void endElement(String namespaceURI, String localName, String qName) throws SAXException
{
if (localName.equals("input"))
in_input = false;
}
}
Thanks for giving it a try.
Using Integer.toString() is the problem. Change this:
Log.v(TAG, Integer.toString(atts.getValue("value")));
to this:
Log.v(TAG, atts.getValue("value") );

Categories

Resources