problem with using SAX XML Parser - java

I am using the SAX Parser for XML Parsing. The problem is for the following XML code:
<description>
Designer:Paul Smith Color:Plain Black Fabric/Composition:100% cotton Weave/Pattern:pinpoint Sleeve:Long-sleeved Fit:Classic Front style:Placket front Back style:Side pleat back Collar:Classic/straight collar Button:Pearlescent front button Pocket:rounded chest pocket Hem:Rounded hem
</description>
I get this:
Designer:Paul Smith
Color:Plain Black
The other parts are missing. The same thing happens for a few other lines. Can anyone kindly tell me whats the problem with my approach ?
My code is given below:
Parser code:
try {
/** Handling XML */
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
XMLReader xr = sp.getXMLReader();
/** Send URL to parse XML Tags */
URL sourceUrl = new URL(
"http://50.19.125.224/Demo/VeryGoodSex_and_the_City_S6E6.xml");
/** Create handler to handle XML Tags ( extends DefaultHandler ) */
MyXMLHandler myXMLHandler = new MyXMLHandler();
xr.setContentHandler((ContentHandler) myXMLHandler);
xr.parse(new InputSource(sourceUrl.openStream()));
} catch (Exception e) {
System.out.println("XML Pasing Excpetion = " + e);
}
Object to hold XML parsed Info:
public class ParserObject {
String name=null;
String description=null;
String bitly=null; //single
String productLink=null;//single
String productPrice=null;//single
Vector<String> price=new Vector<String>();
}
Handler class:
public void endElement(String uri, String localName, String qName)
throws SAXException {
currentElement = false;
if (qName.equalsIgnoreCase("title"))
{
xmlDataObject[index].name=currentValue;
}
else if (qName.equalsIgnoreCase("artist"))
{
xmlDataObject[index].artist=currentValue;
}
}
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
currentElement = true;
if (qName.equalsIgnoreCase("allinfo"))
{
System.out.println("started");
}
else if (qName.equalsIgnoreCase("tags"))
{
insideTag=1;
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
if (currentElement) {
currentValue = new String(ch, start, length);
currentElement = false;
}
}

You have to concatenate characters which the parser gives to you until it calls endElement.
Try removing currentElement = false; from characters handler, and
currentValue = currentValue + new String(ch, start, length);
Initialize currentValue with an empty string or handle null value in the expression above.

I think characters read some, but not all characters at the same time.
Thus, you only get the first "chunk".
Try printing each character chunk on a separate line, as debugging (before the if).

Related

Parsing a big xml file Java

I have big xml files (~1GB) with this structure:
<?xml version="1.0" encoding="UTF-8"?>
<GenoExchange xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.ncbi.nlm.nih.gov/SNP/geno" xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/geno ftp://ftp.ncbi.nlm.nih.gov/snp/specs/genoex_1_5.xsd" dbSNPBuildNo="146" reportId="MT" reportType="chromosome">
<Population popId="638" handle="TSC-CSHL" locPopId="TSC_42_AA">
<popClass self="NORTH AMERICA"/>
</Population>
<SnpInfo rsId="1041870" observed="C/T">
<SnpLoc genomicAssembly="107:GRCh38.p2" geneId="4512" geneSymbol="COX1" chrom="MT" start="6150" locType="2" rsOrientToChrom="fwd" contigAllele="T" contig="NC_012920:1"/>
<SsInfo ssId="1508548" locSnpId="TSC0349089" ssOrientToRs="fwd">
<ByPop popId="1303" sampleSize="184">
<AlleleFreq allele="T" freq="1"/>
<AlleleFreq allele="C" freq="0"/>
</ByPop>
</SsInfo>
</SnpInfo>
<SnpInfo rsId="1029293" observed="C/T">
<SnpLoc genomicAssembly="107:GRCh38.p2" geneId="4512" geneSymbol="COX1" chrom="MT" start="6307" locType="2" rsOrientToChrom="fwd" contigAllele="C" contig="NC_012920:1"/>
<SsInfo ssId="1494519" locSnpId="TSC0254145" ssOrientToRs="fwd">
<ByPop popId="639" sampleSize="82">
<AlleleFreq allele="T" freq="0"/>
<AlleleFreq allele="C" freq="1"/>
</ByPop>
<ByPop popId="1303" sampleSize="184">
<AlleleFreq allele="T" freq="0"/>
<AlleleFreq allele="C" freq="1"/>
</ByPop>
</SsInfo>
</SnpInfo>
I want to find a specific rsID, for example rsID="1029293" and extract all the information inside that node. I don't want to run all the file. I only want to find that ID, extract that information and end the iteration.
From what I read it's better if I use SAX or Stax parsers. I'm using SAX, this is my code:
class UserHandler extends DefaultHandler {
String rsID = null;
String i = "1029293";
#Override
public void startElement(String uri,
String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase("SnpInfo")) {
rsID = attributes.getValue("rsId");
//System.out.println("value: " + rsID);
}
if((i).equals(rsID) &&
qName.equalsIgnoreCase("SnpInfo")){
System.out.println("Start Element: " + qName + " " + rsID);
}
if ((i).equals(rsID) && qName.equalsIgnoreCase("SsInfo")) {
String a = attributes.getValue("ssId");
System.out.println("SSID: " + a);
}
if ((i).equals(rsID) && qName.equalsIgnoreCase("ByPop")) {
String p = attributes.getValue("popId");
System.out.println("POPID: " + p);
}
if ((i).equals(rsID) && qName.equalsIgnoreCase("AlleleFreq")) {
String p = attributes.getValue("allele");
String f = attributes.getValue("freq");
System.out.println("ALLELE: " + p + " FREQ: " + f);
}
if ((i).equals(rsID) && qName.equalsIgnoreCase("GTypeFreq")) {
String p = attributes.getValue("gtype");
String f = attributes.getValue("freq");
System.out.println("GTYPE: " + p + " FREQ: " + f);
}
}
#Override
public void endElement(String uri,
String localName, String qName) throws SAXException {
if (qName.equalsIgnoreCase("SnpInfo")) {
if((i).equals(rsID)
&& qName.equalsIgnoreCase("SnpInfo"))
System.out.println("End Element: " + qName);
}
}
}
public class XMLParser {
public static void main(String argv[]) {
try {
InputStream fileStream = new FileInputStream("/home/xml/gt_chr10.xml.gz");
InputStream gzipStream = new GZIPInputStream(fileStream);
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
UserHandler userhandler = new UserHandler();
saxParser.parse(gzipStream, userhandler);
} catch (Exception e) {
e.printStackTrace();
}
}
My problem is that my code searches the whole file for the ID and that takes more than 2 minutes each time. I can't have a code that takes so long.
Is there a better approach for this?
Using STAX gives you more control when parsing XML, since you actively pull elements from the stream. This way you can pull the next event, handle it and once you found your data, simply terminate the loop (using a flag or even a return statement if you must)
InputStream in = ...
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader eventReader = factory.createXMLEventReader(in);
boolean found = false;
while (!found && eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
switch (event.getEventType()) {
case XMLStreamConstants.START_ELEMENT:
// your logic here
// once you found your element, you can terminate the loop
found = true;
break;
case XMLStreamConstants.END_ELEMENT:
// your logic here
break;
}
}
(omitted exception and resource handling for brevity)
On a side note, you will gain some performance by combining your if ((i).equals(rsID) && ... into a single one, with detail checks in nested ifs
if ((i).equals(rsID)) {
if(qName.equalsIgnoreCase("GTypeFreq")) {
...
}
}
You can throw an exception in your end element handler, to indicate to the parser that it aborts parsing (http://www.ibm.com/developerworks/library/x-tipsaxstop/):
#Override
public void endElement(String uri,
String localName, String qName) throws SAXException {
if (qName.equalsIgnoreCase("SnpInfo")) {
if((i).equals(rsID)
&& qName.equalsIgnoreCase("SnpInfo"))
System.out.println("End Element: " + qName);
throw SAXException("Element found.");
}
}
The only way to avoid parsing the whole file every time you run this is to put the data in an XML database. Parsing a 1Gb file is going to take about a minute, plus or minus depending on the speed of your machine and what processing you do on each node.
A streamed XSLT 3.0 solution is simply:
<xsl:transform version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xpath-default-namespace="http://www.ncbi.nlm.nih.gov/SNP/geno">
<xsl:template name="xsl:initial-template">
<xsl:stream href="input.xml">
<xsl:copy-of select="/GenoExchange/SnpInfo[#rsId='1041870'][1]"/>
</xsl:stream>
</xsl:template>
</xsl:transform>
No need to write all that pesky SAX or StAX code.
I put the "[1]" predicate in to allow the processor to abandon the search when it has found the first hit.
The best approach is to use vtd-xml and xpath... 1GB xml file takes about 1.5GB heap space and < 10 sec in a 3~4 year old intel processor.see code example below.. One more thing, if you want to eliminate parsing entirely, you can create a vtd+XML file format so any subsequent query can directly access the vtd index portion, which could easily triple or quadruple your app performance...
import com.ximpleware.*;
public class simpleXpathSearch{
public static void main(String s[]) throws VTDException,java.io.UnsupportedEncodingException,java.io.IOException{
VTDGen vg = new VTDGen();
vg.setLCLevel(5);
if (!vg.parseFile("input.xml", false))
return;
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/*/*[#rsID='1029293']");
int i=0;
while((i=ap.evalXPath())!=-1){
// your code logic here
}
//Main class
public static void main(String[] args) {
SAXReader.read();
}
//SAXReader
public static void read(){
try {
XMLReader processor = XMLReaderFactory.createXMLReader();
processor.setContentHandler(new SAXController());
processor.parse(new InputSource("MyXML.xml"));
} catch (SAXException | IOException e) {
System.err.println(e.getMessage());
}
}
//SAXController
// The SAXController extends DefaultHandler
private int tab = 0;
private void tabulation() {
for (int i=0; i<tab; i++)
System.out.print(" ");
}
#Override
public void startDocument() {
tabulation();
System.out.println("Starting XML Document");
tab++;
}
#Override
public void endDocument() {
tab--;
tabulation();
System.out.println("Ending XML Document");
}
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes)
throws SAXException {
tabulation();
System.out.print(localName);
if (attributes.getLength()>0) {
for (int i=0; i<attributes.getLength(); i++) {
System.out.print(attributes.getLocalName(i)+": "+attributes.getValue(i));
}
}
System.out.println();
tab++;
}
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
tab--;
tabulation();
System.out.println(localName);
}
#Override
public void characters(char[] ch, int start, int length)
throws SAXException {
String content= new String(ch, start, length);
content= content.replaceAll("[\t\n]", "").trim();
if (!content.equals("")) {
tabulation();
System.out.println(content);
}
}

java parser sax doesn't get value & on my field

I have more elements on my xml file contains & and others characters html >.
I tested my code but it obtain the first part of my field for example:
SERIES & FILMS
It give only the word SERIES.
And other example:
C>SUDO
It give only C.
My code, my field name is "summary":
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
chars = new StringBuffer();
DefaultHandler handler = new DefaultHandler() {
public void startElement(String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
System.out.println("Start Element :" + qName);
if (qName.equals(SUMMARY2)) {
bfSummary = true;
}
if (qName.equals(SERVICE_DATA)) {
idServiceData = attributes.getValue("id");
bfServicedata = true;
}
}
public void endElement(String uri, String localName,
String qName) throws SAXException {
System.out.println("End Element :" + qName + ""
+ mListBaseLineByEpgId.size());
// maliste.put(listeId, summary);
malisteParThem.add(summary);
if (mListBaseLineByEpgId.get(idServiceData) != null) {
List<String> listeModif = mListBaseLineByEpgId
.get(idServiceData);
for (String chaine : malisteParThem) {
listeModif.add(chaine);
}
mListBaseLineByEpgId.replace(idServiceData, listeModif);
} else {
mListBaseLineByEpgId.put(idServiceData, malisteParThem);
}
malisteParThem = new ArrayList<String>();
}
public void characters(char ch[], int start, int length)
throws SAXException {
if (bfSummary) {
summary = new String(ch, start, length);
summary = summary.replace(BEFORETILESUMMARY, "");
// chars.append(summary);
// summary=chars.toString();
summary = removeHtmlFrom(summary);
System.out.println("Summary : " + summary);
bfSummary = false;
}
if (bfServicedata) {
System.out.println("listeId : " + idServiceData);
bfServicedata = false;
}
}
};
File file = new File(cheminFichier);
InputStream inputStream = new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream);
InputSource is = new InputSource(reader);
//is.setEncoding("ISO-8859-1");
saxParser.parse(is, handler);
} catch (Exception e) {
e.printStackTrace();
}
Thank you.
Perhaps this problem is related to the unexpected behavior of SAX parser: it is allowed (per spec) to split the text part of an element and call characters() method multiple times for the same element.
What you need to do is have a StringBuffer or StringBuilder instance variable. You initialize it in startElement(), append to it on characters() and get the full text on endElement()
see this question for more info JAVA SAX parser split calls to characters()

Android: get HTML text from XML

I have implemented in my app reading a XML. It works fine. But I want to format the text. I've tried in the XML:
<monumento>
<horarios><b>L-V:</b> 10 a 20<br/>S-D: 11 a 15</horarios>
<tarifas>4000</tarifas>
</monumento>
But the only thing I get if I put HTML character is that the text does not display in my app.
I'll have many xml so that I will not always know where to place <b>, <br/>...
How I can do?
Main
StringBuilder builder = new StringBuilder();
for (HorariosTarifasObj post : helper.posts) {
builder.append(post.getHorarios());
}
horario2.setText(builder.toString());
builder = new StringBuilder();
for (HorariosTarifasObj post : helper.posts) {
builder.append(post.getTarifas());
}
tarifa2.setText(builder.toString());
XMLReader
public void get() {
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setContentHandler(this);
InputStream inputStream = new URL(URL + monumento + ".xml").openStream();
reader.parse(new InputSource(inputStream));
} catch (Exception e) {
}
}
#Override
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
currTag = true;
currTagVal = "";
if (localName.equals("monumento")) {
post = new HorariosTarifasObj();
}
}
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
currTag = false;
if(localName.equalsIgnoreCase("horarios")) {
post.setHorarios(currTagVal);
} else if(localName.equalsIgnoreCase("tarifas")) {
post.setTarifas(currTagVal);
} else if (localName.equalsIgnoreCase("monumento")) {
posts.add(post);
}
}
#Override
public void characters(char[] ch, int start, int length)
throws SAXException {
if (currTag) {
currTagVal = currTagVal + new String(ch, start, length);
currTag = false;
}
}
Try CDATA:
<monumento>
<horarios><![CDATA[<b>L-V:</b> 10 a 20<br/>S-D: 11 a 15]]></horarios>
<tarifas>4000</tarifas>
</monumento>
In XML, < and > characters are reserved for XML tags. You will need to protect them by replacing them with special encoding characters.
You can use > for > and < for <
(edit) Eomm answer is right, CDATA does this as well, and more simple
Also, to use HTML coding in TextView, you will need to use Html.fromHtml() method
For instance :
tarifa2.setText(Html.fromHtml(builder.toString()));

Quotes Issue when Reading from XML File in JAVA

I'm trying to read from XML and store the data in a text file.
My code works very well in reading and storing the data, EXCEPT when the paragraph from the XML file contains double quotes.
For example:
<Agent> "The famous spy" James Bond </Agent>
The output will ignore any data with quotes, and the result would be: James Bond
I'm using SAX, and here is part of my code that might have the issue:
public void characters(char[] ch, int start, int length) throws SAXException
{
tempVal = new String(ch, start, length);
}
I think I should replace the Quotes before storing the string in my tempVal.
Any ideas???
HERE is the complete code just in case:
public class Entailment {
private String Text;
private String Hypothesis;
private String ID;
private String Entailment;
}
//Event Handlers
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
//reset
tempVal = "";
if(qName.equalsIgnoreCase("pair")) {
//create a new instance of Entailment
tempEntailment = new Entailment();
tempEntailment.setID(attributes.getValue("id"));
tempEntailment.setEntailment(attributes.getValue("entailment"));
}
}
public void characters(char[] ch, int start, int length) throws SAXException {
tempVal = new String(ch, start, length);
}
public void endElement(String uri, String localName, String qName) throws SAXException {
if(qName.equalsIgnoreCase("pair")) {
//add it to the list
Entailments.add(tempEntailment);
}else if (qName.equalsIgnoreCase("t")) {
tempEntailment.setText(tempVal);
}else if (qName.equalsIgnoreCase("h")) {
tempEntailment.setHypothesis(tempVal);
}
}
public static void main(String[] args){
XMLtoTXT spe = new XMLtoTXT();
spe.runExample();
}
Your characters() method is being invoked multiple times because the parser is treating the input as several adjacent text nodes. The way your code is written (which you did not show) your are probably keeping only the last text node.
You need to accumulate the contents of adjacent text nodes yourself.
StringBuilder tempVal = null;
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
//reset
tempVal = new StringBuilder();
....
}
public void characters(char[] ch, int start, int length) throws SAXException {
tempVal.append(ch, start, length);
}
public void endElement(String uri, String localName, String qName) throws SAXException {
String textValue = tempVal.toString();
....
}
}
Interestingly enough I simulated your situation and my SAX parser works fine. I'm using jdk 1.6.0_20, and this is how I create my parser:
// Obtain a new instance of a SAXParserFactory.
SAXParserFactory factory = SAXParserFactory.newInstance();
// Specifies that the parser produced by this code will provide support for XML namespaces.
factory.setNamespaceAware(true);
// Specifies that the parser produced by this code will validate documents as they are parsed.
factory.setValidating(true);
// Creates a new instance of a SAXParser using the currently configured factory parameters.
saxParser = factory.newSAXParser();
My XML header is:
<?xml version="1.0" encoding="iso-8859-1"?>
What about you?

Read multiple lines from xml

I am trying to fetch data from a xml file in java using sax parser. I successfully got small amount of data but when data becomes too large and in multiple lines it gives only two lines of data, not all the lines. I am trying following code-
InputStreamReader isr = new InputStreamReader(is);
InputSource source = new InputSource(isr);
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
SAXParser parser = factory.newSAXParser();
XMLReader xr = parser.getXMLReader();
GeofenceParametersXMLHandler handler = new GeofenceParametersXMLHandler();
xr.setContentHandler(handler);
xr.parse(source);
And my GeofenceParametersXMLHandler is-
private boolean inTimeZone = false;
private boolean inCoordinate = false;
private boolean outerBoundaryIs = false;
private boolean innerBoundaryIs = false;
private String timeZone;
private List<String> innerCoordinates = new ArrayList<String>();
private String outerCoordinates;
public String getTimeZone() {
return timeZone;
}
public List<String> getInnerCoordinates() {
return innerCoordinates;
}
public String getOuterCoordinates() {
return outerCoordinates;
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
super.characters(ch, start, length);
if (this.inTimeZone) {
this.timeZone = new String(ch, start, length);
this.inTimeZone = false;
}
if (this.inCoordinate && this.innerBoundaryIs) {
this.innerCoordinates.add(new String(ch, start, length));
this.inCoordinate = false;
this.innerBoundaryIs = false;
}
if (this.inCoordinate && this.outerBoundaryIs) {
this.outerCoordinates = new String(ch, start, length);
this.inCoordinate = false;
this.outerBoundaryIs = false;
}
}
#Override
public void endElement(String uri, String localName, String name) throws SAXException {
super.endElement(uri, localName, name);
}
#Override
public void startDocument() throws SAXException {
super.startDocument();
}
#Override
public void startElement(String uri, String localName, String name, Attributes attributes) throws SAXException {
super.startElement(uri, localName, name, attributes);
if (localName.equalsIgnoreCase("timezone")) {
this.inTimeZone = true;
}
if (localName.equalsIgnoreCase("outerBoundaryIs")) {
this.outerBoundaryIs = true;
}
if (localName.equalsIgnoreCase("innerBoundaryIs")) {
this.innerBoundaryIs = true;
}
if (localName.equalsIgnoreCase("coordinates")) {
this.inCoordinate = true;
}
}
And the xml file is-
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2"
xmlns:gx="http://www.google.com/kml/ext/2.2">
<Placemark>
<name>gx:altitudeMode Example</name>
<timezone>EASTERN</timezone>
<Polygon>
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<outerBoundaryIs>
<LinearRing>
<coordinates>
-77.05788457660967,38.87253259892824,100
-77.05465973756702,38.87291016281703,100
-77.05315536854791,38.87053267794386,100
-77.05552622493516,38.868757801256,100
-77.05844056290393,38.86996206506943,100
-77.05788457660967,38.87253259892824,100
</coordinates>
</LinearRing>
</outerBoundaryIs>
</Polygon>
I always got two line of data for coordinates. But when they are in single line I got complete data. How to fetch complete data in multiple line?
Thanks in Advance.
The characters() method won't necessarily give you all the text data in one go (this is a very common misconception, btw).
The proper approach is to concatenate all the data returned by successive calls to characters() (using a StringBuilder or similar). Once your endElement() method is called, you can then treat that text buffer as complete and process it as such.
From the doc:
The Parser will call this method to report each chunk of character
data. SAX parsers may return all contiguous character data in a single
chunk, or they may split it into several chunks
Often you see that for a small XML doc one call to characters() will suffice. However as your XML doc increases in size, you'll find that due to buffering etc. you'll start getting multiple calls. Consequently each call treated on its own appears to be incomplete.

Categories

Resources