Problem in using stream reader - java

I have XML data as a string which has to parsed, I am converting the XML string to inputsource using the following code:
StringReader reader1 = new StringReader( xmlstring);
InputSource inputSource1= new InputSource( reader );
And I am passing input source to
Document doc = builder.build(inputSource);
I want to use the same inputSource1 in another parser class also, but I am getting stream closed.
How would I handle this or is there any other way to pass XML data to a parser other than file?

Looking at the JavaDoc, it seems that InputSource is not designed to be shared and reused by multiple parsers.
standard processing of both byte and character streams is to close them on as part of end-of-parse cleanup, so applications should not attempt to re-use such streams after they have been handed to a parser.
So you will have to create a new InputSource. If you are really reading from a String, there would be no difference in I/O or memory cost anyway.

Related

Why using InputSource fixes SAX parser when file contains special UTF-8 characters

I'm looking to get an explanation on why my SAX parser fails when some special UTF-8 characters are inside my XML file.
To parse the XML file I use Document doc = builder.parse(inputSource);
However when I use an inputSource it works fine:
DocumentBuilder builder = factory.newDocumentBuilder();
InputStream in = new FileInputStream(file);
InputSource inputSource = new InputSource(new InputStreamReader(in));
Document doc = builder.parse(inputSource);
I don't quite understand why the latter works. I've seen example of it being used but there isn't an explanation on why it works.
Does the second parse a string rather than a file, therefore the encoding will be UTF-8?
I suspect your document isn't really in the encoding you've declared. This line:
InputSource inputSource = new InputSource(new InputStreamReader(in));
will use the platform default encoding to convert the binary data into text within InputStreamReader. The XML parser doesn't get to do it any more - it doesn't get to see the raw bytes.
If this is working, your XML file is probably subtly bust - it may be declaring that it's in UTF-8, but using the platform default encoding (e.g. Windows-1252). Rather than use the workaround, you should fix the XML if you have any choice about it.

parsing a large xml string in java using XMLReader

I have the following bit of code which parses an XML string returned from from a database:
XMLReader xReader = XMLReaderFactory.createXMLReader();
xReader.setContentHandler(parser);
xReader.parse(new InputSource(new StringReader(theResponseStringFromTheDatabase)));
whenever the theResponseStringFromTheDatabase is too large, the parsing fails. Is there a way to modify the code so it will parse large strings?
best wishes,
ck
I would suggest that you need to get a Reader accessing that column.
InputSource can take a Reader object, and that reader should be one capable of pulling the XML incrementally (suitable for the SAX reader underlying the XMLReader)
ResultSet.getCharacterStream() may do the trick.

How to do a search/replace on a file on the fly?

My java application loads an XML file and then parses the XML.
What I would like to is a search/replace on the file before I create the SAXBuilder. How can I do this in memory ( without having to write to the file ) ?
Here's my code, and where I envision doing the search/replace :
private String xmlFile = "D:\\mycomputer\\extract.xml";
File myXMLFile = new File(xmlFile);
// TODO
// REPLACE ALL "<content>" in xmlFile with "<content><![CDATA["
// REPLACE ALL "</content>" with "]]></content>"
SAXBuilder builder = new SAXBuilder("org.apache.xerces.parsers.SAXParser");
document = builder.build(new File(myXMLFile));
Read the file into memory, do the search/replace, and use the SAXBuilder(StringReader) method.
You can first read file to string with apache commons io and then change the input source for the SaxBuilder as in the following code snippet:
String fileStr = FileUtils.readFileToString(myXMLFile);
fileStr = fileStr.replaceAll("<content>","<content><![CDATA[");
fileStr = fileStr.replaceAll("</content>","]]></content>");
SAXBuilder builder = new SAXBuilder("org.apache.xerces.parsers.SAXParser");
document = builder.build(new ByteArrayInputStream(fileStr.getBytes()));
You answered to the question yourself - read the whole file into a StringBuilder, perform the replace in it and then call SAXParser.
The string can be passed to SAXBuilder using StringReader:
StringBuilder sb = new StringBuilder ();
loadFIleContent (filePath, sb);
document = builder.build (new StringReader (sb.toString ()));
P.S.: follow up to theglauber's answer:
If the file is really big (~100Mb) it's impractical to fully read it into memory as well as parsing it into a DOM tree. In this case you should consider using SAXParser and replacing as the file being parsed.
Depending on how large these files are, either read the file into a String, do your replacements in memory and build the XML from the String, or spawn a new thread to read the file, do the replacements and output, then build the XML from the output of that thread.
(I would suggest parsing and modifying the XML tree or using a XML filter, but i suspect you want to do this string-based replacement because the current content of your files is not correct XML.)

Stream xml input to sax Parser, How to print the xml streamed?

Well I am trying to connect to one remote server via socket, and I get big xml responses back from socket, delimited by a '\n' character.
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<data>
.......
.......
</data>
</Response>\n <---- \n acts as delimiter
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<data>
....
....
</data>
</Response>\n
..
I am trying to parse these xml using SAX Parser. Ideally I want to get one full response to a string by searching for '\n' and give this response to parser. But since my single response is very large, I am getting outOfMemory Exception when holding such a large xml in string..So the only option remained was to stream the xml to SAX.
SAXParserFactory spfactory = SAXParserFactory.newInstance();
SAXParser saxParser = spfactory.newSAXParser();
XMLReader xmlReader = saxParser.getXMLReader();
xmlReader.setContentHandler(new MyDefaultHandler(context));
InputSource xmlInputSource = new InputSource(new
CloseShieldInputStream(mySocket.getInputStream()));
xmlReader.parse(xmlInputSource);
I am using closeShieldInputStream to prevent SAX closing my socket stream on exception because of '\n'. I asked a previous question on that..
Now sometimes I am getting Parse Error
org.apache.harmony.xml.ExpatParser$ParseException: At line 1, column 8: not well-formed (invalid token)
I searched for it and found out this error normally comes when the encoding of actual xml is not same as what SAX expecting. I wrote a C program and print out the xml, and all my xml is encoded by UTF-8.
Now my question..
Is there any other reason for the above given error in xml parsing
other than encoding issue
Is there any way to print (or write to any file) the input to SAX as
it streams from socket?
After trying Hemal Pandya's answer..
OutputStream log = new BufferedOutputStream(new FileOutputStream("log.txt"));
InputSource xmlInputSource = new InputSource(new CloseShieldInputStream(new
TeeInputStream(mReadStream, log)));
xmlReader.parse(xmlInputSource);
a new file with name log.txt getting created when I mount the SDCard but it is empty..Am I using this right?
Well Finally how I done it..
I worked it out with TeeInputStream itself..thanks Hemal Pandya for suggesting that..
//open a log file in append mode..
OutputStream log = new BufferedOutputStream(new FileOutputStream("log.txt",true));
InputSource xmlInputSource = new InputSource(new CloseShieldInputStream(new
TeeInputStream(mReadStream, log)));
try{
xmlReader.parse(xmlInputSource);
//flush content in the log stream to file..this code only executes if parsing completed successfully
log.flush();
}catch(SaxException e){
//we want to get the log even if parsing failed..So we are making sure we get the log in either case..
log.flush();
}
Is there any way to print (or write to any file) the input to SAX as
it streams from socket?
Apache Commons has a TeeInputStream that should be useful.
OutputStream log = new BufferedOutputStream(new FileOutputtStream("response.xml"));
InputSource xmlInputSource = new InputSource(new
CloseShieldInputStream(new TeeInputStream(mySocket.getInputStream(), log)));
I haven't used it, you might want to try it first in a standalone program to figure out close semantics, though looking at docs and your requirements it looks like you would want to close it separately at end.
I am not familiar with Expat, but to accomplish you are are describing in general, you need a SAX parser that supports pushing data into the parser instead of having the parser pull data from a source. Check if Expat supports a push model. If it does, then you can simply read a chunk of data from the socket, push it to the parser, and it will parse whatever it can from the chuck, caching any remaining data for use during the next push. Repeat as needed until you are ready to close the socket connection. In this model, the \n separator would get treated as miscellaneous whitespace between nodes, so you have to use the SAX events to detect when a new <Response> node opens and closes. Also, because you are receiving multiple <Response> nodes in the data, and XML does not allow more than 1 top-level document node, you would need to push a custom opening tag into the parser before you then start pushing the socket data into the parser. The custom opening tag will then become the top-level document node, and the <Response> nodes will be children of it.

Converting document encoding when reading with dom4j

Is there any way I can convert a document being parsed by dom4j's SAXReader from the ISO-8859-2 encoding to UTF-8? I need that to happen while parsing, so that the objects created by dom4j are already Unicode/UTF-8 and running code such as:
"some text".equals(node.getText());
returns true.
This is done automatically by dom4j. All String instances in Java are in a common, decoded form; once a String is created, it isn't possible to tell what the original character encoding was (or even if the string was created from encoded bytes).
Just make sure that the XML document has the character encoding specified (which is required unless it is UTF-8).
The decoding happens in (or before) the InputSource (before the SAXReader). From that class's javadocs:
The SAX parser will use the InputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly, disregarding any text encoding declaration found in that stream. If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification. If neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.
So it depends on how you are creating the InputSource. To guarantee the proper decoding you can use something like the following:
InputStream stream = <input source>
Charset charset = Charset.forName("ISO-8859-2");
Reader reader = new BufferedReader(new InputStreamReader(stream, charset));
InputSource source = new InputSource(reader);

Categories

Resources