Java SAX: Parsing XML file on the fly - java

I'm trying to get informations from a XML file in Java with SAX.
I found some examples with a class that implements ContentHandler
interface and it work well when I run the parse method on an entire file
well formed with XmlReaderFactory class.
But my goal is to parse an XML file on the fly from stdin for example,
I'd like to get XML informations markups by markups like:
> <foo>
markup = foo
> <bar a="baz">
markup = bar
attribute a = baz
> </bar>
end markup bar
> </foo>
end markup foo
But when I pass theses inputs step by step to the parser it stops at the first entry
and say
[Fatal Error] :1:10: XML document structures must start and end within the same entity.
Is there a solution to do this.
I'm only allowed to use SAX to do this :-( for my school exercise.
Thanks for your help,
Arthur.

your markup is invalid xml. specifically the foo element is ended before the bar element
<foo>
<bar>
</foo>
</bar>
if the markup was correct, you should be able to do what you like.

Related

Missing NameSpace Information In XML file using EXIficient

I am using EXIficient to convert XML data to EXI and back to XML. Here, i use their EXIficientDemo class. Sample Code:
EXIficientDemo sample = new EXIficientDemo();
sample.parseAndProofFileLocations("FilePath");
sample.codeSchemaLess();
Firstly it converted xml file to EXI then back to XML, when it generate XML from previously generated EXI's file, it loses some information about Namespace.
Actual XML File:
<?xml version="1.0" encoding="utf-8"?>
<tt xml:lang="ja" xmlns="http://www.w3.org/ns/ttml"
xmlns:tts="http://www.w3.org/ns/ttml#styling"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<body>
<div>
<p xml:id="s1">
<span tts:origin="somethings">somethings</span>
</p>
</div>
</body>
Generated XML File By EXIficient
<?xml version="1.0" encoding="UTF-8"?>
<ns3:tt xmlns:ns3="http://www.w3.org/ns/ttml"
xml:lang="ja"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns3:body><ns3:div>
<ns3:p xml:id="s1">
<ns3:span xmlns:ns4="http://www.w3.org/ns/ttml#styling"
ns4:origin="somethings">somethings</ns3:span>
</ns3:p>
</ns3:div></ns3:body>
In the generated XML file, it is missing xmlns:tts="http://www.w3.org/ns/ttml#styling"
How to fixed this problem? If you can, please help me.
EXIficient may be suppressing unused namespaces. Your example doesn't show any use of the ttm namespace.
As you can see, it didn't retain the namespace prefix for the ttml namespace either (changed to ns3). The generated XML is perfectly valid if the ttml#metadata namespace is unused.
Update
With the updated question, where namespace ttml#styling is used by the origin attribute of the span element, the namespace is retained in the rebuilt XML, but it has been moved to the span element.
This is still a very valid XML document.
Namespace declarations (xmlns) can appear anywhere in a XML document, and applies to the element on which it appears, and all subelements (unless overridden, which is very unusual).
The same namespace can be declared many times on different elements. For simplicity and/or optimization, it is common to declare all namespaces up front, on the root element, using different prefixes, but it is not required to do so.
I read this question by accident and rather late unfortunately.
Just in case people are still struggling with this and are wondering what they can do.
As it was pointed out EXIficient behaves just fine with regards to namespace handling.
Having said that, the EXI specification allows one to preserve prefixes and namespaces (see Preserve Options).
In EXIficient one can set these options accordingly,
e.g.,
EXIFactory.getFidelityOptions().setFidelity(FidelityOptions.FEATURE_PREFIX, true);

Parsing data inside CDATA element

i need to parse a XML file that looks like this
1.<?xml version="1.0" encoding="UTF-8"?>
2.<Root>
3.<Record>
4.<in><![CDATA[<?xml version="1.0" encoding="UTF-8"?><XML><Attribute AttrID="A">Test</Attribute>-<Attribute AttrID="B"> <![CDATA[Aap Noot Mies]]> </Attribute>]]></XML></in>
5.<out><![CDATA[]]></out>
6.</Record>
7.</Root>
I am getting a erro while parsing line number 4 Is there any way to escape a CDATA end token ( ]]> ) within a CDATA section in an xml document.
Your input is not well formed there are several errors I think you need to fix whatever generated that to generate something more like
<?xml version="1.0" encoding="UTF-8"?>
<Root>
<Record>
<in><![CDATA[<?xml version="1.0" encoding="UTF-8"?><!-- - --><XML><Attribute AttrID="A">Test</Attribute>-<Attribute AttrID="B"> <![CDATA[Aap Noot Mies]]<![CDATA[> </Attribute></XML>]]></in>
<out><![CDATA[]]></out>
</Record>
</Root>
Note that the outer CDATA needs <![CDATA[ not <!CDATA[ the first use of ]]> needs to be quoted (for example by stopping and starting the outer CDATA section as here). The outer ]]> needs to be moved after the </XML> so that the end as well as the start of the element is quoted.
That makes the file technically well formed, although elements with name XML (or in general starting with xml in upper or lower case are reserved by the W3C for use in XML related specifications and should not be used in user XML files unless it is a specific element or attribute (such as xmlns defined by the W3C)
In addition I added a (quoted) comment around the dash after the XML declaration as if that CDATA section were extracted and made into an XML document it would make the resulting document non-well formed as only white space or comments and PIs are allowed before the first element.

Parse XML with nested xml opening tags <?xml ...?> in java

can you help me in parsing xml with nested <?xml version="1.0" encoding="utf-8"?> tags. when i am trying to parse this xml, i m getting parsing error.
<?xml version="1.0" encoding="utf-8"?>
<soap>
<soapenvBody>
<serviceResponse>
<?xml version="1.0" encoding="UTF-8"?>
<data>
<respCode>0</respCode>
</data>
</serviceResponse>
</soapenvBody>
</soap>
I don't think this is really a Java problem. Having a second XML declaration within the XML body is just illegal, so I don't think you'll be able to get any XML parsers to parse that. If you have control over the XML (it looks like you're generating it to store a response) then you could try wrapping the inner-XML document with CDATA:
<?xml version="1.0" encoding="utf-8"?>
<soap>
<soapenvBody>
<serviceResponse>
<![CDATA[
<?xml version="1.0" encoding="UTF-8"?>
<data>
<respCode>0</respCode>
</data>
]]>
</serviceResponse>
</soapenvBody>
</soap>
EDIT:
I'm thinking that you most likely don't want the extra XML declaration inside that response at all. Do you have control over the code that creates the response? My guess is that the XML snippet <data>...</data> is created as a separate DOM object and then the string is spliced in the middle of the response. Writing out the entire XML document object results in the XML declaration being included, but if you just grab the document root node object (<data>) and write that out as a string then it probably won't include the extra XML declaration that's causing you all this trouble.
It occurred to me that a parser made for dealing with HTML might be able to do what you want. Since HTML tends to be a total mess compared to strict XML, HTML parsers are usually much more error-tolerant. A quick search turned up jsoup. I was able to pull the respCode from your sample XML above with roughly this code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
String data = "your xml goes here";
Document doc = Jsoup.parse(data);
String respCodeRaw = doc.select("respCode").first().text();
int respCode = Integer.valueOf(respCodeRaw);
(I actually tested the library in the Clojure repl, but the code above should work!)
A tag that starts with like <? is a processing instruction. <?xml...> is an XML declaration, and can only be present at the beginning of the xml content. It's not allowed in the XML body.
Why does your soap body contain this? Do you have the option of removing it?
i did not find any parser in java to parse such embedded xml as it is not a valid xml and i guess almost all parses validate the xml before parsing it. so i choose the option to preprocess the xml and selected the inner xml then using SAX parser i parsed the xml and retrieved the values from xml. Guys thanks for your replies.

Removing nodes with invalid tag names from a xml document

I transform xml with the Saxon XSLT2 processor (using Java + the Saxon S9API) and have to deal with xml-documents as the source, which contain invalid characters as tag names and therefore can't be parsed by the document-builder.
Example:
<A>
<B />
<C>
<D />
</C>
<E!_RANDOM_ />
< />
</A>
Code:
import net.sf.saxon.s9api.*;
[...]
/* XSLT Processor & Compiler */
proc = new Processor(false);
/* build document from input*/
XdmNode source = proc.newDocumentBuilder().build(new StreamSource(input));
Error:
Error on line X column Y
SXXP0003: Error reported by XML parser: Element type
"E" must be followed by either attribute specifications, ">" or "/>".
The exclamation mark and the tag name just being space are currently my only invalid tags.
I am searching for a more robust solution rather than just removing whole lines of the (formated) xml.
With some mind-bending I could come up with a regular expression to identify the invalid strings, but would struggle with the removal of the nodes containing attributes and child-nodes.
Thank you for your help!
If the input contains invalid tags then it is not XML. It's best to get your mind-set right by referring to these as non-XML documents rather than XML documents; that helps to make it clear that to process non-XML documents, you need non-XML tools. (Forget about "nodes" - there are no nodes until the document has been parsed, and it can't be parsed until you have turned it into well-formed XML). To turn non-XML into XML, you will typically want to use non-XML tools that are good at text manipulation, such as Perl. Of course, it's much better to fix the problem at source: all the benefits of XML are lost if people generate data in private non-XML formats.

The markup must be well-formed

First off, let me say I am a new to SAX and Java.
I am trying to read information from an XML file that is not well formed.
When I try to use the SAX or DOM Parser I get the following error in response:
The markup in the document following the root element must be well-formed.
This is how I set up my XML file:
<format type="filename" t="13241">0;W650;004;AG-Erzgeb</format>
<format type="driver" t="123412">001;023</format>
...
Can I force the SAX or DOM to parse XML files even if they are not well formed XML?
Thank you for your help. Much appreciated.
Haythem
Your best bet is to make the XML well-formed, probably by pre-processing it a bit. In this case, you can achieve that simply by putting an XML declaration on (and even that's optional) and providing a root element (which is not optional), like this:
<?xml version="1.0"?>
<wrapper>
<format type="filename" t="13241">0;W650;004;AG-Erzgeb</format>
<format type="driver" t="123412">001;023</format>
</wrapper>
There I've arbitrarily picked the name "wrapper" for the root element; it can be whatever you like.
Hint: using sax or stax you can successfully parse a not well formed xml document until the FIRST "well formed-ness" error is encountered.
(I know that this is not of too much help...)
As the DOM will scan you xml file then build a tree, the root node of the tree is like the as 1 Answer. However, if the Parser can't find the or even , it can even build the tree. So, its better to do some pre-processing the xml file before parser it by DOM or Sax.

Categories

Resources