i need to parse a XML file that looks like this
1.<?xml version="1.0" encoding="UTF-8"?>
2.<Root>
3.<Record>
4.<in><![CDATA[<?xml version="1.0" encoding="UTF-8"?><XML><Attribute AttrID="A">Test</Attribute>-<Attribute AttrID="B"> <![CDATA[Aap Noot Mies]]> </Attribute>]]></XML></in>
5.<out><![CDATA[]]></out>
6.</Record>
7.</Root>
I am getting a erro while parsing line number 4 Is there any way to escape a CDATA end token ( ]]> ) within a CDATA section in an xml document.
Your input is not well formed there are several errors I think you need to fix whatever generated that to generate something more like
<?xml version="1.0" encoding="UTF-8"?>
<Root>
<Record>
<in><![CDATA[<?xml version="1.0" encoding="UTF-8"?><!-- - --><XML><Attribute AttrID="A">Test</Attribute>-<Attribute AttrID="B"> <![CDATA[Aap Noot Mies]]<![CDATA[> </Attribute></XML>]]></in>
<out><![CDATA[]]></out>
</Record>
</Root>
Note that the outer CDATA needs <![CDATA[ not <!CDATA[ the first use of ]]> needs to be quoted (for example by stopping and starting the outer CDATA section as here). The outer ]]> needs to be moved after the </XML> so that the end as well as the start of the element is quoted.
That makes the file technically well formed, although elements with name XML (or in general starting with xml in upper or lower case are reserved by the W3C for use in XML related specifications and should not be used in user XML files unless it is a specific element or attribute (such as xmlns defined by the W3C)
In addition I added a (quoted) comment around the dash after the XML declaration as if that CDATA section were extracted and made into an XML document it would make the resulting document non-well formed as only white space or comments and PIs are allowed before the first element.
Related
I am using EXIficient to convert XML data to EXI and back to XML. Here, i use their EXIficientDemo class. Sample Code:
EXIficientDemo sample = new EXIficientDemo();
sample.parseAndProofFileLocations("FilePath");
sample.codeSchemaLess();
Firstly it converted xml file to EXI then back to XML, when it generate XML from previously generated EXI's file, it loses some information about Namespace.
Actual XML File:
<?xml version="1.0" encoding="utf-8"?>
<tt xml:lang="ja" xmlns="http://www.w3.org/ns/ttml"
xmlns:tts="http://www.w3.org/ns/ttml#styling"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<body>
<div>
<p xml:id="s1">
<span tts:origin="somethings">somethings</span>
</p>
</div>
</body>
Generated XML File By EXIficient
<?xml version="1.0" encoding="UTF-8"?>
<ns3:tt xmlns:ns3="http://www.w3.org/ns/ttml"
xml:lang="ja"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns3:body><ns3:div>
<ns3:p xml:id="s1">
<ns3:span xmlns:ns4="http://www.w3.org/ns/ttml#styling"
ns4:origin="somethings">somethings</ns3:span>
</ns3:p>
</ns3:div></ns3:body>
In the generated XML file, it is missing xmlns:tts="http://www.w3.org/ns/ttml#styling"
How to fixed this problem? If you can, please help me.
EXIficient may be suppressing unused namespaces. Your example doesn't show any use of the ttm namespace.
As you can see, it didn't retain the namespace prefix for the ttml namespace either (changed to ns3). The generated XML is perfectly valid if the ttml#metadata namespace is unused.
Update
With the updated question, where namespace ttml#styling is used by the origin attribute of the span element, the namespace is retained in the rebuilt XML, but it has been moved to the span element.
This is still a very valid XML document.
Namespace declarations (xmlns) can appear anywhere in a XML document, and applies to the element on which it appears, and all subelements (unless overridden, which is very unusual).
The same namespace can be declared many times on different elements. For simplicity and/or optimization, it is common to declare all namespaces up front, on the root element, using different prefixes, but it is not required to do so.
I read this question by accident and rather late unfortunately.
Just in case people are still struggling with this and are wondering what they can do.
As it was pointed out EXIficient behaves just fine with regards to namespace handling.
Having said that, the EXI specification allows one to preserve prefixes and namespaces (see Preserve Options).
In EXIficient one can set these options accordingly,
e.g.,
EXIFactory.getFidelityOptions().setFidelity(FidelityOptions.FEATURE_PREFIX, true);
I have this XML:
<Body xmlns:wsu="http://mynamespace">
<Ticket xmlns="http://othernamespace">
<Customer xlmns="">Robert</Customer>
<Products xmlns="">
<Product>a product</>
</Products>
</Ticket>
<Delivered xmlns="" />
<Payment xlmns="">cash</Payment>
</Body>
I am using Java to read it as a DOM document. I want remove the empty namespace attributes (i.e., xmlns=""). Is there any way to do that?
You need to understand that xmlns is a very special attribute. Basically, the xmlns="" is so that your Customer element is in the "unnamed" namespace, rather than the http://othernamespace namespace (and likewise for other elements which would otherwise inherit a default namespace from their ancestors).
If you want to get rid of the xmlns="", you basically need to put the elements into the appropriate namespace - so it's changing the element name. I don't think the W3C API lets you change the name of an element - you may well need to create a new element with the appropriate namespaced-name, and copy the content. Or if you're responsible for creating the document to start with, just use the right namespace.
I have below xml. Namespace is missing in the below xml.
<?xml version="1.0" encoding="UTF-8"?>
<policy>
<num-drivers>123</num-drivers>
<risk-policy-ind>false</risk-policy-ind>
<premium-amt>23.00</premium-amt>
</policy>
Looking for java code to take the above xml as input and add namespace (xmlns) element to it? The expected output xml is as below:
<?xml version="1.0" encoding="UTF-8"?>
<policy xmlns="http://aaa.bbb.com">
<num-drivers>123</num-drivers>
<risk-policy-ind>false</risk-policy-ind>
<premium-amt>23.00</premium-amt>
</policy>
First of all, in the above xml, risk-policy-ind tag is not properly closed. In XML all tags are custom tags, they should close. Moreover,in xml, tags are performed without namespace.
If u just want to add xmlns attribute to policy tag, create policy element using w3c.dom.Element and set attribute using setAttribute function
can you help me in parsing xml with nested <?xml version="1.0" encoding="utf-8"?> tags. when i am trying to parse this xml, i m getting parsing error.
<?xml version="1.0" encoding="utf-8"?>
<soap>
<soapenvBody>
<serviceResponse>
<?xml version="1.0" encoding="UTF-8"?>
<data>
<respCode>0</respCode>
</data>
</serviceResponse>
</soapenvBody>
</soap>
I don't think this is really a Java problem. Having a second XML declaration within the XML body is just illegal, so I don't think you'll be able to get any XML parsers to parse that. If you have control over the XML (it looks like you're generating it to store a response) then you could try wrapping the inner-XML document with CDATA:
<?xml version="1.0" encoding="utf-8"?>
<soap>
<soapenvBody>
<serviceResponse>
<![CDATA[
<?xml version="1.0" encoding="UTF-8"?>
<data>
<respCode>0</respCode>
</data>
]]>
</serviceResponse>
</soapenvBody>
</soap>
EDIT:
I'm thinking that you most likely don't want the extra XML declaration inside that response at all. Do you have control over the code that creates the response? My guess is that the XML snippet <data>...</data> is created as a separate DOM object and then the string is spliced in the middle of the response. Writing out the entire XML document object results in the XML declaration being included, but if you just grab the document root node object (<data>) and write that out as a string then it probably won't include the extra XML declaration that's causing you all this trouble.
It occurred to me that a parser made for dealing with HTML might be able to do what you want. Since HTML tends to be a total mess compared to strict XML, HTML parsers are usually much more error-tolerant. A quick search turned up jsoup. I was able to pull the respCode from your sample XML above with roughly this code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
String data = "your xml goes here";
Document doc = Jsoup.parse(data);
String respCodeRaw = doc.select("respCode").first().text();
int respCode = Integer.valueOf(respCodeRaw);
(I actually tested the library in the Clojure repl, but the code above should work!)
A tag that starts with like <? is a processing instruction. <?xml...> is an XML declaration, and can only be present at the beginning of the xml content. It's not allowed in the XML body.
Why does your soap body contain this? Do you have the option of removing it?
i did not find any parser in java to parse such embedded xml as it is not a valid xml and i guess almost all parses validate the xml before parsing it. so i choose the option to preprocess the xml and selected the inner xml then using SAX parser i parsed the xml and retrieved the values from xml. Guys thanks for your replies.
I'm having big problems with Xpath evaluation using Jaxen.
Here's part of XML i'm evaluating on:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2011-05-31T13:04:08+00:00</responseDate>
<request metadataPrefix="oai_dc" verb="ListRecords">http://citeseerx.ist.psu.edu/oai2</request>
<ListRecords>
<record>
<header>
<identifier>oai:CiteSeerXPSU:10.1.1.1.1484</identifier>
<datestamp>2009-05-24</datestamp>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Winner-Take-All..</dc:title>
<dc:relation>10.1.1.134.6077</dc:relation>
<dc:relation>10.1.1.65.2144</dc:relation>
<dc:relation>10.1.1.54.7277</dc:relation>
<dc:relation>10.1.1.48.5282</dc:relation>
</oai_dc:dc>
</metadata>
</record>
<resumptionToken>10.1.1.1.2041-1547151-500-oai_dc</resumptionToken>
</ListRecords>
</OAI-PMH>
I'm using Jaxen because in my use case it's much faster then Apache implementation. I'm using W3C DOM for XML representation.
I need to select all record arguments, and then on selected nodes evaluate other xpaths (it's needed because of my processing architecture).
I'm selecting all record nodes (this works):
/OAI-PMH/ListRecords/record
Then on every selected record node I'm evaluating other xpaths to get needed data:
Select identifier text value (this works):
header/identifier/text()
Select title text value (this does NOT work):
metadata/oai_dc:dc/dc:title/text()
I've registered namespaces prefixes with their URIs (oai_dc and dc). I also tried other xpaths but none of them work:
metadata/dc/title/text()
metadata//dc:title/text()
I've read other stackoverflow questions about xpaths, namespaces and solution to add prefix "oai" with URI "http://www.openarchives.org/OAI/2.0/". I tried adding that "oai:" prefix to nodes without defined prefix but as result I even didn't select record nodes. Any ideas what I'm doing wrong?
Solution:
Problem was about parser (thanks jasso). It wasn't set to be namespace aware - after changing that setting everything works fine, as expected.
I can't see how the XPath expression /OAI-PMH/ListRecords/record can possibly select anything, since your document does not have a {}OAI-PMH element, only a {http://www.openarchives.org/OAI/2.0/}OAI-PMH element. See http://jaxen.codehaus.org/faq.html