Java CDATA extract xml - java

For some reason someone changed the webService xml response that I needed. So now, the imformation I need to fetch is inside a CDATA tag.
The thing is that all "<" and ">" characters have been replaced with "<" and ">".
Example how it should look like:
<MapAAAResult><!CDATA[<map>http://tstgis.xxxxxxx.xxx/gis_n/WebService1/Users/Image/xxxxxbinkor4.png|vialcap:2</map>
<nbr>234</nbr>
<nbrProcess>97` ....
And this is how I am receiving it:
<MapAAAResult>
<mapa>http://tstgis.xxxxxxx.xxx/gis_n/WebService1/Users/Image/xxxxxxxxbi542m4.png|vialcap:1</map>
<nbr>234</nbr>
<nbrProcess>97 .....
How can I do to get the information back to its original form? More exactly how can I transform that information back to an xml?
Any ideas?
Thanks!!

Possibly related to the character escaping issue:
HTML inside XML CDATA being converted with < and > brackets
The characters like "<" , ">", "&" are illegal in XML elements and escaping these can be done via CDATA or character replacement. Looks like the webService switched up their schema somewhere along the way.
I've encountered a similar issue where I had to parse an escaped xml. A quick solution to get back the xml is to use replaceAll():
String data = "<MapAAAResult>"
+ "<map>http://tstgis.xxxxxxx.xxx/gis_n/WebService1/Users/Image/xxxxxxxxbi542m4.png|vialcap:1</map><nbr>234</nbr>"
+ "<nbrProcess>97";
data = data.replaceAll("<","<");
data = data.replaceAll(">", ">");
data = data.replaceAll("&","&");
System.out.println(data);
you will get back:
<MapAAAResult><map>http://tstgis.xxxxxxx.xxx/gis_n/WebService1/Users/Image/xxxxxxxxbi542m4.png|vialcap:1</map><nbr>234</nbr><nbrProcess>97...
It can get more complex with embedded CDATA tags within the first CDATA field, and xml parsing could get confused with the ending "]]>" such as:
<xml><![CDATA[ <tag><![CDATA[data]]></tag> ]]></xml>
Thus, escaping the embedded data by using the < > & is more resilient but can introduce unnecessary processing. Also note: some parsers or xml readers can recognize the escaped characters.
Some other related threads:
XSL unescape HTML inside CDATA
When to CDATA vs. Escape & Vice Versa?

Related

How to escape ampersand in the XML present inside the CDATA section

Hi I tried to find out the solution for this but could get anything of much help.
The problem is that the CDATA section has a XML present in it and I want to escape the special character '&' present in it. I'am using XMLBeans and tried using XmlOptionCharEscapeMap but its throwing exception while parsing.
`XmlObject.Factory.parse(XMLString, xmlOptionsObj);`
here the setSaveSubstituteCharacters in xmlOptionsObj was set with XmlOptionCharEscapeMap Object.
XML example:
<Message xmlns="http://www.com.test/XMLSchema">
<Header></Header>
<Body><![CDATA[<Inner xmlns="http://www.com.test/XMLSchema">
<TagA>...</TagA>
</Inner>]]></Body>
</Message>'
'

Jsoup changes output from single quote to double quote on HTML attributes

We are using Jsoup to parse, manipulate and extend a html template. So far everything works fine until it comes to single quotes used in combination with HTML attributes
<span data-attr='JSON'></span>
That HTML snippet is converted to
<span data-attr="JSON"></span>
which will conflict with the inner json data which is specified as valid with double quotes only
{"param" : "value"} //valid
{'param' : 'value'} //invalid
so we need to force Jsoup to NOT change those single quotes to double quotes, but how? Currently that is our code to parse and produce html content.
pageTemplate = Jsoup.parse(new File(mainTemplateFilePath), "UTF-8");
pageTemplate.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
pageTemplate.outputSettings().charset("UTF-8");
... adding some html
pageTemplate.html(); // will output the double quoted attributes :(
You need to HTML encode the JSON value before putting it into the data-attr attribute. When you do so, you should end up with this:
<span data-attr="{"param":"value"}"></span>
Although that looks fairly daunting, it is actually valid HTML. When your corresponding JavaScript executes someSpan.getAttribute("data-attr"), the " values will be transformed into " values automatically, giving you access to the original valid JSON string.

Remove special characters java

Hi I'm trying to figure out a way to remove the tags from the results returned from the Google Feed API. Their result is
Breaking \u003cb\u003eNews\u003c/b\u003e Updates
How can we remove these characters?
I'm not sure if RegEx would be better (or worse). Does anyone have an idea on how to remove these? Google does not supply an option to remove tags from the results in Java.
I pull those routinely with
String.replaceAll("\\p{Cntrl}","")
You can use the below regex..
String str = "Breaking \u003cb\u003eNews\u003c/b\u003e Updates";
str = str.replaceAll("\\<(.*)?\\>(.*)\\</\\1\\>", "$2");
System.out.println(str);
OUTPUT: -
Breaking News Updates
\\<(.*)?\\> matches the first opening tag - <b>
\\</\\1\\> matches the corresponding closing tag - </b>
\\1 is used to backreference what was the tag, so that correct pair of tags are matched..
So, <b>news <update></b> -> In this case <update> will not be removed..
The best solution would be to use JSON to convert the data.
JSON.parse(JSON.stringify({a : '<put your string here>'}));
It will be proper as the data you will get from Google API will be in the form of JSON.
This is HTML. \u003cb\u003e translates to <b>.
You'll want to use an HTML parser as HTML is not fully parse-able by a regular expression.
With a library like Jsoup you could do this as.
String data = Jsoup.parse(html).body().text();
This will get you "Breaking News Updates".

how to place < symbol in xml file which is to be read by the java program?

I am placing an SQL query(which contains < symbol) inside xml file and i am trying to read that query in a java program.
But it is displaying the exception
"org.xml.sax.SAXParseException: The content of elements must consist
of well-formed character data or markup."
can any one help me how to fix the above issue?
You need to escape using XML entities:
& encode as &
< encode as <
Technically, you don't need to escape the following, but it is common to do so:
> encode as >
" encode as "
' encode as &apos;
For more info, see this Wikipedia article for more
Make use of CDATA
CDATA - (Unparsed) Character Data
CDATA stands for Character Data and it means that the data in between these tags includes data that could be interpreted as XML markup, but should not be
The term CDATA is used about text data that should not be parsed by the XML parser.
Characters like "<" and "&" are illegal in XML elements.
"<" will generate an error because the parser interprets it as the start of a new element.
"&" will generate an error because the parser interprets it as the start of an character entity.
Some text, like JavaScript code, contains a lot of "<" or "&" characters. To avoid errors script code can be defined as CDATA.
Everything inside a CDATA section is ignored by the parser.
Example:
<![CDATA[ select <abcddata> ]]>
< = >
> = <
These are the HTML entities and should be accepted

Removing nodes with invalid tag names from a xml document

I transform xml with the Saxon XSLT2 processor (using Java + the Saxon S9API) and have to deal with xml-documents as the source, which contain invalid characters as tag names and therefore can't be parsed by the document-builder.
Example:
<A>
<B />
<C>
<D />
</C>
<E!_RANDOM_ />
< />
</A>
Code:
import net.sf.saxon.s9api.*;
[...]
/* XSLT Processor & Compiler */
proc = new Processor(false);
/* build document from input*/
XdmNode source = proc.newDocumentBuilder().build(new StreamSource(input));
Error:
Error on line X column Y
SXXP0003: Error reported by XML parser: Element type
"E" must be followed by either attribute specifications, ">" or "/>".
The exclamation mark and the tag name just being space are currently my only invalid tags.
I am searching for a more robust solution rather than just removing whole lines of the (formated) xml.
With some mind-bending I could come up with a regular expression to identify the invalid strings, but would struggle with the removal of the nodes containing attributes and child-nodes.
Thank you for your help!
If the input contains invalid tags then it is not XML. It's best to get your mind-set right by referring to these as non-XML documents rather than XML documents; that helps to make it clear that to process non-XML documents, you need non-XML tools. (Forget about "nodes" - there are no nodes until the document has been parsed, and it can't be parsed until you have turned it into well-formed XML). To turn non-XML into XML, you will typically want to use non-XML tools that are good at text manipulation, such as Perl. Of course, it's much better to fix the problem at source: all the benefits of XML are lost if people generate data in private non-XML formats.

Categories

Resources