How to replace invalid characters in XML string?

How to replace invalid characters in XML string? - java

I have a string which was encoded by UTF-16. When parsing using javax.xml.parsers.DocumentBuilder, I got an error like this:
Character reference "&#x0" is an invalid XML character
Here is the code I used to parse the XML:
InputSource inputSource = new InputSource();
inputSource.setCharacterStream(new StringReader(xmlString));
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = factory.newDocumentBuilder();
org.w3c.dom.Document document = parser.parse(inputSource);
My question is, how to replace the invalid characters by (space)?

You just need to use String.replaceAll and pass the pattern of invalid characters.

You are trying to parse an invalid xml entity and this is what raising exception. It seems you need not to worry about UTF-16 for your situation.
Find some explanation and example here.
As an example, it is not possible to use & character for a valid xml, we need to use & instead. Here & is the xml entity.
Assuming above example should be self explanatory to understand what xml entity is.
As I understand there are some xml entity which is not valid. But no worry again. it is possible to declare & add new xml entity. Take a look at the above article for more detail.
EDIT: Assuming there is & character making the xml invalid.

StringEscapeUtils()
escapeXml
public static void escapeXml(java.io.Writer writer,
java.lang.String str)
throws java.io.IOException
Escapes the characters in a String using XML entities.
For example: "bread" & "butter" => "bread" & "butter".
Supports only the five basic XML entities (gt, lt, quot, amp, apos).
Does not support DTDs or external entities.
Note that unicode characters greater than 0x7f are currently escaped to their
numerical \\u equivalent. This may change in future releases.
Parameters:
writer - the writer receiving the unescaped string, not null
str - the String to escape, may be null
Throws:
java.lang.IllegalArgumentException - if the writer is null
java.io.IOException - if there is a problem writing
See Also:
unescapeXml(java.lang.String)

Related

Java XML Document converting " to "(literal quote) upon parsing/converting to Document

I have this problem where I need to send to soap webservice that requires the root tag to have an xml data, this the xml that I'm trying to send:
<root><test key="Applicants">this is a data</test></root>
I need to append this to the SoapBody object as a document with this code:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setExpandEntityReferences(false);
DocumentBuilder builder = factory.newDocumentBuilder();
Document result = builder.parse(new ByteArrayInputStream(request.getRequest().getBytes()));
Then adding it to the SoapBody to be sent to the webservice.
However, upon sending this request and tracing the logs, it's actually reverting the " character to literal quotes (")
This is the xml being sent:
<root><test key="Applicants">this is a data</test></root>
As you can see, the " is being transformed to literal quotes, how can I keep the original data within root tag (which has the ")? It seems to be transforming it when I'm converting it to a Document object.
Would appreciate any help. Thanks.
Edit:
The webservice actually requires this format (from their documentation and sample xml requests), if this isn't possible, is it a limitation? Should I user another framework?

The " and " are completely equivalent in this context. You haven't actually said whether this is causing a problem: if it is, then it's because some recipient of the XML isn't processing it correctly. Incidentally, it would also be legitimate to convert the > to >.
When you parse XML and re-serialise it, irrelevant details like redundant whitespace get lost - just as if you copy this text into your text editor, the line-wrapping and font size gets lost.

How to parse json string with UTF-8 characters using java?

I have a json string with SUBSTITUTE () utf-8 character. I'm getting parsing exception when I try to convert json string to java object using jackson. Can you please let me know how to encode and decode utf-8 characters ?
ObjectMapper mapper = new ObjectMapper();
mapper.readValue(jsonString, MY_DOMAIN_OBJECT.class);
jsonString:
{"studentId":"753253-2274", "information":[{"key":"1","value":"Get alerts on your phone(SUBSTITUTE character is present here. Unable to paste it)To subscribe"}]}
Error:
Illegal unquoted character ((CTRL-CHAR, code 26)): has to be escaped using backslash to be included in string value

Can you try this?
ObjectMapper mapper = new ObjectMapper();
mapper.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, true);
mapper.readValue(jsonString, MY_DOMAIN_OBJECT.class);
I hope it helps you:
Javadoc
Feature that determines whether parser will allow JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not. If feature is set false, an exception is thrown if such a character is encountered.
Since JSON specification requires quoting for all control characters, this is a non-standard feature, and as such disabled by default.

Transform encoded UTF-8 characters to special accented characters in android

In Java, I have a string obtained from an API, which looks like:
Hola, ésto es una frase con acentos.
And I want to have:
Hola, ésto es una frase con acentos.
Not only for this example, I need it for all UTF-8 encoded characters.
I've been looking for this for an hour but I haven't found a solution.

This isn't encoding, it's an HTML numeric character reference.
The easiest way to deal with it is add the Apache Commons Lang library to your project, and call StringEscapeUtils.unescapeHtml4

Well, if your text is encoded with SGML entities, a possible approach is to use a XML parser to decode it (though it might be not so smart):
public static String decodeSgml(String src)
throws org.xml.sax.SAXException,
javax.xml.parsers.ParserConfigurationException,
java.io.IOException
{
InputSource inputSource=new InputSource(new StringReader("<x>"+src+"</x>"));
javax.xml.parsers.DocumentBuilderFactory factory=javax.xml.parsers.DocumentBuilderFactory.newInstance();
javax.xml.parsers.DocumentBuilder docBuilder=factory.newDocumentBuilder();
org.w3c.dom.Document doc=docBuilder.parse(inputSource);
return doc.getDocumentElement().getTextContent();
}
(If the number of exceptions thrown by the method looks excessive, you could maybe re-throw some of them as ServiceConfigurationErrors, or store some of the variables as static members).

Decoding a base64 XML cuts off the last part

I have a base64 encoded string, which represents an XML Schema (xsd). I decode this using Apache's Base64 utilities, put the resulting byte array into an intputsource and let an XMLSchemaCollection read this inputSource:
String base64String = ......
byte[] decoded = Base64.decodeBase64(base64String);
InputSource inputSource = new InputSource(new ByteArrayInputStream(decoded));
xmlSchemaCollection.read(inputSource, new ValidationEventHandler());
This gives an error:
XML document structure must start and end within the same entity
Which usually means the XML structure isn't valid. I performed two tests to see what the base64 actually holds. First is printing it out to the console:
System.out.println(new String(decoded,"UTF-8"));
In eclipse, I see my xml is suddenly cut off, like part of it is missing. However, if I use any online website, such as https://www.base64decode.org/, and I copy/paste my base64, I see the complete full xml. If I validate this xml, the validation succeeds. So I'm a bit confused as to why eclipse seemingly cuts off my xml after decoding?

Errors like this are usually indicative of a badly formatted document:
XML document structures must start and end within the same entity...
A few things you can do to debug this:
1. Print out the XML document to a log and run it through some sort of XML validator.
2. Check to make sure that there are no invalid characters (ex UTF-16 characters in a UTF-8 document)

How can i convert a string to a Document (DOM) with charset in ISO-8859-1

I'm converting a string received in a web service to a Document (DOM) xml, like this:
Document file= null;
String xmlFile= "blablabla"; //latin1 encodeing
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
this.file = builder.parse(new InputSource(new StringReader(xmlFile)));
But the string is encoded with ISO-8859-1 (latin1) and when I read a node of this Document, I have some errors. How can I create correctly DOM object with ISO-8859-1 encoding?or how can I read a node with the encoding Latin 1 in a string???

try this:
this.file = builder.parse(new ByteArrayInputStream(xmlFile.getBytes("ISO-8859-1")));

Foreword
String have no encoding as they represent a sequence of characters (which are abstract entities defined in unicode standard).
Byte sequences have an encoding and may be interpreted as a sequence of Character (provided that you tell java how to interpret it).
Your problem
In your problem, your data is stored into a String. Hence it has already been interpreted as a sequence of characters. Apparently the interpretation was incorrect.
Depending on your problem and the way you know the encoding of your data, there are 2 options:
Solution 1 (may be the best):
DO NOT INTERPRET the data you receive and keep it as a byte sequence (Stream or byte[] or ByteArray). Then pass this Byte sequence directly to your DOM parser (it will correctly decode the xml file, whatever its encoding, provided that the markup is correct.
Solution 2 (may be the only possible depending on the way you get the data):
Reencode the String as a ByteArray as mentioned in #ThOrndike's answer:
this.file = builder.parse(new ByteArrayInputStream(xmlFile.getBytes("ISO-8859-1")));
This will only work if you are sure the String has been correctly interpreted in the first place.
Apparently, it is not the case here and it seems that the library that give you the String, already interpreted it as an UTF-8 byte sequence (replacing all erroneous bytes by '?', it is the behavior of the UTF-8 readers). In that case, you cannot do anything as the original byte has been lost.
Your only hope is solution 1, or find a way to force the library that gives you the String to interpret it correctly.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.