Illegal Character in XML are not being replaced - java

SOLUTION So this was not an xml issue at all. My xml escapes were done properly, however there was an encoding issue. So i would like to share my solution with everyone, i hope you find this useful.
public static String entityEncode(String text) throws UnsupportedEncodingException {
String result = text;
if (result == null) {
return result;
}
byte ptext[] = result.getBytes("ISO-8859-1");
String value = new String(ptext, "UTF-8");
String temp = XMLStringUtil.escapeControlChrs(value);
return temp;
}
EXPLANATION The xml function above is for XML 1.0. We take our given text, convert it into a byte since String does not have an associated encoding. After which we create a new string off of the byte in "UTF-8". That is also why java was just telling me that character reference error with &#, it couldn't recognize the character at fault. Now that I did the encoding and assigned it to UTF-8, there are no issues and the xml escape proceeds properly!
EDIT: How do i print out all illegal xml characters in the provided string? According to StringEscapeUtils.escapeXml parameters? The problem i have is that i don't want to escape everything, because it doesn't properly decode after. So right now, i just need to find out what my invalid characters in the text are. The oens that are causing issues and need to be encoded.
I have the following error message:
ERROR: 'Character reference "&#'
ERROR: 'com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Character reference "&#'
It does not specifically tell me what the character is which is a problem.
I do my original XML parse to convert to an xml document and then after that. I sanitize further to remove illegal characters
String xml10pattern = "[^"
+ "\u0009\r\n"
+ "\u0020-\uD7FF"
+ "\uE000-\uFFFD"
+ "\ud800\udc00-\udbff\udfff"
+ "]";
However, it's not removing them so i'm not sure how to go about this. Currently i have:
String temp = entityEncode(temp);
String legal = temp.replaceAll(xml10pattern , "");
item.setResponseBody(legal);
Entity encode just uses a standard xml parse class to escape characters XMLStringUtil.escapeControlChrs which is based off of StringEscapeUtils.escapeXml and just has additional escapes, nothing removed. But something is being missed.

Related

Why is java returning encoded values different

I am not quite sure why does java return %27+ for special characters in the name.
For example, the value I am trying to encode was "Mc' Donald". Its encoding to "Mc%27+Donald" when it should be "Mc%27%20Donald". reason why I replaced in the first place is db has ' instead of ' so replacing and encoding again.
lastName = URLEncoder.encode(lastName.replace("'", "'"), "UTF-8");
In HTML encoding, + is a valid replacement for SPACE (%20) as well.

How to solve the IllegalDataException in jdom2 library?

I am using jdom 2.0.6 version and received this IllegalDataException:
Error in setText for tokenization:
it fails on calling the setText() method.
Element text = new Element("Text");
text.setText(doc.getText());
It seems some characters in 'text' it doesn't accept. For two examples:
Originally Posted by Yvette H( 45) Odd socks, yes, no undies yes, no coat yes, no shoes odd. 🏻
ParryOtter said: Posted
Should I specify encoding somewhere or for some other reasons?
In fact you just have to escape your text which contains illegal characters with CDATA :
Element text = new Element("Text");
text.setContent(new CDATA(doc.getText()));
The reverse operation (reading text escaped with CDATA is transparent in JDOM2, you won't have to escape it back).
For my tests I added an illegal character at the end of my text by creating one from hex value 0x2 like that :
String text = doc.getText();
int hex = 0x2;
text += (char) hex;

Special characters getting converted into ascii

I need to pass a string as it is but at the back end the special characters are getting converted.How to avoid this? Please find below my code how I am doing it.
My String "1nIH6iLxXVYBj0J\/JhDlQmSm9aAtjz7ynZpaJ4bxcko="
At the backen "1nIH6iLxXVYBj0J%5C%2FJhDlQmSm9aAtjz7ynZpaJ4bxcko%3D"
Part of my code::
oauthMessage = computeOAuthRequest("POST", appId, appSecret,URLEncoder.encode(authToken,"UTF-8"), url);

Base64 String to Windows1251 (cyrillic symbols)

I have a trouble to convert email attachment(simple text file in windows-1251 encoding with latin and cyrillic symbols) to String. I.e I have a problem with converting cyrillic.
I got attachment file as base64 encoded String like this:
Base64Encoded email Attachment
Original file
So when I try to decode it, I got "?" instead of Cyrillic symbols.
How can I get right Cyrillic(Russian) symbols instead of "?"
I've already tried this code with all encodings, but nothing help to get correct Russian symbols.
BASE64Decoder dec = new BASE64Decoder();
for (String key : Charset.availableCharsets().keySet()) {
System.out.println("K=" + key + " Value:" +
Charset.availableCharsets().get(key));
try {
System.out.println(new String(dec.decodeBuffer(encoded), key));
} catch (Exception e) {
continue;
}
}
Thank You beforehand.
I am not very familiar with BPEL and protocols it uses. If you communicate between nodes using some binary protocols, then you must 1) ensure, client and receiver use the same charset and 2) convert java string into proper bytes in this encoding. Java stores string internally in UTF-16 format. So when you execute String correct = new String(commonName.getBytes("ISO-8859-1"), "ISO-8859-5") you will get correct string in UTF-16. Then you need to export it to bytes in requested encoding, eg. byte[] buff = correct.getBytes("UTF-8") assuming the encoding you use between nodes is UTF-8. If happen the encoding is different, then you must make sure, it actually supports Cyrillic characters (e.g. ISO-8859-1 does not support it).
If you use XML for data exchange, make sure it uses suitable encoding in <?xml encoding="UTF-8"?>. You don't need then to play with bytes, you just need to correctly "import" the string (see correct variable). Writing to XML converts characters automatically, but it (encoding) must support characters you want to write. So if you set encoding="ISO-88591", then you will get those question marks again.

What is the most efficient way to format UTF-8 strings in java?

I am doing the following:
String url = String.format(WEBSERVICE_WITH_CITYSTATE, cityName, stateName);
String urlUtf8 = new String(url.getBytes(), "UTF8");
Log.d(TAG, "URL: [" + urlUtf8 + "]");
Reader reader = WebService.queryApi(url);
The output that I am looking for is essentially to get the city name with blanks (e.g., "Overland Park") to be formatted as Overland%20Park.
Is it this the best way?
Assuming you are actually wanting to encode your string for use in a URL (ie, "Overland Park" can also be formatted as "Overland+Park") you want URLEncoder.encode(url, "UTF-8"). Other unsafe characters will be converted to the %xx format you are asking for.
The simple answer is to use URLEncoder.encode(...) as stated by #Recurse. However, if part or all of the URL has already been encoded, then this can lead to double encoding. For example:
http://foo.com/pages/Hello%20There
or
http://foo.com/query?keyword=what%3f
Another concern with URLEncoder.encode(...) is that it doesn't understand that certain characters should be escaped in some contexts and not others. So for example, a '?' in a query parameter should be escaped, but the '?' that marks the start of the "query part" should not be escaped.
I think that safer way to add missing escapes would be the following:
String safeURI = new URI(url).toASCIIString();
However, I haven't tested this ...

Categories

Resources