How to determine if a String contains invalid encoded characters

How to determine if a String contains invalid encoded characters - java

Usage scenario
We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.
Note: We use restlet, not tomcat
Original Problem
Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.
Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.
I.e. for a query part looking like
...v=abcädef
if "ISO-8859-1" is selected, the sent query part looks like
...v=abc%E4def
but if "UTF-8" is selected, the sent query part looks like
...v=abc%C3%A4def
Desired Solution
As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status
Current Solution In Detail
Check for each character ( == string.substring(i,i+1) )
if character.getBytes()[0] equals 63 for '?'
if Character.getType(character.charAt(0)) returns OTHER_SYMBOL
Code
protected List< String > getNonUnicodeCharacters( String s ) {
final List< String > result = new ArrayList< String >();
for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
final String character = s.substring( i , i + 1 );
final boolean isOtherSymbol =
( int ) Character.OTHER_SYMBOL
== Character.getType( character.charAt( 0 ) );
final boolean isNonUnicode = isOtherSymbol
&& character.getBytes()[ 0 ] == ( byte ) 63;
if ( isNonUnicode )
result.add( character );
}
return result;
}
Question
Will this catch all invalid (non utf encoded) characters?
Does any of you have a better (easier) solution?
Note: I checked URLDecoder with the following code
final String[] test = new String[]{
"v=abc%E4def",
"v=abc%C3%A4def"
};
for ( int i = 0 , n = test.length ; i < n ; i++ ) {
System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );
System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") );
}
This prints:
v=abc?def
v=abcädef
v=abcädef
v=abcÃ¤def
and it does not throw an IllegalArgumentException sigh

I asked the same question,
Handling Character Encoding in URI on Tomcat
I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,
Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
If you have to manually URL decode, use Latin1 as charset also.
Use the fixEncoding() function to fix up encodings.
For example, to get a parameter from query string,
String name = fixEncoding(request.getParameter("name"));
You can do this always. String with correct encoding is not changed.
The code is attached. Good luck!
public static String fixEncoding(String latin1) {
try {
byte[] bytes = latin1.getBytes("ISO-8859-1");
if (!validUTF8(bytes))
return latin1;
return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
// Impossible, throw unchecked
throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());
}
}
public static boolean validUTF8(byte[] input) {
int i = 0;
// Check for BOM
if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
&& (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
i = 3;
}
int end;
for (int j = input.length; i < j; ++i) {
int octet = input[i];
if ((octet & 0x80) == 0) {
continue; // ASCII
}
// Check for UTF-8 leading byte
if ((octet & 0xE0) == 0xC0) {
end = i + 1;
} else if ((octet & 0xF0) == 0xE0) {
end = i + 2;
} else if ((octet & 0xF8) == 0xF0) {
end = i + 3;
} else {
// Java only supports BMP so 3 is max
return false;
}
while (i < end) {
i++;
octet = input[i];
if ((octet & 0xC0) != 0x80) {
// Not a valid trailing byte
return false;
}
}
}
return true;
}
EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?
Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.
The solution I posted here is not perfect but it's the best one we found so far.

You can use a CharsetDecoder configured to throw an exception if invalid chars are found:
CharsetDecoder UTF8Decoder =
Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT);
See CodingErrorAction.REPORT

This is what I used to check the encoding:
CharsetDecoder ebcdicDecoder = Charset.forName("IBM1047").newDecoder();
ebcdicDecoder.onMalformedInput(CodingErrorAction.REPORT);
ebcdicDecoder.onUnmappableCharacter(CodingErrorAction.REPORT);
CharBuffer out = CharBuffer.wrap(new char[3200]);
CoderResult result = ebcdicDecoder.decode(ByteBuffer.wrap(bytes), out, true);
if (result.isError() || result.isOverflow() ||
result.isUnderflow() || result.isMalformed() ||
result.isUnmappable())
{
System.out.println("Cannot decode EBCDIC");
}
else
{
CoderResult result = ebcdicDecoder.flush(out);
if (result.isOverflow())
System.out.println("Cannot decode EBCDIC");
if (result.isUnderflow())
System.out.println("Ebcdic decoded succefully ");
}
Edit: updated with Vouze suggestion

Replace all control chars into empty string
value = value.replaceAll("\\p{Cntrl}", "");

URLDecoder will decode to a given encoding. This should flag errors appropriately. However the documentation states:
There are two possible ways in which this decoder could deal with illegal strings. It could either leave illegal characters alone or it could throw an IllegalArgumentException. Which approach the decoder takes is left to the implementation.
So you should probably try it. Note also (from the decode() method documentation):
The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites
so there's something else to think about!
EDIT: Apache Commons URLDecode claims to throw appropriate exceptions for bad encodings.

I've been working on a similar "guess the encoding" problem. The best solution involves knowing the encoding. Barring that, you can make educated guesses to distinguish between UTF-8 and ISO-8859-1.
To answer the general question of how to detect if a string is properly encoded UTF-8, you can verify the following things:
No byte is 0x00, 0xC0, 0xC1, or in the range 0xF5-0xFF.
Tail bytes (0x80-0xBF) are always preceded by a head byte 0xC2-0xF4 or another tail byte.
Head bytes should correctly predict the number of tail bytes (e.g., any byte in 0xC2-0xDF should be followed by exactly one byte in the range 0x80-0xBF).
If a string passes all those tests, then it's interpretable as valid UTF-8. That doesn't guarantee that it is UTF-8, but it's a good predictor.
Legal input in ISO-8859-1 will likely have no control characters (0x00-0x1F and 0x80-0x9F) other than line separators. Looks like 0x7F isn't defined in ISO-8859-1 either.
(I'm basing this off of Wikipedia pages for UTF-8 and ISO-8859-1.)

You might want to include a known parameter in your requests, e.g. "...&encTest=ä€", to safely differentiate between the different encodings.

You need to setup the character encoding from the start. Try sending the proper Content-Type header, for example Content-Type: text/html; charset=utf-8 to fix the right encoding. The standard conformance refers to utf-8 and utf-16 as the proper encoding for Web Services. Examine your response headers.
Also, at the server side — in the case which the browser do not handles properly the encoding sent by the server — force the encoding by allocating a new String. Also you can check each byte in the encoded utf-8 string by doing a single each_byte & 0x80, verifying the result as non zero.
boolean utfEncoded = true;
byte[] strBytes = queryString.getBytes();
for (int i = 0; i < strBytes.length(); i++) {
if ((strBytes[i] & 0x80) != 0) {
continue;
} else {
/* treat the string as non utf encoded */
utfEncoded = false;
break;
}
}
String realQueryString = utfEncoded ?
queryString : new String(queryString.getBytes(), "iso-8859-1");
Also, take a look on this article, I hope it would help you.

the following regular expression might be of interest for you:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185624
I use it in ruby as following:
module Encoding
UTF8RGX = /\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x unless defined? UTF8RGX
def self.utf8_file?(fileName)
count = 0
File.open("#{fileName}").each do |l|
count += 1
unless utf8_string?(l)
puts count.to_s + ": " + l
end
end
return true
end
def self.utf8_string?(a_string)
UTF8RGX === a_string
end
end

Try to use UTF-8 as a default as always in anywhere you can touch. (Database, memory, and UI)
One and single charset encoding could reduce a lot of problems, and actually it can speed up your web server performance. There are so many processing power and memory wasted to encoding/decoding.

Related

URLDecode.decode method not working as expected in Java

I was trying to decode URL encoded post body and came across this problem.
I was using this method to decode (it decodes multiple encoded urls too) :
public static String decodeUrl(String url)
{
try {
String prevURL="";
String decodeURL=url;
while(!prevURL.equals(decodeURL))
{
prevURL=decodeURL;
decodeURL= URLDecoder.decode( decodeURL, "UTF-8" );
}
return decodeURL;
} catch (UnsupportedEncodingException e) {
return "Issue while decoding" +e.getMessage();
}
}
When the input url was "a%20%2B%20b%20%3D%3D%2013%25!" , the control somehow doens't show up after line decodeURL = when debugging . No exceptions are raised too.
The issue is that control doesn't go beyond the line"decodeURL" .
What might be causing the issue ? Please use debugger to hopefully mimic this problem.

Just tested it on Java 8u151. This throws an IllegalArgumentException on the second spin of the loop: "URLDecoder: Incomplete trailing escape (%) pattern". That's because after the first decoding you have "a + b == 13%!", and during the second decoding that % is supposed to introduce an encoding sequence but it does not. I think that's the expected behaviour, even if the standard library of other languages does not agree. Python 3.6 for example:
>>> from urllib.parse import unquote
>>> result = unquote('a%20%2B%20b%20%3D%3D%2013%25!')
>>> result
'a + b == 13%!'
>>> unquote(result)
'a + b == 13%!'

reading UTF-16 produces unexpected results

I use the beaglebuddy Java library in an Android project for reading/writing ID3 tags of mp3 files. I'm having an issue with reading the text that was previously written using the same library and could not find anything related in their docs.
Assume I write the following info:
MP3 mp3 = new MP3(pathToFile);
mp3.setLeadPerformer("Jon Skeet");
mp3.setTitle("A Million Rep");
mp3.save();
Looking at the source code of the library, I see that UTF-16 encoding is explicitly set, internally it calls
protected ID3v23Frame setV23Text(String text, FrameType frameType) {
return this.setV23Text(Encoding.UTF_16, text, frameType);
}
and
protected ID3v23Frame setV23Text(Encoding encoding, String text, FrameType frameType) {
ID3v23FrameBodyTextInformation frameBody = null;
ID3v23Frame frame = this.getV23Frame(frameType);
if(frame == null) {
frame = this.addV23Frame(frameType);
}
frameBody = (ID3v23FrameBodyTextInformation)frame.getBody();
frameBody.setEncoding(encoding);
frameBody.setText(encoding == Encoding.UTF_16?Utility.getUTF16String(text):text);
return frame;
}
At a later point, I read the data and it gives me some weird Chinese characters:
mp3.getLeadPerformer(); // 䄀 䴀椀氀氀椀漀渀 刀攀瀀
mp3.getTitle(); // 䨀漀渀 匀欀攀攀琀
I took a look at the built-in Utility.getUTF16String(String) method:
public static String getUTF16String(String string) {
String text = string;
byte[] bytes = string.getBytes(Encoding.UTF_16.getCharacterSet());
if(bytes.length < 2 || bytes[0] != -2 || bytes[1] != -1) {
byte[] bytez = new byte[bytes.length + 2];
bytes[0] = -2;
bytes[1] = -1;
System.arraycopy(bytes, 0, bytez, 2, bytes.length);
text = new String(bytez, Encoding.UTF_16.getCharacterSet());
}
return text;
}
I'm not quite getting the point of setting the first 2 bytes to -2 and -1 respectively, is this a pattern stating that the string is UTF-16 encoded?
However, I tried to explicitly call this method when reading the data, that seems to be readable, but always prepends some cryptic characters at the start:
Utility.getUTF16String(mp3.getLeadPerformer()); // ��Jon Skeet
Utility.getUTF16String(mp3.getTitle()); // ��A Million Rep
Since the count of those characters seems to be constant, I created a temporary workaround by simply cutting them off.
Fields like "comments" where the author does not explicitly enforce UTF-16 when writing are read without any issues.
I'm really curious about what's going on here and appreciate any suggestions.

java writeInt from php,

hey all i am trying to make a data output stream in php
to write back primitive data types to a java application
i created a class that write the data to an array
(write it same as java do , copy from java code)
and finally i am writing back the array to the client.
feels like its not working well
for example the writeInt method
send to the java client some wrong values
am i doing ok ?
thank you
here is my code
private $buf = array();
public function writeByte($b) {
$this->buf[] = pack('c' ,$b);
}
public function writeInt($v) {
$this->writeByte($this->shiftRight3($v , 24) & 0xFF);
$this->writeByte($this->shiftRight3($v , 16) & 0xFF);
$this->writeByte($this->shiftRight3($v , 8) & 0xFF);
$this->writeByte($this->shiftRight3($v , 0) & 0xFF);
}
private function shiftRight3($a ,$b){
if(is_numeric($a) && $a < 0){
return ($a >> $b) + (2<<~$b);
}else{
return ($a >> $b);
}
}
public function toByteArray(){
return $this->buf;
}
this is how i am setting the main php file
header("Content-type: application/octet-stream" ,true);
header("Content-Transfer-Encoding: binary" ,true);
this is how i am returning the data
$arrResult = $dataOutputStream->toByteArray();
for ($i = 0 ; $i < count($arrResult) ; $i ++){
echo $arrResult[$i];
}
I EDIT THE QUESTION ,ACCOURDING TO MY CODE CHANGING
in the java client side seems that i have 2 bytes to read start always
i have 13 , 10 , which is \r \n
how come i am reading them always ?
(in my test i am sending one byte to the java client side ,
URL u = new URL("http://localhost/jtpc/test/inputTest.php");
URLConnection c = u.openConnection();
InputStream in = c.getInputStream();
int read = 0;
for (int j = 0; read != -1 ; j++) {
read = in.read();
System.out.println("More to read : " + read);
}
)
the output will be ,
More to read : 13
More to read : 10
More to read : 1 (this is the byte i am sending)

Php has pack() function for turning data into binary form. Unpack() reverses the operation.
$binaryInt = pack('I', $v);

The one thing that strikes me as odd is that you are setting the content type to application/zip, but you don't seem to be creating a ZIP encoded output stream. Is this an oversight ... or does PHP perform the encoding for you without you asking?
EDIT
According to RFC 2046, the recommended content type for a binary data format whose content type is not standardized is "application/octet-stream". There is also a practice of defining custom content subtypes with a name starting with "x-" (for experimental), but RFC 2046 says that this practice is now strongly discouraged.

You don't need that shiftRight3() method, just use >>, as you are masking the result, and then turning it into a chr(). Throw it away.

Unable to decode hex values in javascript tooltip

I have quite the process that we go through in order to display some e-mail communications in our application. Trying to keep it as general as possible...
-We make a request to a service via XML
-Get the XML reply string, send the string to a method to encode any invalid characters as follows:
public static String convertUTF8(String value) {
char[] chars = value.toCharArray();
StringBuffer retVal = new StringBuffer(chars.length);
for (int i = 0; i < chars.length; i++) {
char c = chars[i];
int chVal = (int)c;
if (chVal > Byte.MAX_VALUE) {
retVal.append("&#x").append(Integer.toHexString(chVal)).append(";");
} else {
retVal.append(c);
}
}
return retVal.toString();
}
We then send that result of a string to another method to remove any other invalid characters:
public static String removeInvalidCharacters(String inString)
{
if (inString == null){
return null;
}
StringBuffer newString = new StringBuffer();
char ch;
char c[] = inString.toCharArray();
for (int i = 0; i < c.length; i++)
{
ch = c[i];
// remove any characters outside the valid UTF-8 range as well as all control characters
// except tabs and new lines
if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
{
newString.append(ch);
}
}
return newString.toString();
}
This string is then "unmarshal'ed" via the SaxParser
The object is then sent back to our Display action which generated the response to the calling jsp/javascript to create the page.
The issue is some text can contain characters which can't be processed correctly. The following is eventually rendered on the JSP just fine:
<PrvwCommTxt>This is a new test. Have a*&#xc7;&#xb4;)&#xa1;.&#xf1;&#xc7;&#xa1;.&#xf1;*&#xc7;&#xb4;)...</PrvwCommTxt>
Which shows up as "This is a new test. Have a*Ç´)¡.ñÇ¡." in the browser.
-The following shows up in a tooltip while hovering over the above text:
<CommDetails>This is a new test. Have a*Ç´)¡.ñÇ¡.ñ*Ç´)¡.ñ*´)(¡.ñÇ(¡.ñÇ* Wonderful Day!</CommDetails>
This then shows up incorrectly when rendered in the tooltip javascript with all the HEX values and not being rendered correctly.
Any suggestions on how to make the unknown characters show correctly in javascript?

Get the XML reply string, send the string to a method to encode any invalid characters as follows:
You should be using Apache Commons Lang StringEscapeUtils#escapeXml() for this.
// remove any characters outside the valid UTF-8 range
This makes no sense. There's nothing outside UTF-8 range. The problem lies somewhere else. Get rid of this method.
The issue is some text can contain characters which can't be processed correctly. The following is eventually rendered on the JSP just fine:
You need to set the response encoding to UTF-8 and instruct the webbrowser to use UTF-8. This can be done by putting the following line in top of JSP:
<%#page pageEncoding="UTF-8" %>
See also:
Unicode - How to get characters right?

How to replace ï¿½ in a string

I have a string that contains a character ï¿½ I haven't been able to replace it correctly.
String.replace("ï¿½", "");
doesn't work, does anyone know how to remove/replace the ï¿½ in the string?

That's the Unicode Replacement Character, \uFFFD. (info)
Something like this should work:
String strImport = "For some reason my �double quotes� were lost.";
strImport = strImport.replaceAll("\uFFFD", "\"");

Character issues like this are difficult to diagnose because information is easily lost through misinterpretation of characters via application bugs, misconfiguration, cut'n'paste, etc.
As I (and apparently others) see it, you've pasted three characters:
codepoint glyph escaped windows-1252 info
=======================================================================
U+00ef ï \u00ef ef, LATIN_1_SUPPLEMENT, LOWERCASE_LETTER
U+00bf ¿ \u00bf bf, LATIN_1_SUPPLEMENT, OTHER_PUNCTUATION
U+00bd ½ \u00bd bd, LATIN_1_SUPPLEMENT, OTHER_NUMBER
To identify the character, download and run the program from this page. Paste your character into the text field and select the glyph mode; paste the report into your question. It'll help people identify the problematic character.

You are asking to replace the character "�" but for me that is coming through as three characters 'ï', '¿' and '½'. This might be your problem... If you are using Java prior to Java 1.5 then you only get the UCS-2 characters, that is only the first 65K UTF-8 characters. Based on other comments, it is most likely that the character that you are looking for is '�', that is the Unicode replacement character. This is the character that is "used to replace an incoming character whose value is unknown or unrepresentable in Unicode".
Actually, looking at the comment from Kathy, the other issue that you might be having is that javac is not interpreting your .java file as UTF-8, assuming that you are writing it in UTF-8. Try using:
javac -encoding UTF-8 xx.java
Or, modify your source code to do:
String.replaceAll("\uFFFD", "");

As others have said, you posted 3 characters instead of one. I suggest you run this little snippet of code to see what's actually in your string:
public static void dumpString(String text)
{
for (int i=0; i < text.length(); i++)
{
System.out.println("U+" + Integer.toString(text.charAt(i), 16)
+ " " + text.charAt(i));
}
}
If you post the results of that, it'll be easier to work out what's going on. (I haven't bothered padding the string - we can do that by inspection...)

Change the Encoding to UTF-8 while parsing .This will remove the special characters

Use the unicode escape sequence. First you'll have to find the codepoint for the character you seek to replace (let's just say it is ABCD in hex):
str = str.replaceAll("\uABCD", "");

for detail
import java.io.UnsupportedEncodingException;
/**
* File: BOM.java
*
* check if the bom character is present in the given string print the string
* after skipping the utf-8 bom characters print the string as utf-8 string on a
* utf-8 console
*/
public class BOM
{
private final static String BOM_STRING = "ï»¿Hello World";
private final static String ISO_ENCODING = "ISO-8859-1";
private final static String UTF8_ENCODING = "UTF-8";
private final static int UTF8_BOM_LENGTH = 3;
public static void main(String[] args) throws UnsupportedEncodingException {
final byte[] bytes = BOM_STRING.getBytes(ISO_ENCODING);
if (isUTF8(bytes)) {
printSkippedBomString(bytes);
printUTF8String(bytes);
}
}
private static void printSkippedBomString(final byte[] bytes) throws UnsupportedEncodingException {
int length = bytes.length - UTF8_BOM_LENGTH;
byte[] barray = new byte[length];
System.arraycopy(bytes, UTF8_BOM_LENGTH, barray, 0, barray.length);
System.out.println(new String(barray, ISO_ENCODING));
}
private static void printUTF8String(final byte[] bytes) throws UnsupportedEncodingException {
System.out.println(new String(bytes, UTF8_ENCODING));
}
private static boolean isUTF8(byte[] bytes) {
if ((bytes[0] & 0xFF) == 0xEF &&
(bytes[1] & 0xFF) == 0xBB &&
(bytes[2] & 0xFF) == 0xBF) {
return true;
}
return false;
}
}

dissect the URL code and unicode error. this symbol came to me as well on google translate in the armenian text and sometimes the broken burmese.

profilage basï¿½ sur l'analyse de l'esprit (french)
should be translated as:
profilage basé sur l'analyse de l'esprit
so, in this case ï¿½ = é

No above answer resolve my issue. When i download xml it apppends ï»¿<xml to my xml. I simply
xml = parser.getXmlFromUrl(url);
xml = xml.substring(3);// it remove first three character from string,
now it is running accurately.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to determine if a String contains invalid encoded characters - java

You can use a CharsetDecoder configured to throw an exception if invalid chars are found: CharsetDecoder UTF8Decoder = Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT); See CodingErrorAction.REPORT

Replace all control chars into empty string value = value.replaceAll("\\p{Cntrl}", "");

You might want to include a known parameter in your requests, e.g. "...&encTest=ä€", to safely differentiate between the different encodings.

Try to use UTF-8 as a default as always in anywhere you can touch. (Database, memory, and UI) One and single charset encoding could reduce a lot of problems, and actually it can speed up your web server performance. There are so many processing power and memory wasted to encoding/decoding.

Related

URLDecode.decode method not working as expected in Java

reading UTF-16 produces unexpected results

java writeInt from php,

Unable to decode hex values in javascript tooltip

How to replace ï¿½ in a string

Categories

Resources