Java UTF-8 strange behaviour

Java UTF-8 strange behaviour - java

I am trying to decode some UTF-8 strings in Java.
These strings contain some combining unicode characters, such as CC 88 (combining diaresis).
The character sequence seems ok, according to http://www.fileformat.info/info/unicode/char/0308/index.htm
But the output after conversion to String is invalid.
Any idea ?
byte[] utf8 = { 105, -52, -120 };
System.out.print("{{");
for(int i = 0; i < utf8.length; ++i)
{
int value = utf8[i] & 0xFF;
System.out.print(Integer.toHexString(value));
}
System.out.println("}}");
System.out.println(">" + new String(utf8, "UTF-8"));
Output:
{{69cc88}}
>i?

The console which you're outputting to (e.g. windows) may not support unicode, and may mangle the characters. The console output is not a good representation of the data.
Try writing the output to a file instead, making sure the encoding is correct on the FileWriter, then open the file in a unicode-friendly editor.
Alternatively, use a debugger to make sure the characters are what you expect. Just don't trust the console.

Here how I finally solved the problem, in Eclipse on Windows:
Click Run Configuration.
Click Arguments tab.
Add -Dfile.encoding=UTF-8
Click Common tab.
Set Console Encoding to UTF-8.
Modify the code:
byte[] utf8 = { 105, -52, -120 };
System.out.print("{{");
for(int i = 0; i < utf8.length; ++i)
{
int value = utf8[i] & 0xFF;
System.out.print(Integer.toHexString(value));
}
System.out.println("}}");
PrintStream sysout = new PrintStream(System.out, true, "UTF-8");
sysout.print(">" + new String(utf8, "UTF-8"));
Output:
{{69cc88}}
> ï

The code is fine, but as skaffman said your console probably doesn't support the appropriate character.
To test for sure, you need to print out the unicode values of the character:
public class Test {
public static void main(String[] args) throws Exception {
byte[] utf8 = { 105, -52, -120 };
String text = new String(utf8, "UTF-8");
for (int i=0; i < text.length(); i++) {
System.out.println(Integer.toHexString(text.charAt(i)));
}
}
}
This prints 69, 308 - which is correct (U+0069, U+0308).

Java, not unreasonably, encodes Unicode characters into native system encoded bytes before it writes them to stdout. Some operating systems, like many Linux distros, use UTF-8 as their default character set, which is nice.
Things are a bit different on Windows for a variety of backwards-compatibility reasons. The default system encoding will be one of the "ANSI" codepages and if you open the default command prompt (cmd.exe) it will be one of the old "OEM" DOS codepages (though it is possible to get ANSI and Unicode there with a bit of work).
Since U+0308 isn't in any of the "ANSI" character sets (probably 1252 in your case), it'll get encoded as an error character (usually a question mark).
An alternative to Unicode-enabling everything is to normalize the combining sequence U+0069 U+0308 to the single character U+00EF:
public static void emit(String foo) throws IOException {
System.out.println("Literal: " + foo);
System.out.print("Hex: ");
for (char ch : foo.toCharArray()) {
System.out.print(Integer.toHexString(ch & 0xFFFF) + " ");
}
System.out.println();
}
public static void main(String[] args) throws IOException {
String foo = "\u0069\u0308";
emit(foo);
foo = Normalizer.normalize(foo, Normalizer.Form.NFC);
emit(foo);
}
Under windows-1252, this code will emit:
Literal: i?
Hex: 69 308
Literal: ï
Hex: ef

Related

Which encoding for ProcessBuilder parameters

Using ProcessBuilder, I need to be able to send non-ASCII parameters to another Java program.
In this case, a program Abc needs to send e.g. Arabic characters to Def program through the parameters. I have control of Abc code, but not of Def.
Using the normal way of ProcessBuilder without any playing with the encoding, it was mentioned here, it is not possible. Def recieves question marks "?????".
However, I am able to get some result, but different encodings can be used for different scenarios.
E.g. I am trying all encodings to send to the recipient, and comparing the result of what is expected.
Windows, IntelliJ console:
Default charset: UTF-8
Found charsets: windows-1252, windows-1254 and windows-1258
Windows, command prompt:
Default charset: windows-1252
Found charsets: CESU-8 and UTF-8
Ubuntu, command prompt:
Default charset: ISO-8859-1
Found charsets: ISO-2022-CN, ISO-2022-KR, ISO-8859-1, ISO-8859-15, ISO-8859-9, x-IBM1129, x-ISO-2022-CN-CNS and x-ISO-2022-CN-GB
My question is: how to programmatically know which correct encoding to use, since I need to have something universal?
In other words, what is the relation between the default charset and the found ones?
public class Abc {
private static final Path PATH = Paths.get("."); // With maven: ./target/classes
public static void main(String[] args) throws Exception {
var string = "hello أحمد";
var bytes = string.getBytes();
System.out.println("Original string: " + string);
System.out.println("Default charset: " + Charset.defaultCharset());
for (var c : Charset.availableCharsets().values()) {
var newString = new String(bytes, c);
var process = new ProcessBuilder().command("java", "-cp",
PATH.toAbsolutePath().toString(),
"Def", newString).start();
process.waitFor();
var output = asString(process.getInputStream());
if (output.contains(string)) {
System.out.println("Found " + c + " " + output);
}
}
}
private static String asString(InputStream is) throws IOException {
try (var reader = new BufferedReader(new InputStreamReader(is))) {
var builder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
if (builder.length() != 0) {
builder.append(System.lineSeparator());
}
builder.append(line);
}
return builder.toString();
}
}
}
public class Def {
public static void main(String[] args) {
System.out.println(args[0]);
}
}

Under the hood, what's actually being passed around is bytes, not chars. Normally, you'd expect the java method that ends up turning characters into bytes to have an overload that lets you specify charset, but, for whatever reason, it does not exist here.
How it should work is thusly:
You pass a string to ProcessBuilder
PB will turn that string into bytes using Charset.defaultCharset() (why? Because PB is all about making the OS do things, and the default charset reflects the OS's preferred charset).
These bytes are then fed to the process.
The process starts up. If it is java, and we're talking the args in psv main(String[] args), the same is done in reverse: Java takes the bytes and turns them back to characters via Charset.defaultCharset(), again.
This does show an immediate issue: If the default charset is not capable of representing a certain character, then in theory you are out of luck.
That would strongly suggest that using java to fire up java.exe should ordinarily mean you can pass whatever you want (unless the characters involved aren't representable in the system's charset).
Your code is odd. In particular, this line is the problem:
var bytes = string.getBytes();
This is short for string.getBytes(Charset.defaultCharset()). So now you have your bytes in the provided charset.
var newString = new String(bytes, c);
and now you're taking those bytes and turning them into a string using a completely different charset. I'm not sure what you're trying to accomplish with this. Pure gobbledygook would come out.
In other words, what is the relation between the default charset and the found ones?
What do you mean by 'found ones'? The string "Found charsets" appears nowhere in your code. If you mean: What Charset.availableCharsets() returns - there is no relationship at all. availableCharsets isn't relevant for ProcessBuilder.

One possibility is to convert your String to Unicode sequences string and then pass it to another process and there convert it back to a regular String. String of Unicode sequences will always contain ASCI characters only. Here is how it may look like:
String encoded = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("hello أحمد"));
The result will be that String encode will hold this value:
"\u0068\u0065\u006c\u006c\u006f\u0020\u0623\u062d\u0645\u062f"
This String you can safely pass to another process. In that other process, you can do the following:
String originalString = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(encodedString);
And the result will be that originalString will now hold this value:
"hello أحمد"
Class StringUnicodeEncoderDecoder could be found in an Open Source library called MgntUtils. You can get this library as Maven Artifact or get it on Github (including source code and JavaDoc). JavaDoc online is available here
This library and this particular feature is used and well tested by multiple users.
Disclamer: This library is written by me

Print unicode character in java

Displaying unicode character in java shows "?" sign. For example, i tried to print "अ". Its unicode Number is U+0905 and html representation is "अ".
The below codes prints "?" instead of unicode character.
char aa = '\u0905';
String myString = aa + " result" ;
System.out.println(myString); // displays "? result"
Is there a way to display unicode character directly from unicode itself without using unicode numbers? i.e "अ" is saved in file now display the file in jsp.

Java defines two types of streams, byte and character.
The main reason why System.out.println() can't show Unicode characters is that System.out.println() is a byte stream that deal with only the low-order eight bits of character which is 16-bits.
In order to deal with Unicode characters(16-bit Unicode character), you have to use character based stream i.e. PrintWriter.
PrintWriter supports the print( ) and println( ) methods. Thus, you can use these methods
in the same way as you used them with System.out.
PrintWriter printWriter = new PrintWriter(System.out,true);
char aa = '\u0905';
printWriter.println("aa = " + aa);

try to use utf8 character set -
Charset utf8 = Charset.forName("UTF-8");
Charset def = Charset.defaultCharset();
String charToPrint = "u0905";
byte[] bytes = charToPrint.getBytes("UTF-8");
String message = new String(bytes , def.name());
PrintStream printStream = new PrintStream(System.out, true, utf8.name());
printStream.println(message); // should print your character

Your myString variable contains the perfectly correct value. The problem must be the output from System.out.println(myString) which has to send some bytes to some output to show the glyphs that you want to see.
System.out is a PrintStream using the "platform default encoding" to convert characters to byte sequences - maybe your platform doesn't support that character. E.g. on my Windows 7 computer in Germany, the default encoding is CP1252, and there's no byte sequence in this encoding that corresponds to your character.
Or maybe the encoding is correct, but simply the font that creates graphical glyphs from characters doesn't have that charater.
If you are sending your output to a Windows CMD.EXE window, then maybe both reasons apply.
But be assured, your string is correct, and if you send it to a destination that can handle it (e.g. a Swing JTextField), it'll show up correctly.

I ran into the same problem wiht Eclipse. I solved my problem by switching the Encoding format for the console from ISO-8859-1 to UTF-8. You can do in the Run/Run Configurations/Common menu.
https://eclipsesource.com/blogs/2013/02/21/pro-tip-unicode-characters-in-the-eclipse-console/

Unicode is a unique code which is used to print any character or symbol.
You can use unicode from --> https://unicode-table.com/en/
Below is an example for printing a symbol in Java.
package Basics;
/**
*
* #author shelc
*/
public class StringUnicode {
public static void main(String[] args) {
String var1 = "Cyntia";
String var2 = new String(" is my daughter!");
System.out.println(var1 + " \u263A" + var2);
//printing heart using unicode
System.out.println("Hello World \u2665");
}
}
******************************************************************
OUTPUT-->
Cyntia ☺ is my daughter!
Hello World ♥

Print all Unicode characters within a specific range

I can't find the right API for this. I tried this;
public static void main(String[] args) {
for (int i = 2309; i < 3000; i++) {
String hex = Integer.toHexString(i);
System.out.println(hex + " = " + (char) i);
}
}
This code only prints like this in Eclipse IDE.
905 = ?
906 = ?
907 = ?
...
How can I make us of these decimal and hex values to get the Unicode characters?

It prints like that because all consoles use a mono spaced font. Try that on a JLabel in a frame and it should display fine.
EDIT:
Try creating a unicode printstream
PrintStream out = new PrintStream(System.out, true, "UTF-8");
And then print to it.
Here's the output in CMD window.

I forgot to save it in UTF-8 format by changing it from
File > Properties > Select the text file encoding
This will properly print the right character from the Eclipse console. The default is cp1252 which will print only ? for those characters it does not understand.

How to replace ï¿½ in a string

I have a string that contains a character ï¿½ I haven't been able to replace it correctly.
String.replace("ï¿½", "");
doesn't work, does anyone know how to remove/replace the ï¿½ in the string?

That's the Unicode Replacement Character, \uFFFD. (info)
Something like this should work:
String strImport = "For some reason my �double quotes� were lost.";
strImport = strImport.replaceAll("\uFFFD", "\"");

Character issues like this are difficult to diagnose because information is easily lost through misinterpretation of characters via application bugs, misconfiguration, cut'n'paste, etc.
As I (and apparently others) see it, you've pasted three characters:
codepoint glyph escaped windows-1252 info
=======================================================================
U+00ef ï \u00ef ef, LATIN_1_SUPPLEMENT, LOWERCASE_LETTER
U+00bf ¿ \u00bf bf, LATIN_1_SUPPLEMENT, OTHER_PUNCTUATION
U+00bd ½ \u00bd bd, LATIN_1_SUPPLEMENT, OTHER_NUMBER
To identify the character, download and run the program from this page. Paste your character into the text field and select the glyph mode; paste the report into your question. It'll help people identify the problematic character.

You are asking to replace the character "�" but for me that is coming through as three characters 'ï', '¿' and '½'. This might be your problem... If you are using Java prior to Java 1.5 then you only get the UCS-2 characters, that is only the first 65K UTF-8 characters. Based on other comments, it is most likely that the character that you are looking for is '�', that is the Unicode replacement character. This is the character that is "used to replace an incoming character whose value is unknown or unrepresentable in Unicode".
Actually, looking at the comment from Kathy, the other issue that you might be having is that javac is not interpreting your .java file as UTF-8, assuming that you are writing it in UTF-8. Try using:
javac -encoding UTF-8 xx.java
Or, modify your source code to do:
String.replaceAll("\uFFFD", "");

As others have said, you posted 3 characters instead of one. I suggest you run this little snippet of code to see what's actually in your string:
public static void dumpString(String text)
{
for (int i=0; i < text.length(); i++)
{
System.out.println("U+" + Integer.toString(text.charAt(i), 16)
+ " " + text.charAt(i));
}
}
If you post the results of that, it'll be easier to work out what's going on. (I haven't bothered padding the string - we can do that by inspection...)

Change the Encoding to UTF-8 while parsing .This will remove the special characters

Use the unicode escape sequence. First you'll have to find the codepoint for the character you seek to replace (let's just say it is ABCD in hex):
str = str.replaceAll("\uABCD", "");

for detail
import java.io.UnsupportedEncodingException;
/**
* File: BOM.java
*
* check if the bom character is present in the given string print the string
* after skipping the utf-8 bom characters print the string as utf-8 string on a
* utf-8 console
*/
public class BOM
{
private final static String BOM_STRING = "ï»¿Hello World";
private final static String ISO_ENCODING = "ISO-8859-1";
private final static String UTF8_ENCODING = "UTF-8";
private final static int UTF8_BOM_LENGTH = 3;
public static void main(String[] args) throws UnsupportedEncodingException {
final byte[] bytes = BOM_STRING.getBytes(ISO_ENCODING);
if (isUTF8(bytes)) {
printSkippedBomString(bytes);
printUTF8String(bytes);
}
}
private static void printSkippedBomString(final byte[] bytes) throws UnsupportedEncodingException {
int length = bytes.length - UTF8_BOM_LENGTH;
byte[] barray = new byte[length];
System.arraycopy(bytes, UTF8_BOM_LENGTH, barray, 0, barray.length);
System.out.println(new String(barray, ISO_ENCODING));
}
private static void printUTF8String(final byte[] bytes) throws UnsupportedEncodingException {
System.out.println(new String(bytes, UTF8_ENCODING));
}
private static boolean isUTF8(byte[] bytes) {
if ((bytes[0] & 0xFF) == 0xEF &&
(bytes[1] & 0xFF) == 0xBB &&
(bytes[2] & 0xFF) == 0xBF) {
return true;
}
return false;
}
}

dissect the URL code and unicode error. this symbol came to me as well on google translate in the armenian text and sometimes the broken burmese.

profilage basï¿½ sur l'analyse de l'esprit (french)
should be translated as:
profilage basé sur l'analyse de l'esprit
so, in this case ï¿½ = é

No above answer resolve my issue. When i download xml it apppends ï»¿<xml to my xml. I simply
xml = parser.getXmlFromUrl(url);
xml = xml.substring(3);// it remove first three character from string,
now it is running accurately.

How to determine if a String contains invalid encoded characters

Usage scenario
We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.
Note: We use restlet, not tomcat
Original Problem
Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.
Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.
I.e. for a query part looking like
...v=abcädef
if "ISO-8859-1" is selected, the sent query part looks like
...v=abc%E4def
but if "UTF-8" is selected, the sent query part looks like
...v=abc%C3%A4def
Desired Solution
As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status
Current Solution In Detail
Check for each character ( == string.substring(i,i+1) )
if character.getBytes()[0] equals 63 for '?'
if Character.getType(character.charAt(0)) returns OTHER_SYMBOL
Code
protected List< String > getNonUnicodeCharacters( String s ) {
final List< String > result = new ArrayList< String >();
for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
final String character = s.substring( i , i + 1 );
final boolean isOtherSymbol =
( int ) Character.OTHER_SYMBOL
== Character.getType( character.charAt( 0 ) );
final boolean isNonUnicode = isOtherSymbol
&& character.getBytes()[ 0 ] == ( byte ) 63;
if ( isNonUnicode )
result.add( character );
}
return result;
}
Question
Will this catch all invalid (non utf encoded) characters?
Does any of you have a better (easier) solution?
Note: I checked URLDecoder with the following code
final String[] test = new String[]{
"v=abc%E4def",
"v=abc%C3%A4def"
};
for ( int i = 0 , n = test.length ; i < n ; i++ ) {
System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );
System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") );
}
This prints:
v=abc?def
v=abcädef
v=abcädef
v=abcÃ¤def
and it does not throw an IllegalArgumentException sigh

I asked the same question,
Handling Character Encoding in URI on Tomcat
I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,
Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
If you have to manually URL decode, use Latin1 as charset also.
Use the fixEncoding() function to fix up encodings.
For example, to get a parameter from query string,
String name = fixEncoding(request.getParameter("name"));
You can do this always. String with correct encoding is not changed.
The code is attached. Good luck!
public static String fixEncoding(String latin1) {
try {
byte[] bytes = latin1.getBytes("ISO-8859-1");
if (!validUTF8(bytes))
return latin1;
return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
// Impossible, throw unchecked
throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());
}
}
public static boolean validUTF8(byte[] input) {
int i = 0;
// Check for BOM
if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
&& (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
i = 3;
}
int end;
for (int j = input.length; i < j; ++i) {
int octet = input[i];
if ((octet & 0x80) == 0) {
continue; // ASCII
}
// Check for UTF-8 leading byte
if ((octet & 0xE0) == 0xC0) {
end = i + 1;
} else if ((octet & 0xF0) == 0xE0) {
end = i + 2;
} else if ((octet & 0xF8) == 0xF0) {
end = i + 3;
} else {
// Java only supports BMP so 3 is max
return false;
}
while (i < end) {
i++;
octet = input[i];
if ((octet & 0xC0) != 0x80) {
// Not a valid trailing byte
return false;
}
}
}
return true;
}
EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?
Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.
The solution I posted here is not perfect but it's the best one we found so far.

You can use a CharsetDecoder configured to throw an exception if invalid chars are found:
CharsetDecoder UTF8Decoder =
Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT);
See CodingErrorAction.REPORT

This is what I used to check the encoding:
CharsetDecoder ebcdicDecoder = Charset.forName("IBM1047").newDecoder();
ebcdicDecoder.onMalformedInput(CodingErrorAction.REPORT);
ebcdicDecoder.onUnmappableCharacter(CodingErrorAction.REPORT);
CharBuffer out = CharBuffer.wrap(new char[3200]);
CoderResult result = ebcdicDecoder.decode(ByteBuffer.wrap(bytes), out, true);
if (result.isError() || result.isOverflow() ||
result.isUnderflow() || result.isMalformed() ||
result.isUnmappable())
{
System.out.println("Cannot decode EBCDIC");
}
else
{
CoderResult result = ebcdicDecoder.flush(out);
if (result.isOverflow())
System.out.println("Cannot decode EBCDIC");
if (result.isUnderflow())
System.out.println("Ebcdic decoded succefully ");
}
Edit: updated with Vouze suggestion

Replace all control chars into empty string
value = value.replaceAll("\\p{Cntrl}", "");

URLDecoder will decode to a given encoding. This should flag errors appropriately. However the documentation states:
There are two possible ways in which this decoder could deal with illegal strings. It could either leave illegal characters alone or it could throw an IllegalArgumentException. Which approach the decoder takes is left to the implementation.
So you should probably try it. Note also (from the decode() method documentation):
The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites
so there's something else to think about!
EDIT: Apache Commons URLDecode claims to throw appropriate exceptions for bad encodings.

I've been working on a similar "guess the encoding" problem. The best solution involves knowing the encoding. Barring that, you can make educated guesses to distinguish between UTF-8 and ISO-8859-1.
To answer the general question of how to detect if a string is properly encoded UTF-8, you can verify the following things:
No byte is 0x00, 0xC0, 0xC1, or in the range 0xF5-0xFF.
Tail bytes (0x80-0xBF) are always preceded by a head byte 0xC2-0xF4 or another tail byte.
Head bytes should correctly predict the number of tail bytes (e.g., any byte in 0xC2-0xDF should be followed by exactly one byte in the range 0x80-0xBF).
If a string passes all those tests, then it's interpretable as valid UTF-8. That doesn't guarantee that it is UTF-8, but it's a good predictor.
Legal input in ISO-8859-1 will likely have no control characters (0x00-0x1F and 0x80-0x9F) other than line separators. Looks like 0x7F isn't defined in ISO-8859-1 either.
(I'm basing this off of Wikipedia pages for UTF-8 and ISO-8859-1.)

You might want to include a known parameter in your requests, e.g. "...&encTest=ä€", to safely differentiate between the different encodings.

You need to setup the character encoding from the start. Try sending the proper Content-Type header, for example Content-Type: text/html; charset=utf-8 to fix the right encoding. The standard conformance refers to utf-8 and utf-16 as the proper encoding for Web Services. Examine your response headers.
Also, at the server side — in the case which the browser do not handles properly the encoding sent by the server — force the encoding by allocating a new String. Also you can check each byte in the encoded utf-8 string by doing a single each_byte & 0x80, verifying the result as non zero.
boolean utfEncoded = true;
byte[] strBytes = queryString.getBytes();
for (int i = 0; i < strBytes.length(); i++) {
if ((strBytes[i] & 0x80) != 0) {
continue;
} else {
/* treat the string as non utf encoded */
utfEncoded = false;
break;
}
}
String realQueryString = utfEncoded ?
queryString : new String(queryString.getBytes(), "iso-8859-1");
Also, take a look on this article, I hope it would help you.

the following regular expression might be of interest for you:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185624
I use it in ruby as following:
module Encoding
UTF8RGX = /\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x unless defined? UTF8RGX
def self.utf8_file?(fileName)
count = 0
File.open("#{fileName}").each do |l|
count += 1
unless utf8_string?(l)
puts count.to_s + ": " + l
end
end
return true
end
def self.utf8_string?(a_string)
UTF8RGX === a_string
end
end

Try to use UTF-8 as a default as always in anywhere you can touch. (Database, memory, and UI)
One and single charset encoding could reduce a lot of problems, and actually it can speed up your web server performance. There are so many processing power and memory wasted to encoding/decoding.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java UTF-8 strange behaviour - java

Related

Which encoding for ProcessBuilder parameters

Print unicode character in java

Print all Unicode characters within a specific range

How to replace ï¿½ in a string

How to determine if a String contains invalid encoded characters

Categories

Resources