UTF-8 issue in linux

UTF-8 issue in linux - java

String departmentName = request.getParameter("dept_name");
departmentName = new String(departmentName.getBytes(Charset.forName("UTF8")),"UTF8");
System.out.println(departmentName);//O/p: composés
In windows, the displayed output is what I expected and it is also fetching the record on department name matching criteria.
But in Linux it is returning "compos??s", so my DB query fails.
Can anyone give me solution?

Maybe because the Charset UTF8 doesn't exist. You must use UTF-8. See the javadoc.

First of all, using unicode output with System.out.println is no good indicator since you are dealing with console encoding. Open the file with OutputStreamWriter, explicite setting encoding to UTF-8, then you can say if the request parameter in encoded correctly or not.
Second, there may be database connection encoding issue. For MySQL you need to explicite specify encoding in connection string, as for other, it could also be, that the default system encoding is taken, when not specified.

First of all, try to figure out the encoding you have in every particular place.
For example, the string might already have the proper encoding, if your Linux system is running with UTF-8 charset; that hack was maybe only needed on Windows.
Last but not least, how do you know it is incorrect? And it is not your viewer that is incorrect? What character set does your console or log viewer use?
You need to perform a number of checks to find out where exactly the encoding is different from what is expected at that point of your application.

Related

Outputting International Characters from MySQL to Java/Android

Let's say someone uses this letter: ë. They input it in an EditText Box and it correctly is stored in the MySQL Database (via a php script). But to grap that database field with that special character causes an output of "null" in Java/Android.
It appears my database is setup and storing correctly. But retrieving is the issue. Do I have to fix this in the PHP side or handle it in Java/Android? EDIT: I don't believe this has anything to do with the PHP side anymore so I am more interested int he Java side.

Sounds similar to: android, UTF8 - How do I ensure UTF8 is used for a shared preference
I suspect that the problem occurs over the web interface between the web service and the Android App. One side is sending UTF-16 or ISO 8859-1 characters, and the other is interpreting it as UTF-8 (or vice versa). Make sure:
That the web request from Android is using UTF-8
That the web service replies using UTF-8.
As in the other answer, use a HTTP debugging proxy to check that the characters being sent between the Android App and the web service are what you expect.

I suggest to extract your database access code to a standard Java Env then compile and test it. This will help you to isolate the problem.
Usually you won't get null even if there is encode problem. Check other problem and if other exception throws.
Definitely not problem of PHP if you sure the string is correctly inserted.

Probably a confusion between UTF-8 and UTF-16 or any other character set that you might be using for storing these international characters. In UTF-16, the character ë will be stored as two bytes with the first byte beeing the null byte (0x00). If this double byte is incorrectly transmitted back as, said, UTF-8, then the null byte will be seen as the end of string terminator; resulting in the output of a null value instead of the international character.
First, you need to be 100% sure that your international characters are stored correctly in the database. Seeing the correct result in a php page on a web site is not a guaranty for that; as two wrongs can give you a right. In the past, I have often seen incorrectly stored characters in a database that were still displayed correctly on a web page or some system. This will looks OK until you need to access your database from another system and at this point; everything break loose because you cannot repeat the same kind of errors on the second system.

utf-8 invisible characters

I have a website, and need to store data from a text field into a mysql database.
The frontend is perl. I used utf8::encode to encode the data into utf8.
The request is made to the Java backend which connects to the mysql db and inserts this text.
For the table the default charset is set to utf8.
This works in many cases, but it fails in some cases.
If I use テスト, the data stored in the database shows questions marks: ã??ã?¹ã??.
If I try to insert the utf8 encoded string directly from the sql browser, everything works fine.
Update events set summary = ãã¹ã where event_id = 11657;
While inserting I noticed there are some blank characters that show up in the mysql query browser, something like: ã ã¹ ã.
After inserting from here, data in the database shows some boxes in the database instead of these spaces, and テスト displays correctly on the website after utf8 decoding it.
The problem is only when I insert directly from the website, these special characters come up as question marks in the database.
Can someone please help me with these special characters? Do I need to handle them differently?

We have also faced similar issue in one of the projects.So we had to write a small routine to convert those utf8 characters into html encoded and store into the database.

Use StringEscapeUtils from Apache Commons Lang:
import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);

If the database really stored テスト, that's what you should see in the sql browser instead of mojibake.
It sounds like the Java backend is interpreting what Perl sends as ISO-8859-1 rather than UTF-8. This explains hows テ gets converted into \u00E3\u0083\u0086. Then the backend tries to send the data to the database in Windows-1252 - the MySQL default encoding. Unfortunately Windows-1252 cannot represent the Unicode characters in the range \u0080-\u009F, so the last two characters are replaced by question marks.
So you have two problems:
You should make the Java backend read the request in UTF-8 rather than in ISO-8859-1.
The backend should use UTF-8 when talking with the database. The easiest way to do this is adding characterEncoding=utf8 to the connection parameters.

I'm assuming that you are sending POST parameters.
I think that the most likely cause of your initial problem is one of the following:
If the parameters are being sent in the HTTP request body, your Perl front-end is probably not setting the encoding in the content type header of the request. The webserver is probably to assuming ISO-8859-1. The solution to this is to set the request content type properly.
If the parameters are sent in the HTTP request URL, your web server is using the wrong characterset when decoding the request parameters. The solution to this is going be web-server specific ...
It sounds like there might also be a character set problem in talking to the database, but that might just be a consequence of earlier mangling.

Encoding problems exporting file

I'm trying to find out what has happen in an integration project. We just can't get the encoding right at the end.
A Lithuanian file was imported to the as400. There, text is stored in the encoding EBCDIC. Exporting the data to ANSI file and then read as windows-1257. ASCII-characters works fine and some Lithuanian does, but the rest looks like crap with chars like ~, ¶ and ].
Example string going thou the pipe
Start file
Tuskulënö
as400
Tuskulënö
EAA9A9596
34224335A
exported file (after conversion to windows-1257)
Tuskulėnö
expected result for exported file
Tuskulėnų
Any ideas?
Regards,
Karl

EBCDIC isn't a single encoding, it's a family of encodings (in this case called codepages), similar to how ISO-8859-* is a family of encodings: the encodings within the families share about half the codes for "basic" letters (roughly what is present in ASCII) and differ on the other half.
So if you say that it's stored in EBCDIC, you need to tell us which codepage is used.
A similar problem exists with ANSI: when used for an encoding it refers to a Windows default encoding. Unfortunately the default encoding of a Windows installation can vary based on the locale configured.
So again: you need to find out which actual encoding is used here (these are usually from the Windows-* family, the "normal" English one s Windows-1252).
Once you actually know what encoding you have and want at each point, you can go towards the second step: fixing it.
My personal preference for this kind of problems is this: Have only one step where encodings are converted: take whatever the initial tool produces and convert it to UTF-8 in the first step. From then on, always use UTF-8 to handle that data. If necessary convert UTF-8 to some other encoding in the last step (but avoid this if possible).

Get file's encoding in Java [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Java : How to determine the correct charset encoding of a stream
User will upload a CSV file to the server, server need to check if the CSV file is encoded as UTF-8. If so need to inform user, (s)he uploaded a wrong encoding file. The problem is how to detect the file user uploaded is UTF-8 encoding? The back end is written in Java. So anyone get the suggestion?

At least in the general case, there's no way to be certain what encoding is used for a file -- the best you can do is a reasonable guess based on heuristics. You can eliminate some possibilities, but at best you're narrowing down the possibilities without confirming any one. For example, most of the ISO 8859 variants allow any byte value (or pattern of byte values), so almost any content could be encoded with almost any ISO 8859 variant (and I'm only using "almost" out of caution, not any certainty that you could eliminate any of the possibilities).
You can, however, make some reasonable guesses. For example, a file that start out with the three characters of a UTF-8 encoded BOM (EF BB BF), it's probably safe to assume it's really UTF-8. Likewise, if you see sequences like: 110xxxxx 10xxxxxx, it's a pretty fair guess that what you're seeing is encoded with UTF-8. You can eliminate the possibility that something is (correctly) UTF-8 enocded if you ever see a sequence like 110xxxxx 110xxxxx. (110xxxxx is a lead byte of a sequence, which must be followed by a non-lead byte, not another lead byte in properly encoded UTF-8).

You can try and guess the encoding using a 3rd party library, for example: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

Well, you can't. You could show kind of a "preview" (or should I say review?) with some sample data from the file so the user can check if it looks okay. Perhaps with the possibility of selecting different encoding options to help determine the correct one.

UTF-8 and Servlets on Tomcat/Linux

I've had some problems with reading and writing UTF-8 from servlets on Tomcat 6 / Linux. request and response were utf-8, browser was utf-8, URIEncoding was set in server.xml on both connectors and hosts. Ins short, every known thing for me in code itself, and server configuration was utf-8.
When reading request, I've had to take byte array from String, and then convert that byte array into String again. When writing request I've had to write bytes, not String itself, in order to get proper response (otherwise I would get an exception that says some non ASCII character is not valid ISO 8859-1).

Changing the LANG environment variable is one way to solve the problem.
The official way is to set the character encoding in a sevlet filter: http://wiki.apache.org/tomcat/Tomcat/UTF-8
Some background information: http://www.crazysquirrel.com/computing/general/form-encoding.jspx

Solution was to set LANG environmental variable to (in my case) en_US.UTF-8, or probably any other UTF-8 locale. I'm still puzzled with the fact, that I couldn't do anything from code to make my servlet behave properly. If there is no way to do it, than it's a bug from my point of view.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

UTF-8 issue in linux - java

Maybe because the Charset UTF8 doesn't exist. You must use UTF-8. See the javadoc.

Related

Outputting International Characters from MySQL to Java/Android

utf-8 invisible characters

Encoding problems exporting file

Get file's encoding in Java [duplicate]

UTF-8 and Servlets on Tomcat/Linux

Categories

Resources