Issues while converting utf-8 to arabic - java

Arabic Characters are coming in XML request without setting any character set in request header. in text file, characters are going correctly. but if we insert into oracle table, its going as غ.ب (Ø´Ù?ر)
manual insertion in table is going fine.
tried conversion with different type codes: System.out.println(URLDecoder.decode(Value, "ISO-8859-9")); used ByteBuffer ByteArrayInputStream.
one more thing is noticed:
if we set charset=UTF-8 in header than all is going fine and if we print encoded string it prints as: utf-8 in header: 50+%D8%BA.%D8%A8+%28%D8%B4%D9%87%D8%B1%29
and if we didn't set the charset in header, then string prints as: 50+%C3%98%C2%BA.%C3%98%C2%A8+%28%C3%98%C2%B4%C3%99%C2%87%C3%98%C2%B1%29
and second string is going fine in text file.
please someone suggest something.

Do not just think its not going fine in DB change all the fonts and see in the Oracle DB. The Arabic characters is fine but sometimes are not supported by the current font you have selected in SQL developer

Related

Issues writing utf8 strings with emojis from Kotlin to MySQL- utf8 vs utf8mb4 [duplicate]

I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.
Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++

SQL Function to Replace Microsoft Characters (especially "smart quotes") with UTF-8 compatible characters [duplicate]

I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.
Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++

Select japanese character from sqlite database

I created a database from the Edict files with java and i used for that SQLite.
SQLite by default encode the string in UTF-8
Here is a sample of the database: sample
If i do
Select* FROM entry
In Java i get the japanese words in their "correct" form (graphical representation at least).
But if i try and do.
Select * FROM entry WHERE wordJP LIKE '食べる'"
I obviously get nothing. That makes it very hard to find the definition of a word.
Can someone explain why this is occuring, and how to solve it ?
I kind of understand that it is a problem of encoding but i don't understand where it happens and why.
So i managed to solve this:
Using iconv from linux to encode the file from EUC-JP to UTF-8
Setting SQLITE to UTF-8
Java is supposed to be natively in UTF-8, but eclipse put it by default on some ISO-xxx codage, so you need to change that by right-clicking on your project > properties > text file encoding > other (scroll the list)
From your link,
[EDICT] is a plain text document in EUC-JP coding.
If you query strings are encoded in UTF-8, matching will fail.
You should probably try to convert the database into UTF-8 when you fill-in your sqlite database.

I am trying to read in .properties files with Chinese characters encoded in utf-8

I am trying to read .properties files with having Chinese characters. when I read the them using keys they are printing like ??????. I am writing JSF application. Where I need to do translation for this application in chinese. On UI in JSF it is showing all characters correctly as it should be. But in my java code it is showing like ?????. I donot why it is. I also tried "手提電話" and also tried "\u88dc\u7fd2\u500b\u6848" and tried to print them on console with main function, it is printing correctly in chinese lang chars. my properties file having encoding utf-8.
To clarify your confusion about it, I am able to display them console in chinese like
String str="Алексей";
String str="\u88dc\u7fd2\u500b\u6848";
System.out.println("direct output: "+str);
working fine in psvm. but after reading using properties file is shows ???.e
Hope it is clear now.
Also My database is receiving the ?? in place for actual chinese charactersets.
Please help. any other clarification required then please confirm so I can update my lines over here.
Here is my code for reading the properties file which returns the bundle of relevant locale.
FacesContext context = FacesContext.getCurrentInstance();
bundle = context.getApplication().getResourceBundle(context, "hardvalue");
after this I am going to call this to access the value:
bundle.getString("tutorsearch.header"); // results in ?????
Any other code you need then please confirm.
Multiple question marks usually comes from
you had utf8-encoded data (good)
SET NAMES latin1 was in effect (default, but wrong)
the column was declared CHARACTER SET latin1 (default, but wrong)
For Chinese (or Emoji), you need to use MySQL's utf8mb4.
The cure (for future INSERTs):
utf8-encoded data (good)
mysqli_set_charset('utf8mb4') (or whatever your client needs for establishing the CHARACTER SET)
check that the column(s) and/or table default are CHARACTER SET utf8mb4
If you are displaying on a web page, <meta...charset=utf-8> should be near the top. (Note different spelling.)
adding below tagging have resolved my issue.
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
Thanks to All

utf-8 invisible characters

I have a website, and need to store data from a text field into a mysql database.
The frontend is perl. I used utf8::encode to encode the data into utf8.
The request is made to the Java backend which connects to the mysql db and inserts this text.
For the table the default charset is set to utf8.
This works in many cases, but it fails in some cases.
If I use テスト, the data stored in the database shows questions marks: ã??ã?¹ã??.
If I try to insert the utf8 encoded string directly from the sql browser, everything works fine.
Update events set summary = ãã¹ã where event_id = 11657;
While inserting I noticed there are some blank characters that show up in the mysql query browser, something like: ã ã¹ ã.
After inserting from here, data in the database shows some boxes in the database instead of these spaces, and テスト displays correctly on the website after utf8 decoding it.
The problem is only when I insert directly from the website, these special characters come up as question marks in the database.
Can someone please help me with these special characters? Do I need to handle them differently?
We have also faced similar issue in one of the projects.So we had to write a small routine to convert those utf8 characters into html encoded and store into the database.
Use StringEscapeUtils from Apache Commons Lang:
import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);
If the database really stored テスト, that's what you should see in the sql browser instead of mojibake.
It sounds like the Java backend is interpreting what Perl sends as ISO-8859-1 rather than UTF-8. This explains hows テ gets converted into \u00E3\u0083\u0086. Then the backend tries to send the data to the database in Windows-1252 - the MySQL default encoding. Unfortunately Windows-1252 cannot represent the Unicode characters in the range \u0080-\u009F, so the last two characters are replaced by question marks.
So you have two problems:
You should make the Java backend read the request in UTF-8 rather than in ISO-8859-1.
The backend should use UTF-8 when talking with the database. The easiest way to do this is adding characterEncoding=utf8 to the connection parameters.
I'm assuming that you are sending POST parameters.
I think that the most likely cause of your initial problem is one of the following:
If the parameters are being sent in the HTTP request body, your Perl front-end is probably not setting the encoding in the content type header of the request. The webserver is probably to assuming ISO-8859-1. The solution to this is to set the request content type properly.
If the parameters are sent in the HTTP request URL, your web server is using the wrong characterset when decoding the request parameters. The solution to this is going be web-server specific ...
It sounds like there might also be a character set problem in talking to the database, but that might just be a consequence of earlier mangling.

Categories

Resources