I am having trouble displaying the "velar nasal" character (ŋ)(but I assume the same problem would arise with other rare characters).
I have a MySQL table which contains a word with this character.
When my code retrieves it to display in my HTML page, it is displayed as a question mark.
I have tried a number of things:
1) Tried using MySQL's CONVERT to convert the retrieved string to UTF-8 because I understood that the string is stored in my table as "Latin1":
SELECT CONVERT(Name USING utf8)
Instead of:
SELECT Name
This did not help, and, when I saved a string in my java code with the problematic word ("Yolŋu"), and then passed the String through the rest of the code the problem still occured (ie: The problem does not lie in the different character encoding that my DB uses).
2) I also tried creating a new String from bytes:
new String(name.getBytes("UTF-8"));
The String is being passed from java to the html via a JSONObject that is passed to a javascript file:
Relevant JSON code:
JSONArray names = new JSONArray();
for (int iD: iDs)
{
JSONObject namesData = new JSONObject();
String name = NameDB.getNameName(iD);
nameData.put("label", name);
nameData.put("value", iD);
names.put(nameData);
}
return names;
Relevant servlet code:
response.setContentType("application/json");
try (PrintWriter out = response.getWriter())
{
out.print(namesJSONArray);
}
Relevant js code:
An ajax call to the servlet is made via jquery ui's autocomplete "source" option.
I am pretty new to coding in general and very new to the character encoding topic.
Thank you.
First, in Java String should already hold correct Unicode, so new String(string.getBytes(...), ...) is a hack, with its own troubles.
1. The database
It would be nice if the database held the text in UTF-8. The encoding can be set on database, table and column level. The first thing is to investigate how the text is stored. A table dump (mysqldump) would be least error prone.
If you can use UTF-8, this must be set form MySQL on the database engine, and for data transfer for the java driver.
In every case you can check a round-trip in java JDBC by filling a table field, and reading it again, as also reading that existing troublesome field.
Dump the code points of the string.
String dump(String s) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
int cp = s.codePointAt(i);
if (32 < cp && cp < 128) {
sb.append((char) cp);
} else {
sb.append("U+").append(Integer.toHexString(cp));
}
sb.append(' ');
i += Character.charCount(cp);
}
return sb.toString();
}
2. The output
Here probably lies the error. Call at the beginning:
response.setCharacterEncoding("UTF-8");
... response.getWriter(); // Now converts java's Unicode text to UTF-8.
For HTML a charset specification is in order too. Especially when the HTML page is saved to the file system, the encoding header would be lost.
You should be assure about the following things:
Your JVM must work with file.encoding=UTF-8 param
Your mySQL table in which contains special characters must be parametrized with encoding=UTF-8
Your web UI should specify the meta tag with the encoding you have saved the web page in your editor, so UTF-8
If the problem persists, try to use HTML entities (&entity) instead.
Related
I am encountering issues in reporting in displaying names. My application uses different technologies PHP, Perl and for BI Pentaho.
We are using MYSQL as DB and my table is of CHARSET=utf8.
My table is been stored with values in rows as below which is wrong
Row1 = Ãx—350
Row2 = Ñz–401
PHP and Perl are using different in built functions to convert the above values which is stored in DB and it is displaying in UI as below which is correct
Expected Row1 = Áx—350
Expected Row2 = Ñz–401
Coming to reports which is using pentaho I am using ETL to transform the data before showing data in reports. In order to convert the above DB stored values I am trying to convert the data through Java step as below
new java.lang.String(new java.lang.String(CODE).getBytes("Windows-1252"), "UTF-8")
But it is not converting the values properly, among the above 2 wrong values only Row2 value is been converted properly but the first Row1 is wrongly converting as below
Converted Row1 = �?x—350
Converted Row2 = Ñz–401
Please suggest what way I can convert the values properly so that for example Row1 value should be converted properly to Áx—350.
I wrote a small Java program as below to convert the Ãx—350 string to Áx—350
String input = "Ãx—350";
byte[] b1 = input.getBytes("Windows-1252");
System.out.println("Input Get Bytes = "+b1.toString());
String szUT8 = new String(b1, "UTF-8");
System.out.println("Input Encoded = " + szUT8);
The output from the above code is as below
Input Get Bytes = [B#157ee3e5
Input Encoded = �?x—350-350—É1
If we see the output the string is wrong where the actual expected output is Áx—350.
To confirm on the encoding/decoding schemes i tried testing string online and tested with string Ãx—350 and output is as expected Áx—350 which is correct.
So from this any one please point why java code is not able to convert properly although i am using the proper encoding/decoding schemes, anything else which iam missing or my approach is wrong.
The CHARSET setting in your db being set to utf-8 doesn't necessarily mean that the data there is properly encoded in utf-8 (or even in utf-8 at all), as we can see. It looks like you are dealing with mojibake - characters that that were at one time decoded using the wrong encoding scheme, then therefore in turn encoded wrong. Fixing that is a usually tedious process of figuring out past decode/encode errors and then undoing them.
Long story short: if you have mojibake, there isn't any automatic conversions you can do unless you know (or can figure out) what conversions were made in the past.
Converting is a matter of first decoding, then encoding. To convert in Perl:
my $string = "some windows-1252 string";
use Encode;
my $raw = decode('windows-1252',$string);
my $encoded = encode('utf-8',$raw);
I am trying to submit a form with fields containing special characters, such as €ŠšŽžŒœŸ. As far as I can see from the ISO-8859-15 wikipedia page, these characters are included in the standard. Even though the encoding for both request and response is set to the ISO-8859-15, when I am trying to display the values (using FreeMarker 2.3.18 in a JAVA EE environment), the values are ???????. I have set the form's accepted charset to ISO-8859-15, I have checked that the form is submitted with content-type text/html;charset=ISO-8859-15 (using firebug) but I can't figure out how to display the correct characters. If I am running the following code, the correct hex value is displayed (ex: Ÿ = be).
What am I missing? Thank you in advance!
System.out.println(Integer.toHexString(myString.charAt(i)));
EDIT:
I am having the following code as I process the request:
PrintStream ps = new PrintStream(System.out, true, "ISO-8859-15");
String firstName = request.getParameter("firstName");
// check for null before
for (int i = 0; i < firstName.length(); i++) {
ps.println(firstName.charAt(i)); // prints "?"
}
BufferedWriter file=new BufferedWriter(new OutputStreamWriter(new FileOutputStream(path), "ISO-8859-15"));
file.write(firstName); // writes "?" to file (checked with notepad++, correct encoding set)
file.close();
According to the hex value, the form data is submitted correctly.
The problem seems to be related to the output. Java replaces a character with ? if it cannot be represented with the charset in use.
You have to use a correct charset when constructing the output stream. What commands do you use for that? I do not know FreeMarker but there will probably be something like
Writer out = new OutputStreamWriter(System.out);
This should be replaced with something resembling this:
Writer out = new OutputStreamWriter(System.out, "iso-8859-15");
By the way, UTF-8 is usually much better choice for the encoding charset.
I am trying to read from an oracle db which stores data in Windows-1252 encoding. I am reading that data using jdbc and writing to an xml file with UTF-8 encoding.
while writing to these files, I am getting '?' characters instead of the latin characters e.g. instead of í, i get a ?
'Coquí' is being written to XML as 'Coqu?'
I use this file to upload to solr later on.
I have only put the relevant code here and not the whole code since its a long method (legacy code that i have inherited) which is complicated.
BufferedWriter result = new BufferedWriter(new FileWriter(OUTPUT_FILE));
stmt = conn.createStatement(ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_READ_ONLY);
rst = stmt.executeQuery(sql);
if (rst.getFetchSize() < 1)
return;
rst.beforeFirst();
while (rst.next()) {
Profile p = new Profile();
p.business_name = rst.getString("business_name");
p.business_name_sort = rst.getString("business_name_sort");
result.write(p.business_name;
result.write(p.business_name_sort);
}
By the sounds of it (you haven't given us the relevant code so I can't be certain) you aren't handling character set conversion properly. Java doesn't perform any automatic character set conversions for you - you've got to do it yourself.
You can do the following to convert it to UTF-8:
String utf8Text = new String(originalText.getBytes("UTF-8"), "UTF-8");
This assumes that originalText is a String containing the Windows-1252 encoded text.
My application has a Java servlet that reads a JSONObject out of the request and constructs some Java objects that are used elsewhere. I'm running into a problem because there are strings in the JSON that are encoded in ISO-8859-1. When I extract them into Java strings, the encoding appears to get interpreted as UTF-16. I need to be able to get the correctly encoded string back at some point to put into another JSON object.
I've tried mucking around with ByteBuffers and CharBuffers, but then I don't get any characters at all. I can't change the encoding, as I have to play nicely with other applications that use ISO-8859-1.
Any tips would be greatly appreciated.
It's a legacy application using Struts 1.3.8. I'm using net.sf.json 2.2.4 for JSONObject and JSONArray.
A snippet of the parsing code is:
final JSONObject a = (JSONObject) i;
final JSONObject attr = a.getJSONObject("attribute");
final String category = attr.getString("category");
final String value = attr.getString("value");
I then create POJOs using that information, that are retrieved by another action class to create JSON to pass to the client for display, or to pass to other applications.
So to clarify, if the JSON contains the string "Juan Guzmán", the Java String contains something like Juan Guzm?_An (I don't have the exact one in front of me). I'm not sure how to get the correct diacritical back. I believe that if I can get a Java String that contains the correct representation, that Mezzie's solution, below, will allow me to create the string with the correct encoding to put back into the JSON to serve back.
I had the same issue and I am using the same technology as you are. In our case, it was UTF 8. so just change that to UTF-16
public static String UTF8toISO( String str )
{
try
{
return new String( str.getBytes( "ISO-8859-1" ), "UTF-8" );
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
return str;
}
In my java code,I am retrieving some multibyte data from database and making some xml DOM, with that data as the value of some node then converting the DOM to String and posting bytest to ASP Page via HTTPURLConnection , but somehow at receiver end the data is appearing as ???? instead of some multibyte values.Please suggest what to do.
Things that i am already doing..
1) I have set -Dfile.encoding =UTF8 as System Property
2)While using TransformerFactory for converting my XML DOM to String , i have set
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8")
to make sure that the encoding is proper there.
Please suggest where i am getting wrong.
#Jon Skeet Few more things to add here... 1) I am getting data from database correctly 2) Transformed XML also appears to be proper, as i checked by saving it to my local file system.
For posting earlier i was using something like
'dout = new DataOutputStream(urlconn.getOutputStream());'
'dout.write(strXML.getBytes());'
'dout.write(strXML);'
and the resulting data at the receiver end was getting converted to ????? but then i switched to
'
dout=new OutputStreamWriter(urlconn.getOutputStream(),"UTF8");'
'dout.write(strXML);'
then data at receiver end appears to be proper ... but the problem occurs with the way it is handled at receiver end in this case. in my receiver ASP code i am using objStream.WriteLine (oXMLDom.xml)
... and here it fails and starts to give internal server error... please suggest whats wrong with second approach.
There are lots of potential conversions going on there. You should verify the data at every step:
Check that you're getting it out of the database correctly
See what the transformed XML looks like
Watch what goes over the network (including HTTP headers)
Check exactly what you're getting in ASP
Don't just print out the strings as strings - log the Unicode value of each character, by casting it to int:
for (int i = 0; i < text.length(); i++)
{
char c = text.charAt(i);
log("Character " + c + " - " + (int) c);
}