Encoding problem from database to javamail

Encoding problem from database to javamail - java

I have a small application which reads from a Oracle 9i database and sends the data via e-mail, using JavaMail. The database has NLS_CHARACTERSET = "WE8MSWIN1252", that's it, CP1252.
If I run the app without any parameter, it works fine and the e-mails are sent correctly. However, I've a requeriment that enforces me to run the app with the -Dfile-encoding=utf8 parameter, which results in the text being sent with corrupted characters.
I've tried to change the encoding of the data read from the database, with:
String textToSend = new String(textRead.getBytes("CP1252"), "UTF-8");
But it doesn't help. I've tried all the possible combinations with CP1252, windows-1252, ISO-8859-1 and UTF-8, but still had no luck.
Any ideas?
Update to clarify my problem: when I do the following:
Statement stat = connection.createStatement(ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);
stat.executeQuery("SELECT blah FROM blahblah ...");
ResultSet rs = stat.getResultSet();
String textRead = rs.getString("whatever");
I get textRead corrupted, because the database is CP1252 and the application is running in UTF-8. Another approach that I've tried but also failed:
InputStream is = rs.getBinaryStream("whatever");
Writer writer = new StringWriter();
char[] buffer = new char[1024];
Reader reader = new BufferedReader(new InputStreamReader(stream, "UTF-8"));
while ((n = reader.read(buffer)) != -1) {
writer.write(buffer, 0, n);
}
String textRead = writer.toString();

Your driver should do the conversion automatically and since cp-1252 is a subset of UTF-8 you shouldn't lose information.
Can you try the following: get the String with ResultSet.getString, write the string to a file. Open the file with an editor with which you can specify UTF-8 character set (jEdit for example).
The file should contain UTF-8 data.

You seem to get lost in charset space -- i understand this... :-)
This line
String textToSend = new String(textRead.getBytes("CP1252"), "UTF-8");
does not make much sense. You have already text, convert it to a "cp1252" encoded byte []. Then you tell the VM to treat the bytes as if they were "UTF-8" (which is a lie...).
In short: if you have a String as in textRead, you don't have to convert it at all. If something goes wrong, either the text is already rotten (look at it in the debugger) or gets rotten in the API later on. Check this and come back with more detail? Where is the text that is wrong and where do you exactly read it from or write it to...

Your database data is in windows-1252. So -- assuming it's being handed back verbatim by the JDBC driver -- when you try to convert it to a Java String, that's the charset you need to specify:
Statement stat = connection.createStatement(ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);
ResultSet rs = stat.executeQuery("SELECT blah FROM blahblah ...");
byte[] rawbytes = rs.getBytes("whatever");
String textRead = new String(rawbytes, "windows-1252");
Is part of the requirement that the data be mailed out as UTF-8? If so, the UTF-8 part needs to occur on the output side, not the input side. When you have String data in Java, it's stored internally as UTF-16. So when you serialize it out to the MimeMessage, you again need to pick a charset:
mimebodypart.setText(textRead, "UTF-8");

I had the same problem:
Orace database using WE8MSWIN1252 charset, some VARCHAR2 column data/text containing the euro-sign (€) in it. Sending the text using JavaMail gave problems on the euro-sign.
Finally it works. Two important things you should check/do:
be sure to use the most recent Oracle JDBC driver for the Java version you use.
specify the charset (prefer: UTF-8) in JavaMail, e.g.MimeMessage.setSubject(String text, "UTF-8")MimeMessage.setText(String text, "UTF-8").That way the email text gets UTF-8 encoded.NOTE: Because RFC 821 restricts mail messages to 7-bit US-ASCII, 8-bit character or binary data needs to be encoded into a 7-bit format. The email header "Content-Transfer-Encoding" specifies the encoding used. For more information: http://www.w3.org/Protocols/rfc1341/5_Content-Transfer-Encoding.html

Can you do the conversion in the database? Instead of:
SELECT blah FROM blahblah
Try
SELECT convert(blah, 'WE8MSWIN1252', 'UTF8') FROM blahblah

Related

UTF-8 encoding CSV file

I have a CSV file, which using Excel to save as CSV UTF-8 encoded.
I have my java code read the file as byte array
then
String result = new String(b, 0, b.length, "UTF-8");
But somehow the content "Montréal" becomes "Montr?al" when save to DB, what might be the problem?
The environment is unix with:
-bash-4.1$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
BTW it works on my windows machine when I run my code and see in DB the correct "Montréal". So my guess is that the environment has some default locale setting that forces the use of dedault encoding.
Thanks

I don't have your complete code, but I tried the following code and it works for me:
String x = "c:/Book2.csv";
BufferedReader br = null;
try{
br = new BufferedReader(new InputStreamReader(new FileInputStream(
x), "UTF8"));
String b;
while ((b = br.readLine()) != null) {
System.out.println(b);
}
} finally {
if (br != null){
br.close();
}
}
If you see "Montr?al" printed on your console, don't worry. It does not mean that the program is not working. Now, you may want to check if your console supports printing UTF-8 characters. So, you can put a debug and inspect the variable to check if has what you want.
If you see correct value in debug and it prints a "?" in your output, you can rest assured that the String variable is having the right value and you can write it back to any file or DB as needed.
If you see "?" when you query your DB, the tool you may be using is not printing the output correctly. Try reading the DB value in java code an check by putting a debug in you code. I usually use putty to query the DB to see the double byte characters correctly. That's all I have, hope that helps.

You have to use ISO/IEC 8859, not UTF-8, if you look at the list of character encodings on Wikipedia page you'll understand the difference.
Basically, UTF-8 its the commom encoding used by western country...
Also, you can check your terminal encoding, maybe the problem is there.

FreeMarker special character output as question mark

I am trying to submit a form with fields containing special characters, such as €ŠšŽžŒœŸ. As far as I can see from the ISO-8859-15 wikipedia page, these characters are included in the standard. Even though the encoding for both request and response is set to the ISO-8859-15, when I am trying to display the values (using FreeMarker 2.3.18 in a JAVA EE environment), the values are ???????. I have set the form's accepted charset to ISO-8859-15, I have checked that the form is submitted with content-type text/html;charset=ISO-8859-15 (using firebug) but I can't figure out how to display the correct characters. If I am running the following code, the correct hex value is displayed (ex: Ÿ = be).
What am I missing? Thank you in advance!
System.out.println(Integer.toHexString(myString.charAt(i)));
EDIT:
I am having the following code as I process the request:
PrintStream ps = new PrintStream(System.out, true, "ISO-8859-15");
String firstName = request.getParameter("firstName");
// check for null before
for (int i = 0; i < firstName.length(); i++) {
ps.println(firstName.charAt(i)); // prints "?"
}
BufferedWriter file=new BufferedWriter(new OutputStreamWriter(new FileOutputStream(path), "ISO-8859-15"));
file.write(firstName); // writes "?" to file (checked with notepad++, correct encoding set)
file.close();

According to the hex value, the form data is submitted correctly.
The problem seems to be related to the output. Java replaces a character with ? if it cannot be represented with the charset in use.
You have to use a correct charset when constructing the output stream. What commands do you use for that? I do not know FreeMarker but there will probably be something like
Writer out = new OutputStreamWriter(System.out);
This should be replaced with something resembling this:
Writer out = new OutputStreamWriter(System.out, "iso-8859-15");
By the way, UTF-8 is usually much better choice for the encoding charset.

Reading from Windows-1252 format from Oracle and Writing to XML file with Latin1 characters UTF-8 encoded

I am trying to read from an oracle db which stores data in Windows-1252 encoding. I am reading that data using jdbc and writing to an xml file with UTF-8 encoding.
while writing to these files, I am getting '?' characters instead of the latin characters e.g. instead of í, i get a ?
'Coquí' is being written to XML as 'Coqu?'
I use this file to upload to solr later on.
I have only put the relevant code here and not the whole code since its a long method (legacy code that i have inherited) which is complicated.
BufferedWriter result = new BufferedWriter(new FileWriter(OUTPUT_FILE));
stmt = conn.createStatement(ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_READ_ONLY);
rst = stmt.executeQuery(sql);
if (rst.getFetchSize() < 1)
return;
rst.beforeFirst();
while (rst.next()) {
Profile p = new Profile();
p.business_name = rst.getString("business_name");
p.business_name_sort = rst.getString("business_name_sort");
result.write(p.business_name;
result.write(p.business_name_sort);
}

By the sounds of it (you haven't given us the relevant code so I can't be certain) you aren't handling character set conversion properly. Java doesn't perform any automatic character set conversions for you - you've got to do it yourself.
You can do the following to convert it to UTF-8:
String utf8Text = new String(originalText.getBytes("UTF-8"), "UTF-8");
This assumes that originalText is a String containing the Windows-1252 encoded text.

Encoding problems in database

I have a postgres 9.2 database, which encoding is UTF-8.
I have an application(written in java) to update this database, reading .sql files and executing them in the database.
But i found a problem:
In one of those .sql files, i have the following instruction:
insert into usuario(nome)
values('Usuário Padrão');
After executing this, when i go to the table data, what was inserted was this: "UsuÃ¡rio PadrÃ£o"
If i execute this command directly from pgAdmin, it creates correctly.
So i don't know if it's a problem in database, or in the program that executes the scripts.
---EDIT---
Here is how i get a JDBC connection:
public static Connection getConnection() throws SQLException{
Connection connection;
String url="jdbc:postgresql://"+servidor+":"+porta+"/"+nomeBanco;
Properties props = new Properties();
props.put("user", usuario);
props.put("password", senha);
connection=DriverManager.getConnection(url,props);
connection.setAutoCommit(false);
return connection;
}
And here is the code i use to read the file, but this looks correct, because if i print the String read from the file, it shows the correct String.
public static String lerArquivo(File arquivo){
StringBuilder conteudo=new StringBuilder();
BufferedReader br = null;
try {
br=new BufferedReader(new FileReader(arquivo));
String linha;
while((linha=br.readLine())!=null){
conteudo.append(linha).append("\n");
}
} catch (IOException e) {
FrameErroBasico f=new FrameErroBasico(null, true);
f.setText("Erro ao ler arquivo.",e);
f.setVisible(true);
}finally{
try{br.close();}catch(Exception e){}
}
return conteudo.toString();
}

This is most likely the problematic line:
br=new BufferedReader(new InputStreamReader(new FileInputStream(arquivo), "UTF-8"));
(looks like my crystal ball is still working well!)

To be sure I'd need to see the code that reads the SQL file in, but (as pointed out by jtahlborn) I'd say you're reading the file with an encoding other than the encoding it really has.
PgJDBC uses Unicode on the Java side and takes care of client/server encoding differences by always communicating with the server in utf-8, letting the server do any required encoding conversions. So unless you set client_encoding via your PgJDBC connection - something PgJDBC tries to detect and warn you about - the problem won't be on the PostgreSQL/PgJDBC side, it'll be with misreading the file.
Specifically, it looks like the file is utf-8 encoded, but you are reading it in as if it was latin-1 (ISO-8859-1) encoded. Witness this simple demo in Python to replicate the results you are getting by converting a native Unicode string to utf-8 then decoding it as if it was latin-1:
>>> print u'Usuário Padrão'.encode("utf-8").decode("latin-1");
UsuÃ¡rio PadrÃ£o
Your application most likely reads the file into a String in a manner that performs inappropriate text encoding conversions from the file encoding to the unicode text that Java works with internally. There is no reliable way to "auto-detect" the encoding of a file, so you must specify the text encoding of the input when reading a file. Java typically defaults to the system encoding, but that can be overridden. If you know the encoding of the file, you should explicitly pass it when opening the file for reading
You haven't shown the code that reads the file so it's hard to be more specific, but this is really a Java side issue not PostgreSQL-side. If you System.out.println your SQL file from Java you'll see that it already mangled in your Java string before you send it to the database server.

As jtahlborn said, the right way to read the file is like this:
br=new BufferedReader(new InputStreamReader(new FileInputStream(arquivo),"UTF-8"));
That was my problem, doing like this, it works like a charm.

Convert from Codepage 1252 (Windows) to Java, in Java

I have some strings in Java (originally from an Excel sheet) that I presume are in Windows 1252 codepage. I want them converted to Javas own unicode format. The Excel file was parsed using the JXL package, in case that matter.
I will clarify: apparently the strings gotten from the Excel file look pretty much like it already is some kind of unicode.
WorkbookSettings ws = new WorkbookSettings();
ws.setCharacterSet(someInteger);
Workbook workbook = Workbook.getWorkbook(new File(filename), ws);
Sheet s = workbook.getSheet(sheet);
row = s.getRow(4);
String contents = row[0].getContents();
This is where contents seems to contain something unicode, the åäö are multibyte characters, while the ASCII ones are normal single byte characters. It is most definitely not Latin1. If I print the "contents" string with printLn and redirect it to a hello.txt file, I find that the letter "ö" is represented with two bytes, C3 B6 in hex. (195 and 179 in decimal.)
[edit]
I have tried the suggestions with different codepages etc given below, tried converting from Cp1252 etc. There was some kind of conversion, because I would get some other kind of gibberish instead. As reference I always printed an "ö" string hand coded into the source code, to verify that there was not something wrong with my terminal or typefaces or anything. The manually typed "ö" always worked.
[edit]
I also tried WorkBookSettings as suggested in the comments, but I looked in the code for JXL and characterSet seems to be ignored by parsing code. I think the parsing code just looks at whatever encoding the XLS file is supposed to be in.

WorkbookSettings ws = new WorkbookSettings();
ws.setEncoding("CP1250");
Worked for me.

If none of the answer above solve the problem, the trick might be done like this:
String myOutput = new String (myInput, "UTF-8");
This should decode the incoming string, whatever its format.

When Java parses a file it uses some encoding to read the bytes on the disk and create bytes in memory. The default encoding varies from platform to platform. Java's internal String representation is Unicode already, so if it parses the file with the right encoding then you are already done; just write out the data in any encoding you want.
If your strings appear corrupted when you look at them in Java, it is probably because you are using the wrong encoding to read the data. Excel is probably using UTF-16 (Little-Endian I think) but I'd expect a library like JXL should be able to detect it appropriately. I've looked at the Javadocs for JXL and it doesn't do anything with character encodings. I imagine it auto-detects any encodings as it needs to.
Do you just need to write the already loaded strings to a text file? If so, then something like the following will work:
String text = getCP1252Text(); // doesn't matter what the original encoding was, Java always uses Unicode
FileOutputStream fos = new FileOutputStream("test.txt"); // Open file
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-16"); // Specify character encoding
PrintWriter pw = new PrintWriter(osw);
pw.print(text ); // repeat as needed
pw.close(); // cleanup
osw.close();
fos.close();
If your problem is something else please edit your question and provide more details.

You need to specify the correct encoding when the file is parsed - once you have a Java String based on the wrong encoding, it's too late.
JXL allows you to specify the encoding by passing a WorkbookSettings object to the factory method.

"windows-1252"/"Cp1252" is not required to be supported by JREs, but is by Sun's (and presumably most others). See the "Supported Encodings" in your JDK documentation. Then it's just a matter of using String, InputStreamReader or similar to decode the bytes into chars.

FileInputStream fis = new FileInputStream (yourFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"CP1250"));
And do with reader whatever you'd do directly with file.

Your description indicates that the encoding is UTF-8 and indeed C3 B6 is the UTF-8 encoding for 'ö'.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Encoding problem from database to javamail - java

Can you do the conversion in the database? Instead of: SELECT blah FROM blahblah Try SELECT convert(blah, 'WE8MSWIN1252', 'UTF8') FROM blahblah

Related

UTF-8 encoding CSV file

FreeMarker special character output as question mark

Reading from Windows-1252 format from Oracle and Writing to XML file with Latin1 characters UTF-8 encoded

Encoding problems in database

Convert from Codepage 1252 (Windows) to Java, in Java

Categories

Resources