Encoding problems in database

Encoding problems in database - java

I have a postgres 9.2 database, which encoding is UTF-8.
I have an application(written in java) to update this database, reading .sql files and executing them in the database.
But i found a problem:
In one of those .sql files, i have the following instruction:
insert into usuario(nome)
values('Usuário Padrão');
After executing this, when i go to the table data, what was inserted was this: "UsuÃ¡rio PadrÃ£o"
If i execute this command directly from pgAdmin, it creates correctly.
So i don't know if it's a problem in database, or in the program that executes the scripts.
---EDIT---
Here is how i get a JDBC connection:
public static Connection getConnection() throws SQLException{
Connection connection;
String url="jdbc:postgresql://"+servidor+":"+porta+"/"+nomeBanco;
Properties props = new Properties();
props.put("user", usuario);
props.put("password", senha);
connection=DriverManager.getConnection(url,props);
connection.setAutoCommit(false);
return connection;
}
And here is the code i use to read the file, but this looks correct, because if i print the String read from the file, it shows the correct String.
public static String lerArquivo(File arquivo){
StringBuilder conteudo=new StringBuilder();
BufferedReader br = null;
try {
br=new BufferedReader(new FileReader(arquivo));
String linha;
while((linha=br.readLine())!=null){
conteudo.append(linha).append("\n");
}
} catch (IOException e) {
FrameErroBasico f=new FrameErroBasico(null, true);
f.setText("Erro ao ler arquivo.",e);
f.setVisible(true);
}finally{
try{br.close();}catch(Exception e){}
}
return conteudo.toString();
}

This is most likely the problematic line:
br=new BufferedReader(new InputStreamReader(new FileInputStream(arquivo), "UTF-8"));
(looks like my crystal ball is still working well!)

To be sure I'd need to see the code that reads the SQL file in, but (as pointed out by jtahlborn) I'd say you're reading the file with an encoding other than the encoding it really has.
PgJDBC uses Unicode on the Java side and takes care of client/server encoding differences by always communicating with the server in utf-8, letting the server do any required encoding conversions. So unless you set client_encoding via your PgJDBC connection - something PgJDBC tries to detect and warn you about - the problem won't be on the PostgreSQL/PgJDBC side, it'll be with misreading the file.
Specifically, it looks like the file is utf-8 encoded, but you are reading it in as if it was latin-1 (ISO-8859-1) encoded. Witness this simple demo in Python to replicate the results you are getting by converting a native Unicode string to utf-8 then decoding it as if it was latin-1:
>>> print u'Usuário Padrão'.encode("utf-8").decode("latin-1");
UsuÃ¡rio PadrÃ£o
Your application most likely reads the file into a String in a manner that performs inappropriate text encoding conversions from the file encoding to the unicode text that Java works with internally. There is no reliable way to "auto-detect" the encoding of a file, so you must specify the text encoding of the input when reading a file. Java typically defaults to the system encoding, but that can be overridden. If you know the encoding of the file, you should explicitly pass it when opening the file for reading
You haven't shown the code that reads the file so it's hard to be more specific, but this is really a Java side issue not PostgreSQL-side. If you System.out.println your SQL file from Java you'll see that it already mangled in your Java string before you send it to the database server.

As jtahlborn said, the right way to read the file is like this:
br=new BufferedReader(new InputStreamReader(new FileInputStream(arquivo),"UTF-8"));
That was my problem, doing like this, it works like a charm.

Related

UTF-8 encoding CSV file

I have a CSV file, which using Excel to save as CSV UTF-8 encoded.
I have my java code read the file as byte array
then
String result = new String(b, 0, b.length, "UTF-8");
But somehow the content "Montréal" becomes "Montr?al" when save to DB, what might be the problem?
The environment is unix with:
-bash-4.1$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
BTW it works on my windows machine when I run my code and see in DB the correct "Montréal". So my guess is that the environment has some default locale setting that forces the use of dedault encoding.
Thanks

I don't have your complete code, but I tried the following code and it works for me:
String x = "c:/Book2.csv";
BufferedReader br = null;
try{
br = new BufferedReader(new InputStreamReader(new FileInputStream(
x), "UTF8"));
String b;
while ((b = br.readLine()) != null) {
System.out.println(b);
}
} finally {
if (br != null){
br.close();
}
}
If you see "Montr?al" printed on your console, don't worry. It does not mean that the program is not working. Now, you may want to check if your console supports printing UTF-8 characters. So, you can put a debug and inspect the variable to check if has what you want.
If you see correct value in debug and it prints a "?" in your output, you can rest assured that the String variable is having the right value and you can write it back to any file or DB as needed.
If you see "?" when you query your DB, the tool you may be using is not printing the output correctly. Try reading the DB value in java code an check by putting a debug in you code. I usually use putty to query the DB to see the double byte characters correctly. That's all I have, hope that helps.

You have to use ISO/IEC 8859, not UTF-8, if you look at the list of character encodings on Wikipedia page you'll understand the difference.
Basically, UTF-8 its the commom encoding used by western country...
Also, you can check your terminal encoding, maybe the problem is there.

Java Character Encoding Writing to Text File

My Issue is as follows:
Having issue with character encoding when writing to text file. The issue is characters are not showing the intended value. for example I am writing ' '(which is probably a Tab character) and 'Â' is what is displayed in the text file.
Background information
This data is being stored on a MSQL Database. The Database Collation is SQL_Latin1_General_CP1_CI_AS and the fields are varchar. I've come to learn the collation and type determine what character encoding is used on the database side. Values are stored correctly so no issues here.
My Java application runs queries to pull the data from the DB and this too also looks OK. I have debugged the code and seen all the Strings have the correct representation before writing to the file.
Next I write the text to the .TXT file using a OutputStreamWriter as follows:
public OfferFileBuilder(String clientAppName, boolean isAppend) throws IOException, URISyntaxException {
String exportFileLocation = getExportedFileLocation();
File offerFile = new File(getDatedFileName(exportFileLocation+"/"+clientAppName+"_OFFERRECORDS"));
bufferedWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(offerFile, isAppend), "UTF-8"));
}
Now once I open up the file on the Linux server by running cat command on file or open up the file using notepad++ some of the characters are incorrectly displaying.
I've ran the following commands on the server to see its encoding locale charmap which prints UTF-8, echo $LANG which prints en_US.UTF-8, and echo $LC_CTYPE` prints nothing.
Here is what I've attempted so far.
I've attempted to change the Character encoding used by the OutputStreamWriter I've tried UTF-8, and CP1252. When switching encoding some characters are fixed when others are then improperly displayed.
My Question is this:
Which encoding should my OutputStreamWriter be using?
(Bonus Questions) how are we supposed to avoid issues like this from happening. The rule of thumb i was provided was use UTF-8 and you will never run into problems, but this isn't the case for me right now.

running file -bi command on the server revealed that the file was encoded with ascii instead of utf8. Removing the file completely and rerunning the process fixed this for me.

jar java program encoded

I have made a small java program in netbeans that's read a text file. When I run the program in my netbeans, everything goes fine. So I made an executable jar of my program, but when I run that jar I get wired characters when the program read the text file.
For example:
I get "CÃ©leste" but it has to be Céleste.
That's my code to read the file:
private void readFWFile(File file){
try {
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
String ligne;
while((ligne = br.readLine()) != null) {
System.out.println(ligne);
}
fr.close();
} catch (IOException ex) {
Logger.getLogger(FWFileReader.class.getName()).log(Level.SEVERE, null, ex);
}
}

The FileReader class uses the "platform default character encoding" to decode bytes in the file into characters. It seems that your file is encoded in UTF-8, while the default encoding is something else on your system.
You can read the file in a specific encoding using InputStreamReader:
Reader fr = new InputStreamReader(new FileInputStream(file), "UTF-8");

This kind of output is caused by a mismatch somewhere - your file is encoded in UTF-8 but the console where you print the data expects a single-byte encoding such as Windows-1252.
You need to (a) ensure you read the file as UTF-8 and (b) ensure you write to the console using the encoding it expects.
FileReader always uses the platform default encoding when reading files. If this is UTF-8 then
your Java code reads the file as UTF-8 and sees Céleste
you then print out that data as UTF-8
in NetBeans the console clearly expects UTF-8 and displays the data correctly
outside NetBeans the console expects a single-byte encoding and displays the incorrect rendering.
Or if your default encoding is a single byte one then
your Java code reads the file as a single byte encoding and sees CÃ©leste
you then print out that data as the same encoding
NetBeans treats the bytes you wrote as UTF-8 and displays Céleste
outside NetBeans you see the wrong data you originally read.
Use an InputStreamReader with a FileInputStream to ensure you read the data in the correct encoding, and make sure that when you print data to the console you do so using the encoding that the console expects.

Reading Arabic chars from text file

I had finished a project in which I read from a text file written with notepad.
The characters in my text file are in Arabic language,and the file encoding type is UTF-8.
When launching my project inside Netbeans(7.0.1) everything seemed to be ok,but when I built the project as a (.jar) file the characters where displayed in this way: ÇáãæÇÞÚááÊØæíÑ.
How could I solve This problem please?

Most likely you are using JVM default character encoding somewhere. If you are 100% sure your file is encoded using UTF-8, make sure you explicitly specify UTF-8 when reading as well. For example this piece of code is broken:
new FileReader("file.txt")
because it uses JVM default character encoding - which you might not have control over and apparently Netbeans uses UTF-8 while your operating system defines something different. Note that this makes FileReader class completely useless if you want your code to be portable.
Instead use the following code snippet:
new InputStreamReader(new FileInputStream("file.txt"), "UTF-8");
You are not providing your code, but this should give you a general impression how this should be implemented.

Maybe this example will help a little. I will try to print content of utf-8 file to IDE console and system console that is encoded in "Cp852".
My d:\data.txt contains ąźżćąś adsfasdf
Lets check this code
//I will read chars using utf-8 encoding
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream("d:\\data.txt"), "utf-8"));
//and write to console using Cp852 encoding (works for my windows7 console)
PrintWriter out = new PrintWriter(new OutputStreamWriter(System.out,
"Cp852"),true); // "Cp852" is coding used in
// my console in Win7
// ok, lets read data from file
String line;
while ((line = in.readLine()) != null) {
// here I use IDE encoding
System.out.println(line);
// here I print data using Cp852 encoding
out.println(line);
}
When I run it in Eclipse output will be
ąźżćąś adsfasdf
Ą«ľ†Ą? adsfasdf
but output from system console will be

Encoding problem from database to javamail

I have a small application which reads from a Oracle 9i database and sends the data via e-mail, using JavaMail. The database has NLS_CHARACTERSET = "WE8MSWIN1252", that's it, CP1252.
If I run the app without any parameter, it works fine and the e-mails are sent correctly. However, I've a requeriment that enforces me to run the app with the -Dfile-encoding=utf8 parameter, which results in the text being sent with corrupted characters.
I've tried to change the encoding of the data read from the database, with:
String textToSend = new String(textRead.getBytes("CP1252"), "UTF-8");
But it doesn't help. I've tried all the possible combinations with CP1252, windows-1252, ISO-8859-1 and UTF-8, but still had no luck.
Any ideas?
Update to clarify my problem: when I do the following:
Statement stat = connection.createStatement(ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);
stat.executeQuery("SELECT blah FROM blahblah ...");
ResultSet rs = stat.getResultSet();
String textRead = rs.getString("whatever");
I get textRead corrupted, because the database is CP1252 and the application is running in UTF-8. Another approach that I've tried but also failed:
InputStream is = rs.getBinaryStream("whatever");
Writer writer = new StringWriter();
char[] buffer = new char[1024];
Reader reader = new BufferedReader(new InputStreamReader(stream, "UTF-8"));
while ((n = reader.read(buffer)) != -1) {
writer.write(buffer, 0, n);
}
String textRead = writer.toString();

Your driver should do the conversion automatically and since cp-1252 is a subset of UTF-8 you shouldn't lose information.
Can you try the following: get the String with ResultSet.getString, write the string to a file. Open the file with an editor with which you can specify UTF-8 character set (jEdit for example).
The file should contain UTF-8 data.

You seem to get lost in charset space -- i understand this... :-)
This line
String textToSend = new String(textRead.getBytes("CP1252"), "UTF-8");
does not make much sense. You have already text, convert it to a "cp1252" encoded byte []. Then you tell the VM to treat the bytes as if they were "UTF-8" (which is a lie...).
In short: if you have a String as in textRead, you don't have to convert it at all. If something goes wrong, either the text is already rotten (look at it in the debugger) or gets rotten in the API later on. Check this and come back with more detail? Where is the text that is wrong and where do you exactly read it from or write it to...

Your database data is in windows-1252. So -- assuming it's being handed back verbatim by the JDBC driver -- when you try to convert it to a Java String, that's the charset you need to specify:
Statement stat = connection.createStatement(ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);
ResultSet rs = stat.executeQuery("SELECT blah FROM blahblah ...");
byte[] rawbytes = rs.getBytes("whatever");
String textRead = new String(rawbytes, "windows-1252");
Is part of the requirement that the data be mailed out as UTF-8? If so, the UTF-8 part needs to occur on the output side, not the input side. When you have String data in Java, it's stored internally as UTF-16. So when you serialize it out to the MimeMessage, you again need to pick a charset:
mimebodypart.setText(textRead, "UTF-8");

I had the same problem:
Orace database using WE8MSWIN1252 charset, some VARCHAR2 column data/text containing the euro-sign (€) in it. Sending the text using JavaMail gave problems on the euro-sign.
Finally it works. Two important things you should check/do:
be sure to use the most recent Oracle JDBC driver for the Java version you use.
specify the charset (prefer: UTF-8) in JavaMail, e.g.MimeMessage.setSubject(String text, "UTF-8")MimeMessage.setText(String text, "UTF-8").That way the email text gets UTF-8 encoded.NOTE: Because RFC 821 restricts mail messages to 7-bit US-ASCII, 8-bit character or binary data needs to be encoded into a 7-bit format. The email header "Content-Transfer-Encoding" specifies the encoding used. For more information: http://www.w3.org/Protocols/rfc1341/5_Content-Transfer-Encoding.html

Can you do the conversion in the database? Instead of:
SELECT blah FROM blahblah
Try
SELECT convert(blah, 'WE8MSWIN1252', 'UTF8') FROM blahblah

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.