UTF-8 encoding CSV file

UTF-8 encoding CSV file - java

I have a CSV file, which using Excel to save as CSV UTF-8 encoded.
I have my java code read the file as byte array
then
String result = new String(b, 0, b.length, "UTF-8");
But somehow the content "Montréal" becomes "Montr?al" when save to DB, what might be the problem?
The environment is unix with:
-bash-4.1$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
BTW it works on my windows machine when I run my code and see in DB the correct "Montréal". So my guess is that the environment has some default locale setting that forces the use of dedault encoding.
Thanks

I don't have your complete code, but I tried the following code and it works for me:
String x = "c:/Book2.csv";
BufferedReader br = null;
try{
br = new BufferedReader(new InputStreamReader(new FileInputStream(
x), "UTF8"));
String b;
while ((b = br.readLine()) != null) {
System.out.println(b);
}
} finally {
if (br != null){
br.close();
}
}
If you see "Montr?al" printed on your console, don't worry. It does not mean that the program is not working. Now, you may want to check if your console supports printing UTF-8 characters. So, you can put a debug and inspect the variable to check if has what you want.
If you see correct value in debug and it prints a "?" in your output, you can rest assured that the String variable is having the right value and you can write it back to any file or DB as needed.
If you see "?" when you query your DB, the tool you may be using is not printing the output correctly. Try reading the DB value in java code an check by putting a debug in you code. I usually use putty to query the DB to see the double byte characters correctly. That's all I have, hope that helps.

You have to use ISO/IEC 8859, not UTF-8, if you look at the list of character encodings on Wikipedia page you'll understand the difference.
Basically, UTF-8 its the commom encoding used by western country...
Also, you can check your terminal encoding, maybe the problem is there.

Related

Encoding problems in database

I have a postgres 9.2 database, which encoding is UTF-8.
I have an application(written in java) to update this database, reading .sql files and executing them in the database.
But i found a problem:
In one of those .sql files, i have the following instruction:
insert into usuario(nome)
values('Usuário Padrão');
After executing this, when i go to the table data, what was inserted was this: "UsuÃ¡rio PadrÃ£o"
If i execute this command directly from pgAdmin, it creates correctly.
So i don't know if it's a problem in database, or in the program that executes the scripts.
---EDIT---
Here is how i get a JDBC connection:
public static Connection getConnection() throws SQLException{
Connection connection;
String url="jdbc:postgresql://"+servidor+":"+porta+"/"+nomeBanco;
Properties props = new Properties();
props.put("user", usuario);
props.put("password", senha);
connection=DriverManager.getConnection(url,props);
connection.setAutoCommit(false);
return connection;
}
And here is the code i use to read the file, but this looks correct, because if i print the String read from the file, it shows the correct String.
public static String lerArquivo(File arquivo){
StringBuilder conteudo=new StringBuilder();
BufferedReader br = null;
try {
br=new BufferedReader(new FileReader(arquivo));
String linha;
while((linha=br.readLine())!=null){
conteudo.append(linha).append("\n");
}
} catch (IOException e) {
FrameErroBasico f=new FrameErroBasico(null, true);
f.setText("Erro ao ler arquivo.",e);
f.setVisible(true);
}finally{
try{br.close();}catch(Exception e){}
}
return conteudo.toString();
}

This is most likely the problematic line:
br=new BufferedReader(new InputStreamReader(new FileInputStream(arquivo), "UTF-8"));
(looks like my crystal ball is still working well!)

To be sure I'd need to see the code that reads the SQL file in, but (as pointed out by jtahlborn) I'd say you're reading the file with an encoding other than the encoding it really has.
PgJDBC uses Unicode on the Java side and takes care of client/server encoding differences by always communicating with the server in utf-8, letting the server do any required encoding conversions. So unless you set client_encoding via your PgJDBC connection - something PgJDBC tries to detect and warn you about - the problem won't be on the PostgreSQL/PgJDBC side, it'll be with misreading the file.
Specifically, it looks like the file is utf-8 encoded, but you are reading it in as if it was latin-1 (ISO-8859-1) encoded. Witness this simple demo in Python to replicate the results you are getting by converting a native Unicode string to utf-8 then decoding it as if it was latin-1:
>>> print u'Usuário Padrão'.encode("utf-8").decode("latin-1");
UsuÃ¡rio PadrÃ£o
Your application most likely reads the file into a String in a manner that performs inappropriate text encoding conversions from the file encoding to the unicode text that Java works with internally. There is no reliable way to "auto-detect" the encoding of a file, so you must specify the text encoding of the input when reading a file. Java typically defaults to the system encoding, but that can be overridden. If you know the encoding of the file, you should explicitly pass it when opening the file for reading
You haven't shown the code that reads the file so it's hard to be more specific, but this is really a Java side issue not PostgreSQL-side. If you System.out.println your SQL file from Java you'll see that it already mangled in your Java string before you send it to the database server.

As jtahlborn said, the right way to read the file is like this:
br=new BufferedReader(new InputStreamReader(new FileInputStream(arquivo),"UTF-8"));
That was my problem, doing like this, it works like a charm.

Error when reading non-English language character from file

I am building an app where users have to guess a secret word. I have *.txt files in assets folder. The problem is that words are in Albanian language. Our language uses letters like "ë" and "ç", so whenever I try to read from the file some word containing any of those characters I get some wicked symbol and I can not implement string.compare() for these characters. I have tried many options with UTF-8, changed Eclipse setting but still the same error.
I wold really appreciate if someone has got any advice.
The code I use to read the files is:
AssetManager am = getAssets();
strOpenFile = "fjalet.txt";
InputStream fins = am.open(strOpenFile);
reader = new BufferedReader(new InputStreamReader(fins));
ArrayList<String> stringList = new ArrayList<String>();
while ((aDataRow = reader.readLine()) != null) {
aBuffer += aDataRow + "\n";
stringList.add(aDataRow);
}
Otherwise the code works fine, except for mentioned characters

It seems pretty clear that the default encoding that is in force when you create the InputStreamReader does not match the file.
If the file you are trying to read is UTF-8, then this should work:
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
If the file is not UTF-8, then that won't work. Instead you should use the name of the file's true encoding. (My guess is that it is in ISO/IEC_8859-1 or ISO/IEC_8859-16.)
Once you have figured out what the file's encoding really is, you need to try to understand why it does not correspond to your Java platform's default encoding ... and then make a pragmatic decision on what to do about it. (Should you hard-wire the encoding into your application ... as above? Should you make it a configuration property or command parameter? Should you change the default encoding? Should you change the file?)

You need to determine the character encoding that was used when creating the file, and specify this encoding when reading it. If it's UTF-8, for example, use
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
or
reader = new BufferedReader(new InputStreamReader(fins, StandardCharsets.UTF_8));
if you're under Java 7.
Text editors like Notepad++ have good heuristics to guess what the encoding of a file is. Try opening it with such an editor and see which encoding it has guessed (if the characters appear correctly).

You should know encoding of the file.
InputStream class reads file binary. Although you can interpet input as character, it will be implicit guessing, which may be wrong.
InputStreamReader class converts binary to chars. But it should know character set.
You should use the following version to feed it by character set.
UPDATE
Don't suggest you have UTF-8 encoded file, which may be wrong. Here in Russia we have such encodings as CP866, WIN1251 and KOI8, which are all differ from UTF8. Probably you have some popular Albanian encoding of text files. Check your OS setting to guess.

Reading Arabic chars from text file

I had finished a project in which I read from a text file written with notepad.
The characters in my text file are in Arabic language,and the file encoding type is UTF-8.
When launching my project inside Netbeans(7.0.1) everything seemed to be ok,but when I built the project as a (.jar) file the characters where displayed in this way: ÇáãæÇÞÚááÊØæíÑ.
How could I solve This problem please?

Most likely you are using JVM default character encoding somewhere. If you are 100% sure your file is encoded using UTF-8, make sure you explicitly specify UTF-8 when reading as well. For example this piece of code is broken:
new FileReader("file.txt")
because it uses JVM default character encoding - which you might not have control over and apparently Netbeans uses UTF-8 while your operating system defines something different. Note that this makes FileReader class completely useless if you want your code to be portable.
Instead use the following code snippet:
new InputStreamReader(new FileInputStream("file.txt"), "UTF-8");
You are not providing your code, but this should give you a general impression how this should be implemented.

Maybe this example will help a little. I will try to print content of utf-8 file to IDE console and system console that is encoded in "Cp852".
My d:\data.txt contains ąźżćąś adsfasdf
Lets check this code
//I will read chars using utf-8 encoding
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream("d:\\data.txt"), "utf-8"));
//and write to console using Cp852 encoding (works for my windows7 console)
PrintWriter out = new PrintWriter(new OutputStreamWriter(System.out,
"Cp852"),true); // "Cp852" is coding used in
// my console in Win7
// ok, lets read data from file
String line;
while ((line = in.readLine()) != null) {
// here I use IDE encoding
System.out.println(line);
// here I print data using Cp852 encoding
out.println(line);
}
When I run it in Eclipse output will be
ąźżćąś adsfasdf
Ą«ľ†Ą? adsfasdf
but output from system console will be

Java linux character encoding issue

I'm facing an issue with character encoding in linux. I'm retrieving a content from amazon S3, which was saved using UTF-8 encoding. The content is in Chinese and I'm able to see the content correctly in the browser.
I'm using amazon SDK to retrieve the content and do some update to it.Here's the code I'm using:
StringBuilder builder = new StringBuilder();
S3Object object = client.getObject(new GetObjectRequest(bucketName, key));
BufferedReader reader = new BufferedReader(new
InputStreamReader(object.getObjectContent(), "utf-8"));
while (true) {
String line = reader.readLine();
if (line == null)
break;
builder.append(line);
}
This piece of code works fine in Windows environment as I was able to update the content and save it back without messing up any chinese characters in it.
But, its acting differently in linux enviroment. The code is unable to translate the characters properly, the chinese characters are rendered as ???
I'm not sure what's going wrong here. Any pointers will be appreciated.
-Thanks

The default charset is different for the 2 OS's your using.
To start off, you can confirm the difference by printing out the default charset.
Charset.defaultCharset.name()
Somewhere in your code, I think this default charset is being used for some String conversion. The correct procedure should be to track that down, and specify UTF-8.
Without seeing that code, I can only suggest the 'cheating' way to do it: set the default charset explicitly, near the beginning of your code, or at Java startup. See here for changing default charset: Setting the default Java character encoding?
HTH

Greek String doesn't match regex when read from keyboard

public static void main(String[] args) throws IOException {
String str1 = "ΔΞ123456";
System.out.println(str1+"-"+str1.matches("^\\p{InGreek}{2}\\d{6}")); //ΔΞ123456-true
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String str2 = br.readLine(); //ΔΞ123456 same as str1.
System.out.println(str2+"-"+str2.matches("^\\p{InGreek}{2}\\d{6}")); //Ξ”Ξ�123456-false
System.out.println(str1.equals(str2)); //false
}
The same String doesn't match regex when read from keyboard.
What causes this problem, and how can we solve this?
Thanks in advance.
EDIT: I used System.console() for input and output.
public static void main(String[] args) throws IOException {
PrintWriter pr = System.console().writer();
String str1 = "ΔΞ123456";
pr.println(str1+"-"+str1.matches("^\\p{InGreek}{2}\\d{6}")+"-"+str1.length());
String str2 = System.console().readLine();
pr.println(str2+"-"+str2.matches("^\\p{InGreek}{2}\\d{6}")+"-"+str2.length());
pr.println("str1.equals(str2)="+str1.equals(str2));
}
Output:
ΔΞ123456-true-8
ΔΞ123456
ΔΞ123456-true-8
str1.equals(str2)=true

There are multiple places where transcoding errors can take place here.
Ensure that your class is being compiled correctly (unlikely to be an issue in an IDE):
Ensure that the compiler is using the same encoding as your editor (i.e. if you save as UTF-8, set your compiler to use that encoding)
Or switch to escaping to the ASCII subset that most encodings are a superset of (i.e. change the string literal to "\u0394\u039e123456")
Ensure you are reading input using the correct encoding:
Use the Console to read input - this class will detect the console encoding
Or configure your Reader to use the correct encoding (probably windows-1253) or set the console to Java's default encoding
Note that System.console() returns null in an IDE, but there are things you can do about that.

If you use Windows, it may be caused by the fact that console character encoding ("OEM code page") is not the same as a system encoding ("ANSI code page").
InputStreamReader without explicit encoding parameter assumes input data to be in the system default encoding, therefore characters read from the console are decoded incorrectly.
In order to correctly read non-us-ascii characters in Windows console you need to specify console encoding explicitly when constructing InputStreamReader (required codepage number can be found by executing mode con cp in the command line):
BufferedReader br = new BufferedReader(
new InputStreamReader(System.in, "CP737"));
The same problem applies to the output, you need to construct PrintWriter with proper encoding:
PrintWriter out = new PrintWrtier(new OutputStreamWriter(System.out, "CP737"));
Note that since Java 1.6 you can avoid these workarounds by using Console object obtained from System.console(). It provides Reader and Writer with correctly configured encoding as well as some utility methods.
However, System.console() returns null when streams are redirected (for example, when running from IDE). A workaround for this problem can be found in McDowell's answer.
See also:
Code page

I get true in both cases with nothing changed on your code. (I tested with greek layout keyboard - I'm from Greece :])
Probably your keyboard is sending ascii in 8859-7 ISO and not UTF-8. Mine sends UTF-8.
EDIT: I still get true with the addition of the equals command..
System.out.println(str1.equals(str2));
Check if you can get it working by changing everything to greek in the regional options (if you are using windows).
Rundll32 Shell32.dll,Control_RunDLL Intl.cpl,,0
If this is the case then you can act accordingly.. as 'axtavt' said

The keyboard is likely not sending the characters as UTF-8, but as the operating system's default character encoding.
See also
Java : How to determine the correct charset encoding of a stream
Java App : Unable to read iso-8859-1 encoded file correctly

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

UTF-8 encoding CSV file - java

You have to use ISO/IEC 8859, not UTF-8, if you look at the list of character encodings on Wikipedia page you'll understand the difference. Basically, UTF-8 its the commom encoding used by western country... Also, you can check your terminal encoding, maybe the problem is there.

Related

Encoding problems in database

Error when reading non-English language character from file

Reading Arabic chars from text file

Java linux character encoding issue

Greek String doesn't match regex when read from keyboard

Categories

Resources