Parse csv file with special character like µ using java

Parse csv file with special character like µ using java - java

I have a situation where I have to read a CSV file which contains special character like 'µ'. This has to be done using java. I am doing:
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream(,"UTF-8"));
in windows it runs OK. But in redhat linux environment, it converts those special characters to '?'. Any help is highly appreciated

Output written to System.out will be encoded using the "platform default encoding" which on linux is determined from locale environment variables (see output of locale command), and those in turn are set in user or system level configuration files.
On server installations, the default encoding is often ASCII. "µ" is not an ASCII character so it will be converted to "?" when it is printed.
There are a couple of ways to change the default encoding:
Set the Java file.encoding system property when you run your program, e.g.
java -Dfile.encoding=utf-8 yourprogram
Set LC_CTYPE env variable before you run your program, for example:
export LC_CTYPE=en_US.UTF-8
java yourprogram
Those methods also change the default encoding for input and file names etc. You can change the encoding specifically for System.out with Java code:
PrintStream originalOut = System.out; // in case you need it later
System.setOut(new PrintStream(System.out, true, "utf-8"));

Related

How to print UTF8 when running code with java -jar

I'm writing a project which parses a UTF-8 encoded file.
I'm doing it this way
ArrayList<String> al = new ArrayList<>();
BufferedReader bufferedReader = new BufferedReader(new
InputStreamReader(new FileInputStream(filename),"UTF8"));
String line = null;
while ((line = bufferedReader.readLine()) != null)
{
al.add(line);
}
return al;
The strange thing is that it reads the file properly when I run it in IntelliJ, but not when I run it through java -jar (It gives me garbage values instead of UTF8).
What can I do to either
Run my Java through java -jar in the same environment as intelliJ or
Fix my code so that it reads UTF-8 into the string

I think that what is going on here is that you just don't have your terminal setup correctly for your default encoding. Basically, if your program runs correctly, then it's grabbing the UTF-8 bytes, storing them as Java strings, then outputting them to the terminal in whatever the default encoding scheme is. To find out what your default encoding scheme see this question. Then you need to ensure that your terminal that you are running your java -jar command from is compatible with it. For example, see my terminal settings/preferences on my Mac.

Oracle docs give a pretty straightforward answer about Charset:
Standard charsets
Every implementation of the Java platform is required to support the following standard charsets. Consult the release documentation for your implementation to see if any other charsets are supported. The behavior of such optional charsets may differ between implementations.
...
UTF-8
Eight-bit UCS Transformation Format
So you should use new InputStreamReader(new FileInputStream(filename),"UTF-8"));

How can I change the Standard Out to "UTF-8" in Java

I download a file from a website using a Java program and the header looks like below
Content-Disposition attachment;filename="Textkürzung.asc";
There is no encoding specified
What I do is after downloading I pass the name of the file to another application for further processing. I use
System.out.println(filename);
In the standard out the string is printed as Textk³rzung.asc
How can I change the Standard Out to "UTF-8" in Java?
I tried to encode to "UTF-8" and the content is still the same
Update:
I was able to fix this without any code change. In the place where I call this my jar file from the other application, i did the following
java -DFile.Encoding=UTF-8 -jar ....
This seem to have fixed the issue
thank you all for your support

The default encoding of System.out is the operating system default. On international versions of Windows this is usually the windows-1252 codepage. If you're running your code on the command line, that is also the encoding the terminal expects, so special characters are displayed correctly. But if you are running the code some other way, or sending the output to a file or another program, it might be expecting a different encoding. In your case, apparently, UTF-8.
You can actually change the encoding of System.out by replacing it:
try {
System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, "UTF-8"));
} catch (UnsupportedEncodingException e) {
throw new InternalError("VM does not support mandatory encoding UTF-8");
}
This works for cases where using a new PrintStream is not an option, for instance because the output is coming from library code which you cannot change, and where you have no control over system properties, or where changing the default encoding of all files is not appropriate.

The result you're seeing suggests your console expects text to be in Windows "code page 850" encoding - the character ü has Unicode code point U+00FC. The byte value 0xFC renders in Windows code page 850 as ³. So if you want the name to appear correctly on the console then you need to print it using the encoding "Cp850":
PrintWriter consoleOut = new PrintWriter(new OutputStreamWriter(System.out, "Cp850"));
consoleOut.println(filename);
Whether this is what your "other application" expects is a different question - the other app will only see the correct name if it is reading its standard input as Cp850 too.

Try to use:
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(test);

Encoding cp-1252 as utf-8?

I am trying to write a Java app that will run on a linux server but that will process files generated on legacy Windows machines using cp-1252 as the character set. Is there anyway to encode these files as utf-8 instead of the cp-1252 it is generated as?

If the file names as well as content is a problem, the easiest way to solve the problem is setting the locale on the Linux machine to something based on ISO-8859-1 rather than UTF-8. You can use locale -a to list available locales. For example if you have en_US.iso88591 you could use:
export LANG=en_US.iso88591
This way Java will use ISO-8859-1 for file names, which is probably good enough. To run the Java program you still have to set the file.encoding system property:
java -Dfile.encoding=cp1252 -cp foo.jar:bar.jar blablabla
If no ISO-8859-1 locale is available you can generate one with localedef. Installing it requires root access though. In fact, you could generate a locale that uses CP-1252, if it is available on your system. For example:
sudo localedef -f CP1252 -i en_US en_US.cp1252
export LANG=en_US.cp1252
This way Java should use CP1252 by default for all I/O, including file names.
Expanded further here: http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters/

You can read and write text data in any encoding that you wish. Here's a quick code example:
public static void main(String[] args) throws Exception
{
// List all supported encodings
for (String cs : Charset.availableCharsets().keySet())
System.out.println(cs);
File file = new File("SomeWindowsFile.txt");
StringBuilder builder = new StringBuilder();
// Construct a reader for a specific encoding
Reader reader = new InputStreamReader(new FileInputStream(file), "windows-1252");
while (reader.ready())
{
builder.append(reader.read());
}
reader.close();
String string = builder.toString();
// Construct a writer for a specific encoding
Writer writer = new OutputStreamWriter(new FileOutputStream(file), "UTF8");
writer.write(string);
writer.flush();
writer.close();
}
If this still 'chokes' on read, see if you can verify that the the original encoding is what you think it is. In this case I've specified windows-1252, which is the java string for cp-1252.

How do you specify a Java file.encoding value consistent with the underlying Windows code page?

I have a Java application that receives data over a socket using an InputStreamReader. It reports "Cp1252" from its getEncoding method:
/* java.net. */ Socket Sock = ...;
InputStreamReader is = new InputStreamReader(Sock.getInputStream());
System.out.println("Character encoding = " + is.getEncoding());
// Prints "Character encoding = Cp1252"
That doesn't necessarily match what the system reports as its code page. For example:
C:\>chcp
Active code page: 850
The application may receive byte 0x81, which in code page 850 represents the character ü. The program interprets that byte with code page 1252, which doesn't define any character at that value, so I get a question mark instead.
I was able to work around this problem for one customer who used code page 850 by adding another command-line option in the batch file that launches the application:
java.exe -Dfile.encoding=Cp850 ...
But not all my customers use code page 850, of course. How can I get Java to use a code page that's compatible with the underlying Windows system? My preference would be something I could just put in the batch file, leaving the Java code untouched:
ENC=...
java.exe -Dfile.encoding=%ENC% ...

The default encoding used by cmd.exe is Cp850 (or whatever "OEM" CP is native to the OS); the system encoding is Cp1252 (or whatever "ANSI" CP is native to the OS). Gory details here. One way to discover the console encoding would be to do it via native code (see GetConsoleOutputCP for current console encoding; see GetACP for default "ANSI" encoding; etc.).
Altering the encoding via the -D switch is going to affect all your default encoding mechanisms, including redirected stdout/stdin/stderr. It is not an ideal solution.
I came up with this WSH script that can set the console to the system ANSI codepage, but haven't figured out how to programmatically switch to a TrueType font.
'file: setacp.vbs
'usage: cscript /Nologo setacp.vbs
Set objShell = CreateObject("WScript.Shell")
'replace ACP (ANSI) with OEMCP for default console CP
cp = objShell.RegRead("HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001" &_
"\Control\Nls\CodePage\ACP")
WScript.Echo "Switching console code page to " & cp
objShell.Exec "chcp.com " & cp
(This is my first WSH script, so it may be flawed - I'm not familiar with registry read permissions.)
Using a TrueType font is another requirement for using ANSI/Unicode with cmd.exe. I'm going to look at a programmatic switch to a better font when time permits.

In regards to the code snippit, the right answer is to use the appropriate constructor for InputStreamReader that does the correct code conversion. That way it won't matter what encoding the default on the system is, you know you are getting a correct encoding that corresponds to what you are getting on the socket.
Then you can specify the encoding when you write out files if you need to, rather than relying on the system encoding, but of course when they open files on that system they may have issues, but modern windows systems support UTF-8, so you can write out the file in UTF-8 if you need to (internally Java is representing all Strings as 16 bit unicode).
I would think this is the "right" solution in general that would be most compatible with largest range of underlying systems.

If the code page value that comes back from a chcp command will return the value that you need, you can use the following command to get the code page
C:\>for /F "Tokens=4" %I in ('chcp') Do Set CodePage=%I
This sets the variable CodePage to the code page value returned from chcp
C:\>echo %CodePage%
437
You could use this value in your bat file by prefixing it with Cp
C:\>echo Cp%CodePage%
Cp437
If when you put this into a bat file, the %I values in the first command will need to be replaced with %%I

Windows has the added complication of having two active codepages. In your example both 1252 and 850 are correct, but they depend on the way the program is being run. For GUI applications, Windows will use the ANSI code page, which for Western European languages will typically be 1252. However, the command line will report the OEM codepage which is 850 for the same locales.

Accents in file name using Java on Solaris

I have a problem where I can't write files with accents in the file name on Solaris.
Given following code
public static void main(String[] args) {
System.out.println("Charset = "+ Charset.defaultCharset().toString());
System.out.println("testéörtkuoë");
FileWriter fw = null;
try {
fw = new FileWriter("testéörtkuoë");
fw.write("testéörtkuoëéörtkuoë");
fw.close();
I get following output
Charset = ISO-8859-1
test??rtkuo?
and I get a file called "test??rtkuo?"
Based on info I found on StackOverflow, I tried to call the Java app by adding "-Dfile.encoding=UTF-8" at startup.
This returns following output
Charset = UTF-8
testéörtkuoë
But the filename is still "test??rtkuo?"
Any help is much appreciated.
Stef

All these characters are present in ISO-8859-1. I suspect part of the problem is that the code editor is saving files in a different encoding to the one your operating system is using.
If the editor is using ISO-8859-1, I would expect it to encode ëéö as:
eb e9 f6
If the editor is using UTF-8, I would expect it to encode ëéö as:
c3ab c3a9 c3b6
Other encodings will produce different values.
The source file would be more portable if you used Unicode escape sequences. At least be certain your compiler is using the same encoding as the editor.
Examples:
ë \u00EB
é \u00E9
ö \u00F6
You can look up these values using the Unicode charts.
Changing the default file encoding using -Dfile.encoding=UTF-8 might have unintended consequences for how the JVM interacts with the system.
There are parallels here with problems you might see on Windows.
I'm unable to reproduce the problem directly - my version of OpenSolaris uses UTF-8 as the default encoding.

If you attempt to list the filenames with the java io apis, what do you see? Are they encoded correctly? I'm curious as to whether the real problem is with encoding the filenames or with the tools that you are using to check them.

What happens when you do:
ls > testéörtkuoë
If that works (writes to the file correctly), then you know you can write to files with accents.

I got a similar problem. Contrary to that example, the program was unable to list the files correct using sysout.println, despite the ls was showing correct values.
As described in the documentation, the enviroment variable file.encoding should not be used to define charset and, in this case, the JVM ignores it
The symptom:
I could not type accents in shell.
ls was showing correct values
File.list() was printing incorrect values
the environ file.encoding was not affecting the output
the environ user.(language|country) was not affecting the output
The solution:
Although the enviroment variable LC_* was set in the shell with values inherited from /etc/defaut/init, as listed by set command, the locale showed different values.
$ set | grep LC
LC_ALL=pt_BR.ISO8859-1
LC_COLLATE=pt_BR.ISO8859-1
LC_CTYPE=pt_BR.ISO8859-1
LC_MESSAGES=C
LC_MONETARY=pt_BR.ISO8859-1
LC_NUMERIC=pt_BR.ISO8859-1
LC_TIME=pt_BR.ISO8859-1
$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
The solution was simple exporting LANG. This environment variable really affect the jvm
LANG=pt_BR.ISO8859-1
export LANG

Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.
In Java you can use following for reading and writing:
Reading:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));
Writing:
PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parse csv file with special character like µ using java - java

Related

How to print UTF8 when running code with java -jar

How can I change the Standard Out to "UTF-8" in Java

Encoding cp-1252 as utf-8?

How do you specify a Java file.encoding value consistent with the underlying Windows code page?

Accents in file name using Java on Solaris

Categories

Resources