Process.getInputStream() encoding issue

Process.getInputStream() encoding issue - java

I have the following lines of code.I want to use proper encoding scheme.
Process process = processBuilder.start();
InputStreamReader isr = new InputStreamReader(process.getInputStream());
My eclipse is by default using Windows-1252 encoding.While when I run chcp command on command prompt The result is codepage 437.
This means the stream of bytes that I get from command line is encoded by using (codepage437)different scheme than the one used by JVM(windows1252).How do I synchronise between the two when I want my application to run across different platforms.[I can not hardcode to use code 437 in my java application]

Eclipse has nothing to do with it. At runtime, your constants are UTF-16 strings, independent of whatever encoding for Java source you have set in Eclipse. Your program that reads from the stream simply has to know the encoding in use in the process you launch. That will, as you note, depend on what sort of computer you are running on, what your settings are, and choices made by the creator of the program you launch. I'd expect that the literal values of the bytes written by a native non-_UNICODE program on Windows to appear on the stream. If the program you are running was built as an _UNICODE application, it's an interesting question what would appear on the stream ... UTF-16? In any case, any programmer creating any command-line program can send whatever they like down the standard output stream: even if every other program on the system is coughing up, say, Windows-1252, one particular program might write UTF-8 and be documented to do so for use with > redirection into a file. You just have to know.

Related

Why does System.out.println() in Java print to the console?

I read several posts online explaining what System.out.println() is in Java. Most of them go like this:
System is a final class in the java.lang package.
out is a public static object inside the System class of type PrintStream.
println() prints a line of text to the output stream.
My question is when we do System.out.println() in our code, why does it end up writing to the console? This article explains how we can make it write to a file by calling System.setOut(). So my question translates to where is System.setOut() called to redirect its output to the console?
I checked System.setOut()'s source. It makes a call to setOut0() which is a native method. This method is directly called inside the initializeSystemClass() method by passing it fdOut which is a FileOutputStream defined here. I did not find a console output stream passed to setOut0() anywhere, nor did I find a call to the non-native setOut() done anywhere. Is it done somewhere else outside the System class by the JVM while starting execution? If so, can someone point me to it?

My doubt is when we do System.out.println() in our code, why it ends up in writing to console?
In any POSIX compliant shell, each process gets three "standard" streams when the shell starts it:
The "standard input" stream is for reading input.
The "standard output" stream is for writing ordinary output.
The "standard error" stream is for writing error output.
(The same idea is also used in many non-POSIX compliant shells as well.)
For an interactive POSIX shell, the default is for these streams to read from and write to the shell's "console" ... which could be a physical console, but is more likely to be a "terminal emulator" on the user's (ultimate) desktop machine. (Details vary.)
A POSIX shell allows you to redirect the standard streams in various ways; e.g.
$ some-command < file # read stdin from 'file'
$ some-command > file # write stdout to 'file'
$ some-command 2> file # write stderr to 'file'
$ some-command << EOF # read stdin from a 'here' document
lines of input
...
EOF
$ some-command | another # connect stdout for one command to
# stdin for the next one in a pipeline
and so on. If you do this, one or more of the standard streams is NOT connected to the console.
Further reading:
"What are stdin, stdout and stderr on Linux?"
"Standard Streams"
So how does this relate to the question?
When a Java program start, the System.in/out/err streams are connected to the standard input / output / error streams specified by the parent process; typically a shell.
In the case of System.out, that could be the console (however you define that) or it could be a file, or another program or ... /dev/null. But where the output goes is determined by how the JVM was launched.
So, the literal answer is "because that is what the parent process has told the Java program to do".
How internally shell communicates with jvm to set standard input / output in both Windows and Linux?
This is what happens with Linux, UNIX, Mac OSX and similar. (I don't know for Windows ... but I imagine it is similar.)
Suppose that the shell is going to run aaa > bbb.txt.
The parent shell forks a child process ... sharing the parent shell's address space.
The child process closes file descriptor 1 (the standard output file descriptor)
The child process opens "bbb.txt" for writing on file descriptor 1.
The child process execs the "aaa" command, and it becomes the "aaa" command process. The file descriptors 0, 1, and 2 are preserved by the exec call.
The "aaa" command starts ...
When the "aaa" command starts, it finds that file descriptors 0 and 2 (stdin and stderr) refer to the same "file" as the parent shell. File descriptor 1 (stdout) refers to "bbb.txt".
The same thing happens when "aaa" is the java command.

It doesn't need to. We can redirect to somewhere else. Here is the code to re-direct into the file:
PrintStream output = new PrintStream(new File("output.txt"));
System.setOut(output);
System.out.println("This will be written to file");
By default, the console is the standard output stream (System.in) in Java.

System.out.println does not print to the console, it prints to the standard output stream (System.out is Java's name of the standard output stream). The standard output stream is usually the console, but it doesn't have to be. The Java runtime just wraps the standard output stream of the operating system in a nice Java object.
A non-interactive program often uses a few standard input and output channels: it reads input from the standard input stream, does some operations on it, and produces output on the standard output stream. The standard output stream can be the console, but it can also be piped to the standard input stream of another program or to a file. In the end, the operating system running the programming decides what the standard output stream output to.
For example, in Unix terminals you can do something like:
java -jar your.program.jar > output.txt
and store the output of your program in a text file, or
java -jar your.program.jar | grep hello
to only display the lines of the output which contain 'hello'. Only if you don't specify another destination, the standard output stream writes to the console.

java println output encoding on windows

This question originates from a question I asked here. There were suggestions that this may be a Java problem instead so I posted another question.
What determines the output encoding of the system.out.println command? Basically, I'm executing a python program from command prompt, which spawns a child process running java(stanford parser) It takes my input document encoded in UTF-8, processes and println my input in specific formats. Back at the python program, I could not decode the output from stdout with utf-8. This works on OSX so I'm suspecting it could be a console encoding issue.
I have tried setting chcp 65001 and changing the font type but these does not work.

It uses the default encoding which on Windows will be an obsolete "ANSI" encoding. The documented way to change this is "via the operating system" though this is as far as it goes. You can also call System.setOut to provide your own mechanism:
System.setOut(new PrintStream(System.out, true, "UTF-8"));
See here for more depth.

java/c++ How does output work? cout<< System.out.print

I am mostly concerned with Linux but answers involving windows are welcome.
When I use System.out.println or cout<< what is actually happening and what happens when I do a cout in a gui application (does it go anywhere)?
One case that I am interested in is the Netbeans IDE. When I run a java program in Netbeans what makes it possible for the IDE to "steal" the output from the program and display it?
Update/Sidenote
http://www.linfo.org/standard_output.html
One of the features of standard output is that it has a default
destination but can easily be redirected (i.e., diverted) to another
destination. That default destination is the display screen on the
computer that initiated the program. Because the standard streams are
plain text, they are by definition human readable.
What is meant by "initiate the program"?
I'm not very familiar with how the execution of a program begins but in the case of my netbeans example it's pretty clear that the IDE initiated the program. So what does that mean? When the program is being setup to be executed is there some meta data that is floating around letting the OS know that Netbeans is initiating the program?

When the program gets executed, three special file descriptors: stdin, stdout and stderr are associated to some device to determine how input and output is managed. If you execute a program from a terminal shell, stdin is associated to the keyboard, stdout and stderr to the terminal window. When you execute the program in a development environment usually stdout and stderr are displayed in some special console tabs. In other situations the output goes to some log file or maybe get discarded...
System.out and cout are the objects representing the stdout stream in Java and C++.

Encoding issue on filename with Java 7 on OSX with jnlp/webstart

I have this problem that has been dropped on me, and have been a couple of days of unsuccessful searches and workaround attempts.
I have now an internal java swing program distributed by jnlp/webstart, on osx and windows computers, that, among other things, downloads some files from WebDav.
Recently, on a test machine with OSX 10.8 and Java 7, filenames and directory names with accented characters started having those replaced by question marks.
No problem on OSX with versions of Java before 7.
example :
XXXYYY_è_ABCD/
becomes
XXXYYY_?_ABCD/
using java.text.Normalizer (NFD, NFC, NFKD, NFKC) on the original string, the result is different but still wrong :
XXXYYY_e?_ABCD/
or
XXXYYY_e_ABCD/
I know, from correspondence between [andrew.brygin at oracle.com] and [mik3hall at gmail.com] that
Yes, file.encoding is set based on the locale that the jvm is running
on, and if you run your java vm in xxxx.UTF-8 locale, the
file.encoding should be UTF-8, set to MacRoman will be problematic.
So I believe Oracle/OpenJDK7 behaves correctly. That said, as Andrew
Thompson pointed out, if all previous Apple JDK releases use MacRoman
as the file.encoding for english/UTF-8 locale, there is a
"compatibility" concern here, it might worth putting something in the
release note to give Oracle/OpenJDK MacOS user a heads up.
original mail
from Joni Salonen blog (java-and-file-names-with-invalid-characters) i know that :
You probably know that Java uses a “default character encoding” to
convert binary data to Strings. To read or write text using another
encoding you can use an InputStreamReader or OutputStreamWriter. But
for data-to-text conversions deep in the API you have no choice but to
change the default encoding.
and
What about file.encoding?
The file.encoding system property can also be used to set the default
character encoding that Java uses for I/O. Unfortunately it seems to
have no effect on how file names are decoded into Strings.
executing locale from inside the jnlp invariabily prints
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
the most similar problem on stackoverflow with a solution is this :
encoding-issues-on-java-7-file-names-in-os-x
but the solution is wrapping the execution of the java program in a script with
#!/bin/bash
export LC_CTYPE="UTF-8" # Try other options if this doesn't work
exec java your.program.Here
but I don't think this option is available to me because of the webstart, and I haven't found any way to set the LC_CTYPE environment variable from within the program.
Any solutions or workarounds?
P.S. :
If we run the program directly from shell, it writes the file/directory correctly even on OSX 10+Java 7.
The problem appears only with the combination of JNLP+OSX+Java7

I take it it's acceptable to have maximal ASCII representation of the file name, which works in virtually any encoding.
First, you want to use specifically NFKD, so that maximum information is retained in the ASCII form. For example, "2⁵" becomes "25"rather than just
"2", "ﬁ" becomes "fi" rather than "" etc once the non-ascii and non-control characters are filtered out.
String str = "XXXYYY_è_ABCD/";
str = Normalizer.normalize(str, Normalizer.Form.NFKD);
str = str.replaceAll( "[^\\x20-\\x7E]", "");
//The file name will be XXXYYY_e_ABCD no matter what system encoding
You would then always pass filenames through this filter to get their filesystem name. You only lose is some uniqueness, I.E file asdé.txt is the same
as asde.txt and in this system they cannot be differentiated.

EDIT: After experimenting with OS X some more I realized my answer was totally wrong, so I'm redoing it.
If your JVM supports -Dfile.encoding=UTF-8 on the JVM command line, that might fix the issue. I believe that is a standard property but I'm not certain about that.
HFS Plus, like other POSIX-compliant file systems, stores filenames as bytes. But unlike Linux's ext3 filesystem, it forces filenames to be valid decomposed UTF-8. This can be seen here with the Python interpreter on my OS X system, starting in an empty directory.
$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
>>> import os
>>> os.mkdir('\xc3\xa8')
>>> os.mkdir('e\xcc\x80')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 17] File exists: 'e\xcc\x80'
>>> os.mkdir('\x8f')
>>> os.listdir('.')
['%8F', 'e\xcc\x80']
>>> ^D
$ ls
%8F è
This proves that the directory name on your filesystem cannot be Mac-Roman encoded (i.e. with byte value 8F where the è is seen), as long as it's an HFS Plus filesystem. But of course, the JVM is not assured of an HFS Plus filesystem, and SMB and NFS do not have the same encoding guarantees, so the JVM should not assume this scheme.
Therefore, you have to convince the JVM to interpret file and directory names with UTF-8 encoding, in order to read the names as java.lang.String objects correctly.

Shot in the dark: File Encoding does not influence the way how the file names are created, just how the content gets written into the file - check this guy here: http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters/
Here is a short entry from Apple: http://developer.apple.com/library/mac/#qa/qa1173/_index.html
Comparing this to http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html I would assume you want to use
normalized_string = Normalizer.normalize(target_chars, Normalizer.Form.NFD);
to normalize the file names before you pass them to the File constructor. Does this help?

I don't think there is a real solution to this problem, right now.
Meantime I came to the conclusion that the "C" environment variables printed from inside the program are from the Java Web Start sandbox, and (by design, apparently) you can't influence those using the jnlp.
The accepted (as accepted by the company) workaround/compromise was of launching the jnlp using javaws from a bash script.
Apparently, launching the jnlp from browser or from finder creates a new sandbox environment with the LANG not setted (so is setted to "C" that is equal to ASCII).
Launching the jnlp from command line instead prints the right LANG from the system default, inheriting it from the shell.
This permits to at least preserve the autoupdating feature of the jnlp and dependencies.
Anyway, we sent a bug report to Oracle, but personally I'm not hoping it to be resolved anytime soon, if ever.

It's a bug in the old-skool java File api, maybe just on a mac? Anyway, the new java.nio api works much better. I have several files containing unicode characters and content that failed to load using java.io.File and related classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced org.apache.commons.io.FileUtils (which has the same problem) with java.nio.Files...
...and be sure to read and write the content of file using an appropriate charset, for example:
Files.readAllLines(myPath, StandardCharsets.UTF_8)

Java application failing on special characters

An application I am working on reads information from files to populate a database. Some of the characters in the files are non-English, for example accented French characters.
The application is working fine in Windows but on our Solaris machine it is failing to recognise the special characters and is throwing an exception. For example when it encounters the accented e in "Gérer" it says :-
Encountered: "\u0161" (353), after : "\'G\u00c3\u00a9rer les mod\u00c3"
(an exception which is thrown from our application)
I suspect that in order to stop this from happening I need to change the file.encoding property of the JVM. I tried to do this via System.setProperty() but it has not stopped the error from occurring.
Are there any suggestions for what I could do? I was thinking about setting the basic locale of the solaris platform in /etc/default/init to be UTF-8. Does anyone think this might help?
Any thoughts are much appreciated.

That looks like a file that was converted by native2ascii using the wrong parameters. To demonstrate, create a file with the contents
Gérer les modÚ
and save it as "a.txt" with the encoding UTF-8. Then run this command:
native2ascii -encoding windows-1252 a.txt b.txt
Open the new file and you should see this:
G\u00c3\u00a9rer les mod\u00c3\u0161
Now reverse the process, but specify ISO-8859-1 this time:
native2ascii -reverse -encoding ISO-8859-1 b.txt c.txt
Read the new file as UTF-8 and you should see this:
Gérer les modÀ\u0161
It recovers the "é" okay, but chokes on the "Ú", like your app did.
I don't know what all is going wrong in your app, but I'm pretty sure incorrect use of native2ascii is part of it. And that was probably the result of letting the app use the system default encoding. You should always specify the encoding when you save text, whether it's to a file or a database or what--never let it default. And if you don't have a good reason to choose something else, use UTF-8.

Try to use
java -Dfile.encoding=UTF-8 ...
when starting the application in both systems.
Another way to solve the problem is to change the encoding from both system to UTF-8, but i prefer the first option (less intrusive on the system).
EDIT:
Check this answer on stackoverflow, It might help either:
Changing the default encoding for String(byte[])

Instead of setting the system-wide character encoding, it might be easier and more robust, to specify the character encoding when reading and writing specific text data. How is your application reading the files? All the Java I/O package readers and writers support passing in a character encoding name to be used when reading/writing text to/from bytes. If you don't specify one, it will then use the platform default encoding, as you are likely experiencing.
Some databases are surprisingly limited in the text encodings they can accept. If your Java application reads the files as text, in the proper encoding, then it can output it to the database however it needs it. If your database doesn't support any encoding whose character repetoire includes the non-ASCII characters you have, then you may need to encode your non-English text first, for example into UTF-8 bytes, then Base64 encode those bytes as ASCII text.
PS: Never use String.getBytes() with no character encoding argument for exactly the reasons you are seeing.

I managed to get past this error by running the command
export LC_ALL='en_GB.UTF-8'
This command set the locale for the shell that I was in. This set all of the LC_ environment variables to the Unicode file encoding.
Many thanks for all of your suggestions.

You can also set the encoding at the command line, like so java -Dfile.encoding=utf-8.

I think we'll need more information to be able to help you with your problem:
What exception are you getting exactly, and which method are you calling when it occurs.
What is the encoding of the input file? UTF8? UTF16/Unicode? ISO8859-1?
It'll also be helpful if you could provide us with relevant code snippets.
Also, a few things I want to point out:
The problem isn't occurring at the 'é' but later on.
It sounds like the character encoding may be hard coded in your application somewhere.

Also, you may want to verify that operating system packages to support UTF-8 (SUNWeulux, SUNWeuluf etc) are installed.

Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.
In Java you can use following for reading and writing:
Reading:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));
Writing:
PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.