Which encoding does Process.getInputStream() use? - java

In a Java program, I spawn a new Process via ProcessBuilder.
args[0] = directory.getAbsolutePath() + File.separator + program;
ProcessBuilder pb = new ProcessBuilder(args);
pb.directory(directory);
final Process process = pb.start();
Then, I read the process standard output with a new Thread
new Thread() {
public void run() {
BufferedReader reader = new BufferedReader(
new InputStreamReader(process.getInputStream()));
String line = "";
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
}.start();
However, when the process outputs non-ASCII characters (such as 'é'), the line has character '\uFFFD' instead.
What is the encoding in the InputStream returned by getInputStream (my platform is Windows in Europe)?
How can I change things so that line contains the expected data (i.e. '\u00E9' for 'é')?
Edit: I tried new InputStreamReader(...,"UTF-8"):
é becomes \uFFFD

An InputStream is a binary stream, so there is no encoding. When you create the Reader, you need to know what character encoding to use, and that would depend on what the program you called produces (Java will not convert it in any way).
If you do not specify anything for InputStreamReader, it will use the platform default encoding, which may not be appropriate. There is another constructor that allows you to specify the encoding.
If you know what encoding to use (and you really have to know):
new InputStreamReader(process.getInputStream(), "UTF-8") // for example

Interestingly enough, when running on Windows:
ProcessBuilder pb = new ProcessBuilder("cmd", "/c dir");
Process process = pb.start();
Then CP437 code page works quite well for
new InputStreamReader(process.getInputStream(), "CP437");

As I understand, an operation system streams are byte-streams, there are no characters here. The InputStreamReader constructor uses jvm default character set java.nio.charset.Charset#defaultCharset(), you could use another constructor to explicitly specify a character set.

According to http://www.fileformat.info/info/unicode/char/e9/index.htm '\uFFFD' is a unicode code for character 'é'. It actually means that you are reading the stream correctly. Your problem is in writing.
Windows console does not support unicode by default. So, if you want to test your code open file and write your stream there. But do not forget to set the encoding UTF-8.

Scientific
On Windows this works perfect:
private static final Charset CONSOLE_ENCODING;
static {
Charset enc = Charset.defaultCharset();
try {
String example = "äöüßДŹす";
String command = File.separatorChar == '/' ? "echo " + example : "cmd.exe /c echo " + example;
Process exec = Runtime.getRuntime().exec(command);
InputStream inputStream = exec.getInputStream();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
while (exec.isAlive()) {
Thread.sleep(100);
}
byte[] buff = new byte[inputStream.available()];
if (buff.length > 0) {
int count = inputStream.read(buff);
baos.write(buff, 0, count);
}
byte[] array = baos.toByteArray();
for (Charset charset : Charset.availableCharsets().values()) {
String s = new String(array, charset);
if (s.equals(example)) {
enc = charset;
break;
}
}
} catch (InterruptedException e) {
throw new Error("Could not determine console charset.", e);
} catch (IOException e) {
throw new Error("Could not determine console charset.", e);
}
CONSOLE_ENCODING = enc;
}
According to specification: there is no hint for runtime-encoding change of jvm. We can not be sure that the encoding does NOT change while running and the charset still correct after such change.

If you, like me, know in what encoding you want to use for all input/output, you can either encode it in the Java API calls to some (not all) CreateReader methods, which some other answers have pointed out.
But this will hard code it in the source, which might or might not, be ok.
I found a better way after reading this answer which reveals that you can set the encoding before the JVM starts up to what you need.
java -Dfile.encoding=ISO-8859-1 ...

I put this as a comment but i see there was an answer after ,so it might be redundant now :)
BufferedReader br = new BufferedReader(
new InputStreamReader(conn.getInputStream(), "UTF-8"));

use commons-lang jar file in this use - StringEscapeUtils.escapeHtml
BufferedReader br = new BufferedReader(
new InputStreamReader(StringEscapeUtils.escapeHtml(conn.getInputStream()));

Related

My (String).split("="); isnt working?

I wanted to make a string to a String[] but it isnt working how i wanted it to work! My code:
public static void get(HashMap<String, String> saves, File file) throws UnsupportedEncodingException, FileNotFoundException, IOException{
if (!file.exists()){
return;
}
InputStreamReader reader;
reader = new InputStreamReader(new FileInputStream(file), "UTF-16");
String r = null;
String[] s;
BufferedReader bufreader = new BufferedReader(reader);
while((r=bufreader.readLine()) != null){
s = r.split("=");
if (s.length < 2){
System.out.println(s.length);
System.out.println(s[0]);
return;
}
saves.put(s[0].toString(), s[1].toString());
s = null;
}
}
And also when i tell it to println the String to the console
System.out.println(s.length);
System.out.println(s[0]);
it just prints:
1
??????????????????
-
-
What it should be reading (What is in the file):
1=welcome
2=hello
3=bye
4=goodbye
So i want it to put the values in to the hashmap:
saves.put("1", "welcome");
saves.put("2", "hello");
saves.put("3", "bye");
saves.put("4", "goodbye");
but the s = e.split("=") is not splitting it is making the String to "?????????"
Thank you!
It seems you're using the wrong encoding.
Your input file is not really UTF-16, as the Java code expects it.
I saved your example data in a file, and the result was similarly broken.
The default encoding on my system is UTF-8, so I changed the encoding of the file with the command:
iconv -f utf-8 -t utf-16 orig.txt > converted.txt
When using your program on converted.txt,
it produces the expected output.
It also produces the expected output if I use orig.txt,
and make this simple change in your program:
reader = new InputStreamReader(new FileInputStream(file), "UTF-8");
You can either make sure the file is UTF-16 encoded,
and if not, convert it,
or use the correct encoding when you create the InputStreamReader.

Java replace line in a text file

I found this code from another question
private void updateLine(String toUpdate, String updated) throws IOException {
BufferedReader file = new BufferedReader(new FileReader(data));
String line;
String input = "";
while ((line = file.readLine()) != null)
input += line + "\n";
input = input.replace(toUpdate, updated);
FileOutputStream os = new FileOutputStream(data);
os.write(input.getBytes());
file.close();
os.close();
}
This is my file before I replace some lines
example1
example2
example3
But when I replace a line, the file now looks like this
example1example2example3
Which makes it impossible to read the file when there are a lot of lines in it.
How would I go about editing the code above to make my file look what it looked like at the start?
Use System.lineSeparator() instead of \n.
while ((line = file.readLine()) != null)
input += line + System.lineSeparator();
The issue is that on Unix systems, the line separator is \n while on Windows systems, it's \r\n.
In Java versions older then Java 7, you would have to use System.getProperty("line.separator") instead.
As pointed out in the comments, if you have concerns about memory usage, it would be wise to not store the entire output in a variable, but write it out line-by-line in the loop that you're using to process the input.
If you read and modify line by line this has the advantage, that you dont need to fit the whole file in memory. Not sure if this is possible in your case, but it is generally a good thing to aim for streaming. In your case this would in addition remove the need for concatenate the string and you don't need to select a line terminator, because you can write each single transformed line with println(). It requires to write to a different file, which is generally a good thing as it is crash safe. You would lose data if you rewrite a file and get aborted.
private void updateLine(String toUpdate, String updated) throws IOException {
BufferedReader file = new BufferedReader(new FileReader(data));
PrintWriter writer = new PrintWriter(new File(data+".out"), "UTF-8");
String line;
while ((line = file.readLine()) != null)
{
line = line.replace(toUpdate, updated);
writer.println(line);
}
file.close();
if (writer.checkError())
throw new IOException("cannot write");
writer.close();
}
In this case, it assumes that you need to do the replace only on complete lines, not multiple lines. I also added an explicit encoding and use a writer, as you have a string to output.
This is because you use OutputStream which is better for handling binary data. Try using PrintWriter and don't add any line terminator at the end of the lines. Example is here

Call and return the output of a jar to another java program

I am calling jar in a java program. the inner jar returns some output. how should i read and display in following program ?
i am able to call the jar successfully but how to display the output ?
import java.io.InputStream;
public class call_xml_jar {
public static void main(String argv[]) {
try{
// Run a java app in a separate system process
Process proc = Runtime.getRuntime().exec("java -jar xml_validator.jar");
// Then retreive the process output
InputStream in = proc.getInputStream();
InputStream err = proc.getErrorStream();
System.out.println("Completed...");
}
catch (Exception e) {
e.printStackTrace();
}
}
}
Output: Completed...
I want to print jar output as well
With the lines
InputStream in = proc.getInputStream();
InputStream err = proc.getErrorStream();
you are already on the way.
These streams give you access to the other application's standard output and standard error streams (respectively). By the way: You retrieve the other application's standard output stream by calling getInputStream(), as this is the view of your current application; you are inputting the other application's data.
Just to make it clear: The standard output and th standard error stream are accessed in an application by printing calls to System.out and System.err (respectively).
So, if you have - for example - System.out.println("Hello world") in the other application, you will retrieve the corresponding bytes (see below) in the input stream that you reference with the variable in of the above code snippet.
Normally, you are not interested in the bytes but you want to retrieve the String that you have placed into the output. So you must convert the bytes to a String. For this you normally must provide an encoding (the Java class for that is Charset). In fact, the platform's default encoding works in such cases.
The easiest way is to wrap the input stream in a buffered reader:
BufferedReader outReader = new BufferedReader(new InputStreamReader(proc.getInputStream()));
The above mntioned platform's default encoding is used, when not specifying any character set in the InputStreamReader's constructor.
A BufferedReader knows a method readLine(), which you must use to get all the other application's output.
while(outReader.ready())
System.out.println(outReader.readLine())
One word about flushing: If you write data to the standard output stream, this stream is flushed only, when a newline is also written. This is done by calls to System.out.println(...). And this is the reason, why you must read entire lines from the reader.
Are you now able to assemble some code that reads out the other application's output? If not, you maybe should post another question.
I solved it myself.. Here is my solution...
// Then retreive the process output
InputStream in = proc.getInputStream();
InputStream err = proc.getErrorStream();
System.out.println("Completed...");
InputStreamReader is = new InputStreamReader(in);
StringBuilder sb=new StringBuilder();
BufferedReader br = new BufferedReader(is);
String read = br.readLine();
while(read != null) {
//System.out.println(read);
sb.append(read);
sb.append("\n");
read =br.readLine();
}
System.out.println(sb);

Read a CSV file in UTF-8 format

I am reading a csv file in java, adding a new column with new information and exporting it back to a CSV file. I have a problem in reading the CSV file in UTF-8 format. I read line by line and store it in a StringBuilder, but when I print the line I can see that the information I'm reading is not in UTF-8 but in ANSI. I used both System.out.print and printstream in UTF and the information appears still in ANSI. This is my code :
BufferedReader br;
try {
br = new BufferedReader(new InputStreamReader(new FileInputStream(
"./users.csv"), "UTF8"));
String line;
while ((line = br.readLine()) != null) {
if (line.contains("none#none.com")) {
continue;
}
if (!line.contains("#") && !line.contains("FirstName")) {
continue;
}
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
ps.print(line + "\n");
sbusers.append(line);
sbusers.append("\n");
sbusers2.append(line);
sbusers2.append(",");
}
br.close();
} catch (IOException e) {
System.out.println("Failed to read users file.");
} finally {
}
It prints out information like "Professor -P�s". Since the reading isn't being done correctly the output to the new file is also being exported in ANSI.
Are you sure your CSV is UTF-8 encoded? My guess is that it's not. Try using ISO-8859-1 for reading the file, but keep the output as UTF-8. (UTF8 and UTF-8 both tend to work, but you should use UTF-8 as #Marcelo suggested)
In the line:
br = new BufferedReader(new InputStreamReader(new FileInputStream("./users.csv"),"UTF8"));
Your charset should be "UTF-8" not "UTF8".
Printing to System.out using UTF encoding ????????????
Why would you do that ? System.out and the encoding it uses is determined at the OS level (it becomes the default charset in the JVM), and that's the only one you want to use on System.out.
Fist, as suggested by #Marcelo, use UTF8 instead of UTF-8:
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream("./users.csv"), "UTF8"));
Second, forget about the PrintStream, just use System.out, or better yet, a logging API. You don't need to worry about how Java will output your string to the console (number one rule about character encoding: After you've read things successfully, let Java handle the encoding and only worry about it again when you are writing to an external file / database / etc).
Third and more important, check that your file is really encoded in UTF-8, this is the cause of 99% of the encoding problems.
Make sure that you test with a real UTF-8 file (use tools like iconv to convert to UTF-8 and be sure about it).
found a potential solution(I had the same problem). Depending on the type of UTF-8 encoding you need to specify if further...
Replace:
br = new BufferedReader(new InputStreamReader(new FileInputStream(
"./users.csv"), "UTF8"));
With:
br = new BufferedReader(new InputStreamReader(new FileInputStream(
"./users.csv"), "ISO_8859_1"));
For further understanding: https://mincong.io/2019/04/07/understanding-iso-8859-1-and-utf-8/

PHP and a file written by Java FileOutputStream

I have a text file that is written by Java FileOutputStream.
When i read that file using file_get_contents, then everything is on same line and there are no separators between different strings.
I need to know, how to read/parse that file so i have some kind on separators between strings
I'm using somethig like this, to save the file:
Stream stream = new Stream(30000, 30000);
stream.outOffset = 0;
stream.writeString("first string");
stream.writeString("second string");
FileOutputStream out = new FileOutputStream("file.txt");
out.write(stream.outBuffer, 0, stream.outOffset);
out.flush();
out.close();
out = null;
I have no idea what that Stream thing in your code represents, but the usual approach to write String lines to a file is using a PrintWriter.
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream("/file.txt"), "UTF-8"));
writer.println("first line");
writer.println("second line");
writer.close();
This way each line is separated by the platform default newline, which is the same as you obtain by System.getProperty("line.separator"). On Windows machines this is usually \r\n. In the PHP side, you can then just explode() on that.
file_get_contents returns the content of the file as a string. There are no lines in a string.
Are you familiar with newlines?
See wikipedia
So, what you are probably looking for is either reading your file line for line in PHP,
or reading it with file_get_contents like you did and then explode-ing it into lines (use "\n" as separator).
There is no indication in your code that you are writing a line separator to the output stream. You need to do something like this:
String nl = System.getProperty("line.separator");
Stream stream = new Stream(30000, 30000);
stream.outOffset = 0;
stream.writeString("first string");
stream.writeString(nl);
stream.writeString("second string");
stream.writeString(nl);
FileOutputStream out = null;
try
{
out = new FileOutputStream("file.txt");
out.write(stream.outBuffer, 0, stream.outOffset);
out.flush();
}
finally
{
try
{
if (out != null)
out.close();
}
catch (IOException ioex) { ; }
}
Using PHP, you can use the explode function to fill an array full of strings from the file you are reading in:
<?php
$data = file_get_contents('file.txt');
$lines = explode('\n', $data);
foreach ($lines as $line)
{
echo $line;
}
?>
Note that depending on your platform, you may need to put '\r\n' for the first explode parameter, or some of your lines may have carriage returns on the end of them.

Categories

Resources