Java - Problems with "ü/ä/ö" after SCP

Java - Problems with "ü/ä/ö" after SCP - java

I create a Programm which can load local or remote log files.
If i load a local file there is no error.
But if I copy first the file with SCP to my local (where i use this code: http://www.jcraft.com/jsch/examples/ScpFrom.java.html) and read it out I get an Error and the letters "ü/ä/ö" shown as �.
How can i fix this ?
Remote : Linux-Server
Local: Windows-PC
Code for SCP :
http://www.jcraft.com/jsch/examples/ScpFrom.java.html
Code for reading out :
protected void openTempRemoteFile() throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream( lfile )));
String strLine;
DefaultTableModel dtm = new DefaultTableModel(0, 0);
String header[] = new String[]{ "Timestamp", "Session-ID", "Log" };
dtm.setColumnIdentifiers(header);
table.setModel(dtm);
while ((strLine = reader.readLine()) != null) {
String[] sparts = strLine.split(" ");
String[] bparts = strLine.split(" : ");
String Timestamp = sparts[0] + " " + sparts[1];
String SessionID = sparts[4];
String Log = bparts[1];
dtm.addRow(new Object[] {Timestamp, SessionID, Log});
}
reader.close();
}
EDIT :
Encoding Format of the Local-Files: UTF-8
Encoding Format of the SCP-Remote-Files from Linux-Server: WINDOWS-1252

Supply appropriate Charset to InputStreamReader constructor, e.g.:
import java.nio.charset.StandardCharsets;
...
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream( lfile ),
StandardCharsets.UTF_8)); // try also ISO_8859_1 if UTF_8 doesn't help.

To fix your problem you have at least two options:
You can specify the encoding for your files directly in your code, updating it as follow:
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream( lfile ),
"UTF8"
)
);
or set the default file encoding when starting the JVM with:
java -Dfile.encoding=UTF-8 … com.example.Main
I definitely prefer the first way and you can parametrize the "UTF8" value too, if you need.
With the latter way you could still face the same issues if you forgot to specify that.
You can replace the encoding with whatever you prefer (Refer to https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html for Supported Encodings) and, on Windows, "Cp1252" is usually the default encoding.
Remember, you can always use query the file.encoding property or Charset.defaultCharset() to find the current default encoding for your application, eg:
byte [] byteArray = {'blablabla'};
InputStream inputStream = new ByteArrayInputStream(byteArray);
InputStreamReader reader = new InputStreamReader(inputStream);
String defaultEncoding = reader.getEncoding();

Working with encoding is very tricky thing. If your system always uses this kind of files (from different environment) than you should first detect the charset than read it with given charset. I had similar problem and i used
juniversalchardet
to detect charset and used InputStreamReader(stream, Charset).
In your case it would be like
protected void openTempRemoteFile() throws IOException {
String encoding = UniversalDetector.detectCharset(lfile);
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream( lfile ), Charset.forName(encoding)));
....
If it is only one time job than open it in text editor (notapad++ for example) than save it in your encoding. Than use it in program.

Related

Reading from a file doesn't get the expected text [duplicate]

I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all.
Here's my environment:
Windows 2003, OS encoding: CP1252
Java 5.0
My files are UTF-8 encoded or CP1252 encoded, and some of them (UTF-8 encoded files) may contain Chinese (non-Latin) characters.
I use the following code to do my work:
private static String readFileAsString(String filePath)
throws java.io.IOException{
StringBuffer fileData = new StringBuffer(1000);
FileReader reader = new FileReader(filePath);
//System.out.println(reader.getEncoding());
BufferedReader reader = new BufferedReader(reader);
char[] buf = new char[1024];
int numRead=0;
while((numRead=reader.read(buf)) != -1){
String readData = String.valueOf(buf, 0, numRead);
fileData.append(readData);
buf = new char[1024];
}
reader.close();
return fileData.toString();
}
The above code doesn't work. I found the FileReader's encoding is CP1252 even if the text is UTF-8 encoded. But the JavaDoc of java.io.FileReader says that:
The constructors of this class assume
that the default character encoding
and the default byte-buffer size are
appropriate.
Does this mean that I am not required to set character encoding by myself if I am using FileReader? But I did get wrongly encoded data currently, what's the correct way to deal with my situtaion? Thanks.

Yes, you need to specify the encoding of the file you want to read.
Yes, this means that you have to know the encoding of the file you want to read.
No, there is no general way to guess the encoding of any given "plain text" file.
The one-arguments constructors of FileReader always use the platform default encoding which is generally a bad idea.
Since Java 11 FileReader has also gained constructors that accept an encoding: new FileReader(file, charset) and new FileReader(fileName, charset).
In earlier versions of java, you need to use new InputStreamReader(new FileInputStream(pathToFile), <encoding>).

FileReader uses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale.
If this "best guess" is not correct then you have to specify the encoding explicitly. Unfortunately, FileReader does not allow this (major oversight in the API). Instead, you have to use new InputStreamReader(new FileInputStream(filePath), encoding) and ideally get the encoding from metadata about the file.

For Java 7+ doc you can use this:
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
Here are all Charsets doc
For example if your file is in CP1252, use this method
Charset.forName("windows-1252");
Here is other canonical names for Java encodings both for IO and NIO doc
If you do not know with exactly encoding you have got in a file, you may use some third-party libs like this tool from Google this which works fairly neat.

Since Java 11 you may use that:
public FileReader(String fileName, Charset charset) throws IOException;

FileInputStream with InputStreamReader is better than directly using FileReader, because the latter doesn't allow you to specify encoding charset.
Here is an example using BufferedReader, FileInputStream and InputStreamReader together, so that you could read lines from a file.
List<String> words = new ArrayList<>();
List<String> meanings = new ArrayList<>();
public void readAll( ) throws IOException{
String fileName = "College_Grade4.txt";
String charset = "UTF-8";
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileName), charset));
String line;
while ((line = reader.readLine()) != null) {
line = line.trim();
if( line.length() == 0 ) continue;
int idx = line.indexOf("\t");
words.add( line.substring(0, idx ));
meanings.add( line.substring(idx+1));
}
reader.close();
}

For another as Latin languages for example Cyrillic you can use something like this:
FileReader fr = new FileReader("src/text.txt", StandardCharsets.UTF_8);
and be sure that your .txt file is saved with UTF-8 (but not as default ANSI) format. Cheers!

Converting utf8 to gb2312 in java

Just look at the code bellow
try {
String str = "上海上海";
String gb2312 = new String(str.getBytes("utf-8"), "gb2312");
String utf8 = new String(gb2312.getBytes("gb2312"), "utf-8");
System.out.println(str.equals(utf8));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
print false!!!
I run this code both under jdk7 and jdk8 and my code style of IDE is utf8.
Can anyone help me?

what you are looking for is the encoding/decoding when you output/input.
as #kalpesh said, internally, it is all unicode. if you want to READ a stream in a specific encoding and then WRITE it to a different one, you will have to specify the encoding for the conversion between bytes (in the stream) and strings (in java), and then between strings (in java) to bytes (the output stream) like so:
InputStream is = new FileInputStream("utf8_encoded_text.txt");
OutputStream os = new FileOutputStream("gb2312_encoded.txt");
Reader r = new InputStreamReader(is,"utf-8");
BufferedReader br = new BufferedReader(r);
Writer w = new OutputStreamWriter(os, "gb2312");
BufferedWriter bw = new BufferedWriter(w);
String s=null;
while((s=br.readLine())!=null) {
bw.write(s);
}
br.close();
bw.close();
os.flush();
of course, you still have to do proper exception handling to make sure everything is properly closed.

String gb2312 = new String(str.getBytes("utf-8"), "gb2312");
This statement is incorrect because String constructor is supposed to take matching byte array and charset, you are saying bytes are utf-8 but charset is gb2312

Unable to recover a full image using java bufferwriter?

From Input Stream i am reading the image data and convert it to string. From string am writing to an image directly by following type.
final BufferedReader reader = new BufferedReader(new InputStreamReader(in));
final char[] cbuf = new char[1024];
final int length = reader.read(cbuf);
String packet=new String(cbuf,0,length);
BufferedWriter out = null ;
FileWriter fstream ;
File file = new File(fileName);
fstream = new FileWriter(file);
out.write(packet);
Please guide me in this issue.
I am not getting full image.

final BufferedReader reader = new BufferedReader(new InputStreamReader(in));
Decodes input using default encoding potentially corrupting data.
out.write(packet);
Encodes characters using default encoding potentially corrupting data.
Read documentation on API you use. Only perform conversion with default or unknown encoding when you absolutely need it.
Read/convert an InputStream to a String

Java decode special characters Â¡ and Ì§ becomes A?¡ and I?§

I'm trying to read a file name off XML, whose encoding can be changed.
The file name on the XML has string such as "Ì§oÌ" which is supposed to be read by my code as "Ì§oÌ". However, I keep getting I?§.
Similar problem for Â and A?¡
Below is my code:
Socket s = new Socket();
InputStream is = s.getInputStream();
ByteArrayInputStream bAis = new ByteArrayInputStream(buf, 0, rlen);
BufferedReader bReader = new BufferedReader( new InputStreamReader( hbis, "ISO-8859-1" ));
String theStringINeed = bReader.readLine();
Any help would be appreciated.

new InputStreamReader( hbis, "ISO-8859-1" )
If you lie about the encoding of a file, bad things will happen.
You need to read the file using the encoding it was actually written in, which is probably UTF8.

Java FileReader encoding issue

I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all.
Here's my environment:
Windows 2003, OS encoding: CP1252
Java 5.0
My files are UTF-8 encoded or CP1252 encoded, and some of them (UTF-8 encoded files) may contain Chinese (non-Latin) characters.
I use the following code to do my work:
private static String readFileAsString(String filePath)
throws java.io.IOException{
StringBuffer fileData = new StringBuffer(1000);
FileReader reader = new FileReader(filePath);
//System.out.println(reader.getEncoding());
BufferedReader reader = new BufferedReader(reader);
char[] buf = new char[1024];
int numRead=0;
while((numRead=reader.read(buf)) != -1){
String readData = String.valueOf(buf, 0, numRead);
fileData.append(readData);
buf = new char[1024];
}
reader.close();
return fileData.toString();
}
The above code doesn't work. I found the FileReader's encoding is CP1252 even if the text is UTF-8 encoded. But the JavaDoc of java.io.FileReader says that:
The constructors of this class assume
that the default character encoding
and the default byte-buffer size are
appropriate.
Does this mean that I am not required to set character encoding by myself if I am using FileReader? But I did get wrongly encoded data currently, what's the correct way to deal with my situtaion? Thanks.

Yes, you need to specify the encoding of the file you want to read.
Yes, this means that you have to know the encoding of the file you want to read.
No, there is no general way to guess the encoding of any given "plain text" file.
The one-arguments constructors of FileReader always use the platform default encoding which is generally a bad idea.
Since Java 11 FileReader has also gained constructors that accept an encoding: new FileReader(file, charset) and new FileReader(fileName, charset).
In earlier versions of java, you need to use new InputStreamReader(new FileInputStream(pathToFile), <encoding>).

FileReader uses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale.
If this "best guess" is not correct then you have to specify the encoding explicitly. Unfortunately, FileReader does not allow this (major oversight in the API). Instead, you have to use new InputStreamReader(new FileInputStream(filePath), encoding) and ideally get the encoding from metadata about the file.

For Java 7+ doc you can use this:
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
Here are all Charsets doc
For example if your file is in CP1252, use this method
Charset.forName("windows-1252");
Here is other canonical names for Java encodings both for IO and NIO doc
If you do not know with exactly encoding you have got in a file, you may use some third-party libs like this tool from Google this which works fairly neat.

Since Java 11 you may use that:
public FileReader(String fileName, Charset charset) throws IOException;

FileInputStream with InputStreamReader is better than directly using FileReader, because the latter doesn't allow you to specify encoding charset.
Here is an example using BufferedReader, FileInputStream and InputStreamReader together, so that you could read lines from a file.
List<String> words = new ArrayList<>();
List<String> meanings = new ArrayList<>();
public void readAll( ) throws IOException{
String fileName = "College_Grade4.txt";
String charset = "UTF-8";
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileName), charset));
String line;
while ((line = reader.readLine()) != null) {
line = line.trim();
if( line.length() == 0 ) continue;
int idx = line.indexOf("\t");
words.add( line.substring(0, idx ));
meanings.add( line.substring(idx+1));
}
reader.close();
}

For another as Latin languages for example Cyrillic you can use something like this:
FileReader fr = new FileReader("src/text.txt", StandardCharsets.UTF_8);
and be sure that your .txt file is saved with UTF-8 (but not as default ANSI) format. Cheers!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - Problems with "ü/ä/ö" after SCP - java

Supply appropriate Charset to InputStreamReader constructor, e.g.: import java.nio.charset.StandardCharsets; ... BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( lfile ), StandardCharsets.UTF_8)); // try also ISO_8859_1 if UTF_8 doesn't help.

Related

Reading from a file doesn't get the expected text [duplicate]

Converting utf8 to gb2312 in java

Unable to recover a full image using java bufferwriter?

Java decode special characters Â¡ and Ì§ becomes A?¡ and I?§

Java FileReader encoding issue

Categories

Resources