Buffer Reader encoding charset for Russian characters

Buffer Reader encoding charset for Russian characters - java

Currently we are facing some issues wherein Russian characters are getting converted to junk data which is seen as rectangles in notepad. Below is the code we are using and the code is executing on Linux server with Java 1.8
BufferReader buff=new BufferReader(new FileReader(new File("text.txt")));
String line;
StringBuffer result;
while((line=buff.readLine())!=null)
{
result.append(line).append('\n');
}
return result.toString.getBytes();
Earlier same code use to work on AIX environment with java 1.6.
Can anyone please give me a hint what might be going wrong. As this seems to be totally environmental since no code changes has been done.

Your code seems to be reading the whole file into a byte-array. That can be done this way:
static byte [] GetFileBytes (String filename)
throws java.io.FileNotFoundException,
java.io.IOException {
java.io.File f= new java.io.File (filename);
java.io.FileInputStream fi= new java.io.FileInputStream (f);
long fsize = f.length ();
byte b [] = new byte [(int)fsize];
int rsize= fi.read (b, 0, (int)fsize);
fi.close ();
if (rsize!=fsize) {
byte [] btmp= new byte [rsize];
System.arraycopy (b, 0, btmp, 0, rsize);
b= btmp;
}
return b;
}
Or, within your code, you can pick an encoding, and use it in both conversion:
static byte [] GetFileByteArray (String filename)
throws Exception {
String cset= "ISO-8859-1"; /* any one-byte encoding */
java.io.BufferedReader buff=
new java.io.BufferedReader
(new java.io.InputStreamReader
(new java.io.FileInputStream(filename), cset));
String line;
StringBuffer result= new StringBuffer ();
while((line=buff.readLine())!=null)
{
result.append(line).append('\n');
}
return result.toString().getBytes(cset);
}

Try this
BufferedReader buff= new BufferedReader(
new InputStreamReader(
new FileInputStream(fileDir), "UTF-8"));
Edit
make sure the file is saved as UTF-8

Related

Java - Problems with "ü/ä/ö" after SCP

I create a Programm which can load local or remote log files.
If i load a local file there is no error.
But if I copy first the file with SCP to my local (where i use this code: http://www.jcraft.com/jsch/examples/ScpFrom.java.html) and read it out I get an Error and the letters "ü/ä/ö" shown as �.
How can i fix this ?
Remote : Linux-Server
Local: Windows-PC
Code for SCP :
http://www.jcraft.com/jsch/examples/ScpFrom.java.html
Code for reading out :
protected void openTempRemoteFile() throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream( lfile )));
String strLine;
DefaultTableModel dtm = new DefaultTableModel(0, 0);
String header[] = new String[]{ "Timestamp", "Session-ID", "Log" };
dtm.setColumnIdentifiers(header);
table.setModel(dtm);
while ((strLine = reader.readLine()) != null) {
String[] sparts = strLine.split(" ");
String[] bparts = strLine.split(" : ");
String Timestamp = sparts[0] + " " + sparts[1];
String SessionID = sparts[4];
String Log = bparts[1];
dtm.addRow(new Object[] {Timestamp, SessionID, Log});
}
reader.close();
}
EDIT :
Encoding Format of the Local-Files: UTF-8
Encoding Format of the SCP-Remote-Files from Linux-Server: WINDOWS-1252

Supply appropriate Charset to InputStreamReader constructor, e.g.:
import java.nio.charset.StandardCharsets;
...
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream( lfile ),
StandardCharsets.UTF_8)); // try also ISO_8859_1 if UTF_8 doesn't help.

To fix your problem you have at least two options:
You can specify the encoding for your files directly in your code, updating it as follow:
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream( lfile ),
"UTF8"
)
);
or set the default file encoding when starting the JVM with:
java -Dfile.encoding=UTF-8 … com.example.Main
I definitely prefer the first way and you can parametrize the "UTF8" value too, if you need.
With the latter way you could still face the same issues if you forgot to specify that.
You can replace the encoding with whatever you prefer (Refer to https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html for Supported Encodings) and, on Windows, "Cp1252" is usually the default encoding.
Remember, you can always use query the file.encoding property or Charset.defaultCharset() to find the current default encoding for your application, eg:
byte [] byteArray = {'blablabla'};
InputStream inputStream = new ByteArrayInputStream(byteArray);
InputStreamReader reader = new InputStreamReader(inputStream);
String defaultEncoding = reader.getEncoding();

Working with encoding is very tricky thing. If your system always uses this kind of files (from different environment) than you should first detect the charset than read it with given charset. I had similar problem and i used
juniversalchardet
to detect charset and used InputStreamReader(stream, Charset).
In your case it would be like
protected void openTempRemoteFile() throws IOException {
String encoding = UniversalDetector.detectCharset(lfile);
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream( lfile ), Charset.forName(encoding)));
....
If it is only one time job than open it in text editor (notapad++ for example) than save it in your encoding. Than use it in program.

Converting utf8 to gb2312 in java

Just look at the code bellow
try {
String str = "上海上海";
String gb2312 = new String(str.getBytes("utf-8"), "gb2312");
String utf8 = new String(gb2312.getBytes("gb2312"), "utf-8");
System.out.println(str.equals(utf8));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
print false!!!
I run this code both under jdk7 and jdk8 and my code style of IDE is utf8.
Can anyone help me?

what you are looking for is the encoding/decoding when you output/input.
as #kalpesh said, internally, it is all unicode. if you want to READ a stream in a specific encoding and then WRITE it to a different one, you will have to specify the encoding for the conversion between bytes (in the stream) and strings (in java), and then between strings (in java) to bytes (the output stream) like so:
InputStream is = new FileInputStream("utf8_encoded_text.txt");
OutputStream os = new FileOutputStream("gb2312_encoded.txt");
Reader r = new InputStreamReader(is,"utf-8");
BufferedReader br = new BufferedReader(r);
Writer w = new OutputStreamWriter(os, "gb2312");
BufferedWriter bw = new BufferedWriter(w);
String s=null;
while((s=br.readLine())!=null) {
bw.write(s);
}
br.close();
bw.close();
os.flush();
of course, you still have to do proper exception handling to make sure everything is properly closed.

String gb2312 = new String(str.getBytes("utf-8"), "gb2312");
This statement is incorrect because String constructor is supposed to take matching byte array and charset, you are saying bytes are utf-8 but charset is gb2312

Can someone explain me how this code works

I got it from a page
Android AsyncTask method that I dont know how to solve
but i am not sure how it work completly, if someone can explain me what is the while for and This part "iso-8859-1"
i understood that the 8 is for the number of characters but i could be wrong
static InputStream is = null;
static String json = "";
is = httpEntity.getContent();
BufferedReader reader = new BufferedReader(new InputStreamReader(
is, "iso-8859-1"), 8);
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line + "\n");
}
is.close();
json = sb.toString();

Your code basically reads from an inputstream obtained from the httpentity, puts that into a StringBuilder and converts that into a json finally.
For understanding the api codes, javadoc is your friend.
Here is what I found in BufferredReader javadoc
public BufferedReader(Reader in,
int sz)
Creates a buffering character-input stream that uses an input buffer of the specified size.
Parameters:** in - A Reader sz - Input-buffer size
Throws: IllegalArgumentException - If sz is <=0
http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
As a reader, InputStreamReader is used in your code. Here is the relevant javadoc for the InputStreamReader
public InputStreamReader(InputStream in,Charset cs) Creates an
InputStreamReader that uses the given charset.
Parameters:
in - An
InputStream cs - A charset
http://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html#InputStreamReader(java.io.InputStream, java.nio.charset.Charset)
So "iso-8859-1" is the charset specified.

InputStreamReader don't limit returned length

I am working on learning Java and am going through the examples on the Android website. I am getting remote contents of an XML file. I am able to get the contents of the file, but then I need to convert the InputStream into a String.
public String readIt(InputStream stream, int len) throws IOException, UnsupportedEncodingException {
InputStreamReader reader = null;
reader = new InputStreamReader(stream, "UTF-8");
char[] buffer = new char[len];
reader.read(buffer);
return new String(buffer);
}
The issue I am having is I don't want the string to be limited by the len var. But, I don't know java well enough to know how to change this.
How can I create the char without a length?

Generally speaking it's bad practice to not have a max length on input strings like that due to the possibility of running out of available memory to store it.
That said, you could ignore the len variable and just loop on reader.read(...) and append the buffer to your string until you've read the entire InputStream like so:
public String readIt(InputStream stream, int len) throws IOException, UnsupportedEncodingException {
String result = "";
InputStreamReader reader = null;
reader = new InputStreamReader(stream, "UTF-8");
char[] buffer = new char[len];
while(reader.read(buffer) >= 0)
{
result = result + (new String(buffer));
buffer = new char[len];
}
return result;
}

Java: problem with reading a file

I'm loading an XML file with this method:
public static String readTextFile(String fullPathFilename) throws IOException {
StringBuffer sb = new StringBuffer(1024);
BufferedReader reader = new BufferedReader(new FileReader(fullPathFilename));
char[] chars = new char[1024];
while(reader.read(chars) > -1){
sb.append(String.valueOf(chars));
}
reader.close();
return sb.toString();
}
But it doesn't load the whole data. Instead of 25634 characters, it loads 10 less (25624). Why is that?
Thanks,
Ivan

With BufferedReader you get the readLine()-Method, which works well for me.
StringBuffer sb = new StringBuffer( 1024 );
BufferedReader reader = new BufferedReader( new FileReader( fullPathFilename ) );
while( true ) {
String line = reader.readLine();
if(line == null) {
break;
}
sb.append( line );
}
reader.close();

I think there's a bug in your code, the last read might not necessarily fill the char[], but you still load the string with all of it. To account for this you need to do something like:
StringBuilder res = new StringBuilder();
InputStreamReader r = new InputStreamReader(new BufferedInputStream(is));
char[] c = new char[1024];
while(true) {
int charCount = r.read(c);
if (charCount == -1) {
break;
}
res.append(c, 0, charCount);
}
r.close();
Also, how do you know you're expecting 25634 chars?
(and use StringBuilder instead of StringBuffer, the former is not threadsafe so sightly faster)

Perhaps you have 25634 Bytes in your file that represent only 25624 Characters? This might happen with multibyte character sets like UTF-8. All InputStreamReader (including FileReader) automatically do this conversion using a Charset (either an explicitly given one, or the default encoding that depends on the platform).

Use FileInputStream to avoid certain characters getting recognized as utf-8:
StringBuffer sb = new StringBuffer(1024);
FileInputStream fis = new FileInputStream(filename);
char[] chars = new char[1024];
while(reader.read(chars) > -1){
sb.append(String.valueOf(chars));
}
fis.close();
return sb.toString();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Buffer Reader encoding charset for Russian characters - java

Try this BufferedReader buff= new BufferedReader( new InputStreamReader( new FileInputStream(fileDir), "UTF-8")); Edit make sure the file is saved as UTF-8

Related

Java - Problems with "ü/ä/ö" after SCP

Converting utf8 to gb2312 in java

Can someone explain me how this code works

InputStreamReader don't limit returned length

Java: problem with reading a file

Categories

Resources