Converting utf8 to gb2312 in java

Converting utf8 to gb2312 in java - java

Just look at the code bellow
try {
String str = "上海上海";
String gb2312 = new String(str.getBytes("utf-8"), "gb2312");
String utf8 = new String(gb2312.getBytes("gb2312"), "utf-8");
System.out.println(str.equals(utf8));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
print false!!!
I run this code both under jdk7 and jdk8 and my code style of IDE is utf8.
Can anyone help me?

what you are looking for is the encoding/decoding when you output/input.
as #kalpesh said, internally, it is all unicode. if you want to READ a stream in a specific encoding and then WRITE it to a different one, you will have to specify the encoding for the conversion between bytes (in the stream) and strings (in java), and then between strings (in java) to bytes (the output stream) like so:
InputStream is = new FileInputStream("utf8_encoded_text.txt");
OutputStream os = new FileOutputStream("gb2312_encoded.txt");
Reader r = new InputStreamReader(is,"utf-8");
BufferedReader br = new BufferedReader(r);
Writer w = new OutputStreamWriter(os, "gb2312");
BufferedWriter bw = new BufferedWriter(w);
String s=null;
while((s=br.readLine())!=null) {
bw.write(s);
}
br.close();
bw.close();
os.flush();
of course, you still have to do proper exception handling to make sure everything is properly closed.

String gb2312 = new String(str.getBytes("utf-8"), "gb2312");
This statement is incorrect because String constructor is supposed to take matching byte array and charset, you are saying bytes are utf-8 but charset is gb2312

Related

Cannot convert and save UTF-8 string to ANSI in java

Here is my code. I have to write string to console in UTF-8 but save the string in ANSI. When I open file it's in UTF-8. What do I do?
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
String message = bufferedReader.readLine();
bufferedReader.close();
String utfString = new String(message.getBytes(), "UTF-8");
String ansiMessage = new String(utfString.getBytes(), "WINDOWS-1251");
writeToFile(ansiMessage, "ANSI.txt", "WINDOWS-1251");
private static void writeToFile(String string, String path, String enc) throws IOException {
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(path), enc));
writer.write(string);
writer.close();
}

First, getBytes() returns the bytes of the string in a default charset, which is usually UTF-16. Second, new String(bytes[], string) interpret the bytes as a string in the charset provided, it doesn't convert them. So:
new String(message.getBytes(), "UTF-8")
Try to read a UTF-16 string as UTF-8, bad. Then:
new String(utfString.getBytes(), "WINDOWS-1251")
Try to read the resulting string as WINDOWS-1251, equally bad.
I'm sure at this point your string is destroyed.
You can just call getBytes(Charset) to get the bytes of your string in the charset you want. But in your case you don't even need to do that, because your writeToFile(...) method already does charset conversion when writing to the file, so you can just give it the original message.

Buffer Reader encoding charset for Russian characters

Currently we are facing some issues wherein Russian characters are getting converted to junk data which is seen as rectangles in notepad. Below is the code we are using and the code is executing on Linux server with Java 1.8
BufferReader buff=new BufferReader(new FileReader(new File("text.txt")));
String line;
StringBuffer result;
while((line=buff.readLine())!=null)
{
result.append(line).append('\n');
}
return result.toString.getBytes();
Earlier same code use to work on AIX environment with java 1.6.
Can anyone please give me a hint what might be going wrong. As this seems to be totally environmental since no code changes has been done.

Your code seems to be reading the whole file into a byte-array. That can be done this way:
static byte [] GetFileBytes (String filename)
throws java.io.FileNotFoundException,
java.io.IOException {
java.io.File f= new java.io.File (filename);
java.io.FileInputStream fi= new java.io.FileInputStream (f);
long fsize = f.length ();
byte b [] = new byte [(int)fsize];
int rsize= fi.read (b, 0, (int)fsize);
fi.close ();
if (rsize!=fsize) {
byte [] btmp= new byte [rsize];
System.arraycopy (b, 0, btmp, 0, rsize);
b= btmp;
}
return b;
}
Or, within your code, you can pick an encoding, and use it in both conversion:
static byte [] GetFileByteArray (String filename)
throws Exception {
String cset= "ISO-8859-1"; /* any one-byte encoding */
java.io.BufferedReader buff=
new java.io.BufferedReader
(new java.io.InputStreamReader
(new java.io.FileInputStream(filename), cset));
String line;
StringBuffer result= new StringBuffer ();
while((line=buff.readLine())!=null)
{
result.append(line).append('\n');
}
return result.toString().getBytes(cset);
}

Try this
BufferedReader buff= new BufferedReader(
new InputStreamReader(
new FileInputStream(fileDir), "UTF-8"));
Edit
make sure the file is saved as UTF-8

Java - Problems with "ü/ä/ö" after SCP

I create a Programm which can load local or remote log files.
If i load a local file there is no error.
But if I copy first the file with SCP to my local (where i use this code: http://www.jcraft.com/jsch/examples/ScpFrom.java.html) and read it out I get an Error and the letters "ü/ä/ö" shown as �.
How can i fix this ?
Remote : Linux-Server
Local: Windows-PC
Code for SCP :
http://www.jcraft.com/jsch/examples/ScpFrom.java.html
Code for reading out :
protected void openTempRemoteFile() throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream( lfile )));
String strLine;
DefaultTableModel dtm = new DefaultTableModel(0, 0);
String header[] = new String[]{ "Timestamp", "Session-ID", "Log" };
dtm.setColumnIdentifiers(header);
table.setModel(dtm);
while ((strLine = reader.readLine()) != null) {
String[] sparts = strLine.split(" ");
String[] bparts = strLine.split(" : ");
String Timestamp = sparts[0] + " " + sparts[1];
String SessionID = sparts[4];
String Log = bparts[1];
dtm.addRow(new Object[] {Timestamp, SessionID, Log});
}
reader.close();
}
EDIT :
Encoding Format of the Local-Files: UTF-8
Encoding Format of the SCP-Remote-Files from Linux-Server: WINDOWS-1252

Supply appropriate Charset to InputStreamReader constructor, e.g.:
import java.nio.charset.StandardCharsets;
...
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream( lfile ),
StandardCharsets.UTF_8)); // try also ISO_8859_1 if UTF_8 doesn't help.

To fix your problem you have at least two options:
You can specify the encoding for your files directly in your code, updating it as follow:
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream( lfile ),
"UTF8"
)
);
or set the default file encoding when starting the JVM with:
java -Dfile.encoding=UTF-8 … com.example.Main
I definitely prefer the first way and you can parametrize the "UTF8" value too, if you need.
With the latter way you could still face the same issues if you forgot to specify that.
You can replace the encoding with whatever you prefer (Refer to https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html for Supported Encodings) and, on Windows, "Cp1252" is usually the default encoding.
Remember, you can always use query the file.encoding property or Charset.defaultCharset() to find the current default encoding for your application, eg:
byte [] byteArray = {'blablabla'};
InputStream inputStream = new ByteArrayInputStream(byteArray);
InputStreamReader reader = new InputStreamReader(inputStream);
String defaultEncoding = reader.getEncoding();

Working with encoding is very tricky thing. If your system always uses this kind of files (from different environment) than you should first detect the charset than read it with given charset. I had similar problem and i used
juniversalchardet
to detect charset and used InputStreamReader(stream, Charset).
In your case it would be like
protected void openTempRemoteFile() throws IOException {
String encoding = UniversalDetector.detectCharset(lfile);
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream( lfile ), Charset.forName(encoding)));
....
If it is only one time job than open it in text editor (notapad++ for example) than save it in your encoding. Than use it in program.

Convert Windows-1252 to UTF-16 in Java

I am trying to convert all Windows special characters to their Unicode equivalent. We have a Flex application, where a user saves some Rich Text, and then it is emailed through a Java Emailer to their recipient. However, we keep running into Word's special characters that just show up in the email as a ?.
So far I've tried
private String replaceWordChars(String text_in) {
String s = text_in;
// smart single quotes and apostrophe
s = s.replaceAll("[\\u2018|\\u2019|\\u201A]", "\'");
// smart double quotes
s = s.replaceAll("[\\u201C|\\u201D|\\u201E]", "\"");
// ellipsis
s = s.replaceAll("\\u2026", "...");
// dashes
s = s.replaceAll("[\\u2013|\\u2014]", "-");
// circumflex
s = s.replaceAll("\\u02C6", "^");
// open angle bracket
s = s.replaceAll("\\u2039", "<");
// close angle bracket
s = s.replaceAll("\\u203A", ">");
// spaces
s = s.replaceAll("[\\u02DC|\\u00A0]", " ");
return s;
Which works, but I don't want to hand encode all Windows-1252 characters to their equivalent UTF-16 (assuming that's what default Java character set is)
However our users keep finding more characters from Microsoft Word that Java just can't handle. So I searched and searched, and found this example
private String replaceWordChars(String text_in) {
String s = text_in;
try {
byte[] b = s.getBytes("Cp1252");
byte[] encoded = new String(b, "Cp1252").getBytes("UTF-16");
s = new String(encoded, "UTF-16");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return s;
But when I watch the encoding happen in the Eclipse debugger, nothing changes.
There has to be a simple solution to dealing with Microsoft's lovely encoding with Java.
Any thoughts?

You could try using java.nio.charset.Charset:
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
final CharBuffer windowsEncoded = windowsCharset.decode(ByteBuffer.wrap(new byte[] {(byte) 0x91}));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
System.out.println(new String(utfEncoded, utfCharset.displayName()));

Use the following steps:
Create an InputStreamReader using the source file's encoding (Windows-1252)
Create an OutputStreamWriter using the destination file's encoding (UTF-16)
Copy the information read from the reader to the writer. You can use BufferedReader and BufferedWriter to write contents line-by-line.
So your code may look like this:
public void reencode(InputStream source, OutputStream dest,
String sourceEncoding, String destEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(dest, destEncoding));
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.newLine();
}
}
This, of course, excludes try/catch stuff and delegates it to the caller.
If you're just trying to get the contents as a string of sorts, you can replace the writer with StringWriter and return its toString value. Then you don't need a destination stream or encoding, just a place to dump characters:
public String decode(InputStream source, String sourceEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
StringWriter writer = new StringWriter();
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.write('\n'); // Java newline should be fine, test this just in case
}
return writer.toString();
}

What seems to work so far for everything I've tested is:
private String replaceWordChars(String text_in) {
String s = text_in;
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
byte[] incomingBytes = s.getBytes();
final CharBuffer windowsEncoded =
windowsCharset.decode(ByteBuffer.wrap(incomingBytes));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
s = new String(utfEncoded);
return s;
}

Splitting strings by newline trouble

I am reading in a file that is being sent though a socket and then trying to split it via newlines (\n), when I read in the file I am using a byte[] and I convert the byte array to a string so that I can split it.
public String getUserFileData()
{
try
{
byte[] mybytearray = new byte[1024];
InputStream is = clientSocket.getInputStream();
int bytesRead = is.read(mybytearray, 0, mybytearray.length);
is.close();
return new String(mybytearray);
}
catch(IOException e)
{
}
return "";
}
Here is the code used to attempting to split the String
public void readUserFile(String userData, Log logger)
{
String[] data;
String companyName;
data = userData.split("\n");
username = data[0];
password = data[1].toCharArray();
companyName = data[2];
quota = Float.parseFloat(data[3]);
company = new Company();
company.readCompanyFile("C:\\Users\\Chris\\Documents\\NetBeansProjects\\ArFile\\ArFile Clients\\" + companyName + "\\"
+ companyName + ".cmp");
cloudFiles = new CloudFiles();
cloudFiles.readCloudFiles(this, logger);
}
It causes this error
Exception in thread "AWT-EventQueue-1" java.lang.ArrayIndexOutOfBoundsException

You can use the readLine method in BufferedReader class.
Wrap the InputStream under InputStreamReader, and wrap it under BufferedReader:
InputStream is = clientSocket.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
Please also check the encoding of the stream - you might need to specify the encoding in the constructor of InputStreamReader.

As stated in comments, using a BufferedReader would be best - you should be using an InputStreamReader anyway in order to convert from binary to text.
// Or use a different encoding - whatever's appropriate
BufferedReader reader = new BufferedReader(
new InputStreamReader(clientSocket.getInputStream(), "UTF-8");
try {
String line;
// I'm assuming you want to read every incoming line
while ((line = reader.readLine()) != null) {
processLine(line);
}
} finally {
reader.close();
}
Note that it's important to state which encoding you want to use - otherwise it'll use the platform's default encoding, which will vary from machine to machine, whereas presumably the data is in one specific encoding. If you don't know which encoding that is yet, you need to find out. Until then, you simply can't reliably understand the data.
(I hope your real code doesn't have an empty catch block, by the way.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Converting utf8 to gb2312 in java - java

String gb2312 = new String(str.getBytes("utf-8"), "gb2312"); This statement is incorrect because String constructor is supposed to take matching byte array and charset, you are saying bytes are utf-8 but charset is gb2312

Related

Cannot convert and save UTF-8 string to ANSI in java

Buffer Reader encoding charset for Russian characters

Java - Problems with "ü/ä/ö" after SCP

Convert Windows-1252 to UTF-16 in Java

Splitting strings by newline trouble

Categories

Resources