Cannot convert and save UTF-8 string to ANSI in java

Cannot convert and save UTF-8 string to ANSI in java - java

Here is my code. I have to write string to console in UTF-8 but save the string in ANSI. When I open file it's in UTF-8. What do I do?
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
String message = bufferedReader.readLine();
bufferedReader.close();
String utfString = new String(message.getBytes(), "UTF-8");
String ansiMessage = new String(utfString.getBytes(), "WINDOWS-1251");
writeToFile(ansiMessage, "ANSI.txt", "WINDOWS-1251");
private static void writeToFile(String string, String path, String enc) throws IOException {
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(path), enc));
writer.write(string);
writer.close();
}

First, getBytes() returns the bytes of the string in a default charset, which is usually UTF-16. Second, new String(bytes[], string) interpret the bytes as a string in the charset provided, it doesn't convert them. So:
new String(message.getBytes(), "UTF-8")
Try to read a UTF-16 string as UTF-8, bad. Then:
new String(utfString.getBytes(), "WINDOWS-1251")
Try to read the resulting string as WINDOWS-1251, equally bad.
I'm sure at this point your string is destroyed.
You can just call getBytes(Charset) to get the bytes of your string in the charset you want. But in your case you don't even need to do that, because your writeToFile(...) method already does charset conversion when writing to the file, so you can just give it the original message.

Related

Converting utf8 to gb2312 in java

Just look at the code bellow
try {
String str = "上海上海";
String gb2312 = new String(str.getBytes("utf-8"), "gb2312");
String utf8 = new String(gb2312.getBytes("gb2312"), "utf-8");
System.out.println(str.equals(utf8));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
print false!!!
I run this code both under jdk7 and jdk8 and my code style of IDE is utf8.
Can anyone help me?

what you are looking for is the encoding/decoding when you output/input.
as #kalpesh said, internally, it is all unicode. if you want to READ a stream in a specific encoding and then WRITE it to a different one, you will have to specify the encoding for the conversion between bytes (in the stream) and strings (in java), and then between strings (in java) to bytes (the output stream) like so:
InputStream is = new FileInputStream("utf8_encoded_text.txt");
OutputStream os = new FileOutputStream("gb2312_encoded.txt");
Reader r = new InputStreamReader(is,"utf-8");
BufferedReader br = new BufferedReader(r);
Writer w = new OutputStreamWriter(os, "gb2312");
BufferedWriter bw = new BufferedWriter(w);
String s=null;
while((s=br.readLine())!=null) {
bw.write(s);
}
br.close();
bw.close();
os.flush();
of course, you still have to do proper exception handling to make sure everything is properly closed.

String gb2312 = new String(str.getBytes("utf-8"), "gb2312");
This statement is incorrect because String constructor is supposed to take matching byte array and charset, you are saying bytes are utf-8 but charset is gb2312

ANSI to UTF-8 through Java : some lines are lost

I wanted to convert some files from ANSI (Arabic) to UTF-8. It works but after the new file is created, it is missing some lines (at the end). Any ideas why?
This is the code :
public class CustomFileConverter {
private static final char BYTE_ORDER_MARK = '\uFEFF';
public void createFile(String inputFile, String outputFile) throws IOException{
FileInputStream input = new FileInputStream(inputFile);
InputStreamReader inputStreamReader = new InputStreamReader(input, "windows-1256"); // Arabic
char[] data = new char[1024];
int i = inputStreamReader.read(data);
if(new File(outputFile).exists()){
new File(outputFile).delete();
}
FileOutputStream output = new FileOutputStream(outputFile,true);
Writer writer = new OutputStreamWriter(output,"UTF-8");
String text = "";
writer.write(BYTE_ORDER_MARK);
while(i !=-1){
String str = new String(data,0,i);
text = text+str;
i = inputStreamReader.read(data);
}
// System.out.print(text); It is printed Completely
writer.write(text);
// File lacks some final lines...
output.close();
input.close();
}
}

When wrapping an output stream in a writer and writing to the writer, the writer may cache data before actually forwarding it to the output stream.
Since you're closing the output stream (file) before flushing the writer, there may be unwritten data which can no longer be written to the file since the output stream is closed.
Instead of closing the FileOutputStream output, close the writer writer, that will flush the contents of the writer to the file and also close both the writer itself and the wrapped FileOutputStream;

Unable to recover a full image using java bufferwriter?

From Input Stream i am reading the image data and convert it to string. From string am writing to an image directly by following type.
final BufferedReader reader = new BufferedReader(new InputStreamReader(in));
final char[] cbuf = new char[1024];
final int length = reader.read(cbuf);
String packet=new String(cbuf,0,length);
BufferedWriter out = null ;
FileWriter fstream ;
File file = new File(fileName);
fstream = new FileWriter(file);
out.write(packet);
Please guide me in this issue.
I am not getting full image.

final BufferedReader reader = new BufferedReader(new InputStreamReader(in));
Decodes input using default encoding potentially corrupting data.
out.write(packet);
Encodes characters using default encoding potentially corrupting data.
Read documentation on API you use. Only perform conversion with default or unknown encoding when you absolutely need it.
Read/convert an InputStream to a String

Convert Windows-1252 to UTF-16 in Java

I am trying to convert all Windows special characters to their Unicode equivalent. We have a Flex application, where a user saves some Rich Text, and then it is emailed through a Java Emailer to their recipient. However, we keep running into Word's special characters that just show up in the email as a ?.
So far I've tried
private String replaceWordChars(String text_in) {
String s = text_in;
// smart single quotes and apostrophe
s = s.replaceAll("[\\u2018|\\u2019|\\u201A]", "\'");
// smart double quotes
s = s.replaceAll("[\\u201C|\\u201D|\\u201E]", "\"");
// ellipsis
s = s.replaceAll("\\u2026", "...");
// dashes
s = s.replaceAll("[\\u2013|\\u2014]", "-");
// circumflex
s = s.replaceAll("\\u02C6", "^");
// open angle bracket
s = s.replaceAll("\\u2039", "<");
// close angle bracket
s = s.replaceAll("\\u203A", ">");
// spaces
s = s.replaceAll("[\\u02DC|\\u00A0]", " ");
return s;
Which works, but I don't want to hand encode all Windows-1252 characters to their equivalent UTF-16 (assuming that's what default Java character set is)
However our users keep finding more characters from Microsoft Word that Java just can't handle. So I searched and searched, and found this example
private String replaceWordChars(String text_in) {
String s = text_in;
try {
byte[] b = s.getBytes("Cp1252");
byte[] encoded = new String(b, "Cp1252").getBytes("UTF-16");
s = new String(encoded, "UTF-16");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return s;
But when I watch the encoding happen in the Eclipse debugger, nothing changes.
There has to be a simple solution to dealing with Microsoft's lovely encoding with Java.
Any thoughts?

You could try using java.nio.charset.Charset:
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
final CharBuffer windowsEncoded = windowsCharset.decode(ByteBuffer.wrap(new byte[] {(byte) 0x91}));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
System.out.println(new String(utfEncoded, utfCharset.displayName()));

Use the following steps:
Create an InputStreamReader using the source file's encoding (Windows-1252)
Create an OutputStreamWriter using the destination file's encoding (UTF-16)
Copy the information read from the reader to the writer. You can use BufferedReader and BufferedWriter to write contents line-by-line.
So your code may look like this:
public void reencode(InputStream source, OutputStream dest,
String sourceEncoding, String destEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(dest, destEncoding));
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.newLine();
}
}
This, of course, excludes try/catch stuff and delegates it to the caller.
If you're just trying to get the contents as a string of sorts, you can replace the writer with StringWriter and return its toString value. Then you don't need a destination stream or encoding, just a place to dump characters:
public String decode(InputStream source, String sourceEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
StringWriter writer = new StringWriter();
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.write('\n'); // Java newline should be fine, test this just in case
}
return writer.toString();
}

What seems to work so far for everything I've tested is:
private String replaceWordChars(String text_in) {
String s = text_in;
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
byte[] incomingBytes = s.getBytes();
final CharBuffer windowsEncoded =
windowsCharset.decode(ByteBuffer.wrap(incomingBytes));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
s = new String(utfEncoded);
return s;
}

Splitting strings by newline trouble

I am reading in a file that is being sent though a socket and then trying to split it via newlines (\n), when I read in the file I am using a byte[] and I convert the byte array to a string so that I can split it.
public String getUserFileData()
{
try
{
byte[] mybytearray = new byte[1024];
InputStream is = clientSocket.getInputStream();
int bytesRead = is.read(mybytearray, 0, mybytearray.length);
is.close();
return new String(mybytearray);
}
catch(IOException e)
{
}
return "";
}
Here is the code used to attempting to split the String
public void readUserFile(String userData, Log logger)
{
String[] data;
String companyName;
data = userData.split("\n");
username = data[0];
password = data[1].toCharArray();
companyName = data[2];
quota = Float.parseFloat(data[3]);
company = new Company();
company.readCompanyFile("C:\\Users\\Chris\\Documents\\NetBeansProjects\\ArFile\\ArFile Clients\\" + companyName + "\\"
+ companyName + ".cmp");
cloudFiles = new CloudFiles();
cloudFiles.readCloudFiles(this, logger);
}
It causes this error
Exception in thread "AWT-EventQueue-1" java.lang.ArrayIndexOutOfBoundsException

You can use the readLine method in BufferedReader class.
Wrap the InputStream under InputStreamReader, and wrap it under BufferedReader:
InputStream is = clientSocket.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
Please also check the encoding of the stream - you might need to specify the encoding in the constructor of InputStreamReader.

As stated in comments, using a BufferedReader would be best - you should be using an InputStreamReader anyway in order to convert from binary to text.
// Or use a different encoding - whatever's appropriate
BufferedReader reader = new BufferedReader(
new InputStreamReader(clientSocket.getInputStream(), "UTF-8");
try {
String line;
// I'm assuming you want to read every incoming line
while ((line = reader.readLine()) != null) {
processLine(line);
}
} finally {
reader.close();
}
Note that it's important to state which encoding you want to use - otherwise it'll use the platform's default encoding, which will vary from machine to machine, whereas presumably the data is in one specific encoding. If you don't know which encoding that is yet, you need to find out. Until then, you simply can't reliably understand the data.
(I hope your real code doesn't have an empty catch block, by the way.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Cannot convert and save UTF-8 string to ANSI in java - java

Related

Converting utf8 to gb2312 in java

ANSI to UTF-8 through Java : some lines are lost

Unable to recover a full image using java bufferwriter?

Convert Windows-1252 to UTF-16 in Java

Splitting strings by newline trouble

Categories

Resources