Convert Windows-1252 to UTF-16 in Java - java

I am trying to convert all Windows special characters to their Unicode equivalent. We have a Flex application, where a user saves some Rich Text, and then it is emailed through a Java Emailer to their recipient. However, we keep running into Word's special characters that just show up in the email as a ?.
So far I've tried
private String replaceWordChars(String text_in) {
String s = text_in;
// smart single quotes and apostrophe
s = s.replaceAll("[\\u2018|\\u2019|\\u201A]", "\'");
// smart double quotes
s = s.replaceAll("[\\u201C|\\u201D|\\u201E]", "\"");
// ellipsis
s = s.replaceAll("\\u2026", "...");
// dashes
s = s.replaceAll("[\\u2013|\\u2014]", "-");
// circumflex
s = s.replaceAll("\\u02C6", "^");
// open angle bracket
s = s.replaceAll("\\u2039", "<");
// close angle bracket
s = s.replaceAll("\\u203A", ">");
// spaces
s = s.replaceAll("[\\u02DC|\\u00A0]", " ");
return s;
Which works, but I don't want to hand encode all Windows-1252 characters to their equivalent UTF-16 (assuming that's what default Java character set is)
However our users keep finding more characters from Microsoft Word that Java just can't handle. So I searched and searched, and found this example
private String replaceWordChars(String text_in) {
String s = text_in;
try {
byte[] b = s.getBytes("Cp1252");
byte[] encoded = new String(b, "Cp1252").getBytes("UTF-16");
s = new String(encoded, "UTF-16");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return s;
But when I watch the encoding happen in the Eclipse debugger, nothing changes.
There has to be a simple solution to dealing with Microsoft's lovely encoding with Java.
Any thoughts?

You could try using java.nio.charset.Charset:
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
final CharBuffer windowsEncoded = windowsCharset.decode(ByteBuffer.wrap(new byte[] {(byte) 0x91}));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
System.out.println(new String(utfEncoded, utfCharset.displayName()));

Use the following steps:
Create an InputStreamReader using the source file's encoding (Windows-1252)
Create an OutputStreamWriter using the destination file's encoding (UTF-16)
Copy the information read from the reader to the writer. You can use BufferedReader and BufferedWriter to write contents line-by-line.
So your code may look like this:
public void reencode(InputStream source, OutputStream dest,
String sourceEncoding, String destEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(dest, destEncoding));
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.newLine();
}
}
This, of course, excludes try/catch stuff and delegates it to the caller.
If you're just trying to get the contents as a string of sorts, you can replace the writer with StringWriter and return its toString value. Then you don't need a destination stream or encoding, just a place to dump characters:
public String decode(InputStream source, String sourceEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
StringWriter writer = new StringWriter();
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.write('\n'); // Java newline should be fine, test this just in case
}
return writer.toString();
}

What seems to work so far for everything I've tested is:
private String replaceWordChars(String text_in) {
String s = text_in;
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
byte[] incomingBytes = s.getBytes();
final CharBuffer windowsEncoded =
windowsCharset.decode(ByteBuffer.wrap(incomingBytes));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
s = new String(utfEncoded);
return s;
}

Related

Java - Reading UTF8 bytes from File into String in a system independent way

How to read a UTF8 encoded file in Java into a String accurately?
When i change the encoding of this .java file to UTF-8 (Eclipse > Rightclick on App.java > Properties > Resource > Text file encoding) , it works fine from within Eclipse but not command line. Seems like eclipse is setting file.encoding parameter when running App.
Why should the encoding of the source file have any impact on creating String from bytes. What is the fool-proof way to create String from bytes when encoding is known?
I may have files with different encodings. Once encoding of a file is known, I must be able to read into string, regardless of value of file.encoding?
The content of utf8 file is below
English Hello World.
Korean 안녕하세요.
Japanese 世界こんにちは。
Russian Привет мир.
German Hallo Welt.
Spanish Hola mundo.
Hindi हैलो वर्ल्ड।
Gujarati હેલો વર્લ્ડ.
Thai สวัสดีชาวโลก.
-end of file-
The code is below. MY observations are in the comments within.
public class App {
public static void main(String[] args) {
String slash = System.getProperty("file.separator");
File inputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text.txt");
File outputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_out.txt");
File outputUtfByteWrittenFile = new File(
"C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_byteout.txt");
outputUtfFile.delete();
outputUtfByteWrittenFile.delete();
try {
/*
* read a utf8 text file with internationalized strings into bytes.
* there should be no information loss here, when read into raw bytes.
* We are sure that this file is UTF-8 encoded.
* Input file created using Notepad++. Text copied from Google translate.
*/
byte[] fileBytes = readBytes(inputUtfFile);
/*
* Create a string from these bytes. Specify that the bytes are UTF-8 bytes.
*/
String str = new String(fileBytes, StandardCharsets.UTF_8);
/*
* The console is incapable of displaying this string.
* So we write into another file. Open in notepad++ to check.
*/
ArrayList<String> list = new ArrayList<>();
list.add(str);
writeLines(list, outputUtfFile);
/*
* Works fine when I read bytes and write bytes.
* Open the other output file in notepad++ and check.
*/
writeBytes(fileBytes, outputUtfByteWrittenFile);
/*
* I am using JDK 8u60.
* I tried running this on command line instead of eclipse. Does not work.
* I tried using apache commons io library. Does not work.
*
* This means that new String(bytes, charset); does not work correctly.
* There is no real effect of specifying charset to string.
*/
} catch (IOException e) {
e.printStackTrace();
}
}
public static void writeLines(List<String> lines, File file) throws IOException {
BufferedWriter writer = null;
OutputStreamWriter osw = null;
OutputStream fos = null;
try {
fos = new FileOutputStream(file);
osw = new OutputStreamWriter(fos);
writer = new BufferedWriter(osw);
String lineSeparator = System.getProperty("line.separator");
for (int i = 0; i < lines.size(); i++) {
String line = lines.get(i);
writer.write(line);
if (i < lines.size() - 1) {
writer.write(lineSeparator);
}
}
} catch (IOException e) {
throw e;
} finally {
close(writer);
close(osw);
close(fos);
}
}
public static byte[] readBytes(File file) {
FileInputStream fis = null;
byte[] b = null;
try {
fis = new FileInputStream(file);
b = readBytesFromStream(fis);
} catch (Exception e) {
e.printStackTrace();
} finally {
close(fis);
}
return b;
}
public static void writeBytes(byte[] inBytes, File file) {
FileOutputStream fos = null;
try {
fos = new FileOutputStream(file);
writeBytesToStream(inBytes, fos);
fos.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
close(fos);
}
}
public static void close(InputStream inStream) {
try {
inStream.close();
} catch (IOException e) {
e.printStackTrace();
}
inStream = null;
}
public static void close(OutputStream outStream) {
try {
outStream.close();
} catch (IOException e) {
e.printStackTrace();
}
outStream = null;
}
public static void close(Writer writer) {
if (writer != null) {
try {
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
writer = null;
}
}
public static long copy(InputStream readStream, OutputStream writeStream) throws IOException {
int bytesread = -1;
byte[] b = new byte[4096]; //4096 is default cluster size in Windows for < 2TB NTFS partitions
long count = 0;
bytesread = readStream.read(b);
while (bytesread != -1) {
writeStream.write(b, 0, bytesread);
count += bytesread;
bytesread = readStream.read(b);
}
return count;
}
public static byte[] readBytesFromStream(InputStream readStream) throws IOException {
ByteArrayOutputStream writeStream = null;
byte[] byteArr = null;
writeStream = new ByteArrayOutputStream();
try {
copy(readStream, writeStream);
writeStream.flush();
byteArr = writeStream.toByteArray();
} finally {
close(writeStream);
}
return byteArr;
}
public static void writeBytesToStream(byte[] inBytes, OutputStream writeStream) throws IOException {
ByteArrayInputStream bis = null;
bis = new ByteArrayInputStream(inBytes);
try {
copy(bis, writeStream);
} finally {
close(bis);
}
}
};
Edit: For #JB Nizet, And Everyone :)
//writeLines(list, outputUtfFile, StandardCharsets.UTF_16BE); //does not work
//writeLines(list, outputUtfFile, Charset.defaultCharset()); //does not work.
writeLines(list, outputUtfFile, StandardCharsets.UTF_16LE); //works
I need to specify encoding of bytes when reading bytes into String.
I need to specify encoding of bytes when I am writing bytes from String into file.
Once I have a String in JVM, I do not need to remember the source byte encoding, am I right?
When I write to file, it should convert the String into the default Charset of my machine (be it UTF8 or ASCII or cp1252). That is failing.
It fails for UTF16 BE too. Why does it fail for some Charsets?
The Java source file encoding is indeed irrelevant. And the reading part of your code is correct (although inefficient). What is incorrect is the writing part:
osw = new OutputStreamWriter(fos);
should be changed to
osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
Otherwise, you use the default encoding (which doesn't seem to be UTF8 on your system) instead of using UTF8.
Note that Java allows using forward slashes in file paths, even on Windows. You could simply write
File inputUtfFile = new File("C:/sources/TestUtfRead/utf8text.txt");
EDIT:
Once I have a String in JVM, I do not need to remember the source byte encoding, am I right?
Yes, you're right.
When I write to file, it should convert the String into the default Charset of my machine (be it UTF8 or ASCII or cp1252). That is failing.
If you don't specify any encoding, Java will indeed use the platform default encoding to transform the characters into bytes. If you specify an encoding (as suggested in the beginning of this answer), then it uses the encoding you tell it to use.
But all encodings can't, like UTF8, represent all the unicode characters. ASCII for example only supports 128 different characters. Cp1252, AFAIK, only supports 256 characters. So, the encoding succeeds, but it replaces unencodable characters with a special one (I can't remember which one) which means: I can't encode this Thai or Russian character because it's not part of my supported character set.
UTF16 encoding should be fine. But make sure to also configure your text editor to use UTF16 when reading and displaying the content of the file. If it's configured to use another encoding, the displayed content won't be correct.

Java fast stream copy with ISO-8859-1

I have the following code, which will read in files in ISO-8859-1, as thats what is required in this application,
private static String readFile(String filename) throws IOException {
String lineSep = System.getProperty("line.separator");
File f = new File(filename);
StringBuffer sb = new StringBuffer();
if (f.exists()) {
BufferedReader br =
new BufferedReader(
new InputStreamReader(
new FileInputStream(filename), "ISO-8859-1"));
String nextLine = "";
while ((nextLine = br.readLine()) != null) {
sb.append(nextLine+ " ");
// note: BufferedReader strips the EOL character.
// sb.append(lineSep);
}
br.close();
}
return sb.toString();
}
The problem is it is pretty slow. I have this function, which is MUCH faster, but I can not seem to find how to place the character encoding:
private static String fastStreamCopy(String filename)
{
String s = "";
FileChannel fc = null;
try
{
fc = new FileInputStream(filename).getChannel();
MappedByteBuffer byteBuffer = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
int size = byteBuffer.capacity();
if (size > 0)
{
byteBuffer.clear();
byte[] bytes = new byte[size];
byteBuffer.get(bytes, 0, bytes.length);
s = new String(bytes);
}
fc.close();
}
catch (FileNotFoundException fnfx)
{
System.out.println("File not found: " + fnfx);
}
catch (IOException iox)
{
System.out.println("I/O problems: " + iox);
}
finally
{
if (fc != null)
{
try
{
fc.close();
}
catch (IOException ignore)
{
}
}
}
return s;
}
Any one have an idea of where i should be putting the ISO encoding?
From the code you posted, you're not trying to "copy" the stream, but read it into a string.
You can simply provide the encoding in the String constructor:
s = new String(bytes, "ISO-88591-1");
Personally I'd just replace the whole method with a call to the Guava method Files.toString():
String content = Files.toString(new File(filename), StandardCharsets.ISO_8859_1);
If you're using Java 6 or earlier, you'll need to use the Guava field Charsets.ISO_8859_1 instead of StandardCharsets.ISO_8859_1 (which was only introduced in Java 7).
However your use of the term "copy" suggests that you want to write the result to some other file (or stream). If that is true, then you don't need to care about the encoding at all, since you can just handle the byte[] directly and avoid the (unnecessary) conversion to and from String.
where you are converting bytes to string e.g. s = new String(bytes, encoding); or vice versa.

Changing encoding in java

I am writting a function that is should detect used charset and then switch it to utf-8. I am using juniversalchardet which is java port for universalchardet by mozilla.
This is my code:
private List<List<String>> setProperEncoding(List<List<String>> input) {
try {
// Detect used charset
UniversalDetector detector = new UniversalDetector(null);
int position = 0;
while ((position < input.size()) & (!detector.isDone())) {
String row = null;
for (String cell : input.get(position)) {
row += cell;
}
byte[] bytes = row.getBytes();
detector.handleData(bytes, 0, bytes.length);
position++;
}
detector.dataEnd();
Charset charset = Charset.forName(detector.getDetectedCharset());
Charset utf8 = Charset.forName("UTF-8");
System.out.println("Detected charset: " + charset);
// rewrite input using proper charset
List<List<String>> newLines = new ArrayList<List<String>>();
for (List<String> row : input) {
List<String> newRow = new ArrayList<String>();
for (String cell : row) {
//newRow.add(new String(cell.getBytes(charset)));
ByteBuffer bb = ByteBuffer.wrap(cell.getBytes(charset));
CharBuffer cb = charset.decode(bb);
bb = utf8.encode(cb);
newRow.add(new String(bb.array()));
}
newLines.add(newRow);
}
return newLines;
} catch (Exception e) {
e.printStackTrace();
return input;
}
}
My problem is that when I read file with chars of for example Polish alphabet, letters like ł,ą,ć and similiar are replaced by ? and other strange things. What am I doing wrong?
EDIT:
For compilation I am using eclipse.
Method parameter is a result of reading MultipartFile. Just using FileInputStream to get every line and then splitting everyline by some separator (it is prepaired for xls, xlsx and csv files). Nothing special there.
First of all, you have your data somewhere in a binary format. For the sake of simplicity, I suppose it comes from an InputStream.
You want to write the output as an UTF-8 String, I suppose it can be an OutputStream.
I would recommend to create an AutoDetectInputStream:
public class AutoDetectInputStream extends InputStream {
private InputStream is;
private byte[] sampleData = new byte[4096];
private int sampleLen;
private int sampleIndex = 0;
public AutoDetectStream(InputStream is) throws IOException {
this.is = is;
// pre-read the data
sampleLen = is.read(sampleData);
}
public Charset getCharset() {
// detect the charset
UniversalDetector detector = new UniversalDetector(null);
detector.handleData(sampleData, 0, sampleLen);
detector.dataEnd();
return detector.getDetectedCharset();
}
#Override
public int read() throws IOException {
// simulate the stream for the reader
if(sampleIndex < sampleLen) {
return sampleData[sampleIndex++];
}
return is.read();
}
}
The second task is quite simple because Java stores the strings (characters) in UTF-8, so just use a simple OutputStreamWriter. So, here's your code:
// open input with Detector stream
// we use BufferedReader so we could read lines
InputStream is = new FileInputStream("in.txt");
AutoDetectInputStream detector = new AutoDetectInputStream(is);
Charset charset = detector.getCharset();
// here we can use the charset to decode the bytes into characters
BufferedReader rdr = new BufferedReader(new InputStreamReader(detector, charset));
// open output to write to
OutputStream os = new FileOutputStream("out.txt");
Writer utf8Writer = new OutputStreamWriter(os, Charset.forName("UTF-8"));
// copy the whole file
String line;
while((line = rdr.readLine()) != null) {
utf8Writer.append(line);
}
// close streams
rdr.close();
utf8Writer.flush();
utf8Writer.close();
So, finally you got all your txt file transcoded to UTF-8.
Note, that the buffer size should be big enough to feed the UniversalDetector.

Splitting strings by newline trouble

I am reading in a file that is being sent though a socket and then trying to split it via newlines (\n), when I read in the file I am using a byte[] and I convert the byte array to a string so that I can split it.
public String getUserFileData()
{
try
{
byte[] mybytearray = new byte[1024];
InputStream is = clientSocket.getInputStream();
int bytesRead = is.read(mybytearray, 0, mybytearray.length);
is.close();
return new String(mybytearray);
}
catch(IOException e)
{
}
return "";
}
Here is the code used to attempting to split the String
public void readUserFile(String userData, Log logger)
{
String[] data;
String companyName;
data = userData.split("\n");
username = data[0];
password = data[1].toCharArray();
companyName = data[2];
quota = Float.parseFloat(data[3]);
company = new Company();
company.readCompanyFile("C:\\Users\\Chris\\Documents\\NetBeansProjects\\ArFile\\ArFile Clients\\" + companyName + "\\"
+ companyName + ".cmp");
cloudFiles = new CloudFiles();
cloudFiles.readCloudFiles(this, logger);
}
It causes this error
Exception in thread "AWT-EventQueue-1" java.lang.ArrayIndexOutOfBoundsException
You can use the readLine method in BufferedReader class.
Wrap the InputStream under InputStreamReader, and wrap it under BufferedReader:
InputStream is = clientSocket.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
Please also check the encoding of the stream - you might need to specify the encoding in the constructor of InputStreamReader.
As stated in comments, using a BufferedReader would be best - you should be using an InputStreamReader anyway in order to convert from binary to text.
// Or use a different encoding - whatever's appropriate
BufferedReader reader = new BufferedReader(
new InputStreamReader(clientSocket.getInputStream(), "UTF-8");
try {
String line;
// I'm assuming you want to read every incoming line
while ((line = reader.readLine()) != null) {
processLine(line);
}
} finally {
reader.close();
}
Note that it's important to state which encoding you want to use - otherwise it'll use the platform's default encoding, which will vary from machine to machine, whereas presumably the data is in one specific encoding. If you don't know which encoding that is yet, you need to find out. Until then, you simply can't reliably understand the data.
(I hope your real code doesn't have an empty catch block, by the way.)

Convert InputStream to String with encoding given in stream data

My input is a InputStream which contains an XML document. Encoding used in XML is unknown and it is defined in the first line of XML document.
From this InputStream, I want to have all document in a String.
To do this, I use a BufferedInputStream to mark the beginning of the file and start reading first line. I read this first line to get encoding and then I use an InputStreamReader to generate a String with the correct encoding.
It seems that it is not the best way to achieve this goal because it produces an OutOfMemory error.
Any idea, how to do it?
public static String streamToString(final InputStream is) {
String result = null;
if (is != null) {
BufferedInputStream bis = new BufferedInputStream(is);
bis.mark(Integer.MAX_VALUE);
final StringBuilder stringBuilder = new StringBuilder();
try {
// stream reader that handle encoding
final InputStreamReader readerForEncoding = new InputStreamReader(bis, "UTF-8");
final BufferedReader bufferedReaderForEncoding = new BufferedReader(readerForEncoding);
String encoding = extractEncodingFromStream(bufferedReaderForEncoding);
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
// stream reader that handle encoding
bis.reset();
final InputStreamReader readerForContent = new InputStreamReader(bis, encoding);
final BufferedReader bufferedReaderForContent = new BufferedReader(readerForContent);
String line = bufferedReaderForContent.readLine();
while (line != null) {
stringBuilder.append(line);
line = bufferedReaderForContent.readLine();
}
bufferedReaderForContent.close();
bufferedReaderForEncoding.close();
} catch (IOException e) {
// reset string builder
stringBuilder.delete(0, stringBuilder.length());
}
result = stringBuilder.toString();
}else {
result = null;
}
return result;
}
The call to mark(Integer.MAX_VALUE) is causing the OutOfMemoryError, since it's trying to allocate 2GB of memory.
You can solve this by using an iterative approach. Set the mark readLimit to a reasonable value, say 8K. In 99% of cases this will work, but in pathological cases, e.g 16K spaces between the attributes in the declaration, you will need to try again. Thus, have a loop that tries to find the encoding, but if it doesn't find it within the given mark region, it tries again, doubling the requested mark readLimit size.
To be sure you don't advance the input stream past the mark limit, you should read the InputStream yourself, upto the mark limit, into a byte array. You then wrap the byte array in a ByteArrayInputStream and pass that to the constructor of the InputStreamReader assigned to 'readerForEncoding'.
You can use this method to convert inputstream to string. this might help you...
private String convertStreamToString(InputStream input) throws Exception{
BufferedReader reader = new BufferedReader(new InputStreamReader(input));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line);
}
input.close();
return sb.toString();
}

Categories

Resources