Understanding encoding in character streams

Understanding encoding in character streams - java

I'm trying to understand how encodings are applied by character streams in Java. For the discussion let's use the following code example:
public static void main(String[] args) throws Exception {
byte[] utf8Input = new byte[] { (byte) 0xc3, (byte) 0xb6 }; // 'ö'
ByteArrayOutputStream utf160Out = new ByteArrayOutputStream();
InputStreamReader is = new InputStreamReader(new ByteArrayInputStream(utf8Input), StandardCharsets.UTF_8); // [
OutputStreamWriter os = new OutputStreamWriter(utf160Out, StandardCharsets.UTF_16);
int len;
while ((len = is.read()) != -1) {
os.write(len);
}
os.close();
}
The program reads the UTF-8 encoded character 'ö' from the byte array utf8Input and writes it UTF-16 encoded to utf160Out. In particular, the ByteArrayInputStream on utf8Input just streams the bytes 'as-is' and the InputStreamReader subsequently decodes the read input with an UTF-8 decoder. Dumping the result of the len variable yields '0xf6' which represents the Unicode code point for 'ö'. The OutputStreamWriter writes using UTF-16 encoding without having any knowledge about the input encoding.
How does the OutputStreamWriter know the input encoding (here: UTF-8)? Is there an internal representation that is assumed which is also mapped to by an InputStreamReader? So basically, we are saying then: Read this input, it is UTF-8 encoded and decode it to our internal encoding X. An OutputStreamWriter is given the target encoding and expects the input to be encoded with X. Is this correct? If so, what is the internal encoding? UTF-16 as mentioned in What is the Java's internal represention for String? Modified UTF-8? UTF-16??

The read() method has returned a Java char value, which is an unsigned 2-byte binary number (0-65535).
The actual return type is int (signed 4-byte binary number) to allow for a special -1 value meaning end-of-stream.
A Java char is a UTF-16 encoded Unicode character. This means that all characters from the Basic Multilingual Plane will appear unencoded, i.e. the char value is the Unicode value.

Related

Is InputStream same as InputStreamReader when a character has 8 bits?

I was reading about InputStream and InputStreamReader.
Most of the people said that InputStream is for bytes and InputStreamReader is for text.
So I created a simple file with only one character, which was 'a'.
When I used InputStream to read the file and convert it to a char it printed the letter 'a'. And when I did the same thing but this time with InputStreamReader it also gave me the same result.
So where was the difference? I thought InputStream would not be able to give the letter 'a'.
Does this mean that when a character has 8 bits, there will be no difference between InputStream and InputStreamReader? Is it true that there will be only a difference between them when a character has more than one byte?

No, InputStream and InputStreamReader are not the same even for 8 bit characters.
Look at InputStream's read() method without parameter. It returns an int but according to the documentation, a byte (range 0 to 255) is returned or -1 for EOF. The other read methods work with arrays of bytes.
InputStreamReader inherits from Reader. Reader's read() method without a parameter also returns an int. But here the int value (range 0 to 65535) is interpreted as a character or -1 for EOF. The other read methods work with arrays of char directly.
The difference is the encoding. InputStreamReader's constructors require an explicit encoding or the platform's default encoding is used. The encoding is the translation between bytes and characters.
You said: "When I used InputStream to read the file and convert it to a char it printed the letter 'a'." So you read the byte and converted it to a character manually. This conversion part is built into InputStreamReader using an encoding for the translation.
Even for one byte character sets there are differences. So your example is the letter "a" which has hex value 61 for Windows ANSI encoding (named "Cp1252" in Java). But for the encoding IBM-Thai the byte 0x61 is interpreted as "/".
So the people said right. InputStream is for binary data and on top of that there is InputStreamReader which is for text, translating between binary data and text according to an encoding.
Here is a simple example:
import java.io.*;
public class EncodingExample {
public static void main(String[] args) throws Exception {
// Prepare the byte buffer for character 'a' in Windows-ANSI
ByteArrayOutputStream baos = new ByteArrayOutputStream();
final PrintWriter writer = new PrintWriter(new OutputStreamWriter(baos, "Cp1252"));
writer.print('a');
writer.flush();
final byte[] buffer = baos.toByteArray();
readAsBytes(new ByteArrayInputStream(buffer));
readWithEncoding(new ByteArrayInputStream(buffer), "Cp1252");
readWithEncoding(new ByteArrayInputStream(buffer), "IBM-Thai");
}
/**
* Reads and displays the InputStream's bytes as hexadecimal.
* #param in The inputStream
* #throws Exception
*/
private static void readAsBytes(InputStream in) throws Exception {
int c;
while((c = in.read()) != -1) {
final byte b = (byte) c;
System.out.println(String.format("Hex: %x ", b));
}
}
/**
* Reads the InputStream with an InputStreamReader and the given encoding.
* Prints the resulting text to the console.
* #param in The input stream
* #param encoding The encoding
* #throws Exception
*/
private static void readWithEncoding(InputStream in, String encoding) throws Exception {
Reader reader = new InputStreamReader(in, encoding);
int c;
final StringBuilder sb = new StringBuilder();
while((c = reader.read()) != -1) {
sb.append((char) c);
}
System.out.println(String.format("Interpreted with encoding '%s': %s", encoding, sb.toString()));
}
}
The output is:
Hex: 61
Interpreted with encoding 'Cp1252': a
Interpreted with encoding 'IBM-Thai': /

Java Charset InputStreamReader, File Channel Differences

I'm trying to read a (Japanese) file that is encoded as a UTF-16 file.
When I read it using an InputStreamReader with a charset of 'UTF-16" the file is read correctly:
try {
InputStreamReader read = new InputStreamReader(new FileInputStream("JapanTest.txt"), "UTF-16");
BufferedReader in = new BufferedReader(read);
String str;
while((str=in.readLine())!=null){
System.out.println(str);
}
in.close();
}catch (Exception e){
System.out.println(e);
}
However, when I use File Channels and read from a byte array the Strings aren't always converted correctly:
File f = new File("JapanTest.txt");
fis = new FileInputStream(f);
channel = fis.getChannel();
MappedByteBuffer buffer = channel.map( FileChannel.MapMode.READ_ONLY, 0L, channel.size());
buffer.position(0);
int get = Math.min(buffer.remaining(), 1024);
byte[] barray = new byte[1024];
buffer.get(barray, 0, get);
CharSet charSet = Charset.forName("UTF-16");
//endOfLinePos is a calculated value and defines the number of bytes to read
rowString = new String(barray, 0, endOfLinePos, charSet);
System.out.println(rowString);
The problem I've found is that I can only read characters correctly if the MappedByteBuffer is at position 0. If I increment the position of the MappedByteBuffer and then read a number of bytes into a byte array, which is then converted to a string using the charset UTF-16, then the bytes are not converted correctly. I haven't faced this issue if a file is encoded in UTF-8, so is this only an issue with UTF-16?
More Details:
I need to be able to read any line from the file channel, so to do this I build a list of line ending byte positions and then use those positions to be able to get the bytes for any given line and then convert them to a string.

The code unit of UTF-16 is 2 bytes, not a byte like UTF-8. The pattern and single byte code unit length makes UTF-8 self-synchronizing; it can read correctly at any point and if it's a continuation byte, it can either backtrack or lose only a single character.
With UTF-16 you must always work with pairs of bytes, you cannot start reading at an odd byte or stop reading at an odd byte. You also must know the endianess, and use either UTF-16LE or UTF-16BE when not reading at the start of the file, because there will be no BOM.
You can also encode the file as UTF-8.

Possibly, the InputStreamReader does some transformations the normal new String(...) does not. As a work-around (and to verify this assumption) you could try to wrap the data read from the channel like new InputStreamReader( new ByteArrayInputStream( barray ) ).
Edit: Forget that :) - Channels.newReader() would be the way to go.

Corrupt Gzip string due to character encoding

I have some corrupted Gzip log files that I'm trying to restore. The files were transfered to our servers through a Java backed web page. The files have always been sent as plain text, but we recently started to receive log files that were Gzipped. These Gzipped files appear to be corrupted, and are not unzip-able, and the originals have been deleted. I believe this is from the character encoding in the method below.
Is there any way to revert the process to restore the files to their original zipped format? I have the resulting Strings binary array data in a database blob.
Thanks for any help you can give!
private String convertStreamToString(InputStream is) throws IOException {
/*
* To convert the InputStream to String we use the
* Reader.read(char[] buffer) method. We iterate until the
* Reader return -1 which means there's no more data to
* read. We use the StringWriter class to produce the string.
*/
if (is != null) {
Writer writer = new StringWriter();
char[] buffer = new char[1024];
try {
Reader reader = new BufferedReader(
new InputStreamReader(is, "UTF-8"));
int n;
while ((n = reader.read(buffer)) != -1) {
writer.write(buffer, 0, n);
}
} finally {
is.close();
}
return writer.toString();
} else {
return "";
}
}

If this is the method that was used to convert the InputStream to a String, then your data is almost certainly lost.
The problem is that UTF-8 has quite a few byte sequences that are simply not legal (i.e. they don't represent any value). These sequences will be replaced with the Unicode replacement character.
That character is the same no matter which invalid byte sequence was decoded. Therefore the specific information in those bytes is lost.

If that's the code you have you never should have converted to a Reader (or in fact a String). Only preserving as a Stream (or byte array) would avoid corrupting binary files. And once it's read into the string....illegal sequences (and there are many in utf-8) WILL be discarded.
So no, unless you are quite lucky, there is no way to recover the info. You'll have to provide another process where you process the pure stream and insert as a pure BLOB not a CLOB

Why does US-ASCII encoding accept non US-ASCII characters?

Consider the following code:
public class ReadingTest {
public void readAndPrint(String usingEncoding) throws Exception {
ByteArrayInputStream bais = new ByteArrayInputStream(new byte[]{(byte) 0xC2, (byte) 0xB5}); // 'micro' sign UTF-8 representation
InputStreamReader isr = new InputStreamReader(bais, usingEncoding);
char[] cbuf = new char[2];
isr.read(cbuf);
System.out.println(cbuf[0]+" "+(int) cbuf[0]);
}
public static void main(String[] argv) throws Exception {
ReadingTest w = new ReadingTest();
w.readAndPrint("UTF-8");
w.readAndPrint("US-ASCII");
}
}
Observed output:
µ 181
? 65533
Why does the second call of readAndPrint() (the one using US-ASCII) succeed? I would expect it to throw an error, since the input is not a proper character in this encoding. What is the place in the Java API or JLS which mandates this behavior?

The default operation when finding un-decodable bytes in the input-stream is to replace them with the Unicode Character U+FFFD REPLACEMENT CHARACTER.
If you want to change that, you can pass a CharacterDecoder to the InputStreamReader which has a different CodingErrorAction configured:
CharsetDecoder decoder = Charset.forName(usingEncoding).newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
InputStreamReader isr = new InputStreamReader(bais, decoder);

I'd say, this is the same as for the constructor
String(byte bytes[], int offset, int length, Charset charset):
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The java.nio.charset.CharsetDecoder class should be used when more control over the decoding process is required.
Using CharsetDecoder you can specify a different CodingErrorAction.

Java InputStream encoding/charset

Running the following (example) code
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
byte[] buf = {-27};
InputStream is = new ByteArrayInputStream(buf);
BufferedReader r = new BufferedReader(
new InputStreamReader(is, "ISO-8859-1"));
String s = r.readLine();
System.out.println("test.java:9 [byte] (char)" + (char)s.getBytes()[0] +
" (int)" + (int)s.getBytes()[0]);
System.out.println("test.java:10 [char] (char)" + (char)s.charAt(0) +
" (int)" + (int)s.charAt(0));
System.out.println("test.java:11 string below");
System.out.println(s);
System.out.println("test.java:13 string above");
}
}
gives me this output
test.java:9 [byte] (char)? (int)63
test.java:10 [char] (char)? (int)229
test.java:11 string below
?
test.java:13 string above
How do I retain the correct byte value (-27) in the line-9 printout? And consequently receive the expected output of the System.out.println(s) command (å).

If you want to retain byte values, don't use a Reader at all, ideally. To represent arbitrary binary data in text and convert it back to binary data later, you should use base16 or base64 encoding.
However, to explain what's going on, when you call s.getBytes() that's using the default character encoding, which apparently doesn't include Unicode character U+00E5.
If you call s.getBytes("ISO-8859-1") everywhere instead of s.getBytes() I suspect you'll get back the right byte value... but relying on ISO-8859-1 for this is kinda dirty IMO.

As noted, getBytes() (no-arguments) uses the Java platform default encoding, which may not be ISO-8859-1. Simply printing it should work, provided your terminal and the default encoding match and support the character. For instance, on my system, the terminal and default Java encoding are both UTF-8. The fact that you're seeing a '?' indicates that yours don't match or å is not supported.
If you want to manually encode to UTF-8 on your system, do:
String s = r.readLine();
byte[] utf8Bytes = s.getBytes("UTF-8");
It should give a byte array with {-61, -91}.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Understanding encoding in character streams - java

Related

Is InputStream same as InputStreamReader when a character has 8 bits?

Java Charset InputStreamReader, File Channel Differences

Corrupt Gzip string due to character encoding

Why does US-ASCII encoding accept non US-ASCII characters?

Java InputStream encoding/charset

Categories

Resources