Java Charset InputStreamReader, File Channel Differences

Java Charset InputStreamReader, File Channel Differences - java

I'm trying to read a (Japanese) file that is encoded as a UTF-16 file.
When I read it using an InputStreamReader with a charset of 'UTF-16" the file is read correctly:
try {
InputStreamReader read = new InputStreamReader(new FileInputStream("JapanTest.txt"), "UTF-16");
BufferedReader in = new BufferedReader(read);
String str;
while((str=in.readLine())!=null){
System.out.println(str);
}
in.close();
}catch (Exception e){
System.out.println(e);
}
However, when I use File Channels and read from a byte array the Strings aren't always converted correctly:
File f = new File("JapanTest.txt");
fis = new FileInputStream(f);
channel = fis.getChannel();
MappedByteBuffer buffer = channel.map( FileChannel.MapMode.READ_ONLY, 0L, channel.size());
buffer.position(0);
int get = Math.min(buffer.remaining(), 1024);
byte[] barray = new byte[1024];
buffer.get(barray, 0, get);
CharSet charSet = Charset.forName("UTF-16");
//endOfLinePos is a calculated value and defines the number of bytes to read
rowString = new String(barray, 0, endOfLinePos, charSet);
System.out.println(rowString);
The problem I've found is that I can only read characters correctly if the MappedByteBuffer is at position 0. If I increment the position of the MappedByteBuffer and then read a number of bytes into a byte array, which is then converted to a string using the charset UTF-16, then the bytes are not converted correctly. I haven't faced this issue if a file is encoded in UTF-8, so is this only an issue with UTF-16?
More Details:
I need to be able to read any line from the file channel, so to do this I build a list of line ending byte positions and then use those positions to be able to get the bytes for any given line and then convert them to a string.

The code unit of UTF-16 is 2 bytes, not a byte like UTF-8. The pattern and single byte code unit length makes UTF-8 self-synchronizing; it can read correctly at any point and if it's a continuation byte, it can either backtrack or lose only a single character.
With UTF-16 you must always work with pairs of bytes, you cannot start reading at an odd byte or stop reading at an odd byte. You also must know the endianess, and use either UTF-16LE or UTF-16BE when not reading at the start of the file, because there will be no BOM.
You can also encode the file as UTF-8.

Possibly, the InputStreamReader does some transformations the normal new String(...) does not. As a work-around (and to verify this assumption) you could try to wrap the data read from the channel like new InputStreamReader( new ByteArrayInputStream( barray ) ).
Edit: Forget that :) - Channels.newReader() would be the way to go.

Related

Understanding encoding in character streams

I'm trying to understand how encodings are applied by character streams in Java. For the discussion let's use the following code example:
public static void main(String[] args) throws Exception {
byte[] utf8Input = new byte[] { (byte) 0xc3, (byte) 0xb6 }; // 'ö'
ByteArrayOutputStream utf160Out = new ByteArrayOutputStream();
InputStreamReader is = new InputStreamReader(new ByteArrayInputStream(utf8Input), StandardCharsets.UTF_8); // [
OutputStreamWriter os = new OutputStreamWriter(utf160Out, StandardCharsets.UTF_16);
int len;
while ((len = is.read()) != -1) {
os.write(len);
}
os.close();
}
The program reads the UTF-8 encoded character 'ö' from the byte array utf8Input and writes it UTF-16 encoded to utf160Out. In particular, the ByteArrayInputStream on utf8Input just streams the bytes 'as-is' and the InputStreamReader subsequently decodes the read input with an UTF-8 decoder. Dumping the result of the len variable yields '0xf6' which represents the Unicode code point for 'ö'. The OutputStreamWriter writes using UTF-16 encoding without having any knowledge about the input encoding.
How does the OutputStreamWriter know the input encoding (here: UTF-8)? Is there an internal representation that is assumed which is also mapped to by an InputStreamReader? So basically, we are saying then: Read this input, it is UTF-8 encoded and decode it to our internal encoding X. An OutputStreamWriter is given the target encoding and expects the input to be encoded with X. Is this correct? If so, what is the internal encoding? UTF-16 as mentioned in What is the Java's internal represention for String? Modified UTF-8? UTF-16??

The read() method has returned a Java char value, which is an unsigned 2-byte binary number (0-65535).
The actual return type is int (signed 4-byte binary number) to allow for a special -1 value meaning end-of-stream.
A Java char is a UTF-16 encoded Unicode character. This means that all characters from the Basic Multilingual Plane will appear unencoded, i.e. the char value is the Unicode value.

String format when reading from file

I have this example. It reads a line "hello" from a file saved as utf-8. Here is my question:
Strings are stored in java in UTF-16 format. So when it reads the line hello it converts it to a utf-16 format. So string s is in a utf-16 with a utf-16 BOM... Am i right?
filereader = new FileReader(file);
read= new BufferedReader(filereader);
String s= null;
while ((s= read.readLine()) != null)
{
System.out.println(s);
}
So when i do this:
s= s.replace("\uFEFF","A");
nothing happens. Should the above find and replace the UTF-16 BOM? Or is it eventually a utf-8 format? Am a little bit confused about this.
Thank you

Try to use the Apache Commons library and the class org.apache.commons.io.input.BOMInputStream to get rid of this kind of problems.
Example:
String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(file);
try
{
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
// your code...
}
finally
{
inputStream.close();
}
For what concerns the BOM itself, as #seand said, it's just meta data being used for reading/writing/storing strings in memory. It's present in the strings themselves, but you cannot replace or modify it unless working at binary level or re-encoding the strings.
Let's make a few examples:
String str = "Hadoop";
byte bt1[] = str.getBytes();
System.out.println(bt1.length); // 6
byte bt2a[] = str.getBytes("UTF-16");
System.out.println(bt2a.length); // 14
byte bt2b[] = str.getBytes("UTF-16BE");
System.out.println(bt2b.length); // 14
byte bt3[] = str.getBytes("UTF-16LE");
System.out.println(bt3.length); // 12
In the UTF-16 (which defaults to Big Endian) and UTF-16BE versions, you get 14 bytes because of the BOM being inserted to distinguish between BE and LE. If you specify UTF-16LE you get 12 bytes because of no BOM is being added.
You cannot strip the BOM from a string with a simple replace, as you tried. Because the BOM, if present, is only part of the underlying byte stream that, memory side, is being handled as a string by the java framework. And you can't manipulate it like you manipulate characters that are part of the string itself.

Using InputStream and InputStreamReader at the same time in java

I made an InputStream Object from a file and a InputStreamReader from that.
InputStream ips = new FileInputStream("c:\\data\\input.txt");
InputStreamReader isr = new InputStreamReader(ips);
I will basically read data in the form of bytes to a buffer but when there comes a time when i should read in chars I will 'switch mode' and read with InputStreamReader
byte[] bbuffer = new byte[20];
char[] cbuffer = new char[20];
while(ips.read(buffer, 0, 20)!=-1){
doSomethingWithbBuffer(bbuffer);
// check every 20th byte and if it is 0 start reading as char
if(bbuffer[20] == 0){
while(isr.read(cbuffer, 0, 20)!=-1){
doSomethingWithcBuffer(cbuffer);
// check every 20th char if its # return to reading as byte
if(cbuffer[20] == '#'){
break;
}
}
}
}
is this a safe way to read files that have mixed char and byte data?

no, this is not safe. the InputStreamReader may read "too much" data from the underlying stream (it uses internal buffers) and corrupt your attempt to read from the underlying byte stream. You can use something like DataInputStream if you want to mix reading characters and bytes.
Alternately, just read the data as bytes and use the correct character encoding to convert those bytes to characters/Strings.

Corrupt Gzip string due to character encoding

I have some corrupted Gzip log files that I'm trying to restore. The files were transfered to our servers through a Java backed web page. The files have always been sent as plain text, but we recently started to receive log files that were Gzipped. These Gzipped files appear to be corrupted, and are not unzip-able, and the originals have been deleted. I believe this is from the character encoding in the method below.
Is there any way to revert the process to restore the files to their original zipped format? I have the resulting Strings binary array data in a database blob.
Thanks for any help you can give!
private String convertStreamToString(InputStream is) throws IOException {
/*
* To convert the InputStream to String we use the
* Reader.read(char[] buffer) method. We iterate until the
* Reader return -1 which means there's no more data to
* read. We use the StringWriter class to produce the string.
*/
if (is != null) {
Writer writer = new StringWriter();
char[] buffer = new char[1024];
try {
Reader reader = new BufferedReader(
new InputStreamReader(is, "UTF-8"));
int n;
while ((n = reader.read(buffer)) != -1) {
writer.write(buffer, 0, n);
}
} finally {
is.close();
}
return writer.toString();
} else {
return "";
}
}

If this is the method that was used to convert the InputStream to a String, then your data is almost certainly lost.
The problem is that UTF-8 has quite a few byte sequences that are simply not legal (i.e. they don't represent any value). These sequences will be replaced with the Unicode replacement character.
That character is the same no matter which invalid byte sequence was decoded. Therefore the specific information in those bytes is lost.

If that's the code you have you never should have converted to a Reader (or in fact a String). Only preserving as a Stream (or byte array) would avoid corrupting binary files. And once it's read into the string....illegal sequences (and there are many in utf-8) WILL be discarded.
So no, unless you are quite lucky, there is no way to recover the info. You'll have to provide another process where you process the pure stream and insert as a pure BLOB not a CLOB

Reading Strings and Binary from the same FileInputStream

I have a file that contains some amount of plain text at the start followed by binary content at the end. The size of the binary content is determined by some one of the plain text lines I read.
I was using a BufferedReader to read the individual lines, however it exposes no methods to refer to read a byte array. The readUTF for a DataInputStream doesnt read all the way to the end of the line, and the readLine method is deprecated.
Using the underlying FileInputStream to read returns empty byte arrays. Any suggestions on how to go about this?
private DOTDataInfo parseFile(InputStream stream) throws IOException{
DOTDataInfo info = new DOTDataInfo();
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
int binSize = 0;
String line;
while((line = reader.readLine()) != null){
if(line.length() == 0)
break;
DOTProperty prop = parseProperty(line);
info.getProperties().add(prop);
if(prop.getName().equals("ContentSize"))
binSize = Integer.parseInt(prop.getValue());
}
byte[] content = new byte[binSize];
stream.read(content); //Its all empty now. If I use a DataInputStream instead, its got the values from the file
return info;
}

You could use RandomAccessFile. Use readLine() to read the plain text at the start (note the limitations of this, as described in the API), and then readByte() or readFully() to read the subsequent binary data.
Using the underlying FileInputStream
to read returns empty byte arrays.
That's because you have wrapped the stream in a BufferedReader, which has probably consumed all the bytes from the stream when filling up its buffer.

If you genuinely have a file (rather than something harder to seek in, e.g. a network stream) then I suggest something like this:
Open the file as a FileInputStream
Wrap it in InputStreamReader and a BufferedReader
Read the text, so you can find out how much content there is
Close the BufferedReader (which will close the InputStreamReader which will close the FileInputStream)
Reopen the file
Skip to (total file length - binary content length)
Read the rest of the data as normal
You could just call mark() at the start of the FileInputStream and then reset() and skip() to get to the right place if you want to avoid reopening the file. (I was looking for an InputStream.seek() but I can't see one - I can't remember wanting it before in Java, but does it really not have one? Ick.)

You need to use an InputStream. Readers are for character data. Look into wrapping your input stream with a DataInputStream, like:
stream=new DataInputStream(new BufferedInputStream(new FileInputStream(...)));
The data input stream will give you many useful methods to read various types of data, and of course, the base InputStream methods for reading bytes.
(This is actually exactly what a HTTP server must do to read a request with content.)
The readUTF doesn't read a line, it reads a string that was written in (modified) UTF8 format - refer to the JavaDoc.

Alas, DataInputStream is deprecated and does not handle UTF. But this should help (it reads a line from a binary stream, without any lookahead).
public static String lineFrom(InputStream in) throws IOException {
byte[] buf = new byte[128];
int pos = 0;
for (;;) {
int ch = in.read();
if (ch == '\n' || ch < 0) break;
buf[pos++] = (byte) ch;
if (pos == buf.length) buf = Arrays.copyOf(buf, pos + 128);
}
return new String(Arrays.copyOf(buf, pos), "UTF-8");
}

The correct way is to use an InputStream of some form, probably a FileInputStream unless this becomes a performance barrier.
What do you mean "Using the underlying FileInputStream to read returns empty byte arrays."? This seems very unlikely and is probably where your mistake is. Can you show us the example code you've tried?

You can read the text with BufferedReader. When you know where the binary starts you can close the file and open it with RandomAccessFile and read binary from any point in the file.
Or you can read the file as binary and convert to text the sections you identify as text. {Using new String(bytes, encoding)}

I recommend using DataInputStream. You have the following options:
Read both text and binary content with DataInputStream
Open a BufferedReader, read text and close the stream. Then open a DataInputStream, skip bytes equal to the size of the text and read binary data.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Charset InputStreamReader, File Channel Differences - java

Related

Understanding encoding in character streams

String format when reading from file

Using InputStream and InputStreamReader at the same time in java

Corrupt Gzip string due to character encoding

Reading Strings and Binary from the same FileInputStream

Categories

Resources