Why does US-ASCII encoding accept non US-ASCII characters?

Why does US-ASCII encoding accept non US-ASCII characters? - java

Consider the following code:
public class ReadingTest {
public void readAndPrint(String usingEncoding) throws Exception {
ByteArrayInputStream bais = new ByteArrayInputStream(new byte[]{(byte) 0xC2, (byte) 0xB5}); // 'micro' sign UTF-8 representation
InputStreamReader isr = new InputStreamReader(bais, usingEncoding);
char[] cbuf = new char[2];
isr.read(cbuf);
System.out.println(cbuf[0]+" "+(int) cbuf[0]);
}
public static void main(String[] argv) throws Exception {
ReadingTest w = new ReadingTest();
w.readAndPrint("UTF-8");
w.readAndPrint("US-ASCII");
}
}
Observed output:
µ 181
? 65533
Why does the second call of readAndPrint() (the one using US-ASCII) succeed? I would expect it to throw an error, since the input is not a proper character in this encoding. What is the place in the Java API or JLS which mandates this behavior?

The default operation when finding un-decodable bytes in the input-stream is to replace them with the Unicode Character U+FFFD REPLACEMENT CHARACTER.
If you want to change that, you can pass a CharacterDecoder to the InputStreamReader which has a different CodingErrorAction configured:
CharsetDecoder decoder = Charset.forName(usingEncoding).newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
InputStreamReader isr = new InputStreamReader(bais, decoder);

I'd say, this is the same as for the constructor
String(byte bytes[], int offset, int length, Charset charset):
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The java.nio.charset.CharsetDecoder class should be used when more control over the decoding process is required.
Using CharsetDecoder you can specify a different CodingErrorAction.

Related

Understanding encoding in character streams

I'm trying to understand how encodings are applied by character streams in Java. For the discussion let's use the following code example:
public static void main(String[] args) throws Exception {
byte[] utf8Input = new byte[] { (byte) 0xc3, (byte) 0xb6 }; // 'ö'
ByteArrayOutputStream utf160Out = new ByteArrayOutputStream();
InputStreamReader is = new InputStreamReader(new ByteArrayInputStream(utf8Input), StandardCharsets.UTF_8); // [
OutputStreamWriter os = new OutputStreamWriter(utf160Out, StandardCharsets.UTF_16);
int len;
while ((len = is.read()) != -1) {
os.write(len);
}
os.close();
}
The program reads the UTF-8 encoded character 'ö' from the byte array utf8Input and writes it UTF-16 encoded to utf160Out. In particular, the ByteArrayInputStream on utf8Input just streams the bytes 'as-is' and the InputStreamReader subsequently decodes the read input with an UTF-8 decoder. Dumping the result of the len variable yields '0xf6' which represents the Unicode code point for 'ö'. The OutputStreamWriter writes using UTF-16 encoding without having any knowledge about the input encoding.
How does the OutputStreamWriter know the input encoding (here: UTF-8)? Is there an internal representation that is assumed which is also mapped to by an InputStreamReader? So basically, we are saying then: Read this input, it is UTF-8 encoded and decode it to our internal encoding X. An OutputStreamWriter is given the target encoding and expects the input to be encoded with X. Is this correct? If so, what is the internal encoding? UTF-16 as mentioned in What is the Java's internal represention for String? Modified UTF-8? UTF-16??

The read() method has returned a Java char value, which is an unsigned 2-byte binary number (0-65535).
The actual return type is int (signed 4-byte binary number) to allow for a special -1 value meaning end-of-stream.
A Java char is a UTF-16 encoded Unicode character. This means that all characters from the Basic Multilingual Plane will appear unencoded, i.e. the char value is the Unicode value.

Is InputStream same as InputStreamReader when a character has 8 bits?

I was reading about InputStream and InputStreamReader.
Most of the people said that InputStream is for bytes and InputStreamReader is for text.
So I created a simple file with only one character, which was 'a'.
When I used InputStream to read the file and convert it to a char it printed the letter 'a'. And when I did the same thing but this time with InputStreamReader it also gave me the same result.
So where was the difference? I thought InputStream would not be able to give the letter 'a'.
Does this mean that when a character has 8 bits, there will be no difference between InputStream and InputStreamReader? Is it true that there will be only a difference between them when a character has more than one byte?

No, InputStream and InputStreamReader are not the same even for 8 bit characters.
Look at InputStream's read() method without parameter. It returns an int but according to the documentation, a byte (range 0 to 255) is returned or -1 for EOF. The other read methods work with arrays of bytes.
InputStreamReader inherits from Reader. Reader's read() method without a parameter also returns an int. But here the int value (range 0 to 65535) is interpreted as a character or -1 for EOF. The other read methods work with arrays of char directly.
The difference is the encoding. InputStreamReader's constructors require an explicit encoding or the platform's default encoding is used. The encoding is the translation between bytes and characters.
You said: "When I used InputStream to read the file and convert it to a char it printed the letter 'a'." So you read the byte and converted it to a character manually. This conversion part is built into InputStreamReader using an encoding for the translation.
Even for one byte character sets there are differences. So your example is the letter "a" which has hex value 61 for Windows ANSI encoding (named "Cp1252" in Java). But for the encoding IBM-Thai the byte 0x61 is interpreted as "/".
So the people said right. InputStream is for binary data and on top of that there is InputStreamReader which is for text, translating between binary data and text according to an encoding.
Here is a simple example:
import java.io.*;
public class EncodingExample {
public static void main(String[] args) throws Exception {
// Prepare the byte buffer for character 'a' in Windows-ANSI
ByteArrayOutputStream baos = new ByteArrayOutputStream();
final PrintWriter writer = new PrintWriter(new OutputStreamWriter(baos, "Cp1252"));
writer.print('a');
writer.flush();
final byte[] buffer = baos.toByteArray();
readAsBytes(new ByteArrayInputStream(buffer));
readWithEncoding(new ByteArrayInputStream(buffer), "Cp1252");
readWithEncoding(new ByteArrayInputStream(buffer), "IBM-Thai");
}
/**
* Reads and displays the InputStream's bytes as hexadecimal.
* #param in The inputStream
* #throws Exception
*/
private static void readAsBytes(InputStream in) throws Exception {
int c;
while((c = in.read()) != -1) {
final byte b = (byte) c;
System.out.println(String.format("Hex: %x ", b));
}
}
/**
* Reads the InputStream with an InputStreamReader and the given encoding.
* Prints the resulting text to the console.
* #param in The input stream
* #param encoding The encoding
* #throws Exception
*/
private static void readWithEncoding(InputStream in, String encoding) throws Exception {
Reader reader = new InputStreamReader(in, encoding);
int c;
final StringBuilder sb = new StringBuilder();
while((c = reader.read()) != -1) {
sb.append((char) c);
}
System.out.println(String.format("Interpreted with encoding '%s': %s", encoding, sb.toString()));
}
}
The output is:
Hex: 61
Interpreted with encoding 'Cp1252': a
Interpreted with encoding 'IBM-Thai': /

Convert a String to a byte array and then back to the original String

Is it possible to convert a string to a byte array and then convert it back to the original string in Java or Android?
My objective is to send some strings to a microcontroller (Arduino) and store it into EEPROM (which is the only 1  KB). I tried to use an MD5 hash, but it seems it's only one-way encryption. What can I do to deal with this issue?

I would suggest using the members of string, but with an explicit encoding:
byte[] bytes = text.getBytes("UTF-8");
String text = new String(bytes, "UTF-8");
By using an explicit encoding (and one which supports all of Unicode) you avoid the problems of just calling text.getBytes() etc:
You're explicitly using a specific encoding, so you know which encoding to use later, rather than relying on the platform default.
You know it will support all of Unicode (as opposed to, say, ISO-Latin-1).
EDIT: Even though UTF-8 is the default encoding on Android, I'd definitely be explicit about this. For example, this question only says "in Java or Android" - so it's entirely possible that the code will end up being used on other platforms.
Basically given that the normal Java platform can have different default encodings, I think it's best to be absolutely explicit. I've seen way too many people using the default encoding and losing data to take that risk.
EDIT: In my haste I forgot to mention that you don't have to use the encoding's name - you can use a Charset instead. Using Guava I'd really use:
byte[] bytes = text.getBytes(Charsets.UTF_8);
String text = new String(bytes, Charsets.UTF_8);

You can do it like this.
String to byte array
String stringToConvert = "This String is 76 characters long and will be converted to an array of bytes";
byte[] theByteArray = stringToConvert.getBytes();
http://www.javadb.com/convert-string-to-byte-array
Byte array to String
byte[] byteArray = new byte[] {87, 79, 87, 46, 46, 46};
String value = new String(byteArray);
http://www.javadb.com/convert-byte-array-to-string

Use [String.getBytes()][1] to convert to bytes and use [String(byte[] data)][2] constructor to convert back to string.

byte[] pdfBytes = Base64.decode(myPdfBase64String, Base64.DEFAULT)

import java.io.FileInputStream;
import java.io.ByteArrayOutputStream;
public class FileHashStream
{
// write a new method that will provide a new Byte array, and where this generally reads from an input stream
public static byte[] read(InputStream is) throws Exception
{
String path = /* type in the absolute path for the 'commons-codec-1.10-bin.zip' */;
// must need a Byte buffer
byte[] buf = new byte[1024 * 16]
// we will use 16 kilobytes
int len = 0;
// we need a new input stream
FileInputStream is = new FileInputStream(path);
// use the buffer to update our "MessageDigest" instance
while(true)
{
len = is.read(buf);
if(len < 0) break;
md.update(buf, 0, len);
}
// close the input stream
is.close();
// call the "digest" method for obtaining the final hash-result
byte[] ret = md.digest();
System.out.println("Length of Hash: " + ret.length);
for(byte b : ret)
{
System.out.println(b + ", ");
}
String compare = "49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d";
String verification = Hex.encodeHexString(ret);
System.out.println();
System.out.println("===")
System.out.println(verification);
System.out.println("Equals? " + verification.equals(compare));
}
}

How best to convert a byte[] array to a string buffer

I have a number of byte[] array variables I need to convert to string buffers.
is there a method for this type of conversion ?
Thanks
Thank you all for your responses..However I didn't make myself clear....
I'm using some byte[] arrays pre-defined as public static "under" the class declaration
for my java program. these "fields" are reused during the "life" of the process.
As the program issues status messages, (written to a file) I've defined a string buffer
(mesg_data) that used to format a status message.
So as the program executes
I tried msg2 = String(byte_array2)
I get a compiler error:
cannot find symbol
symbol : method String(byte[])
location: class APPC_LU62.java.LU62XnsCvr
convrsID = String(conversation_ID) ;
example:
public class LU62XnsCvr extends Object
.
.
static String convrsID ;
static byte[] conversation_ID = new byte[8] ;
So I can't use a "dynamic" define of a string variable because the same variable is used
in multiple occurances.
I hope I made myself clear
Thanks ever so much
Guy

String s = new String(myByteArray, "UTF-8");
StringBuilder sb = new StringBuilder(s);

There is a constructor that a byte array and encoding:
byte[] bytes = new byte[200];
//...
String s = new String(bytes, "UTF-8");
In order to translate bytes to characters you need to specify encoding: the scheme by which sequences (typically of length 1,2 or 3) of 0-255 values (that is: sequence of bytes) are mapped to characters. UTF-8 is probably the best bet as a default.

You can turn it to a String directly
byte[] bytearray
....
String mystring = new String(bytearray)
and then to convert to a StringBuffer
StringBuffer buffer = new StringBuffer(mystring)

You may use
str = new String(bytes)
By thewhat the code above does is to create a java String (i.e. UTF-16) with the default platform character encoding.
If the byte array was created from a string encoded in the platform default character encoding this will work well.
If not you need to specify the correct character encoding (Charset) as
String str = new String (byte [] bytes, Charset charset)

It depends entirely on the character encoding, but you want:
String value = new String(bytes, "US-ASCII");
This would work for US-ASCII values.
See Charset for other valid character encodings (e.g., UTF-8)

Java InputStream encoding/charset

Running the following (example) code
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
byte[] buf = {-27};
InputStream is = new ByteArrayInputStream(buf);
BufferedReader r = new BufferedReader(
new InputStreamReader(is, "ISO-8859-1"));
String s = r.readLine();
System.out.println("test.java:9 [byte] (char)" + (char)s.getBytes()[0] +
" (int)" + (int)s.getBytes()[0]);
System.out.println("test.java:10 [char] (char)" + (char)s.charAt(0) +
" (int)" + (int)s.charAt(0));
System.out.println("test.java:11 string below");
System.out.println(s);
System.out.println("test.java:13 string above");
}
}
gives me this output
test.java:9 [byte] (char)? (int)63
test.java:10 [char] (char)? (int)229
test.java:11 string below
?
test.java:13 string above
How do I retain the correct byte value (-27) in the line-9 printout? And consequently receive the expected output of the System.out.println(s) command (å).

If you want to retain byte values, don't use a Reader at all, ideally. To represent arbitrary binary data in text and convert it back to binary data later, you should use base16 or base64 encoding.
However, to explain what's going on, when you call s.getBytes() that's using the default character encoding, which apparently doesn't include Unicode character U+00E5.
If you call s.getBytes("ISO-8859-1") everywhere instead of s.getBytes() I suspect you'll get back the right byte value... but relying on ISO-8859-1 for this is kinda dirty IMO.

As noted, getBytes() (no-arguments) uses the Java platform default encoding, which may not be ISO-8859-1. Simply printing it should work, provided your terminal and the default encoding match and support the character. For instance, on my system, the terminal and default Java encoding are both UTF-8. The fact that you're seeing a '?' indicates that yours don't match or å is not supported.
If you want to manually encode to UTF-8 on your system, do:
String s = r.readLine();
byte[] utf8Bytes = s.getBytes("UTF-8");
It should give a byte array with {-61, -91}.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why does US-ASCII encoding accept non US-ASCII characters? - java

Related

Understanding encoding in character streams

Is InputStream same as InputStreamReader when a character has 8 bits?

Convert a String to a byte array and then back to the original String

How best to convert a byte[] array to a string buffer

Java InputStream encoding/charset

Categories

Resources