What is actually difference between InputStream and Reader in java - java

As I searched difference between InputStream and Reader. I got answered that.
InputStream: Byte-Base ( read byte by byte )
Reader: Character-Base ( read char by char )
I paste á character in file that's ASCII (or may be other Charset) is 225 in my OS and byte's max_value is 127. and I used FileInputStream to just read() then why it returning 225? how it is able to read more than one byte? because read() method just read one byte or character at a time.
Or what is the actually difference between InputStream and Reader?

á does indeed have a unicode value of 225 (that's its code point, and is unrelated to its encoding). When you cast that down to a byte, you'll get -31. But if you take a careful look at the docs for InputStream.read, you'll see:
Reads the next byte of data from the input stream. The value byte is returned as an int in the range 0 to 255.
(emphasis added) The read method returns an int, not a byte, but that int essentially represents an unsigned byte. If you cast that int down to a char, you'll get back to á. If you cast that int down to a byte, it'll wrap down to -31.
A bit more detail:
á has a unicode value of 225.
chars in Java are represented as UTF-16, which for 225 has a binary representation of 00000000 11100001
if you cast that down to a byte, it'll drop the high byte, leaving you with 11100001. This has a value of -31 if treated as a signed byte, but 225 if treated as unsigned.
InputStream.read returns an int so that it can represent the stream's end as -1. But if the int is non-negative, then only its bottom 8 bits are set (decimal values 0-255)
When you cast that int down to a byte, Java will drop all but the lowest 8 bits -- leaving you again with 11100001

The difference is that an InputStream will read the contents of the file as is, with no interpretation: the raw bytes.
A Reader on the other hand will use a CharsetDecoder to process the byte input and turn it into a sequence of chars instead. And the way it will process the byte input will depend on the Charset used.
And this is not a 1 <-> 1 relationship!
Also, forget about "ASCII values"; Java doesn't use ASCII, it uses Unicode, and a char is in fact a UTF-16 code unit. It was a full code point when Java began, but then Unicode defined code points outside the BMP and Java had to adapt: code points over U+FFFF are now represented using a surrogate pair, ie two chars.
See here for a more detailed explanation.

InputStream.read() returns an int. That is a value between 0 and 255.
Byte.MAX_VALUE is 127 but Byte.MIN_VALUE is -128 which is binary 10000000. But java does not support unsigned primitives so the most significant byte is always the sign bit.

Related

What is the size of char[]?

I'm making hash algorithm, the block of message is 512 bits.
In C/C++ I can store in char[64], but Java char takes 2 bytes.
Question: 512 bits of information are char[32] or char[64]?
Char is 16bit in Java. So char[32] should be enough for 512bits.
I think using byte[64] is better though because everyone know a byte is 8 bits and char[32] makes the code harder to read. Also you don't store characters but bits.
From the documentation:
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff'
(or 65,535 inclusive).
So in order to store 512 bits, you should have an array of size 32.
Why use a char[]?
A hash value consists of bytes so the logical choice would be to use a byte[64].
The datatype char is intended to be used as a character and not as a number.

Array of chars vs. array of bytes

I've found a few answers about this but none of them seem to apply to my issue.
I'm using the NDK and C++ is expecting an unsigned char array of 1024 elements, so I need to create this in java to pass it as a parameter.
The unsigned char array is expected to contain both numbers and characters.
I have tried this:
byte[] lMessage = new byte[1024];
lMessage[4] = 'a';
The problem is that then the 4th element gets added as a numerical value instead of maintaining the 'a' character.
I have also tried
char[] lMessage = new char[1024];
lMessage[4] = 'a';
While this retains the character, it does duplicate the amount of bytes in the array from 8 to 16.
I need the output to be a 8 bit ASCII unsigned array.
Any suggestions?
Thanks.
It is wrong to say that the element "gets added as a numerical value". The only thing that you can say for sure is that it gets added as electrostatic charges in eight cells of your RAM.
How you choose to represent those eight bits (01100001) in order to visualize them has little to do with what they really are, so if you choose to see them as a numerical value, then you might be tricked into believing that they are in fact a numerical value. (Kind of like a self-fulfilling prophecy (wikipedia).)
But in fact they are nothing but 8 electrostatic charges, interpretable in whatever way we like. We can choose to interpret them as a two's complement number (97), we can choose to interpret them as a binary-coded decimal number (61), we can choose to interpret them as an ASCII character ('a'), we can choose to interpret them as an x86 instruction opcode (popa), the list goes on.
The closest thing to an unsigned char in C++ is a byte in java. That's because the fundamental characteristic of these small data types is how many bits long they are. Chars in C++ are 8-bit long, and the only type in java which is also 8-bits long is the byte.
Unfortunately, a byte in java tends to be thought of as a numerical quantity rather than as a character, so tools (such as debuggers) that display bytes will display them as little numbers. But this is just an arbitrary convention: they could have just as easily chosen to display bytes as ASCII (8-bit) characters, and then you would be seeing an actual 'a' in byte[] lMessage[4].
So, don't be fooled by what the tools are showing, all that counts is that it is an 8-bit quantity. And if the tools are showing 97 (0x61), then you know that the bit pattern stored in those 8 memory cells can just as legitimately be thought of as an 'a', because the ASCII code of 'a' is 97.
So, finally, to answer your question, what you need to do is find a way to convert a java string, which consists of 16-bit unicode characters, to an array of ASCII characters, which would be bytes in java. You can try this:
String s = "TooManyEduardos";
byte[] bytes = s.getBytes("US-ASCII");
Or you can read the answers to this question: Convert character to ASCII numeric value in java for more ideas.
Will work for ASCII chars
lMessage[4] = new String('a').getBytes()[0];

Why does writeBytes discard each character's high eight bits?

I wanted to use DataOutputStream#writeBytes, but was running into errors. Description of writeBytes(String) from the Java Documentation:
Writes out the string to the underlying output stream as a sequence of bytes. Each character in the string is written out, in sequence, by discarding its high eight bits.
I think the problem I'm running into is due to the part about "discarding its high eight bits". What does that mean, and why does it work that way?
Most Western programmers tend to think in terms of ASCII, where one character equals one byte, but Java Strings are 16-bit Unicode. writeBytes just writes out the lower byte, which for ASCII/ISO-8859-1 is the "character" in the C sense.
The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive). But The byte data type is an 8-bit signed two's complement integer. It has a minimum value of -128 and a maximum value of 127 (inclusive). That is why this function is writing the low-order byte of each char in the string from first to last. Any information in the high-order byte is lost. In other words, it assumes the string contains only characters whose value is between 0and 255.
You may look into the writeUTF(String s) method, which, retains the information in the high-order byte as well as the length of the string. First it writes the number of characters in the string onto the underlying output stream as a 2-byte unsigned int between 0 and 65,535. Next it encodes the string in UTF-8 and writes the bytes of the encoded string to the underlying output stream. This allows a data input stream reading those bytes to completely reconstruct the string.

how write method in FileOutputStream truncates the int type argument

write() method in FileOutputStream takes an int but truncates the first 3 bytes and writes the byte to stream.
If a file contains characters whose ASCII value is more than 127 and bytes are read from it and then written to output stream(another text file) how will it display it because in Java bytes can have a max value of +127.
If a text file(input.text) has character '›' whose ASCII value is 155.
An input stream,input, reads from it :
int in= new FileInputStream("input.txt").read();//in = 155
Now it writes to another text file(output.txt)
new FileOutputStream("output.txt").write(in);
Here integer "in" is truncated to byte which will have corresponding decimal value : -101.
How it successfully manages to write the character to file even though information about it seems to have been lost?
Just now i went through the description of write(int) method in java docs and what i observed was
The general contract for write is that one byte is written to the output stream. The byte to be written is the eight low-order bits of the argument b. The 24 high-order bits of b are ignored.
So i believe that contrary to what i thought earlier(the int in write() is truncated as would happen while downcasting an integer to byte for values greater than 127) the 24 high order bits are only ignored and only 8 least significant bits are considered.
No truncation and conversion to byte occurs.
I guess i am correct.
I think your confusion is caused by the fact that specs for character sets typically take the view that bytes are unsigned, while Java treats bytes as signed.
In fact 155 as an unsigned byte is -101 as a signed byte. (256 - 101 == 155). The bit patterns are identical. It is just a matter of whether you think of them as signed or unsigned.
How the truncation is coded is implementation specific. But there is no loss of information ... assuming that you had an 8-bit code in the first place.

end of stream in JAVA

I am confused by the following statement that appears here
The basic read() method of the InputStream class reads a single
unsigned byte of data and returns the int value of the unsigned byte.
This is a number between 0 and 255. If the end of stream is
encountered, it returns -1 instead; and you can use this as a flag to
watch for the end of stream.
Since one byte can represent up to 256 integers, I fail to see how it can represent 0 to 256 and -1. Can someone please comment on what I am missing here?
The return type of InputStream#read() is an int, where the value can be read as a byte if it falls in the range of 0-255.
Although the read() operation just reads a byte it actually returns an int so there is no problem.
Just values in range 0-255 are returned though, aside from the special -1 end of stream value.
It returns an int, not a byte, so though it normally will only contain 0-255, it can contain other values.

Categories

Resources