clarification on using OutputStream in Java? - java

I was reading through this article. It has this following snippet
OutputStream output = new FileOutputStream("c:\\data\\output-text.txt");
while(moreData) {
int data = getMoreData();
output.write(data);
}
output.close();
It is mentioned:
OutputStreams are used for writing byte based data, one byte at a time. The write() method of an OutputStream takes an int which contains the byte value of the byte to write.
Let's say I am writing the string Hello World to the file, so each character in string gets converted to int using getMoreData() method. and how does it get written? as character or byte in the output-text.txt? If it gets written in byte, what is the advantage of writing in bytes if I have to "reconvert" byte to character?

Each character (and almost anything stored on a file) is a byte / bytes. For example:
Lowercase 'a' is written as one byte with decimal value 97.
Number '1' is written as one byte with decimal value 49
There's no more concept of data types once the information is written into a file, everything is just a stream of bytes. What's important is the encoding used to store the information into the file
Have a look at ascii table, which is very useful for beginners learning information encoding.
To illustrate this, create a file containing the text 'hello world'
$ echo 'hello world' > hello.txt
Then output the bytes written to the file using od command:
$ od -td1 hello.txt
0000000 104 101 108 108 111 32 119 111 114 108 100 10
0000014
The above means, at address 0000000 from the start of the file, I see one byte with decimal value 104 (which is character 'h'), then one byte with decimal value 101 (which is character 'e") and so on..

The article is incomplete, because an OutputStream has overloaded methods for write that take a byte[], a byte[] along with offset and length arguments, or a single int.
In the case of writing a String to a stream when the only interface you have is OutputStream (say you don't know what the underlying implementation is), it would be much better to use output.write(string.getBytes()). Iteratively peeling off a single int at a time and writing it to the file is going to perform horribly compared to a single call to write that passes an array of bytes.

Streams operate on bytes and simply read/write raw data.
Readers and writers interpret the underlying data as strings using character sets such as UTF-8 or US-ASCII. This means they may take 8 bit characters (ASCII) and convert the data into UTF-16 strings.
Streams use bytes, readers/writers use strings (or other complex types).

The Java.io.OutputStream class is the superclass of all classes representing an output stream of bytes. When bytes are written to the OutputStream, it may not write the bytes immediately, instead the write method may put the bytes into a buffer.
There are methods to write as mentioned below:
void write(byte[] b)
This method writes b.length bytes from the specified byte array to this output stream.
void write(byte[] b, int position, int length)
This method writes length bytes from the specified byte array starting at offset position to this output stream.
void write(int b)
This method writes the specified byte to this output stream.

Related

I wanted to Convert any length String to fixed 32 Bytes

I want to convert any length of String to byte32 in Java.
Code
String s="9c46267273a4999031c1d0f7e40b2a59233ce59427c4b9678d6c3a4de49b6052e71f6325296c4bddf71ea9e00da4e88c4d4fcbf241859d6aeb41e1714a0e";
//Convert into byte32
From the comments it became clear that you want to reduce the storage space of that string to 32 bytes.
The given string can easily be compressed from the 124 bytes to 62 bytes by doing a hexadecimal conversion.
However, there is no algorithm and there will not be an algorithm that can compress any data to 32 bytes. Imagine that would be possible: it would have been implemented and you would be able to get ZIP files of just 32 bytes for any file you compress.
So, unfortunately, the answer is: it's not possible.
You can not convert any length string to a byte array of length 32.
Java uses UTF-16 as it's string encoding, so in order to store 100% of the string, 1:1 as a fixed length byte array, you would be at a surface glance be limited to 16 characters.
If you are willing to live with the limitation of 16 characters, byte[] bytes = s.getBytes(); should give you a variable length byte array, but it's best to specify an explicit encoding. e.g. byte [] array2 = str.getBytes("UTF-16");
This doesn't completely solve your problem. You will now likely have to check that the byte array doesn't exceed 32 bytes, and come up with strategies for padding, possible null termination (which may potentially eat into your character budget)
Now, if you don't need the entire UTF-16 string space that Java uses for strings by default, you can get away with longer strings, by using other encodings.
IF this is to be used for any kind of other standard or something ( I see references to etherium being thrown around) then you will need to follow their standards.
Unless you are writing your own library for dealing with it directly, I highly recommend using a library that already exists, and appears to be well tested, and used.
You can achieve with the following function
byte[] bytes = s.getBytes();

Confusion on byte streams

While reading a Java book on byte streams, I came across this example which the book uses to show the difference between the two. The example used is the number 199. According to the book, if this number is written to character stream, then it is written as three different characters: 0x31 0xC7 0x39. But if this is written to byte stream, it is written as single value 0xC7. My doubt is, 199 does not fit into a byte in Java. So, shouldn't it be written as two bytes instead of one? Is 199 written as 1 byte or two bytes in binary streams?
If you call OutputStream.write(int), which is a method for writing a single byte, it will ignore all the bits except the bottom eight. That means that 199 and -57 would be written exactly the same way. For that particular method, that's the way it works because it is only supposed to write a byte.
If you called some other method, it will work differently. For instance, DataOutputStream.writeInt writes an integer as four bytes, because that's what that method is for.

What is actually difference between InputStream and Reader in java

As I searched difference between InputStream and Reader. I got answered that.
InputStream: Byte-Base ( read byte by byte )
Reader: Character-Base ( read char by char )
I paste á character in file that's ASCII (or may be other Charset) is 225 in my OS and byte's max_value is 127. and I used FileInputStream to just read() then why it returning 225? how it is able to read more than one byte? because read() method just read one byte or character at a time.
Or what is the actually difference between InputStream and Reader?
á does indeed have a unicode value of 225 (that's its code point, and is unrelated to its encoding). When you cast that down to a byte, you'll get -31. But if you take a careful look at the docs for InputStream.read, you'll see:
Reads the next byte of data from the input stream. The value byte is returned as an int in the range 0 to 255.
(emphasis added) The read method returns an int, not a byte, but that int essentially represents an unsigned byte. If you cast that int down to a char, you'll get back to á. If you cast that int down to a byte, it'll wrap down to -31.
A bit more detail:
á has a unicode value of 225.
chars in Java are represented as UTF-16, which for 225 has a binary representation of 00000000 11100001
if you cast that down to a byte, it'll drop the high byte, leaving you with 11100001. This has a value of -31 if treated as a signed byte, but 225 if treated as unsigned.
InputStream.read returns an int so that it can represent the stream's end as -1. But if the int is non-negative, then only its bottom 8 bits are set (decimal values 0-255)
When you cast that int down to a byte, Java will drop all but the lowest 8 bits -- leaving you again with 11100001
The difference is that an InputStream will read the contents of the file as is, with no interpretation: the raw bytes.
A Reader on the other hand will use a CharsetDecoder to process the byte input and turn it into a sequence of chars instead. And the way it will process the byte input will depend on the Charset used.
And this is not a 1 <-> 1 relationship!
Also, forget about "ASCII values"; Java doesn't use ASCII, it uses Unicode, and a char is in fact a UTF-16 code unit. It was a full code point when Java began, but then Unicode defined code points outside the BMP and Java had to adapt: code points over U+FFFF are now represented using a surrogate pair, ie two chars.
See here for a more detailed explanation.
InputStream.read() returns an int. That is a value between 0 and 255.
Byte.MAX_VALUE is 127 but Byte.MIN_VALUE is -128 which is binary 10000000. But java does not support unsigned primitives so the most significant byte is always the sign bit.

Huffman compress file (Got the tree but can't compress)- Java

Alright so I am trying to do a file compress using the Huffman tree.
We got the tree that is working just fine but we are unable to figure out how to write the binary string we get into the file.
So for example our tree returns: '110', it should mean this byte: '00000110' right?
And if the returns: '11111111 11111110' it should mean what? Should we just write it in in byte?
So the question is how do we convert the binary string we get into bytes so we can write it on the file?
Thanks alot,
Ara
So for example our tree returns: '110', it should mean this byte:
'00000110' right?
Wrong. You should have a byte buffer of bits into which you write your bits. Write the three bits 110 into the byte. (You will need to decide on a convention for bit ordering in the byte.) You still have five unused bits in the byte, so there it sits. Now you write 10 into the buffer. The byte buffer now has 11010, and three unused bits. So still it sits. Now you try to write 111011 into the byte buffer. The first three bits go into the byte buffer, giving you 11010111. You now have filled the buffer, so only now do you write out your byte to the file. You are left with 011. You clear your byte buffer of bits since you wrote it out, and put in the remaining 011 from your last code. Your byte buffer now has three bits in it, and five bits unused. Continue in this manner.
The buffer does not have to be one byte. 16-bit or 32-bit buffers are common and are more efficient. You write out bytes whenever the bits therein are eight or more, and shift the remaining 0-7 bits to the start of the buffer.
The only tricky part is what to do at the end, since you may have unused bits in your last byte. Your Huffman codes should have an end symbol to mark the end of the stream. Then you know when you should stop looking for more Huffman codes. If you do not have an end code, then you need to assure somehow that either the remaining bits in the byte cannot be a complete Huffman code, or you need to indicate in some other way where the stream of bits end.

how write method in FileOutputStream truncates the int type argument

write() method in FileOutputStream takes an int but truncates the first 3 bytes and writes the byte to stream.
If a file contains characters whose ASCII value is more than 127 and bytes are read from it and then written to output stream(another text file) how will it display it because in Java bytes can have a max value of +127.
If a text file(input.text) has character '›' whose ASCII value is 155.
An input stream,input, reads from it :
int in= new FileInputStream("input.txt").read();//in = 155
Now it writes to another text file(output.txt)
new FileOutputStream("output.txt").write(in);
Here integer "in" is truncated to byte which will have corresponding decimal value : -101.
How it successfully manages to write the character to file even though information about it seems to have been lost?
Just now i went through the description of write(int) method in java docs and what i observed was
The general contract for write is that one byte is written to the output stream. The byte to be written is the eight low-order bits of the argument b. The 24 high-order bits of b are ignored.
So i believe that contrary to what i thought earlier(the int in write() is truncated as would happen while downcasting an integer to byte for values greater than 127) the 24 high order bits are only ignored and only 8 least significant bits are considered.
No truncation and conversion to byte occurs.
I guess i am correct.
I think your confusion is caused by the fact that specs for character sets typically take the view that bytes are unsigned, while Java treats bytes as signed.
In fact 155 as an unsigned byte is -101 as a signed byte. (256 - 101 == 155). The bit patterns are identical. It is just a matter of whether you think of them as signed or unsigned.
How the truncation is coded is implementation specific. But there is no loss of information ... assuming that you had an 8-bit code in the first place.

Categories

Resources