I am trying to convert unsigned bytes from a file to signed bytes in java. This is the current arrangement I have for reading unsigned bytes from a file in java:
ByteArrayOutputStream output = new ByteArrayOutputStream();
for (String string : fileKeyString) {
output.write(Integer.valueOf(string).byteValue());
}
return output.toByteArray();
Note: I have to use Java 8 and fileKeyString is a String Array that gets created when reading from a file. The variable string holds the unsigned byte. It outputs a byte array which is required.
How would I exactly convert this from an unsigned byte to signed bytes before it gets placed into output.write and evaluated by .byteValue()?
I dont have too much expereience with bytes so any help is appreciated.
Thankyou.
Found it! You just subtract by 256 if it exceeds by 128.
if (byte >= 128) { byte -= 256; }
You seem to misunderstand how computers work.
A byte is what it is. Just 01001100 on disk or in memory. What does 01001100 mean? Is that signed or unsigned? The byte doesn't know. Bytes just are 8 bits, that's it. That's all they ever are. It's things that interact with the byte that decide how one is to read it. Is that signed? The byte has no idea - the software (or the human eyeballs) that look at it decide whether it is or not.
Let's make it more interesting and work with the byte 10000000.
What is that? The byte has no idea. Perhaps you have some software that reads this byte and shows the value of it on screen.
Depending on which software you use, you might see any of the following and they are all equally correct:
128 (interpretation: It's an unsigned byte, show it in decimal)
-128 (interpretation: It's a 2s complement signed byte, show in decimal)
80 (interpretation: Show it in hexadecimal, unsigned)
-80 (interpretation: Show in hex, signed)
� (interpretation: It's a unicode character. The 128th item in the unicode table is 'control', and not really a character perse).
-127 (interpretation: It's a 1s complement signed byte, show in decimal)
Nothing appears on screen, instead, the dulcet tones of Unchained Melody blast out of the speaker (interpretation: It's an id of a song, and Unchained Melody's ID is bit sequence 10000000).
Given a file containing just 1 byte, with bitsequence 10000000 (which is just a sequence of bytes, no metadata), you have no idea which of the above interpretation is correct. In that sense they are ALL correct. I can make you a file which, if you name it 'foo.zip' and unzip it produces 1 file with the collected works of shakespeare in plain text inside. If you rename the .zip to .png, and open it, you see the mona lisa. Same bytes in either case - it's the app that reads them that causes those exact same bytes to mean something completely different.
The exact same principle (it's not the byte itself, it's the software or human eyeballs that decide what it means) applies in reverse as well: If I want to 'write' Unchained Melody to disk, it's the software that decides how to do it.
With that in mind, therefore:
How would I exactly convert this from an unsigned byte to signed bytes before it gets placed into output.write and evaluated by .byteValue()?
That question makes no sense. If I have the number -128 and I want to write it to disk, presumably you just write the bit sequence 10000000 to disk and, yup, that doesn't mean anything unless the user of the computer opens that file again with your app. Or any other app that knows that it is to be interpreted as a signed 2's complement byte.
The code you have already writes 1 byte to disk whose bit sequence is 10000000; you're already doing it, your code is fine as is.
If you are opening it with something and that says 'this file contains +128', and you want that to say '-128' instead, there is nothing you can change in your file writing code. Instead, you need to find different software to open it, or configure that software differently.
Related
I'm learning about Text I/O and Binary I/O in java right now. I read that each value that you write to a file is initially stored in binary. For text I/O, the individual digits are converted to it's corresponding Unicode values and then encoded to the file-specific encoding such as ASCII. For binary I/O, the binary value is directly represented in the file. For example, 199 would be represented as 0xC7 which in binary is 11000111. Now I'm confused on one part. If a variable is initially stored as a binary format, does each digit represent a separate byte that is stored or is the entirety of the number stored as a single byte. For example, is 199 originally stored as 0xc7 which would be 11000111 in binary? Or would it be stored in 3 bytes with each byte representing the binary value for the digit. If it was stored in 3 separate bytes, does binary I/O convert that 3 byte number to a single byte? If it's stored in a single byte, how does text I/O translate that single byte into 3 separate byte values. I'm just confused on how to word this. Hope you can understand what I'm getting at. Thanks
The only thing which a computer is capable of dealing with are sets of 0/1 bits which are stored in memory or, if you wish on a storage device. Those bits can be streamed to monitors and converted to characters by graphical hardware. Sams story with keyboards, you type a key and a few bits of data will be send to the computer.
Bits are stored in memory and are accessible by memory addresses. The addresses are also sets of bits.
For practical reasons the bits are grouped into bytes, words, long words, ... A byte used to be the smallest addressable unit of bits and historically ended up as a group of 8 bits, which is currently used in most of the hardware. Modern memory can store data in multiple byte addressable chunks. Same for the disk, you store data there, using specific addressing mechanisms. But in any case those are just sets of bits.
What you are confused about is the interpretation of those bits. They can represent integer numbers, floating point numbers, characters, addresses, ... The way they are interpreted only depends on the program which uses them.
Characters do not exist in the computer. They are just an abstraction which is provided by programming languages. The programs interpret the bits stored on the computer. There are standards. For example the ASCII encoding maps English characters plus a few special characters into numbers from 0 to 127. Those fit into a single byte (leaving number 128 to 255 for special use). A print command will read those bytes one by one and send them to graphics to form letters on the screen as specified in the encoding standard. Different encoding scheme will display the same bytes differently.
If you write a program wit the "hello world" sting in it, the program will convert the symbols between quotes into a set of 11 ascii bytes. (In 'c' it will add yet another byte which is equal to '0' and ends the string this way). Unicode is yet another way to represent characters. Every unicode character is represented by multiple bytes of data. There are other schemes as well. One thing to pay attention to. If you write strings on the disk using certain encoding, you should read them with the same encoding, or your prints will give you garbage. But you can always read and copy then as binary data without interpretation.
So, any variable of any type is just an abstraction and always consists of bytes of data which your program knows how to interpret based on the data type and/or operations it wants to perform. Variables of type int, double, any java object, including String, are just sets of bytes of different sizes. Only the program (and java interpreter is a program) knows what to do with them, use them in calculations or display as characters.
So we can talk about the endianness of both the bit and byte order.
When I read the next byte from FileInputStream, for example, I practically get an 8-bit signed integer, but I have no idea what is the bit order with which Java calculates the byte's integer value. Which comes first, the most significant or the least significant bit?
(sign bit, 2^6 ..... 2^0)
Or...
(2^0, ..... 2^6, sign bit)
Endianness only really applies when a unit is broken down into other units. So if you were transmitting a byte over a bit stream, you could observe whether the least significant bit was transmitted first or last. And at that point we could say that the stream was little-endian or big-endian.
But within a byte-addressable machine, i.e., where the byte is the smallest unit of storage, there is no "endianness" within the byte. No bit of the byte is "before" any other bit of the byte.
Note that another term for endianness is "byte order". The order of bytes within larger entities.
It is true we like to number bits (0 to 7, for an 8-bit byte) so we can talk about them, but this really does not define endianness, even though the numbering is often chosen to match the byte order of the machine; this is convention.
With respect to FileInputStream - according to its documentation, that transfers bytes: no part of the byte is sent before any other part, at least not as far as FileInputStream is concerned. If the byte has to be sent bitwise over some interconnect (say, a SATA cable), then the decision about which bit goes first is a matter for the hardware. The higher layer code is dealing in bytes (or even blocks).
in int first bit is the sign, the rest is the value, the last bit is the least significant bit.
Alright so I am trying to do a file compress using the Huffman tree.
We got the tree that is working just fine but we are unable to figure out how to write the binary string we get into the file.
So for example our tree returns: '110', it should mean this byte: '00000110' right?
And if the returns: '11111111 11111110' it should mean what? Should we just write it in in byte?
So the question is how do we convert the binary string we get into bytes so we can write it on the file?
Thanks alot,
Ara
So for example our tree returns: '110', it should mean this byte:
'00000110' right?
Wrong. You should have a byte buffer of bits into which you write your bits. Write the three bits 110 into the byte. (You will need to decide on a convention for bit ordering in the byte.) You still have five unused bits in the byte, so there it sits. Now you write 10 into the buffer. The byte buffer now has 11010, and three unused bits. So still it sits. Now you try to write 111011 into the byte buffer. The first three bits go into the byte buffer, giving you 11010111. You now have filled the buffer, so only now do you write out your byte to the file. You are left with 011. You clear your byte buffer of bits since you wrote it out, and put in the remaining 011 from your last code. Your byte buffer now has three bits in it, and five bits unused. Continue in this manner.
The buffer does not have to be one byte. 16-bit or 32-bit buffers are common and are more efficient. You write out bytes whenever the bits therein are eight or more, and shift the remaining 0-7 bits to the start of the buffer.
The only tricky part is what to do at the end, since you may have unused bits in your last byte. Your Huffman codes should have an end symbol to mark the end of the stream. Then you know when you should stop looking for more Huffman codes. If you do not have an end code, then you need to assure somehow that either the remaining bits in the byte cannot be a complete Huffman code, or you need to indicate in some other way where the stream of bits end.
write() method in FileOutputStream takes an int but truncates the first 3 bytes and writes the byte to stream.
If a file contains characters whose ASCII value is more than 127 and bytes are read from it and then written to output stream(another text file) how will it display it because in Java bytes can have a max value of +127.
If a text file(input.text) has character '›' whose ASCII value is 155.
An input stream,input, reads from it :
int in= new FileInputStream("input.txt").read();//in = 155
Now it writes to another text file(output.txt)
new FileOutputStream("output.txt").write(in);
Here integer "in" is truncated to byte which will have corresponding decimal value : -101.
How it successfully manages to write the character to file even though information about it seems to have been lost?
Just now i went through the description of write(int) method in java docs and what i observed was
The general contract for write is that one byte is written to the output stream. The byte to be written is the eight low-order bits of the argument b. The 24 high-order bits of b are ignored.
So i believe that contrary to what i thought earlier(the int in write() is truncated as would happen while downcasting an integer to byte for values greater than 127) the 24 high order bits are only ignored and only 8 least significant bits are considered.
No truncation and conversion to byte occurs.
I guess i am correct.
I think your confusion is caused by the fact that specs for character sets typically take the view that bytes are unsigned, while Java treats bytes as signed.
In fact 155 as an unsigned byte is -101 as a signed byte. (256 - 101 == 155). The bit patterns are identical. It is just a matter of whether you think of them as signed or unsigned.
How the truncation is coded is implementation specific. But there is no loss of information ... assuming that you had an 8-bit code in the first place.
I'm aware that this is probably not the best idea but I've been playing around trying to read a file in PHP that was encoded using Java's DataOutputStream.
Specifically, in Java I use:
dataOutputStream.writeInt(number);
Then in PHP I read the file using:
$data = fread($handle, 4);
$number = unpack('N', $data);
The strange thing is that the only format character in PHP that gives the correct value is 'N', which is supposed to represent "unsigned long (always 32 bit, big endian byte order)". I thought that int in java was always signed?
Is it possible to reliably read data encoded in Java in this way or not? In this case the integer will only ever need to be positive. It may also need to be quite large so writeShort() is not possible. Otherwise of course I could use XML or JSON or something.
This is fine, as long as you don't need that extra bit. l (instead of N) would work on a big endian machine.
Note, however, that the maximum number that you can store is 2,147,483,647 unless you want to do some math on the Java side to get the proper negative integer to represent the desired unsigned integer.
Note that a signed Java integer uses the two's complement method to represent a negative number, so it's not as easy as flipping a bit.
DataOutputStream.writeInt:
Writes an int to the underlying output stream as four bytes, high byte
first.
The formats available for the unpack function for signed integers all use machine dependent byte order. My guess is that your machine uses a different byte order than Java. If that is true, the DataOutputStream + unpack combination will not work for any signed primitive.