So we can talk about the endianness of both the bit and byte order.
When I read the next byte from FileInputStream, for example, I practically get an 8-bit signed integer, but I have no idea what is the bit order with which Java calculates the byte's integer value. Which comes first, the most significant or the least significant bit?
(sign bit, 2^6 ..... 2^0)
Or...
(2^0, ..... 2^6, sign bit)
Endianness only really applies when a unit is broken down into other units. So if you were transmitting a byte over a bit stream, you could observe whether the least significant bit was transmitted first or last. And at that point we could say that the stream was little-endian or big-endian.
But within a byte-addressable machine, i.e., where the byte is the smallest unit of storage, there is no "endianness" within the byte. No bit of the byte is "before" any other bit of the byte.
Note that another term for endianness is "byte order". The order of bytes within larger entities.
It is true we like to number bits (0 to 7, for an 8-bit byte) so we can talk about them, but this really does not define endianness, even though the numbering is often chosen to match the byte order of the machine; this is convention.
With respect to FileInputStream - according to its documentation, that transfers bytes: no part of the byte is sent before any other part, at least not as far as FileInputStream is concerned. If the byte has to be sent bitwise over some interconnect (say, a SATA cable), then the decision about which bit goes first is a matter for the hardware. The higher layer code is dealing in bytes (or even blocks).
in int first bit is the sign, the rest is the value, the last bit is the least significant bit.
Related
I am trying to convert unsigned bytes from a file to signed bytes in java. This is the current arrangement I have for reading unsigned bytes from a file in java:
ByteArrayOutputStream output = new ByteArrayOutputStream();
for (String string : fileKeyString) {
output.write(Integer.valueOf(string).byteValue());
}
return output.toByteArray();
Note: I have to use Java 8 and fileKeyString is a String Array that gets created when reading from a file. The variable string holds the unsigned byte. It outputs a byte array which is required.
How would I exactly convert this from an unsigned byte to signed bytes before it gets placed into output.write and evaluated by .byteValue()?
I dont have too much expereience with bytes so any help is appreciated.
Thankyou.
Found it! You just subtract by 256 if it exceeds by 128.
if (byte >= 128) { byte -= 256; }
You seem to misunderstand how computers work.
A byte is what it is. Just 01001100 on disk or in memory. What does 01001100 mean? Is that signed or unsigned? The byte doesn't know. Bytes just are 8 bits, that's it. That's all they ever are. It's things that interact with the byte that decide how one is to read it. Is that signed? The byte has no idea - the software (or the human eyeballs) that look at it decide whether it is or not.
Let's make it more interesting and work with the byte 10000000.
What is that? The byte has no idea. Perhaps you have some software that reads this byte and shows the value of it on screen.
Depending on which software you use, you might see any of the following and they are all equally correct:
128 (interpretation: It's an unsigned byte, show it in decimal)
-128 (interpretation: It's a 2s complement signed byte, show in decimal)
80 (interpretation: Show it in hexadecimal, unsigned)
-80 (interpretation: Show in hex, signed)
� (interpretation: It's a unicode character. The 128th item in the unicode table is 'control', and not really a character perse).
-127 (interpretation: It's a 1s complement signed byte, show in decimal)
Nothing appears on screen, instead, the dulcet tones of Unchained Melody blast out of the speaker (interpretation: It's an id of a song, and Unchained Melody's ID is bit sequence 10000000).
Given a file containing just 1 byte, with bitsequence 10000000 (which is just a sequence of bytes, no metadata), you have no idea which of the above interpretation is correct. In that sense they are ALL correct. I can make you a file which, if you name it 'foo.zip' and unzip it produces 1 file with the collected works of shakespeare in plain text inside. If you rename the .zip to .png, and open it, you see the mona lisa. Same bytes in either case - it's the app that reads them that causes those exact same bytes to mean something completely different.
The exact same principle (it's not the byte itself, it's the software or human eyeballs that decide what it means) applies in reverse as well: If I want to 'write' Unchained Melody to disk, it's the software that decides how to do it.
With that in mind, therefore:
How would I exactly convert this from an unsigned byte to signed bytes before it gets placed into output.write and evaluated by .byteValue()?
That question makes no sense. If I have the number -128 and I want to write it to disk, presumably you just write the bit sequence 10000000 to disk and, yup, that doesn't mean anything unless the user of the computer opens that file again with your app. Or any other app that knows that it is to be interpreted as a signed 2's complement byte.
The code you have already writes 1 byte to disk whose bit sequence is 10000000; you're already doing it, your code is fine as is.
If you are opening it with something and that says 'this file contains +128', and you want that to say '-128' instead, there is nothing you can change in your file writing code. Instead, you need to find different software to open it, or configure that software differently.
I'm learning about Text I/O and Binary I/O in java right now. I read that each value that you write to a file is initially stored in binary. For text I/O, the individual digits are converted to it's corresponding Unicode values and then encoded to the file-specific encoding such as ASCII. For binary I/O, the binary value is directly represented in the file. For example, 199 would be represented as 0xC7 which in binary is 11000111. Now I'm confused on one part. If a variable is initially stored as a binary format, does each digit represent a separate byte that is stored or is the entirety of the number stored as a single byte. For example, is 199 originally stored as 0xc7 which would be 11000111 in binary? Or would it be stored in 3 bytes with each byte representing the binary value for the digit. If it was stored in 3 separate bytes, does binary I/O convert that 3 byte number to a single byte? If it's stored in a single byte, how does text I/O translate that single byte into 3 separate byte values. I'm just confused on how to word this. Hope you can understand what I'm getting at. Thanks
The only thing which a computer is capable of dealing with are sets of 0/1 bits which are stored in memory or, if you wish on a storage device. Those bits can be streamed to monitors and converted to characters by graphical hardware. Sams story with keyboards, you type a key and a few bits of data will be send to the computer.
Bits are stored in memory and are accessible by memory addresses. The addresses are also sets of bits.
For practical reasons the bits are grouped into bytes, words, long words, ... A byte used to be the smallest addressable unit of bits and historically ended up as a group of 8 bits, which is currently used in most of the hardware. Modern memory can store data in multiple byte addressable chunks. Same for the disk, you store data there, using specific addressing mechanisms. But in any case those are just sets of bits.
What you are confused about is the interpretation of those bits. They can represent integer numbers, floating point numbers, characters, addresses, ... The way they are interpreted only depends on the program which uses them.
Characters do not exist in the computer. They are just an abstraction which is provided by programming languages. The programs interpret the bits stored on the computer. There are standards. For example the ASCII encoding maps English characters plus a few special characters into numbers from 0 to 127. Those fit into a single byte (leaving number 128 to 255 for special use). A print command will read those bytes one by one and send them to graphics to form letters on the screen as specified in the encoding standard. Different encoding scheme will display the same bytes differently.
If you write a program wit the "hello world" sting in it, the program will convert the symbols between quotes into a set of 11 ascii bytes. (In 'c' it will add yet another byte which is equal to '0' and ends the string this way). Unicode is yet another way to represent characters. Every unicode character is represented by multiple bytes of data. There are other schemes as well. One thing to pay attention to. If you write strings on the disk using certain encoding, you should read them with the same encoding, or your prints will give you garbage. But you can always read and copy then as binary data without interpretation.
So, any variable of any type is just an abstraction and always consists of bytes of data which your program knows how to interpret based on the data type and/or operations it wants to perform. Variables of type int, double, any java object, including String, are just sets of bytes of different sizes. Only the program (and java interpreter is a program) knows what to do with them, use them in calculations or display as characters.
I'm confused about LEB128 or Little Endian Base 128 format. In the AOSP source code Leb128.java, its read function's return type whether signed or unsigned is int. I know the the size of int in java is 4 bytes aka 32bits. But the max length of LEB128 in AOSP is 5 bytes aka 35 bits. So where are the other lost 3bits.
Thanks for your reply.
Each byte of data in LEB only accounts for 7 bits in the actual output - the remaining bit is used to indicate whether or not it's the end.
From Wikipedia:
To encode an unsigned number using unsigned LEB128 first represent the number in binary. Then zero extend the number up to a multiple of 7 bits (such that the most significant 7 bits are not all 0). Break the number up into groups of 7 bits. Output one encoded byte for each 7 bit group, from least significant to most significant group.
The extra bits aren't so much "lost" as "used to indicate whether or not it's the end of the data".
You can't hope to encode arbitrary 32-bit values and some of them taking less than 4 bytes without some of them taking more than 4 bytes.
Alright so I am trying to do a file compress using the Huffman tree.
We got the tree that is working just fine but we are unable to figure out how to write the binary string we get into the file.
So for example our tree returns: '110', it should mean this byte: '00000110' right?
And if the returns: '11111111 11111110' it should mean what? Should we just write it in in byte?
So the question is how do we convert the binary string we get into bytes so we can write it on the file?
Thanks alot,
Ara
So for example our tree returns: '110', it should mean this byte:
'00000110' right?
Wrong. You should have a byte buffer of bits into which you write your bits. Write the three bits 110 into the byte. (You will need to decide on a convention for bit ordering in the byte.) You still have five unused bits in the byte, so there it sits. Now you write 10 into the buffer. The byte buffer now has 11010, and three unused bits. So still it sits. Now you try to write 111011 into the byte buffer. The first three bits go into the byte buffer, giving you 11010111. You now have filled the buffer, so only now do you write out your byte to the file. You are left with 011. You clear your byte buffer of bits since you wrote it out, and put in the remaining 011 from your last code. Your byte buffer now has three bits in it, and five bits unused. Continue in this manner.
The buffer does not have to be one byte. 16-bit or 32-bit buffers are common and are more efficient. You write out bytes whenever the bits therein are eight or more, and shift the remaining 0-7 bits to the start of the buffer.
The only tricky part is what to do at the end, since you may have unused bits in your last byte. Your Huffman codes should have an end symbol to mark the end of the stream. Then you know when you should stop looking for more Huffman codes. If you do not have an end code, then you need to assure somehow that either the remaining bits in the byte cannot be a complete Huffman code, or you need to indicate in some other way where the stream of bits end.
This isn't a question as much as it's a sanity check!
If you needed to read 4 bytes into Java as a bitmask in Big endian and those bytes were:
0x00, 0x01, 0xB6, 0x02.
Making that into an int would be: 112130
The binary would be: 00000000000000011010011000000010
The endian of a series of bytes wouldn't affect the bit position, would it?
Thanks
Tony
Endian-ness reflects the ordering of bytes, but not the ordering of the bits within those bytes.
Let's say I want to represent the (two-byte) word 0x9001.
If I just type this out in binary, that would be 1001000000000001.
If I dump the bytes (from lower address to higher) on a big-endian machine, I would see 10010000 00000001.
If I dump the bytes (from lower address to higher) on a little-endian machine, I would see 00000001 10010000.
In general, if the thing you're reading from is giving you whole bytes, then you don't need to worry about the order of bits making up those bytes: it is just the order of the bytes that matters, as you correctly suppose.
The time you might have to worry about the "endianness" of individual bits is where you're actually reading/writing a stream of bits rather than whole bytes (e.g. if you were writing a compression algorithm that operated at the bit level, you'd have to make a decision about what order to write the bits in).
The only thing you have to pay attention is how exactly you "read 4 bytes into Java" - that's where endianness matters and you can mess it up (DataInputStream assumes big endian). Once the value you've read has become the int 112130, you're set.