Alright so I am trying to do a file compress using the Huffman tree.
We got the tree that is working just fine but we are unable to figure out how to write the binary string we get into the file.
So for example our tree returns: '110', it should mean this byte: '00000110' right?
And if the returns: '11111111 11111110' it should mean what? Should we just write it in in byte?
So the question is how do we convert the binary string we get into bytes so we can write it on the file?
Thanks alot,
Ara
So for example our tree returns: '110', it should mean this byte:
'00000110' right?
Wrong. You should have a byte buffer of bits into which you write your bits. Write the three bits 110 into the byte. (You will need to decide on a convention for bit ordering in the byte.) You still have five unused bits in the byte, so there it sits. Now you write 10 into the buffer. The byte buffer now has 11010, and three unused bits. So still it sits. Now you try to write 111011 into the byte buffer. The first three bits go into the byte buffer, giving you 11010111. You now have filled the buffer, so only now do you write out your byte to the file. You are left with 011. You clear your byte buffer of bits since you wrote it out, and put in the remaining 011 from your last code. Your byte buffer now has three bits in it, and five bits unused. Continue in this manner.
The buffer does not have to be one byte. 16-bit or 32-bit buffers are common and are more efficient. You write out bytes whenever the bits therein are eight or more, and shift the remaining 0-7 bits to the start of the buffer.
The only tricky part is what to do at the end, since you may have unused bits in your last byte. Your Huffman codes should have an end symbol to mark the end of the stream. Then you know when you should stop looking for more Huffman codes. If you do not have an end code, then you need to assure somehow that either the remaining bits in the byte cannot be a complete Huffman code, or you need to indicate in some other way where the stream of bits end.
Related
I have a sample .txt file that I want to compress using Huffman encoding. My problem is that if one character has a size of one byte and the smallest size you can write is a byte, how do I reduce the size of the sample file?
I converted the sample file into Huffman codes and wrote it to a new empty .txt file which just consists of 0s and 1s as one huge line of characters. Then I took the new file and used the BitSet class in Java to write to a binary file bit by bit. If the character was 0 or 1 in the new file, I wrote 0 or 1 respectively to the binary file. This process was very slow and it crashed my computer multiple times, I was hoping that someone had a more efficient solution. I have written all my code in Java.
Do not write "0" and "1" characters to the file. Write 0 and 1 bits to the file.
You do this by accumulating eight bits into a byte buffer using the shift (<<) and or (|) operators, and then writing that byte to the file. Repeat. At the end you may have less than eight bits in the byte buffer. If so, write that byte to the file, which will have the remaining bits filled with zeros.
E.g. int buf = 0, count = 0;, for each bit: buf |= bit << count++;, check for eight: if (count == 8) { out.writeByte(buf); buf = count = 0; }. At the end, if (count > 0) out.writeByte(buf);.
When decoding the Huffman codes, you may run into a problem with those filler zero bits in the last byte. They could be decoded as an extraneous symbol or symbols. In order to deal with this you will need for the decoder to know when to stop, by either sending the number of symbols before the Huffman codes, or by adding a symbol for end-of-stream.
One way is to use BitSet to set the bits that represent the code as you compute it. Then you can do either BitSet.toByteArray() or BitSet.toLongArray() and write out the information. Both of these store the bits in little endian encoding.
I am trying to convert unsigned bytes from a file to signed bytes in java. This is the current arrangement I have for reading unsigned bytes from a file in java:
ByteArrayOutputStream output = new ByteArrayOutputStream();
for (String string : fileKeyString) {
output.write(Integer.valueOf(string).byteValue());
}
return output.toByteArray();
Note: I have to use Java 8 and fileKeyString is a String Array that gets created when reading from a file. The variable string holds the unsigned byte. It outputs a byte array which is required.
How would I exactly convert this from an unsigned byte to signed bytes before it gets placed into output.write and evaluated by .byteValue()?
I dont have too much expereience with bytes so any help is appreciated.
Thankyou.
Found it! You just subtract by 256 if it exceeds by 128.
if (byte >= 128) { byte -= 256; }
You seem to misunderstand how computers work.
A byte is what it is. Just 01001100 on disk or in memory. What does 01001100 mean? Is that signed or unsigned? The byte doesn't know. Bytes just are 8 bits, that's it. That's all they ever are. It's things that interact with the byte that decide how one is to read it. Is that signed? The byte has no idea - the software (or the human eyeballs) that look at it decide whether it is or not.
Let's make it more interesting and work with the byte 10000000.
What is that? The byte has no idea. Perhaps you have some software that reads this byte and shows the value of it on screen.
Depending on which software you use, you might see any of the following and they are all equally correct:
128 (interpretation: It's an unsigned byte, show it in decimal)
-128 (interpretation: It's a 2s complement signed byte, show in decimal)
80 (interpretation: Show it in hexadecimal, unsigned)
-80 (interpretation: Show in hex, signed)
� (interpretation: It's a unicode character. The 128th item in the unicode table is 'control', and not really a character perse).
-127 (interpretation: It's a 1s complement signed byte, show in decimal)
Nothing appears on screen, instead, the dulcet tones of Unchained Melody blast out of the speaker (interpretation: It's an id of a song, and Unchained Melody's ID is bit sequence 10000000).
Given a file containing just 1 byte, with bitsequence 10000000 (which is just a sequence of bytes, no metadata), you have no idea which of the above interpretation is correct. In that sense they are ALL correct. I can make you a file which, if you name it 'foo.zip' and unzip it produces 1 file with the collected works of shakespeare in plain text inside. If you rename the .zip to .png, and open it, you see the mona lisa. Same bytes in either case - it's the app that reads them that causes those exact same bytes to mean something completely different.
The exact same principle (it's not the byte itself, it's the software or human eyeballs that decide what it means) applies in reverse as well: If I want to 'write' Unchained Melody to disk, it's the software that decides how to do it.
With that in mind, therefore:
How would I exactly convert this from an unsigned byte to signed bytes before it gets placed into output.write and evaluated by .byteValue()?
That question makes no sense. If I have the number -128 and I want to write it to disk, presumably you just write the bit sequence 10000000 to disk and, yup, that doesn't mean anything unless the user of the computer opens that file again with your app. Or any other app that knows that it is to be interpreted as a signed 2's complement byte.
The code you have already writes 1 byte to disk whose bit sequence is 10000000; you're already doing it, your code is fine as is.
If you are opening it with something and that says 'this file contains +128', and you want that to say '-128' instead, there is nothing you can change in your file writing code. Instead, you need to find different software to open it, or configure that software differently.
So i'm working on a bittorrent project. I need to create a bitfield according to number of piece. So i use BitSet but the problem is, the method toByteArray doesn't return the byte array in the order of byte that i wanted.
Eg:
//number of piece=11
bitSet.set(5,16); //following bittorrent specification
bitSet.toByteArray() -> 0xe0ff (this is the byte that i get)
But what i want is 0xffe0
Thanks in advance.
This is due to big-endian vs little-endian mismatch. BitSet is strictly bit-wise little-endian. BitSet#toByteArray() handles reversing the bit-order, but outputs little-endian bytes. So you'll have to rearrange the bytes yourself to match the desired order. The desired order will depend on the size of the "word" in your output data structure.
If the output is a short, you swap bytes 2 at a time, but if it's a long you'll need to reverse each set of 4 output bytes. It's possible to do this with a ByteBuffer but it's probably more work, unless you're already using ByteBuffers.
Unfortunately, there isn't a BitSet.toShortArray(), which would do what you want.
You might find the Wikipedia article Endianness useful.
So we can talk about the endianness of both the bit and byte order.
When I read the next byte from FileInputStream, for example, I practically get an 8-bit signed integer, but I have no idea what is the bit order with which Java calculates the byte's integer value. Which comes first, the most significant or the least significant bit?
(sign bit, 2^6 ..... 2^0)
Or...
(2^0, ..... 2^6, sign bit)
Endianness only really applies when a unit is broken down into other units. So if you were transmitting a byte over a bit stream, you could observe whether the least significant bit was transmitted first or last. And at that point we could say that the stream was little-endian or big-endian.
But within a byte-addressable machine, i.e., where the byte is the smallest unit of storage, there is no "endianness" within the byte. No bit of the byte is "before" any other bit of the byte.
Note that another term for endianness is "byte order". The order of bytes within larger entities.
It is true we like to number bits (0 to 7, for an 8-bit byte) so we can talk about them, but this really does not define endianness, even though the numbering is often chosen to match the byte order of the machine; this is convention.
With respect to FileInputStream - according to its documentation, that transfers bytes: no part of the byte is sent before any other part, at least not as far as FileInputStream is concerned. If the byte has to be sent bitwise over some interconnect (say, a SATA cable), then the decision about which bit goes first is a matter for the hardware. The higher layer code is dealing in bytes (or even blocks).
in int first bit is the sign, the rest is the value, the last bit is the least significant bit.
This isn't a question as much as it's a sanity check!
If you needed to read 4 bytes into Java as a bitmask in Big endian and those bytes were:
0x00, 0x01, 0xB6, 0x02.
Making that into an int would be: 112130
The binary would be: 00000000000000011010011000000010
The endian of a series of bytes wouldn't affect the bit position, would it?
Thanks
Tony
Endian-ness reflects the ordering of bytes, but not the ordering of the bits within those bytes.
Let's say I want to represent the (two-byte) word 0x9001.
If I just type this out in binary, that would be 1001000000000001.
If I dump the bytes (from lower address to higher) on a big-endian machine, I would see 10010000 00000001.
If I dump the bytes (from lower address to higher) on a little-endian machine, I would see 00000001 10010000.
In general, if the thing you're reading from is giving you whole bytes, then you don't need to worry about the order of bits making up those bytes: it is just the order of the bytes that matters, as you correctly suppose.
The time you might have to worry about the "endianness" of individual bits is where you're actually reading/writing a stream of bits rather than whole bytes (e.g. if you were writing a compression algorithm that operated at the bit level, you'd have to make a decision about what order to write the bits in).
The only thing you have to pay attention is how exactly you "read 4 bytes into Java" - that's where endianness matters and you can mess it up (DataInputStream assumes big endian). Once the value you've read has become the int 112130, you're set.