I have a sample .txt file that I want to compress using Huffman encoding. My problem is that if one character has a size of one byte and the smallest size you can write is a byte, how do I reduce the size of the sample file?
I converted the sample file into Huffman codes and wrote it to a new empty .txt file which just consists of 0s and 1s as one huge line of characters. Then I took the new file and used the BitSet class in Java to write to a binary file bit by bit. If the character was 0 or 1 in the new file, I wrote 0 or 1 respectively to the binary file. This process was very slow and it crashed my computer multiple times, I was hoping that someone had a more efficient solution. I have written all my code in Java.
Do not write "0" and "1" characters to the file. Write 0 and 1 bits to the file.
You do this by accumulating eight bits into a byte buffer using the shift (<<) and or (|) operators, and then writing that byte to the file. Repeat. At the end you may have less than eight bits in the byte buffer. If so, write that byte to the file, which will have the remaining bits filled with zeros.
E.g. int buf = 0, count = 0;, for each bit: buf |= bit << count++;, check for eight: if (count == 8) { out.writeByte(buf); buf = count = 0; }. At the end, if (count > 0) out.writeByte(buf);.
When decoding the Huffman codes, you may run into a problem with those filler zero bits in the last byte. They could be decoded as an extraneous symbol or symbols. In order to deal with this you will need for the decoder to know when to stop, by either sending the number of symbols before the Huffman codes, or by adding a symbol for end-of-stream.
One way is to use BitSet to set the bits that represent the code as you compute it. Then you can do either BitSet.toByteArray() or BitSet.toLongArray() and write out the information. Both of these store the bits in little endian encoding.
Related
I have changed my question message...
I have two streams with audio in Java. What I want is to combine these two audios into one OutputStream.
I've being searching and it seems that if you have both streams with the same audio format, and using PCM, you just need to do the following operation with the two byte arrays:
mixAudio[i] = (byte) ((audio1[i] + audio2[i]) >> 1);
However I am writing this into a file and I get a file without any audio.
Does anyone know how to combine two audios when I have the audios in two streams (not two audio files)?
Thank you in advance.
decent quality audio consumes two bytes of data per sample per channel to give the audio curve a bit depth of 16 bits which gives your audio curve 2^16 distinct values when digitizing the analog audio curve ... knowing this you cannot do your adding while the data lives as simply bytes ... so to add together two channels you first need to get your audio out of its bytes and into a two byte integer ... then you need to pluck out of that two byte integer each of those two bytes one by one and stow into your output array
in pseudo code ( this puts into an integer two consecutive bytes of your audio array which represents one sample in your audio curve )
assign into a 16 bit integer value of your most significant byte
left shift this integer by 8 bits something like ( myint = myint << 8 )
bit level add to this integer your 2nd byte which is your least significant byte
Top Tip : after you have written code to populate one integer from two bytes then do the reverse namely convert a multi byte integer into two bytes in some array ... bonus points if you plot these integers so you can visualize your raw audio curve
To perform above you must know your endianness ( are you doing little endian or big endian ) which will determine the order of your bytes ... specifically since we now know each audio sample consumes two bytes (or more say for 24 bit audio ) the bytes myarray[i] and myarray[i + 1] are one audio sample however only after knowing your endianness will you realize which array element to use first when populating the above myint ... if none of this makes sense please invest time and effort to research notion of raw audio in PCM format
I highly encourage you to do all of above in your code at least once to appreciate what is happening inside some audio library which may do this for you
going back to your question instead of simply doing
mixAudio[i] = (byte) ((audio1[i] + audio2[i]) >> 1);
you should be doing something like this (untested especially regarding endianess)
twoByteAnswer = (byte) ((audio1[i] << 8) + audio1[i + 1]) + (audio2[i] << 8 + audio2[i + 1])) >> 1);
now you need to spread out your twoByteAnswer into two bytes of array mixAudio ... something like this (also untested)
mixAudio[i] = twoByteAnswer >> 8 // throw away its least sig byte only using its most sig byte
mixAudio[i + 1] = twoByteAnswer && 0x0000FFFF // do a bit AND operator mask
I'm working on a huffman compression and decompression application in java. So far I got the encoding and decoding working. It converts a big input text to it's encoded binary text. This is a String of 1's and 0's. For example:
String originaltext = "Hello I am trying to program a huffman application and..."
String encodedtext = "1100001110001111011010100110100110...." It's a pretty long string.
Now I want to save the string to a file as binary file to reduce the size. But when I try do this, the size will be way bigger then the original text size. Instead I need the size smaller then the original file size. After saving the encodedtext to a file I need to read the binary file back in and convert it to the encodedText string to deconvert it with my huffmantree method.
How can I save the binary string to a binary file which size is then smaller then the original size? And how do read the file in and convert the binary code to the encodedString text?
Probably you're writing your string of 1 and 0 as a string, which will result in 1 byte per 1 or 0.
You need to convert those 1 and 0 to bytes (i.e. convert groups of eight 1 or 0 into 1 byte and write those bytes.
EDIT
See answer by 6502 to this question for some code to convert the 1s and 0s to bytes.
So for no particular reason I wanted to know what the largest number you can store in a gigabyte of memory. So I used an arbitrary precision library to calculate it, but the trouble is trying to output this number to a file, since a string can only store int.max character.
Apint a = new Apint(2);
a = ApintMath.pow(a, 8589934591l);
a = a.subtract(new Apint(1));
File file = new File("theNumber.txt");
PrintWriter pls = new PrintWriter(file);
a.writeTo(pls, true);
pls.close();
You should convert that int number to 4 bytes with little endian or big endian style, and then save 4 bytes to file.
And with this method we can store a very very big number. ex: 8 bytes, 16 bytes...
Update:
Try to use BigInteger class and toByteArray() function when writing bytes to file.
(Untested method; may not work)
Use the mod operator % with a power of 10 to select the right most digits. Write those digits to a file on a line. Then divide by the same power of 10. Now your number is N digits shorter. Repeat writing each group of digits into the file on separate lines.
Now copy the lines in reverse order into another file, either using java or tac if you are on Linux.
You could join each line together, though I would discourage that because many programs will hang if you try to load on very long line of text into them but can handle many lines of text.
Alright so I am trying to do a file compress using the Huffman tree.
We got the tree that is working just fine but we are unable to figure out how to write the binary string we get into the file.
So for example our tree returns: '110', it should mean this byte: '00000110' right?
And if the returns: '11111111 11111110' it should mean what? Should we just write it in in byte?
So the question is how do we convert the binary string we get into bytes so we can write it on the file?
Thanks alot,
Ara
So for example our tree returns: '110', it should mean this byte:
'00000110' right?
Wrong. You should have a byte buffer of bits into which you write your bits. Write the three bits 110 into the byte. (You will need to decide on a convention for bit ordering in the byte.) You still have five unused bits in the byte, so there it sits. Now you write 10 into the buffer. The byte buffer now has 11010, and three unused bits. So still it sits. Now you try to write 111011 into the byte buffer. The first three bits go into the byte buffer, giving you 11010111. You now have filled the buffer, so only now do you write out your byte to the file. You are left with 011. You clear your byte buffer of bits since you wrote it out, and put in the remaining 011 from your last code. Your byte buffer now has three bits in it, and five bits unused. Continue in this manner.
The buffer does not have to be one byte. 16-bit or 32-bit buffers are common and are more efficient. You write out bytes whenever the bits therein are eight or more, and shift the remaining 0-7 bits to the start of the buffer.
The only tricky part is what to do at the end, since you may have unused bits in your last byte. Your Huffman codes should have an end symbol to mark the end of the stream. Then you know when you should stop looking for more Huffman codes. If you do not have an end code, then you need to assure somehow that either the remaining bits in the byte cannot be a complete Huffman code, or you need to indicate in some other way where the stream of bits end.
I have implemented the Huffman Encoding Algorithm in Java using Priority Queues where I traverse the Tree from Root to Leaf and get encoding example as #=000011 based on the number of times the symbol appears in the input. Everything is fine, the tree is being built fine, encoding is just as expected: But the output file I am getting is bigger size than the original file. I am currently appending '0' & '1' to a String on traversing left node and right node of the tree. Probably what I end up with uses all 8 bits for each characters and it does not help in compression. I am guessing there is some conversion of these bits into character values which is required. So that these characters use fewer bits than 8 and hence I get a compressed version of the original file. Could you please let me know how to achieve a compression by manipulating characters and reducing bits in Java? Thanks
You're probably using a StringBuilder and appending "0" or "1", or simply the + operator to concatenate "0" or "1" to the end of your string. Or you're using some sort of OutputStream and writing to it.
What you want to do is to write the actual bits. I'd suggest making a whole byte first before writing. A byte looks like this:
0x05
Which would represent the binary string 0000 0011.
You can make these by making a byte type, adding and shifting:
public void writeToFile(String binaryString, OutputStream os){
int pos = 0;
while(pos < binaryString.length()){
byte nextByte = 0x00;
for(int i=0;i<8 && pos+i < binaryString.length(); i++){
nextByte << 1;
nextByte += binaryString.charAt(pos+i)=='0'?0x0:0x1;
}
os.write(nextByte);
pos+=8;
}
}
Of course, it's inefficient to write one byte at a time, and on top of that the OutputStream interface only accepts byte arrays (byte[]). So you'd be better off storing the bytes in an array (or even easier, a List), then writing them at bigger chunks.
If you are not allowed to use byte writes (why the heck not? ObjectOutputStream supports writing byte arrays!), then you can use Base64 to encode your binary string. But remember that Base64 inflates your data usage by 33%.
An easy way to convert a byte array to base64 is by using an existing encoder. After adding the following import:
import sun.misc.BASE64Encoder;
You can instantiate the encoder and turn your byte array into a string:
byte[] bytes = getBytesFromHuffmanEncoding();
BASE64Encoder encoder = new BASE64Encoder();
String encodedString = encoder.encode(bytes);