Writing binary value to file for Huffman Encoding

Writing binary value to file for Huffman Encoding - java

I am trying to implement compression of files using Huffman encoding. Currently, I am writing the header as the first line of the compressed file and then writing the encoded binary strings (i.e. strings having the binary encoded value).
However, instead of reducing the file size, my file size is increasing as for every character like 'a', I am writing its corresponding binary, for example 01010001 which takes more space.
How can I write it into the file in a way that it reduces the space?
This is my code
public void write( String aWord ) {
counter++;
String content;
byte[] contentInBytes;
//Write header before writing file contents
if ( counter == 1 )
{
//content gets the header in String format from the tree
content = myTree.myHeader;
contentInBytes = content.getBytes();
try {
fileOutputStream.write(contentInBytes);
fileOutputStream.write(System.getProperty("line.separator").getBytes());
} catch (IOException e) {
System.err.println(e);
}
}
//content gets the encoded binary in String format from the tree
content = myTree.writeMe(aWord);
contentInBytes = content.getBytes();
try {
fileOutputStream.write(contentInBytes);
fileOutputStream.write(System.getProperty("line.separator").getBytes());
} catch (IOException e) {
System.err.println(e);
}
}
Sample input file:
abc
aef
aeg
Compressed file:
{'g':"010",'f':"011",'c':"000",'b':"001",'e':"10",'a':"11"}
11001000
1110011
1110010

As I gathered from the comments, you are writing text, but what you really want to achive is writing binary data. What you currently have is a nice demo for huffman encoding, but impractical for actually compressing data.
To achieve compression, you will need to output the huffman symbols as binary data, where you currently output the string "11" for an 'a', you will need to just output two bits 11.
I presume this is currently coded in myTree.writeMe(), you need to modify the method to not return a String, but something more suited to binary output, e.g. byte[].
It depends a bit on the inner workings of you tree class how to do this. I presume you are using some StringBuilder internally and simply add the encoded symbol strings while looping over the input. Instead of a StringBuilder you will need a container capable of dealing with single bits. The only suitable class that comes to min immediately is java.util.BitSet (in practice one often will write a specialized class for this, with a specialized API to do this quickly). But for simplicity lets use BitSet for now.
In method writeMe, you will in principle do the following:
BitSet buffer = new BitSet();
int bitIndex = 0;
loop over input symbols {
huff_code = getCodeForSymbol(symbol)
foreach bit in huff_code {
buffer.put(bitIndex++, bit)
}
}
return buffer.toByteArray();
How to do this efficiently depends on how you internally defined the huffman code table. But prinicple is simple, loop over the code, determine if each place is a one or a zero and put them into the BitSet at consecutive indices.
if (digits == '1') {
buffer.set(bitIndex);
} else {
buffer.clear(bitIndex);
}
You now have your huffman encoded data. But the resulting data will be impossible to decompress properly, since you are currently processing words and you do not write any indication where the compressed data actually ends (you do this currently with a line feed). If you encoded for example 3 times an 'a', the BitSet would contain 11 11 11. Thats 6 bits, but when you convert to byte[] its gets padded to 8 bits: 0b11_11_11_00.
Those extra, unavoidable bits will confuse your decompression. You will need to handle this somehow, either by encoding first the number of symbols in the data, or by using an explicit symbol signaling end of data.
This should get you an idea how to continue. Many details depend on how you implement your tree class and the encoded symbols.

Related

Java - How to handle special characters when compressing bytes (Huffman encoding)?

I am writing a Huffman Compression/Decompression program. I have started writing my compression method and I am stuck. I am trying to read all bytes in the file and then put all of the bytes into a byte array. After putting all bytes into the byte array I create an int[] array that will store all the frequencies of each byte (with the index being the ASCII code).
It does include the extended ASCII table since the size of the int array is 256. However I encounter issues as soon as I read a special character in my file (AKA characters with a higher ASCII value than 127). I understand that a byte is signed and will wrap around to a negative value as soon as it crosses the 127 number limit (and an array index obviously cant be negative) so I tried to counter this by turning it into a signed value when I specify my index for the array (array[myByte&0xFF]).
This kind of worked but it gave me the wrong ASCII value (for example if the correct ASCII value for the character is 134 I instead got 191 or something). The even more annoying part is that I noticed that special characters are split into 2 separate bytes, which I feel will cause problems later (for example when I try to decompress).
How do I make my program compatible with every single type of character (this program is supposed to be able to compress/decompress pictures, mp3's etc).
Maybe I am taking the wrong approach to this, but I don't know what the right approach is. Please give me some tips for structuring this.
Tree:
package CompPck;
import java.util.TreeMap;
abstract class Tree implements Comparable<Tree> {
public final int frequency; // the frequency of this tree
public Tree(int freq) { frequency = freq; }
// compares on the frequency
public int compareTo(Tree tree) {
return frequency - tree.frequency;
}
}
class Leaf extends Tree {
public final int value; // the character this leaf represents
public Leaf(int freq, int val) {
super(freq);
value = val;
}
}
class Node extends Tree {
public final Tree left, right; // subtrees
public Node(Tree l, Tree r) {
super(l.frequency + r.frequency);
left = l;
right = r;
}
}
Build tree method:
public static Tree buildTree(int[] charFreqs) {
PriorityQueue<Tree> trees = new PriorityQueue<Tree>();
for (int i = 0; i < charFreqs.length; i++){
if (charFreqs[i] > 0){
trees.offer(new Leaf(charFreqs[i], i));
}
}
//assert trees.size() > 0;
while (trees.size() > 1) {
Tree a = trees.poll();
Tree b = trees.poll();
trees.offer(new Node(a, b));
}
return trees.poll();
}
Compression method:
public static void compress(File file){
try {
Path path = Paths.get(file.getAbsolutePath());
byte[] content = Files.readAllBytes(path);
TreeMap<Integer, String> treeMap = new TreeMap<Integer, String>();
File nF = new File(file.getName() + "_comp");
nF.createNewFile();
BitFileWriter bfw = new BitFileWriter(nF);
int[] charFreqs = new int[256];
// read each byte and record the frequencies
for (byte b : content){
charFreqs[b&0xFF]++;
System.out.println(b&0xFF);
}
// build tree
Tree tree = buildTree(charFreqs);
// build TreeMap
fillEncodeMap(tree, new StringBuffer(), treeMap);
} catch (IOException e) {
e.printStackTrace();
}
}

Encodings matter
If I take the character "ö" and read it in my file it will now be
represented by 2 different values (191 and 182 or something like that)
when its actual ASCII table value is 148.
That really depends, which kind of encoding was used to create your text file. Encodings determine how text messages are stored.
In UTF-8 the ö is stored as hex [0xc3, 0xb6] or [195, 182]
In ISO/IEC 8859-1 (= "Latin-1") it would be stored as hex [0xf6], or [246]
In Mac OS Central European, it would be hex [0x9a] or [154]
Please note, that the basic ASCII table itself doesn't really describe anything for that kind of character. ASCII only uses 7 bits, and by doing so only maps 128 codes.
Part of the problem is that in layman's terms, "ASCII" is sometimes used to describe extensions of ASCII as well, (e.g. like Latin-1)
History
There's actually a bit of history behind that. Originally ASCII was a very limited set of characters. When those weren't enough, each region started using the 8th bit to add their language-specific characters. Leading to all kind of compatibility issues.
Then there was some kind of consortium that made an inventory of all characters in all possible languages (and beyond). That set is called "unicode". It contains not just 128 or 256 characters, but thousands of them.
From that point on you would need more advanced encodings to cover them. UTF-8 is one of those encodings that covers that entire unicode set, and it does so while being kind-of backwards compatible with ASCII.
Each ASCII character is still mapped in the same way, but when 1-byte isn't enough, it will use the 8th bit to indicate that a 2nd byte will follow, which is the case for the ö character.
Tools
If you're using a more advanced text editor like Notepad++, then you can select your encoding from the drop-down menu.
In programming
Having said that, your current java source reads bytes, it's not reading characters. And I would think that it's a plus when it works on byte-level, because then it can support all encodings. Maybe you don't need to work on character level at all.
However, if it does matter for your specific algorithm. Let's say you've written an algorithm that is only supposed to handle Latin-1 encoding. So, then it's really going to work on "character-level" and not on "byte-level". In that case, consider reading directly to String or char[].
Java can do the heavy-lifting for you in that case. There are readers in java that will let you read a text-file directly to Strings/char[]. However, in those cases you should of course specify an encoding when you use them. Internally a single java character can contain up to 2 bytes of data.
Trying to convert bytes to characters manually is a tricky business. Unless you're working with plain old ASCII of course. The moment you see a value above 0x7F (127), (which are presented by negative values in byte) you're no longer working with simple ASCII. Then consider using something like: new String(bytes, StandardCharsets.UTF_8). There's no need to write a decoding algorithm from scratch.

Resource file format processing in Java

I am trying to implement a processor for a specific resource archive file format in Java. The format has a Header comprised of a three-char description, a dummy byte, plus a byte indicating the number of files.
Then each file has an entry consisting of a dummy byte, a twelve-char string describing the file name, a dummy byte, and an offset declared in a three-byte array.
What would be the proper class for reading this kind of structure? I have tried RandomAccessFile but it does not allow to read arrays of data, e.g. I can only read three chars by calling readChar() three times, etc.
Of course I can extend RandomAccessFile to do what I want but there's got to be a proper out-of-the-box class to do this kind of processing isn't it?
This is my reader for the header in C#:
protected override void ReadHeader()
{
Header = new string(this.BinaryReader.ReadChars(3));
byte dummy = this.BinaryReader.ReadByte();
NFiles = this.BinaryReader.ReadByte();
}

I think you got lucky with your C# code, as it relies on the character encoding to be set somewhere else, and if it didn't match the number of bytes per character in the file, your code would probably have failed.
The safest way to do this in Java would be to strictly read bytes and do the conversion to characters yourself. If you need seek abilities, then indeed RandomAccessFile would be your easiest solution, but it should be pointed out that InputStream allows skipping, so if you don`t need actual random access, just to skip some of the files, you could certainly use it.
In either case, you should read the bytes from the file per the file specification, and then convert them to characters based on a known encoding. You should never trust a file that was not written by a Java program to contain any Java data types other than byte, and even if it was written by Java, it may well have been converted to raw bytes while writing.
So your code should be something along the lines of:
String header = "";
int nFiles = 0;
RandomAccessFile raFile = new RandomAccessFile( "filename", "r" );
byte[] buffer = new byte[3];
int numRead = raFile.read( buffer );
header = new String( buffer, StandardCharsets.US_ASCII.name() );
int numSkipped = raFile.skipBytes(1);
nFiles = raFile.read(); // The byte is read as an integer between 0 and 255
Sanity checks (checking that actual 3 bytes were read, 1 byte was skipped and nFiles is not -1) and exception handling have been skipped for brevity.
It's more or less the same if you use InputStream.

I would go with MappedByteBuffer. This will allow you to seek arbitrarily, but will also deal efficiently and transparently with large files that are too large to fit comfortably in RAM.
This is, to my mind, the best way of reading structured binary data like this from a file.
You can then build your own data structure on top of that, to handle the specific file format.

Writing Bits to a file using BitSet & FileOutputStream

I've run into a bit of a problem when it comes to writing specific bits to a file. I apologise if this is a duplicate of anything but I could not find a reasonable answer with the searches I ran.
I have a number of difficulties with the following:
Writing a header (Long) bit by bit (converted to a byte array so the
FileOutputStream can utilise it) to the file.
Writing single bits to the file. For example, at one stage I am required to write a single bit set to 0 to the file so my initial thought would be to use a BitSet but Java seems to treat this as a null?
BitSet initialPadding = new BitSet();
initialPadding.set(0, false);
fileOutputStream.write(initialPadding.toByteArray());
1)
I create a FileOutputStream as shown below with the necessary file name:
FileOutputStream fileOutputStream = new FileOutputStream(file.getAbsolutePath());
I am attempting to create an ".amr" file so the first step before I perform any bit manipulation is to write a header to the beginning of the file. This has the following value:
Long defaultHeader = 0x2321414d520aL;
I've tried writing this to the file using the following method but I am pretty sure it does not write the correct result:
fileOutputStream.write(defaultHeader.byteValue());
Am I using the correct streams? Are my convertions completely wrong?
2)
I have a public BitSet fileBitSet;which has bits read in from a ".raw" file as the input. I need to be able to extract certain bits from the BitSet in order to write them to the file later. I do this using the following method:
public int getOctetPayloadHeader(int startPoint) {
int readLength = 0;
octetCMR = fileBitSet.get(0, 3);
octetRES = fileBitSet.get(4, 7);
if (octetRES.get(0, 3).isEmpty()) {
/* Keep constructing the payload header. */
octetFBit = fileBitSet.get(8, 8);
octetMode = fileBitSet.get(9, 12);
octetQuality = fileBitSet.get(13, 13);
octetPadding = fileBitSet.get(14, 15);
... }
What would be the best way to go for writing these bits to a file bearing in mind that I may be required to sometimes write a single bit or 81 bits at a particular offset in the fileBitSet ?

There is only one thing you can write to an OutputStream: bytes. You have to do the composing of your bits into bytes yourself; only you know the rules how the bits are to be put together into bytes.
As for stuff like:
Long defaultHeader = 0x2321414d520aL;
fileOutputStream.write(defaultHeader.byteValue());
You should take a close look at the javadocs for the methods you are using. byteValue() returns a single byte; so of course its not doing what you expect. Working with streams is well explained in oracles tutorials: http://docs.oracle.com/javase/tutorial/essential/io/streams.html
For writing single bits or groups of bits, you will need a custom OutputStream that handles grouping the bits into bytes to be written. Thats commonly called a BitStream (there is no such class in the JDK); you have to either write it yourself (which I highly recommend, its a very good excercise to teach you about bits and bytes) or find one on the web.

How do you write any ASCII character to a file in Java?

Basically I'm trying to use a BufferedWriter to write to a file using Java. The problem is, I'm actually doing some compression so I generate ints between 0 and 255, and I want to write the character who's ASCII value is equal to that int. When I try writing to the file, it writes many ? characters, so when I read the file back in, it reads those as 63, which is clearly not what I want. Any ideas how I can fix this?
Example code:
int a = generateCode(character); //a now has an int between 0 and 255
bw.write((char) a);
a is always between 0 and 255, but it sometimes writes '?'

You are really trying to write / read bytes to / from a file.
When you are processing byte-oriented data (as distinct from character-oriented data), you should be using InputStream and OutputStream classes and not Reader and Writer classes.
In this case, you should use FileInputStream / FileOutputStream, and wrap with a BufferedInputStream / BufferedOutputStream if you are doing byte-at-a-time reads and writes.
Those pesky '?' characters are due to issues the encoding/decoding process that happens when Java converts between characters and the default text encoding for your platform. The conversion from bytes to characters and back is often "lossy" ... depending on the encoding scheme used. You can avoid this by using the byte-oriented stream classes.
(And the answers that point out that ASCII is a 7-bit not 8-bit character set are 100% correct. You are really trying to read / write binary octets, not characters.)

You need to make up your mind what are you really doing. Are you trying to write some bytes to a file, or are you trying to write encoded text? Because these are different concepts in Java; byte I/O is handled by subclasses of InputStream and OutputStream, while character I/O is handled by subclasses of Reader and Writer. If what you really want to write is bytes to a file (which I'm guessing from your mention of compression), use an OutputStream, not a Writer.
Then there's another confusion you have, which is evident from your mention of "ASCII characters from 0-255." There are no ASCII characters above 127. Please take 15 minutes to read this: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" (by Joel Spolsky). Pay particular attention to the parts where he explains the difference between a character set and an encoding, because it's critical for understanding Java I/O. (To review whether you understood, here's what you need to learn: Java Writers are classes that translate character output to byte output by applying a client-specified encoding to the text, and sending the bytes to an OutputStream.)

Java strings are based on 16 bit wide characters, it tries to perform conversions around that assumption if there is no clear specifications.
The following sample code, write and reads data directly as bytes, meaning 8-bit numbers which have an ASCII meaning associated with them.
import java.io.*;
public class RWBytes{
public static void main(String[] args)throws IOException{
String filename = "MiTestFile.txt";
byte[] bArray1 =new byte[5];
byte[] bArray2 =new byte[5];
bArray1[0]=65;//A
bArray1[1]=66;//B
bArray1[2]=67;//C
bArray1[3]=68;//D
bArray1[4]=69;//E
FileOutputStream fos = new FileOutputStream(filename);
fos.write(bArray1);
fos.close();
FileInputStream fis = new FileInputStream(filename);
fis.read(bArray2);
ByteArrayInputStream bais = new ByteArrayInputStream(bArray2);
for(int i =0; i< bArray2.length ; i++){
System.out.println("As the bytem value: "+ bArray2[i]);//as the numeric byte value
System.out.println("Converted as char to printiong to the screen: "+ String.valueOf((char)bArray2[i]));
}
}
}
A fixed subset of the 7 bit ASCII code is printable, A=65 for example, the 10 corresponds to the "new line" character which steps down one line on screen when found and "printed". Many other codes exist which manipulate a character oriented screen, these are invisible and manipulated the screen representation like tabs, spaces, etc. There are also other control characters which had the purpose of ringing a bell for example.
The higher 8 bit end above 127 is defined as whatever the implementer wanted, only the lower half have standard meanings associated.
For general binary byte handling there are no such qualm, they are number which represent the data. Only when trying to print to the screen the become meaningful in all kind of ways.

How to compress a String in Java?

I use GZIPOutputStream or ZIPOutputStream to compress a String (my string.length() is less than 20), but the compressed result is longer than the original string.
On some site, I found some friends said that this is because my original string is too short, GZIPOutputStream can be used to compress longer strings.
so, can somebody give me a help to compress a String?
My function is like:
String compress(String original) throws Exception {
}
Update:
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.GZIPOutputStream;
import java.util.zip.*;
//ZipUtil
public class ZipUtil {
public static String compress(String str) {
if (str == null || str.length() == 0) {
return str;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(str.getBytes());
gzip.close();
return out.toString("ISO-8859-1");
}
public static void main(String[] args) throws IOException {
String string = "admin";
System.out.println("after compress:");
System.out.println(ZipUtil.compress(string));
}
}
The result is :

Compression algorithms almost always have some form of space overhead, which means that they are only effective when compressing data which is sufficiently large that the overhead is smaller than the amount of saved space.
Compressing a string which is only 20 characters long is not too easy, and it is not always possible. If you have repetition, Huffman Coding or simple run-length encoding might be able to compress, but probably not by very much.

When you create a String, you can think of it as a list of char's, this means that for each character in your String, you need to support all the possible values of char. From the sun docs
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
If you have a reduced set of characters you want to support you can write a simple compression algorithm, which is analogous to binary->decimal->hex radix converstion. You go from 65,536 (or however many characters your target system supports) to 26 (alphabetical) / 36 (alphanumeric) etc.
I've used this trick a few times, for example encoding timestamps as text (target 36 +, source 10) - just make sure you have plenty of unit tests!

If the passwords are more or less "random" you are out of luck, you will not be able to get a significant reduction in size.
But: Why do you need to compress the passwords? Maybe what you need is not a compression, but some sort of hash value? If you just need to check if a name matches a given password, you don't need do save the password, but can save the hash of a password. To check if a typed in password matches a given name, you can build the hash value the same way and compare it to the saved hash. As a hash (Object.hashCode()) is an int you will be able to store all 20 password-hashes in 80 bytes).

Your friend is correct. Both gzip and ZIP are based on DEFLATE. This is a general purpose algorithm, and is not intended for encoding small strings.
If you need this, a possible solution is a custom encoding and decoding HashMap<String, String>. This can allow you to do a simple one-to-one mapping:
HashMap<String, String> toCompressed, toUncompressed;
String compressed = toCompressed.get(uncompressed);
// ...
String uncompressed = toUncompressed.get(compressed);
Clearly, this requires setup, and is only practical for a small number of strings.

Huffman Coding might help, but only if you have a lot of frequent characters in your small String

The ZIP algorithm is a combination of LZW and Huffman Trees. You can use one of theses algorithms separately.
The compression is based on 2 factors :
the repetition of substrings in your original chain (LZW): if there are a lot of repetitions, the compression will be efficient. This algorithm has good performances for compressing a long plain text, since words are often repeated
the number of each character in the compressed chain (Huffman): more the repartition between characters is unbalanced, more the compression will be efficient
In your case, you should try the LZW algorithm only. Used basically, the chain can be compressed without adding meta-informations: it is probably better for short strings compression.
For the Huffman algorithm, the coding tree has to be sent with the compressed text. So, for a small text, the result can be larger than the original text, because of the tree.

Huffman encoding is a sensible option here. Gzip and friends do this, but the way they work is to build a Huffman tree for the input, send that, then send the data encoded with the tree. If the tree is large relative to the data, there may be no not saving in size.
However, it is possible to avoid sending a tree: instead, you arrange for the sender and receiver to already have one. It can't be built specifically for every string, but you can have a single global tree used to encode all strings. If you build it from the same language as the input strings (English or whatever), you should still get good compression, although not as good as with a custom tree for every input.

If you know that your strings are mostly ASCII you could convert them to UTF-8.
byte[] bytes = string.getBytes("UTF-8");
This may reduce the memory size by about 50%. However, you will get a byte array out and not a string. If you are writing it to a file though, that should not be a problem.
To convert back to a String:
private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
...
String s = new String(bytes, UTF8_CHARSET);

You don't see any compression happening for your String, As you atleast require couple of hundred bytes to have real compression using GZIPOutputStream or ZIPOutputStream. Your String is too small.(I don't understand why you require compression for same)
Check Conclusion from this article:
The article also shows how to compress
and decompress data on the fly in
order to reduce network traffic and
improve the performance of your
client/server applications.
Compressing data on the fly, however,
improves the performance of
client/server applications only when
the objects being compressed are more
than a couple of hundred bytes. You
would not be able to observe
improvement in performance if the
objects being compressed and
transferred are simple String objects,
for example.

Take a look at the Huffman algorithm.
https://codereview.stackexchange.com/questions/44473/huffman-code-implementation
The idea is that each character is replaced with sequence of bits, depending on their frequency in the text (the more frequent, the smaller the sequence).
You can read your entire text and build a table of codes, for example:
Symbol Code
a 0
s 10
e 110
m 111
The algorithm builds a symbol tree based on the text input. The more variety of characters you have, the worst the compression will be.
But depending on your text, it could be effective.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.