I am writing a Huffman Compression/Decompression program. I have started writing my compression method and I am stuck. I am trying to read all bytes in the file and then put all of the bytes into a byte array. After putting all bytes into the byte array I create an int[] array that will store all the frequencies of each byte (with the index being the ASCII code).
It does include the extended ASCII table since the size of the int array is 256. However I encounter issues as soon as I read a special character in my file (AKA characters with a higher ASCII value than 127). I understand that a byte is signed and will wrap around to a negative value as soon as it crosses the 127 number limit (and an array index obviously cant be negative) so I tried to counter this by turning it into a signed value when I specify my index for the array (array[myByte&0xFF]).
This kind of worked but it gave me the wrong ASCII value (for example if the correct ASCII value for the character is 134 I instead got 191 or something). The even more annoying part is that I noticed that special characters are split into 2 separate bytes, which I feel will cause problems later (for example when I try to decompress).
How do I make my program compatible with every single type of character (this program is supposed to be able to compress/decompress pictures, mp3's etc).
Maybe I am taking the wrong approach to this, but I don't know what the right approach is. Please give me some tips for structuring this.
Tree:
package CompPck;
import java.util.TreeMap;
abstract class Tree implements Comparable<Tree> {
public final int frequency; // the frequency of this tree
public Tree(int freq) { frequency = freq; }
// compares on the frequency
public int compareTo(Tree tree) {
return frequency - tree.frequency;
}
}
class Leaf extends Tree {
public final int value; // the character this leaf represents
public Leaf(int freq, int val) {
super(freq);
value = val;
}
}
class Node extends Tree {
public final Tree left, right; // subtrees
public Node(Tree l, Tree r) {
super(l.frequency + r.frequency);
left = l;
right = r;
}
}
Build tree method:
public static Tree buildTree(int[] charFreqs) {
PriorityQueue<Tree> trees = new PriorityQueue<Tree>();
for (int i = 0; i < charFreqs.length; i++){
if (charFreqs[i] > 0){
trees.offer(new Leaf(charFreqs[i], i));
}
}
//assert trees.size() > 0;
while (trees.size() > 1) {
Tree a = trees.poll();
Tree b = trees.poll();
trees.offer(new Node(a, b));
}
return trees.poll();
}
Compression method:
public static void compress(File file){
try {
Path path = Paths.get(file.getAbsolutePath());
byte[] content = Files.readAllBytes(path);
TreeMap<Integer, String> treeMap = new TreeMap<Integer, String>();
File nF = new File(file.getName() + "_comp");
nF.createNewFile();
BitFileWriter bfw = new BitFileWriter(nF);
int[] charFreqs = new int[256];
// read each byte and record the frequencies
for (byte b : content){
charFreqs[b&0xFF]++;
System.out.println(b&0xFF);
}
// build tree
Tree tree = buildTree(charFreqs);
// build TreeMap
fillEncodeMap(tree, new StringBuffer(), treeMap);
} catch (IOException e) {
e.printStackTrace();
}
}
Encodings matter
If I take the character "ö" and read it in my file it will now be
represented by 2 different values (191 and 182 or something like that)
when its actual ASCII table value is 148.
That really depends, which kind of encoding was used to create your text file. Encodings determine how text messages are stored.
In UTF-8 the ö is stored as hex [0xc3, 0xb6] or [195, 182]
In ISO/IEC 8859-1 (= "Latin-1") it would be stored as hex [0xf6], or [246]
In Mac OS Central European, it would be hex [0x9a] or [154]
Please note, that the basic ASCII table itself doesn't really describe anything for that kind of character. ASCII only uses 7 bits, and by doing so only maps 128 codes.
Part of the problem is that in layman's terms, "ASCII" is sometimes used to describe extensions of ASCII as well, (e.g. like Latin-1)
History
There's actually a bit of history behind that. Originally ASCII was a very limited set of characters. When those weren't enough, each region started using the 8th bit to add their language-specific characters. Leading to all kind of compatibility issues.
Then there was some kind of consortium that made an inventory of all characters in all possible languages (and beyond). That set is called "unicode". It contains not just 128 or 256 characters, but thousands of them.
From that point on you would need more advanced encodings to cover them. UTF-8 is one of those encodings that covers that entire unicode set, and it does so while being kind-of backwards compatible with ASCII.
Each ASCII character is still mapped in the same way, but when 1-byte isn't enough, it will use the 8th bit to indicate that a 2nd byte will follow, which is the case for the ö character.
Tools
If you're using a more advanced text editor like Notepad++, then you can select your encoding from the drop-down menu.
In programming
Having said that, your current java source reads bytes, it's not reading characters. And I would think that it's a plus when it works on byte-level, because then it can support all encodings. Maybe you don't need to work on character level at all.
However, if it does matter for your specific algorithm. Let's say you've written an algorithm that is only supposed to handle Latin-1 encoding. So, then it's really going to work on "character-level" and not on "byte-level". In that case, consider reading directly to String or char[].
Java can do the heavy-lifting for you in that case. There are readers in java that will let you read a text-file directly to Strings/char[]. However, in those cases you should of course specify an encoding when you use them. Internally a single java character can contain up to 2 bytes of data.
Trying to convert bytes to characters manually is a tricky business. Unless you're working with plain old ASCII of course. The moment you see a value above 0x7F (127), (which are presented by negative values in byte) you're no longer working with simple ASCII. Then consider using something like: new String(bytes, StandardCharsets.UTF_8). There's no need to write a decoding algorithm from scratch.
Related
I am integration testing a component. The component allows you to save and fetch strings.
I want to verify that the component is handling UTF-8 characters properly. What is the minimum test that is required to verify this?
I think that doing something like this is a good start:
// This is the ☺ character
String toSave = "\u263A";
int id = 123;
// Saves to Database
myComponent.save( id, toSave );
// Retrieve from Database
String fromComponent = myComponent.retrieve( id );
// Verify they are same
org.junit.Assert.assertEquals( toSave, fromComponent );
One mistake I have made in the past is I have set String toSave = "è". My test passed because the string was saved and retrieved properly to/from the DB. Unfortunately the application was not actually working correctly because the app was using ISO 8859-1 encoding. This meant that è worked but other characters like ☺ did not.
Question restated: What is the minimum test (or tests) to verify that I can persist UTF-8 encoded strings?
A code and/or documentation review is probably your best option here. But, you can probe if you want. It seems that a sufficient test is the goal and minimizing it is less important. It is hard to figure what a sufficient test is, based only on speculation of what the threat would be, but here's my suggestion: all codepoints, including U+0000, proper handling of "combining characters."
The method you want to test has a Java string as a parameter. Java doesn't have "UTF-8 encoded strings": Java's native text datatypes use the UTF-16 encoding of the Unicode character set. This is common for in-memory representations of text—It's used by Java, .NET, JavaScript, VB6, VBA,…. UTF-8 is commonly used for streams and storage, so it makes sense that you should ask about it in the context of "saving and fetching". Databases typically offer one or more of UTF-8, 3-byte-limited UTF-8, or UTF-16 (NVARCHAR) datatypes and collations.
The encoding is an implementation detail. If the component accepts a Java string, it should either throw an exception for data it is unwilling to handle or handle it properly.
"Characters" is a rather ill-defined term. Unicode codepoints range from 0x0 to 0x10FFFF—21 bits. Some codepoints are not assigned (aka "defined"), depending on the Unicode Standard revision. Java datatypes can handle any codepoint, but information about them is limited by version. For Java 8, "Character information is based on the Unicode Standard, version 6.2.0.". You can limit the test to "defined" codepoints or go all possible codepoints.
A codepoint is either a base "character" or a "combining character". Also, each codepoint is in exactly one Unicode Category. Two categories are for combining characters. To form a grapheme, a base character is followed by zero or more combining characters. It might be difficult to layout graphemes graphically (see Zalgo text) but for text storage all that it is needed to not mangle the sequence of codepoints (and byte order, if applicable).
So, here is a non-minimal, somewhat comprehensive test:
final Stream<Integer> codepoints = IntStream
.rangeClosed(Character.MIN_CODE_POINT, Character.MAX_CODE_POINT)
.filter(cp -> Character.isDefined(cp)) // optional filtering
.boxed();
final int[] combiningCategories = {
Character.COMBINING_SPACING_MARK,
Character.ENCLOSING_MARK
};
final Map<Boolean, List<Integer>> partitionedCodepoints = codepoints
.collect(Collectors.partitioningBy(cp ->
Arrays.binarySearch(combiningCategories, Character.getType(cp)) < 0));
final Integer[] baseCodepoints = partitionedCodepoints.get(true)
.toArray(new Integer[0]);
final Integer[] combiningCodepoints = partitionedCodepoints.get(false)
.toArray(new Integer[0]);
final int baseLength = baseCodepoints.length;
final int combiningLength = combiningCodepoints.length;
final StringBuilder graphemes = new StringBuilder();
for (int i = 0; i < baseLength; i++) {
graphemes.append(Character.toChars(baseCodepoints[i]));
graphemes.append(Character.toChars(combiningCodepoints[i % combiningLength]));
}
final String test = graphemes.toString();
final byte[] testUTF8 = StandardCharsets.UTF_8.encode(test).array();
// Java 8 counts for when filtering by Character.isDefined
assertEquals(736681, test.length()); // number of UTF-16 code units
assertEquals(3241399, testUTF8.length); // number of UTF-8 code units
If your component is only capable of storing and retrieving strings, then all you need to do is make sure that nothing gets lost in the conversion to and from the Unicode strings of java and the UTF-8 strings that the component stores.
That would involve checking with at least one character from each UTF-8 code point length. So, I would suggest check with:
One character from the US-ASCII set, (1-byte long code point,) then
One character from Greek, (2-byte long code point,) and
One character from Chinese (3-byte long code point.)
In theory you would also want to check with an emoji (4-byte long code point,) though these cannot be represented in java's Unicode strings, so it's moot point.
A useful extra test would be to try a string combining at least one character from each of the above cases, so as to make sure that characters of different code-point lengths can co-exist within the same string.
(If your component does anything more than storing and retrieving strings, like searching for strings, then things can get a bit more complicated, but it seems to me that you specifically avoided asking about that.)
I do believe that black box testing is the only kind of testing that makes sense, so I would not recommend polluting the interface of your component with methods that would expose knowledge of its internals. However, there are two things that you can do to increase the testability of the component without ruining its interface:
Introduce additional functions to the interface that might help with testing without disclosing anything about the internal implementation and without requiring that the testing code must have knowledge of the internal implementation of the component.
Introduce functionality useful for testing in the constructor of your component. The code that constructs the component knows precisely what component it is constructing, so it is intimately familiar with the nature of the component, so it is okay to pass something implementation-specific there.
An example of what you could do with any of the above techniques would be to artificially severely limit the number of bytes that the internal representation is allowed to occupy, so that you can make sure that a certain string you are planning to store will fit. So, you could limit the internal size to no more than 9 bytes, and then make sure that a java unicode string containing 3 chinese characters gets properly stored and retrieved.
String instances use a predefined and unchangeable encoding(16-bit words).
So, returning only a String from your service is probably not enough to do this check.
You should try to return the byte representation of the persisted String (a byte array for example) and compare the content of this array with the "\u263A" String that you would encode in bytes with the UTF-8 charset.
String toSave = "\u263A";
int id = 123;
// Saves to Database
myComponent.save(id, toSave );
// Retrieve from Database
byte[] actualBytes = myComponent.retrieve(id );
// assertion
byte[] expectedBytes = toSave.getBytes(Charset.forName("UTF-8"));
Assert.assertTrue(Arrays.equals(expectedBytes, actualBytes));
My assignment is to compress a DNA sequence. First enconding using a = 00 c = 01 g = 10 t = 11. I have to read in from a file the sequence and covert to my encoding. i know i have to use the bitSet class in java, but I'm having issues with how to implement. How do I ensure my encoding is used and the letters are not converted to actual binary.
this is the prompt: Develop space efficient Java code for two kinds of compressed encodings of this file of data. (N's are to be ignored). Convert lower case to upper case chars. Do the following and answer the questions: Credit will be awarded to both time and space efficient mechanisms. If your code takes too long to run, you need to rethink design.
Encoding 1. Using two bits A:00, C:01, G:10, T:11.
(a) How many total bits are needed to represent the genome sequence ? (b) how many of the total bits are 1's in the encoded sequence?
i know the logic i have to use, but the actual implementation of the bitSet class and the encoding is where i'm having issues.
Welcome to StackOverflow! Please look at certain Forward Genetic simulator that is being developed on github. It contains BitSetDNASequence class that may be helpful for creation of your BitMask. Of course it'll serve more of a guideline that 1:1 solution to your problem, but it definitely may get you up to speed.
I've made an example below of how you can convert the 'C' letter into bits. So for the "CCCC" string it should print "01010101".
import java.util.BitSet;
public class Test {
public static void main(String[] args){
String originalString = "CCCC";
int bitSetSize = 2 * originalString.length();
BitSet bitSet = new BitSet(bitSetSize);
for (int i = 0; i < originalString.length(); i++) {
if (originalString.charAt(i) == 'C') {
// put 01 in the bitset
bitSet.clear(i * 2);
bitSet.set(i * 2 + 1);
}
}
// print all the bits in the bitset
for (int i = 0; i < bitSetSize; i++) {
if (bitSet.get(i))
System.out.print("1");
else
System.out.print("0");
}
}
}
I believe all you need to understand from the BitSet in order to do your assignment are the methods: set, clear and get. Hope it helps.
You can have a look at BinCodec that provides binary encoding/decoding procedures to convert back and forth DNA and protein sequences to/from binary compact representation. It relies on the use of standard Java BitSet. Have also a look at BinCodedTest that shows how to use these APIs.
I am trying to implement compression of files using Huffman encoding. Currently, I am writing the header as the first line of the compressed file and then writing the encoded binary strings (i.e. strings having the binary encoded value).
However, instead of reducing the file size, my file size is increasing as for every character like 'a', I am writing its corresponding binary, for example 01010001 which takes more space.
How can I write it into the file in a way that it reduces the space?
This is my code
public void write( String aWord ) {
counter++;
String content;
byte[] contentInBytes;
//Write header before writing file contents
if ( counter == 1 )
{
//content gets the header in String format from the tree
content = myTree.myHeader;
contentInBytes = content.getBytes();
try {
fileOutputStream.write(contentInBytes);
fileOutputStream.write(System.getProperty("line.separator").getBytes());
} catch (IOException e) {
System.err.println(e);
}
}
//content gets the encoded binary in String format from the tree
content = myTree.writeMe(aWord);
contentInBytes = content.getBytes();
try {
fileOutputStream.write(contentInBytes);
fileOutputStream.write(System.getProperty("line.separator").getBytes());
} catch (IOException e) {
System.err.println(e);
}
}
Sample input file:
abc
aef
aeg
Compressed file:
{'g':"010",'f':"011",'c':"000",'b':"001",'e':"10",'a':"11"}
11001000
1110011
1110010
As I gathered from the comments, you are writing text, but what you really want to achive is writing binary data. What you currently have is a nice demo for huffman encoding, but impractical for actually compressing data.
To achieve compression, you will need to output the huffman symbols as binary data, where you currently output the string "11" for an 'a', you will need to just output two bits 11.
I presume this is currently coded in myTree.writeMe(), you need to modify the method to not return a String, but something more suited to binary output, e.g. byte[].
It depends a bit on the inner workings of you tree class how to do this. I presume you are using some StringBuilder internally and simply add the encoded symbol strings while looping over the input. Instead of a StringBuilder you will need a container capable of dealing with single bits. The only suitable class that comes to min immediately is java.util.BitSet (in practice one often will write a specialized class for this, with a specialized API to do this quickly). But for simplicity lets use BitSet for now.
In method writeMe, you will in principle do the following:
BitSet buffer = new BitSet();
int bitIndex = 0;
loop over input symbols {
huff_code = getCodeForSymbol(symbol)
foreach bit in huff_code {
buffer.put(bitIndex++, bit)
}
}
return buffer.toByteArray();
How to do this efficiently depends on how you internally defined the huffman code table. But prinicple is simple, loop over the code, determine if each place is a one or a zero and put them into the BitSet at consecutive indices.
if (digits == '1') {
buffer.set(bitIndex);
} else {
buffer.clear(bitIndex);
}
You now have your huffman encoded data. But the resulting data will be impossible to decompress properly, since you are currently processing words and you do not write any indication where the compressed data actually ends (you do this currently with a line feed). If you encoded for example 3 times an 'a', the BitSet would contain 11 11 11. Thats 6 bits, but when you convert to byte[] its gets padded to 8 bits: 0b11_11_11_00.
Those extra, unavoidable bits will confuse your decompression. You will need to handle this somehow, either by encoding first the number of symbols in the data, or by using an explicit symbol signaling end of data.
This should get you an idea how to continue. Many details depend on how you implement your tree class and the encoded symbols.
Is this ASCII ?
And how can I print it in understanding language like in char ?
I get this answer from my PortCom.
Here is how I read :
boolean ok = false;
int read = 0;
System.out.println("In Read :");
while(ok == false) {
int availableBytes = 0;
try {
availableBytes = inputStream.available();
if (availableBytes > 0) {
read = read + availableBytes;
int raw = inputStream.read(readBuffer, read-availableBytes, availableBytes);
System.out.println("Inpustream ="+raw);
traduction = new String(readBuffer, read-availableBytes, availableBytes);
System.out.println("2=>" + traduction);
Response = new String(readBuffer, "UTF-8"); // bytes -> String
} catch (IOException e) {
}
if (availableBytes == 0 && (read == 19 || read == 8)){
ok = true;
}
}
As I read your comments, I am under the impression that you're a little confused as to what a character and ASCII are.
Characters are numbers. Plain dumb numbers. It just so happens that people created standard mappings between numbers and letters. For instance, according to the ASCII character map, 97 is a. The implications of this are that when display software sees 97, it knows that it has to find the glyph for the character a in a given font, and draw it to the screen.
Integer values 0 through 31, when interpreted with the ASCII character map, are so-called control characters and as such have no visual glyph associated with them. They tell software how to behave rather than what to display. For instance, the character #0 is the NUL character and is used to signal the end of a string with the C string library and has little to no practical use in most other languages. Off my head, character #13 is NL, for "new line", and it tells the rendering software to move the drawing cursor to the next line, rather than to render a character.
Most ASCII control characters are outdated and are not meant to be sent to text rendering software. As such, implementations decide how they deal with them if they don't know what to do. Many of them do nothing, some print question marks, and some print completely unrelated characters.
ASCII only maps integers from 0 to 128 to glyphs or control characters, which leaves another 128 possible integers in a byte undefined. Integers above 127 have no associated glyph in the ASCII standard, and only these can be called "not ASCII". So, what you should be asking, really, is "is that text?" rather than "is that ASCII?", because any sequence of integers between 0 and 127 is necessarily ASCII, which however says nothing about whether or not it's human-readable.
And the obvious answer to that question is "no, it's not text". Asking what it is if it's not text is asking us to be psychics, since there's no "universal bug" that maims text. It could be almost anything.
However, since you state that you're reading from a serial link, I'd advise you to check the bauds rate and other link settings, because there's no built-in mechanism to detect mismatches from on end to the other, and it can mangle data the way it does here.
Use the raw value instead of availableBytes:
traduction = new String(readBuffer, read-availableBytes, raw);
The raw indicates how many were actually read as opposed to how many you requested. If you ask 10 bytes and it reads 5, the remaining 5 will be unknown garbage.
UPDATE
The response is obviously wrong too and for the same reason:
Response = new String(readBuffer, "UTF-8");
You are telling it to convert the entire buffer even though you may have only read 1 byte. If you're a bit unlucky you'll get exceptions because not all byte combinations can be converted using UTF-8
I use GZIPOutputStream or ZIPOutputStream to compress a String (my string.length() is less than 20), but the compressed result is longer than the original string.
On some site, I found some friends said that this is because my original string is too short, GZIPOutputStream can be used to compress longer strings.
so, can somebody give me a help to compress a String?
My function is like:
String compress(String original) throws Exception {
}
Update:
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.GZIPOutputStream;
import java.util.zip.*;
//ZipUtil
public class ZipUtil {
public static String compress(String str) {
if (str == null || str.length() == 0) {
return str;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(str.getBytes());
gzip.close();
return out.toString("ISO-8859-1");
}
public static void main(String[] args) throws IOException {
String string = "admin";
System.out.println("after compress:");
System.out.println(ZipUtil.compress(string));
}
}
The result is :
Compression algorithms almost always have some form of space overhead, which means that they are only effective when compressing data which is sufficiently large that the overhead is smaller than the amount of saved space.
Compressing a string which is only 20 characters long is not too easy, and it is not always possible. If you have repetition, Huffman Coding or simple run-length encoding might be able to compress, but probably not by very much.
When you create a String, you can think of it as a list of char's, this means that for each character in your String, you need to support all the possible values of char. From the sun docs
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
If you have a reduced set of characters you want to support you can write a simple compression algorithm, which is analogous to binary->decimal->hex radix converstion. You go from 65,536 (or however many characters your target system supports) to 26 (alphabetical) / 36 (alphanumeric) etc.
I've used this trick a few times, for example encoding timestamps as text (target 36 +, source 10) - just make sure you have plenty of unit tests!
If the passwords are more or less "random" you are out of luck, you will not be able to get a significant reduction in size.
But: Why do you need to compress the passwords? Maybe what you need is not a compression, but some sort of hash value? If you just need to check if a name matches a given password, you don't need do save the password, but can save the hash of a password. To check if a typed in password matches a given name, you can build the hash value the same way and compare it to the saved hash. As a hash (Object.hashCode()) is an int you will be able to store all 20 password-hashes in 80 bytes).
Your friend is correct. Both gzip and ZIP are based on DEFLATE. This is a general purpose algorithm, and is not intended for encoding small strings.
If you need this, a possible solution is a custom encoding and decoding HashMap<String, String>. This can allow you to do a simple one-to-one mapping:
HashMap<String, String> toCompressed, toUncompressed;
String compressed = toCompressed.get(uncompressed);
// ...
String uncompressed = toUncompressed.get(compressed);
Clearly, this requires setup, and is only practical for a small number of strings.
Huffman Coding might help, but only if you have a lot of frequent characters in your small String
The ZIP algorithm is a combination of LZW and Huffman Trees. You can use one of theses algorithms separately.
The compression is based on 2 factors :
the repetition of substrings in your original chain (LZW): if there are a lot of repetitions, the compression will be efficient. This algorithm has good performances for compressing a long plain text, since words are often repeated
the number of each character in the compressed chain (Huffman): more the repartition between characters is unbalanced, more the compression will be efficient
In your case, you should try the LZW algorithm only. Used basically, the chain can be compressed without adding meta-informations: it is probably better for short strings compression.
For the Huffman algorithm, the coding tree has to be sent with the compressed text. So, for a small text, the result can be larger than the original text, because of the tree.
Huffman encoding is a sensible option here. Gzip and friends do this, but the way they work is to build a Huffman tree for the input, send that, then send the data encoded with the tree. If the tree is large relative to the data, there may be no not saving in size.
However, it is possible to avoid sending a tree: instead, you arrange for the sender and receiver to already have one. It can't be built specifically for every string, but you can have a single global tree used to encode all strings. If you build it from the same language as the input strings (English or whatever), you should still get good compression, although not as good as with a custom tree for every input.
If you know that your strings are mostly ASCII you could convert them to UTF-8.
byte[] bytes = string.getBytes("UTF-8");
This may reduce the memory size by about 50%. However, you will get a byte array out and not a string. If you are writing it to a file though, that should not be a problem.
To convert back to a String:
private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
...
String s = new String(bytes, UTF8_CHARSET);
You don't see any compression happening for your String, As you atleast require couple of hundred bytes to have real compression using GZIPOutputStream or ZIPOutputStream. Your String is too small.(I don't understand why you require compression for same)
Check Conclusion from this article:
The article also shows how to compress
and decompress data on the fly in
order to reduce network traffic and
improve the performance of your
client/server applications.
Compressing data on the fly, however,
improves the performance of
client/server applications only when
the objects being compressed are more
than a couple of hundred bytes. You
would not be able to observe
improvement in performance if the
objects being compressed and
transferred are simple String objects,
for example.
Take a look at the Huffman algorithm.
https://codereview.stackexchange.com/questions/44473/huffman-code-implementation
The idea is that each character is replaced with sequence of bits, depending on their frequency in the text (the more frequent, the smaller the sequence).
You can read your entire text and build a table of codes, for example:
Symbol Code
a 0
s 10
e 110
m 111
The algorithm builds a symbol tree based on the text input. The more variety of characters you have, the worst the compression will be.
But depending on your text, it could be effective.