Unable to compress file during Huffman Encoding in Java - java

I have implemented the Huffman Encoding Algorithm in Java using Priority Queues where I traverse the Tree from Root to Leaf and get encoding example as #=000011 based on the number of times the symbol appears in the input. Everything is fine, the tree is being built fine, encoding is just as expected: But the output file I am getting is bigger size than the original file. I am currently appending '0' & '1' to a String on traversing left node and right node of the tree. Probably what I end up with uses all 8 bits for each characters and it does not help in compression. I am guessing there is some conversion of these bits into character values which is required. So that these characters use fewer bits than 8 and hence I get a compressed version of the original file. Could you please let me know how to achieve a compression by manipulating characters and reducing bits in Java? Thanks

You're probably using a StringBuilder and appending "0" or "1", or simply the + operator to concatenate "0" or "1" to the end of your string. Or you're using some sort of OutputStream and writing to it.
What you want to do is to write the actual bits. I'd suggest making a whole byte first before writing. A byte looks like this:
0x05
Which would represent the binary string 0000 0011.
You can make these by making a byte type, adding and shifting:
public void writeToFile(String binaryString, OutputStream os){
int pos = 0;
while(pos < binaryString.length()){
byte nextByte = 0x00;
for(int i=0;i<8 && pos+i < binaryString.length(); i++){
nextByte << 1;
nextByte += binaryString.charAt(pos+i)=='0'?0x0:0x1;
}
os.write(nextByte);
pos+=8;
}
}
Of course, it's inefficient to write one byte at a time, and on top of that the OutputStream interface only accepts byte arrays (byte[]). So you'd be better off storing the bytes in an array (or even easier, a List), then writing them at bigger chunks.
If you are not allowed to use byte writes (why the heck not? ObjectOutputStream supports writing byte arrays!), then you can use Base64 to encode your binary string. But remember that Base64 inflates your data usage by 33%.
An easy way to convert a byte array to base64 is by using an existing encoder. After adding the following import:
import sun.misc.BASE64Encoder;
You can instantiate the encoder and turn your byte array into a string:
byte[] bytes = getBytesFromHuffmanEncoding();
BASE64Encoder encoder = new BASE64Encoder();
String encodedString = encoder.encode(bytes);

Related

How to write Huffman code to a binary file?

I have a sample .txt file that I want to compress using Huffman encoding. My problem is that if one character has a size of one byte and the smallest size you can write is a byte, how do I reduce the size of the sample file?
I converted the sample file into Huffman codes and wrote it to a new empty .txt file which just consists of 0s and 1s as one huge line of characters. Then I took the new file and used the BitSet class in Java to write to a binary file bit by bit. If the character was 0 or 1 in the new file, I wrote 0 or 1 respectively to the binary file. This process was very slow and it crashed my computer multiple times, I was hoping that someone had a more efficient solution. I have written all my code in Java.
Do not write "0" and "1" characters to the file. Write 0 and 1 bits to the file.
You do this by accumulating eight bits into a byte buffer using the shift (<<) and or (|) operators, and then writing that byte to the file. Repeat. At the end you may have less than eight bits in the byte buffer. If so, write that byte to the file, which will have the remaining bits filled with zeros.
E.g. int buf = 0, count = 0;, for each bit: buf |= bit << count++;, check for eight: if (count == 8) { out.writeByte(buf); buf = count = 0; }. At the end, if (count > 0) out.writeByte(buf);.
When decoding the Huffman codes, you may run into a problem with those filler zero bits in the last byte. They could be decoded as an extraneous symbol or symbols. In order to deal with this you will need for the decoder to know when to stop, by either sending the number of symbols before the Huffman codes, or by adding a symbol for end-of-stream.
One way is to use BitSet to set the bits that represent the code as you compute it. Then you can do either BitSet.toByteArray() or BitSet.toLongArray() and write out the information. Both of these store the bits in little endian encoding.

How to force Java to use only 2 bytes per character for Unicode characters (eg. 'ł')?

I am working on translating a hashing algorithm from C# to Java, and it requires using the byte array of the string. The problem is when working with characters like 'ł' & 'ą', Java converts these letters into 2 characters and thus giving me 4 bytes instead of 2 that I am expecting.
I tried using string.codePointAt() instead of string.charAt(), but it keeps on processing those letters as 2 characters instead of 1. I thought Java uses 16bit Unicode same as C# & VB but why does it require 4 bytes for this letters when C# & VB were able to convert these as 2 bytes.
C# and VB reads the bytes of 'ł' as: [66, 1] (code below)
bytes = Encoding.Unicode.GetBytes("ł");
Console.WriteLine(string.Join(",", bytes));
Java reads the bytes of 'ł' as: [-59, 0, 26, 32] (code below)
String str = "ł";
byte[] B = str.getBytes(Charset.forName("UTF-16LE"));
System.out.println(Arrays.toString(B));
I even tried using StandardCharsets too, but still same issue.
Is there a way for Java to process these letters as a single UTF-16 character instead of separating it into 2 characters?
PS: I cannot also refactor the algorithm since it is already in use, and it just had to be done in our new Java too.
PPS: I tried normalizing the string but there are still differences, character "æ" is read with [-26,0] when C# outputs [230,0] for the character
While Java's internal character encoding is UTF-16BE, String#codePointAt(int) and String#getBytes() (with no supplied arguments) both use the default character encoding, which depends on the Java implementation and the platform it's on. You have the right idea to use String.getBytes(Charset.forName("UTF-16LE")), but I recommend you use String.getBytes(StandardCharsets.UTF_16LE) instead.
The second issue with C# returning [230,0] while Java returns [-26, 0]: technically, they are the same, bit-wise. However, C#'s byte array holds unsigned bytes, while Java's array holds signed bytes. Even though both Java and C# gives the same byte pattern, if you really want to express a positive value, you could store them in a short array instead:
String str = "æ";
byte[] byteArray = "æ".getBytes(StandardCharsets.UTF_16LE);
short[] newByteArray = new short[byteArray.length];
for (int i = 0; i < byteArray.length; i++) {
byte c = byteArray[i];
newByteArray[i] = (c >= 0) ? c : (short)(c + 256);
}
System.out.println(Arrays.toString(byteArray));
// => [-26, 0]
System.out.println(Arrays.toString(newByteArray));
// => [230, 0]
FWIW, replacing æ with ł gives me [66, 1] both for the byte array and short array.
Although the code "converts" the array into unsigned "bytes", I would advise against doing this if you can, because the byte array gives the same pattern as C#, and promises the same number size.
I have found the issue, as Ralf Kleberhoff have guessed, I wasn't using the proper file encoding that the Java compiler expects. My file was using UTF-16LE encoding so I just passed -encoding "UTF-16" when compiling the file.
javac -encoding "UTF-16" HashBrowns.java
Also, as Samuel Hunter had suggested, I converted the values to positive to make sure that I get the exact same values as I get with C# & VB6.
private int[] convertSignedBytesToUnsignedint(byte[] b)
{
int[] intArr = new int[b.length];
for (int i = 0; i < b.length; i++) {
intArr[i] = b[i] & 0xff;
}
return intArr;
}
I am not sure on which code is more optimized, but I just wanted to post this here so I can share what worked with my situation.

How to convert RSA encrypted numbers into text/characters

I wrote a RSA encryption in Java. I am trying to turn the numbers that it outputs into text or characters. For example if I feed it Hello I get:
23805663430659911910
However, online RSA encryptions return something to the effect of this:
GVom5zCerZ+dmOCE7YAp0F+N3L26L
I would just like to know how to convert my numbers into something similar. The number returned by my system is a BigInteger. This is what I've tried so far:
RSA rsa = new RSA("Hello");
BigInteger cypher_number = rsa.encrypt(); // 23805663430659911910
byte[] cypher_bytes = cypher_number.toByteArray(); // [B#368102c8
String cypher_text = new String(cypher_bytes); // J^��*���
// Now even though cypher_text is J^��*��� I wouldn't care as long as I can turn it back.
byte[] plain_bytes = cypher_text.getBytes(); // [B#6996db8 | Not the same as cypher_bytes but lets keep going.
BigInteger plain_number = new BigInteger(plain_bytes); // 28779359581043512470254837759607478877667261
// plain_number has more than doubled in size compared to cypher_number and won't decrypt properly.
Using bytes it the only way I can think of. Can someone please help me understand what I'm supposed to be doing or if it's even possible?
This is generally a 2-step process:
convert to binary encoding of the number;
convert the binary encoding to a text base encoding.
For both steps there are multiple schemes possible.
For binary encoding: the PKCS#1 specifications have always included one that converts the number to a statically sized integer. To be precise, it describes the number into a statically sized, unsigned, big endian octet string. An octet string is nothing but a byte array.
Now, BigInteger.toByteArray returns a dynamically sized, signed, big endian octet string. So you need to implement the possible resizing and removal of initial 00 byte in a separate method, which I have at my other post here. Fortunately going back to a number is much easier as the Java implementation provides a BigInteger(int sign, byte[] value) constructor that reads in an unsigned number and skips leading zero bytes.
Having a standardized and statically sized octet string can be terribly useful, so I would not go for any other scheme.
This leaves the conversion to and from text. For that you can (indeed) use the java.util.Base64 class, which doesn't need much explaining. The only note that I must make is that it converts to an ASCII byte[] for some of the methods, so you need to use the encodeToString(byte[] src) instead.
Another method would be hexadecimals, but since Java doesn't contain a hex encoder for byte arrays in the base classes, I'd go for base 64 instead.
I have found the answer. In case you've found this looking for the answer, you just need to encode the numbers into Base64.
The following code converts the number into a dynamically sized, signed, big endian encoded integer, and then converts it back into a number using the reverse process.
// Encode
BigInteger numbers = new BigInteger("5109763");
byte[] bytes = Base64.getEncoder().encode(numbers.toByteArray());
String encoded = new String(bytes); // Encoded value
// Decode
byte[] decoded_bytes = Base64.getDecoder().decode(encoded.getBytes());
BigInteger numbers_again = new BigInteger(decoded_bytes); // Original numbers

I wanted to Convert any length String to fixed 32 Bytes

I want to convert any length of String to byte32 in Java.
Code
String s="9c46267273a4999031c1d0f7e40b2a59233ce59427c4b9678d6c3a4de49b6052e71f6325296c4bddf71ea9e00da4e88c4d4fcbf241859d6aeb41e1714a0e";
//Convert into byte32
From the comments it became clear that you want to reduce the storage space of that string to 32 bytes.
The given string can easily be compressed from the 124 bytes to 62 bytes by doing a hexadecimal conversion.
However, there is no algorithm and there will not be an algorithm that can compress any data to 32 bytes. Imagine that would be possible: it would have been implemented and you would be able to get ZIP files of just 32 bytes for any file you compress.
So, unfortunately, the answer is: it's not possible.
You can not convert any length string to a byte array of length 32.
Java uses UTF-16 as it's string encoding, so in order to store 100% of the string, 1:1 as a fixed length byte array, you would be at a surface glance be limited to 16 characters.
If you are willing to live with the limitation of 16 characters, byte[] bytes = s.getBytes(); should give you a variable length byte array, but it's best to specify an explicit encoding. e.g. byte [] array2 = str.getBytes("UTF-16");
This doesn't completely solve your problem. You will now likely have to check that the byte array doesn't exceed 32 bytes, and come up with strategies for padding, possible null termination (which may potentially eat into your character budget)
Now, if you don't need the entire UTF-16 string space that Java uses for strings by default, you can get away with longer strings, by using other encodings.
IF this is to be used for any kind of other standard or something ( I see references to etherium being thrown around) then you will need to follow their standards.
Unless you are writing your own library for dealing with it directly, I highly recommend using a library that already exists, and appears to be well tested, and used.
You can achieve with the following function
byte[] bytes = s.getBytes();

clarification on using OutputStream in Java?

I was reading through this article. It has this following snippet
OutputStream output = new FileOutputStream("c:\\data\\output-text.txt");
while(moreData) {
int data = getMoreData();
output.write(data);
}
output.close();
It is mentioned:
OutputStreams are used for writing byte based data, one byte at a time. The write() method of an OutputStream takes an int which contains the byte value of the byte to write.
Let's say I am writing the string Hello World to the file, so each character in string gets converted to int using getMoreData() method. and how does it get written? as character or byte in the output-text.txt? If it gets written in byte, what is the advantage of writing in bytes if I have to "reconvert" byte to character?
Each character (and almost anything stored on a file) is a byte / bytes. For example:
Lowercase 'a' is written as one byte with decimal value 97.
Number '1' is written as one byte with decimal value 49
There's no more concept of data types once the information is written into a file, everything is just a stream of bytes. What's important is the encoding used to store the information into the file
Have a look at ascii table, which is very useful for beginners learning information encoding.
To illustrate this, create a file containing the text 'hello world'
$ echo 'hello world' > hello.txt
Then output the bytes written to the file using od command:
$ od -td1 hello.txt
0000000 104 101 108 108 111 32 119 111 114 108 100 10
0000014
The above means, at address 0000000 from the start of the file, I see one byte with decimal value 104 (which is character 'h'), then one byte with decimal value 101 (which is character 'e") and so on..
The article is incomplete, because an OutputStream has overloaded methods for write that take a byte[], a byte[] along with offset and length arguments, or a single int.
In the case of writing a String to a stream when the only interface you have is OutputStream (say you don't know what the underlying implementation is), it would be much better to use output.write(string.getBytes()). Iteratively peeling off a single int at a time and writing it to the file is going to perform horribly compared to a single call to write that passes an array of bytes.
Streams operate on bytes and simply read/write raw data.
Readers and writers interpret the underlying data as strings using character sets such as UTF-8 or US-ASCII. This means they may take 8 bit characters (ASCII) and convert the data into UTF-16 strings.
Streams use bytes, readers/writers use strings (or other complex types).
The Java.io.OutputStream class is the superclass of all classes representing an output stream of bytes. When bytes are written to the OutputStream, it may not write the bytes immediately, instead the write method may put the bytes into a buffer.
There are methods to write as mentioned below:
void write(byte[] b)
This method writes b.length bytes from the specified byte array to this output stream.
void write(byte[] b, int position, int length)
This method writes length bytes from the specified byte array starting at offset position to this output stream.
void write(int b)
This method writes the specified byte to this output stream.

Categories

Resources