Huffman coding in Java

Huffman coding in Java - java

I want encode every file by Huffman code.
I have found the length of bits per symbol (its Huffman code).
Is it possible to encode a character into a file in Java: are there any existing classes that read and write to a file bit by bit and not with minimum dimension of char?

You could create a BitSet to store your encoding as you are creating it and simply write the String representation to a file when you are done.

You really don't want to write single bits to a file, believe me. Usually we define a byte buffer, build the "file" in memory and, after all work is done, write the complete buffer. Otherwise it would take forever (nearly).
If you need a fast bit vector, then have a look at the colt library. That's pretty convenient if you want to write single bits and don't do all this bit shifting operations on your own.

I'm sure there are Huffman classes out there, but I'm not immediately aware of where they are. If you want to roll your own, two ways to do this spring to mind immediately.
The first is to assemble the bit strings in memory my using mask and shift operators and accumulate the bits into larger data objects (i.e. ints or longs) and then write those out to file with standard streaming.
The second, more ambitious and self-contained idea would be to write an implementation of OutputStream that has a method for writing a single bit and then this OutputStream class would do the aforementioned buffering/shifting/accumulating itself and probably pass the results down to a second, wrapped OutputStream.

Try writing a bit vector in java to do the bit representation: it should allow you to set/reset the individual bits in a bit stream.
The bit stream can thus hold your Huffman encoding. This is the best approach, and lightning fast too.
Huffmann sample analysis here

You can find a working (and fast) implementation here: http://code.google.com/p/kanzi/source/browse/src/kanzi/entropy/HuffmanTree.java

Related

Data type for blocks in DES-like algorithm in Java

I am writing a DES-like block cipher in Java. The cipher works with 64-bit blocks and I'm having a tough time deciding how to partition the data so that its useable. In case your wondering the data will be coming from a file and I'm just going to pad it with zeroes until the nearest multiple of 64. Here's what I've been thinking about.
Store an array of longs.
With an array of longs I can traverse over each block in the fewest amount of steps. But, will the logical operations, like XOR, work properly? Also when I have to split the 64-bit into 32-bits should I convert to ints or just keep using longs? And then there is the sign to worry about, but I think I could use the Long class to fix that.
Store an array of byte arrays.
This was my initial idea, but I'm seeing the limitations now. I would have to work with 8 elements per array rather than just one with the array of longs. This might not even matter I don't know.
BitSets.
I saw these and thought they were the answer I've been looking for, but when I started using them I realized that they are not suited to the problem at hand and a lot of the methods don't actually do what I thought they would do.
I'm wondering how someone more experienced would do this. I think longs are the way to go, but I'm wondering if all the arithmetic will work. Am I on the right track or is there a better way?

Use a data structure that fits your need the best.
If you never want to split you values then use long. If you need to split your data into two halves then use int.
If you need to have more control over your data you should go with byte[]. Because internal representation is not an issue for you (because you use Java) there is no need to use byte[] internally.
When it comes to communiction with other computers (e.g. via network socket or file) it is possible that the byte order is important. Then it would be better to use a byte[] as then you have better control over the byte order.
A BitSet is for other use cases and not feasible to be used in encryption.

You should use the most efficient primitive type for your cipher. So if you primarily use 64 bit instructions, please go for long. If you use primarity 32 bit instructions then int is probably the best type. I'll let you guess the types for 16 and 8 bit operations.
Note that you should not present this interface directly to the outside world. Instead you should use an interface based on byte arrays (just like, e.g. Cipher). You don't want to confront your users with a ton of grief with regards to big endian, signed/unsigned etc. Besides that, ciphers are usually defined for messages of a specific size in bits or bytes.
Certainly do not use BitSet. It's a horrible (unbounded) interface with many peculiarities. It is absolutely not fit for this kind of operations.

How do I read a file without any buffering in Java?

I'm working through the problems in Programming Pearls, 2nd edition, Column 1. One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file. Since Java is the language I'm the most familiar with, I've decided to use it even though the author seems to have had C and C++ in mind.
Since I'm pretending memory is limited for the purpose of the problem I'm working on, I'd like to make sure the process of reading the file has no buffering at all.
I thought InputStreamReader would be a good solution, until I read this in the Java documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
Ideally, only the bytes that are necessary would be read from the stream -- in other words, I don't want any buffering.

One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file.
This implies that you need to read the file as bytes (not characters).
Assuming that you do have a genuine requirement to read from a file without buffering, then you should use the FileInputStream class. It does no buffering. It reads (or attempts to read) precisely the number of bytes that you asked for.
If you then need to convert those bytes to characters, you could do this by applying the appropriate String constructor to a byte or byte[]. Note that for multibyte character encodings such as UTF-8, you would need to read sufficient bytes to complete each character. Doing that without the possibility of read-ahead is a bit tricky ... and entails "knowledge* of the character encoding you are reading.
(You could avoid that knowledge by using a CharsetDecoder directly. But then you'd need to use the decode method that operates on Buffer objects, and that is a bit complicated too.)
For what it is worth, Java makes a clear distinction between stream-of-byte and stream-of-character I/O. The former is supported by InputStream and OutputStream, and the latter by Reader and Write. The InputStreamReader class is a Reader, that adapts an InputStream. You should not be considering using it for an application that wants to read stuff byte-wise.

Most efficient way to write to a file?

I am writing my own image compression program in Java, I have entropy encoded data stored in multiple arrays which I need to write to file. I am aware of different ways to write to file but I would like to know what needs to be taken into account when trying to use the least possible amount of storage. For example, what character set should I use (I just need to write positive and negative numbers), would I be able to write less than 1 byte to a file, should I be using Scanners/BufferedWriters etc. Thanks in advance, I can provide more information if needed.

Read the Java tutorial about IO.
You should
not use Writers and character sets, since you want to write binary data
use a buffered stream to avoid too many native calls and make the write fast
not use Scanners, as they're used to read data, and not write data
And no, you won't be able to write less than a byte in a file. The byte is the smallest amount of information that can be stored in a file.

Compression is almost always more expensive than file IO. You shouldn't worry about the speed of your writes unless you know it's a bottle neck.
I am writing my own image compression program in Java, I have entropy encoded data stored in multiple arrays which I need to write to file. I am aware of different ways to write to file but I would like to know what needs to be taken into account when trying to use the least possible amount of storage.
Write the data in a binary format and it will be the smallest. This is why almost all image formats use binary.
For example, what character set should I use (I just need to write positive and negative numbers),
Character encoding is for encoding characters i.e. text. You don't use these in binary formats generally (unless they contain some text which you are unlikely to do initially).
would I be able to write less than 1 byte to a file,
Technically you can use less than the block size on disk e.g. 512 bytes or 4 KB. You can write any amount less than this but it doesn't use less space, nor would it really matter if it did because the amount of disk is too small to worry about.
should I be using Scanners/BufferedWriters etc.
No, These are for text,
Instead use DataOutputStream and DataInputStream as these are for binary.

what character set should I use
You would need to write your data as bytes, not chars, so forget about character set.
would I be able to write less than 1 byte to a file
No, this would not be possible. But to follow decoder expected bit stream you might need to construct a byte, from something like 5 and 3 bits before writing that byte to the file.

Effectively compress strings of 10-1000 characters in Java?

I need to compress strings (written in a known but variable language) of anywhere from 10 to 1000 characters into individual UDP packets.
What compression algorithms available in Java are well suited to this task?
Are there maybe open source Java libraries available to do this?

"It depends".
I would start with just the primary candidates: LZMA ("7-zip"), deflate (direct, zlib: deflate + small wrapper, gzip: deflate + slightly larger wrapper, zip: deflate + even larger wrapper), bzip2 (I doubt this would be that good here, works best with a relative large window), perhaps even one of other LZ* branches like LZS which has an RFC for IP Payload compression but...
...run some analysis based upon the actual data and compression/throughput using several different approaches. Java has both GZIPOutputStream ("deflate in gzip wrapper") and DeflaterOutputStream ("plain deflate", recommend over gzip or zip "wrappers") standard and there are LZMA Java implementations (just need compressor, not container) so these should all be trivial to mock-up.
If there is regularity between the packets then it is is possible this could be utilized -- e.g. build cache mappings, Huffman tables, or just modify the "windows" of one of the other algorithms -- but packet-loss and "de-compressibility" likely needs to be accounted for. Going down this route though adds far more complexity. More ideas for helping out the compressor may be found at SO: How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?.
Also the protocol should likely have a simple "fall back" of zero-compression because some [especially small random] data might not be practically compressible or might "compress" to a larger size (zlib actually has this guard, but also has the "wrapper overhead" so it would be better encoded separately for very small data). The overhead of the "wrapper" for the compressed data -- such as gzip or zip -- also needs to be taken into account for such small sizes. This is especially important to consider of string data less than ~100 characters.
Happy coding.
Another thing to consider is the encoding used to shove the characters into the output stream. I would first start with UTF-8, but that may not always be ideal.
See SO: Best compression algorithm for short text strings which suggests SMAZ, but I do not know how this algorithm will transfer to unicode / binary.
Also consider that not all deflate (or other format) implementations are created equal. I am not privy on Java's standard deflate compared to a 3rd party (say JZlib) in terms of efficiency for small data, but consider Compressing Small Payloads [.NET] which shows rather negative numbers for "the same compression" format. The article also ends nicely:
...it’s usually most beneficial to compress anyway, and determine which payload (the compressed or the uncompressed one) has the smallest size and include a small token to indicate whether decompression is required.
My final conclusion: always test using real-world data and measure the benefits, or you might be in for a little surprise in the end!
Happy coding. For real this time.

The simplest thing to do would be to layer a GZIPOutputStream on top of a ByteArrayOutputStream, as that is built into the JDK, using
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream zos = new GZIPOutputStream(baos);
zos.write(someText.getBytes());
zos.finish();
zos.flush();
byte[] udpBuffer = baos.toByteArray();
There maybe other algorithms that do a better job, but I'd try this first, to see if it fits your needs as it doesn't require any extra jars, and does a pretty good job.

Most standard compression algorithims doesn't work so well with small amounts of data. Often there is a header and a checksum and it takes time for the compression to warmup. I.e. it builds a data dictionary based on the data it has seen.
For this reason you can find that
small packets may be smaller or the same size with no compression.
a simple application/protocol specific compression is better
you have to provide a prebuilt data dictionary to the compression algorithim and strip out the headers as much as possible.
I usually go with second option for small data packets.

good compression algorithm for short strings/url is lzw implementation, it is in java and can be easily ported for client gwt:
https://code.google.com/p/lzwj/source/browse/src/main/java/by/dev/madhead/lzwj/compress/LZW.java
some remarks
use 9 bit code word length for small strings (though you may try which is better). original ratio is from 1 (very small strings, compressed is not larger than original string) to 0.5 (larger strings)
in case of client gwt for other code word lengths it was required to adjust input/output processing to work on per-byte basis, to avoid bugs when buffering bit sequence into long, which is emulated for js.
I'm using it for complex url parameters encoding in client gwt, together with base64 encoding and autobean serialization to json.
upd: base64 implementation is here: http://www.source-code.biz/base64coder/java
you have to change it to make url-safe, i.e. change following characters:
'+' -> '-'
'/' -> '~'
'=' -> '_'

How would you change a single byte in a file?

What is the best way to change a single byte in a file using Java? I've implemented this in several ways. One uses all byte array manipulation, but this is highly sensitive to the amount of memory available and doesn't scale past 50 MB or so (i.e. I can't allocate 100MB worth of byte[] without getting OutOfMemory errors). I also implemented it another way which works, and scales, but it feels quite hacky.
If you're a java io guru, and you had to contend with very large files (200-500MB), how might you approach this?
Thanks!

I'd use RandomAccessFile, seek to the position I wanted to change and write the change.

If all I wanted to do was change a single byte, I wouldn't bother reading the entire file into memory. I'd use a RandomAccessFile, seek to the byte in question, write it, and close the file.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.