Compressing unicode characters

Compressing unicode characters - java

I am using GZIPOutputStream in my java program to compress big strings, and finally storing it in database.
I can see that while compressing English text, I am achieving 1/4 to 1/10 compression ration (depending on the string value). So say for example my original English text is 100kb, then on an average compressed text will be somewhere around 30kb.
But when I am compressing unicode characters, the compressed string is actually occupying more bytes than the original string. Say for example, my original unicode string is 100kb, then the compressed version is coming out to 200kb.
Unicode string example: "嗨，这是，短信计数测试持续for.Hi这是短"
Can anyone suggest that how can I achieve compression for unicode text as well? and why the compressed version is actually bigger than the original version?
My compression code in Java:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream zos = new GZIPOutputStream(baos);
zos.write(text.getBytes("UTF-8"));
zos.finish();
zos.flush();
byte[] udpBuffer = baos.toByteArray();

Java's GZIPOutputStream uses the Deflate compression algorithm to compress data. Deflate is a combination of LZ77 and Huffman coding. According to Unicode's Compression FAQ:
Q: What's wrong with using standard compression algorithms such as Huffman coding or patent-free variants of LZW?
A: SCSU bridges the gap between an 8-bit based LZW and a 16-bit encoded Unicode text, by removing the extra redundancy that is part of the encoding (sequences of every other byte being the same) and not a redundancy in the content. The output of SCSU should be sent to LZW for block compression where that is desired.
To get the same effect with one of the popular general purpose algorithms, like Huffman or any of the variants of Lempel-Ziv compression, it would have to be retargeted to 16-bit, losing effectiveness due to the larger alphabet size. It's relatively easy to work out the math for the Huffman case to show how many extra bits the compressed text would need just because the alphabet was larger. Similar effects exist for LZW. For a detailed discussion of general text compression issues see the book Text Compression by Bell, Cleary and Witten (Prentice Hall 1990).
I was able to find this set of Java classes for SCSU compression on the unicode website, which may be useful to you, however I couldn't find a .jar library that you could easily import into your project, though you can probably package them into one if you like.

I don't really know Chinese, but as far as I know te GZIP compression depends on repeating sequences of text and those repeating sequences are changed with "descriptions" (this is a very high level explanation). This means if you have a word "library" on 20 places in a string the algorithm will store the word "library" on the side and than note that it should appear on places x, y, z... So, you might not have a lot of redundancy in your original string so you cannot save a lot. Instead, you have more overhead than savings.
I'm not really a compression expert, and I don't know the details, but this is the basic principle of the compression.
P.S
This question might just be a duplicate of: Why gzip compressed buffer size is greater then uncompressed buffer?

Related

Compressing a file using Huffman coding when the characters all have similar repetitions?

So i have been implementing huffman compression on a bunch of different types of files (.jpg, .txt, .docx) and I've noticed a lot that sometimes the compressed file is sometimes almost the same as the original file (ex: 251,339kb -> 250,917kb(without header!)) I'm pretty sure my code is solid, though i'm not sure if this is right or not. The thing that i've noticed is that the character frequencies are very similar, so for example I'll have 10 characters that all have for example 65 repetitions and then another 10 characters that have 66 repetitions, and then another 10 characters that have 67 repetitions, etc etc. And because the files are big, the compressed character representation code end up having the same size as the original, or even larger (9 bits). Is this a normal thing when compressing using huffman?

When encoding with Huffman, divide your file in to smaller chunks under the covers. The idea is that smaller chunks will have more bias than a giant file that averages everything. For example, one chunk may have lots of 0x00s in it. Another chunk may have 0xFF and so on. Then compressing each chunk with the Huffman algorithm will pick up those biases to its advantage. Of course if the chunks are too small then the Huffman code table will be a larger fraction of the compressed chunk and you lose the chunking benefits. In the Deflate case, code tables are in the order of 50-100 bytes.
Of course as the other respondents commented if your source files are already compressed (JPEG etc) you are not going to find any biases or redundancies however you chunk it.

Generate Image using Text

I visited this website,
https://xcode.darkbyte.ru/
Basically the website takes a text as Input and generates an Image.
It also takes an image as input and decodes it back to text.
I really wish to know what this is called and how it is done
I'd like to know the algorithm [preferably in Java]
Please help, Thanks in advance

There are many ways to encode a text (series of bytes) as an image, but the site you quoted does it in a pretty simple and straightforward way. And you can reverse-engineer it easily:
Up to 3 chars are coded as 1 pixel; 4 chars as 2 pixels -- we learn from this that only R(ed), G(reen) and B(lue) channels for each pixel are used (and not alpha/transparency channel).
We know PNG supports 8 bits per channel, and each ASCII char is 8 bits wide. Let's test if first char (first 8 bits) are stored in red channel.
Let's try z... Since z is relatively high in ASCII table (122) and . is relatively low (46) -- we expect to get a redish 1x1 PNG. And we do.
Let's try .z.. Is should be greenesh.. And it is.
Similarly for ..z we get a bluish pixel.
Now let's see what happens with a non-ASCII input. Try entering: ① (unicode char \u2460). The site html-encodes the string into ① and then encodes that ASCII text into the image as before.
Compression. When entering a larger amount of text, we notice the output is shorter then expected. It means the back-end is running some compression algorithm on raw input before (or after?) encoding it as image. By noticing the resolution of the image and maximum information content (HxWx3x8 bits) being smaller than input, we can conclude the compression is done before encoding to image, and not after (thus not relying to PNG compression). We could go further in detecting which compression algorithm is used by encoding the raw input with the common culprits like Huffman coding, Lempel-Zip, LZW, DEFLATE, even Brotli, and comparing the output with bytes from image pixels. (Note we can't detect it directly by inspecting a magic prefix, chances being author stripped anything but the raw compressed data.)

GIF 87a encoding in Java

I've done some research to do a GIF (version 87a) encoder but there are some implementation specifics I can't find specially about the data blocks
First, I can only check how many bytes each block will have when I reach either 255 bytes or the end of the image right? There's no way to tell in advance how many bytes I will need.
Second, as the gif encoding is in little endian, how can I write the resulting integers of the LZW compression in java and how should I align them? I looked at ByteBuffer (I'm coding in Java) but as the integers won't necessarily fit in in a byte it won't work right? How should I do then?
At last, between data blocks, should I start a new LZW compression or just continue where I left off in the previous block?

Which files have good compression ratio using textbook's Huffman coding algorithm?

I am testing Huffman coding now, and I wanted to know which type of files(like .txt, .jpg, .mp3 etc) experience a good compression when they undergo Huffman based compression. I implemented Huffman coding in java and I found out that I was getting about 40% size reduction for .txt files(the ones with ordinary English text) and about almost 0% - 1% reduction on .jpg, .mp3, and .mp4 files (of course I haven't tested it on huge files above 1 MB, because my program is super slow). I understand that Huffman coding works best for those files which have more frequently occurring symbols, however I do not know what kind of symbols are there in a video, audio or an image file, hence the question. Since that I have designed this program(I did it for my project at school, I will not deny it, I did it on my own and I am only asking for a few pointers for my research), I wanted to know where it would work well.
Thanks.
Note: I initially created this project only for .txt files and to my wonder, it was working on all other types of files as well, hence I wanted to test it and thereby I had to ask this question. I found out that for image files, you don't encode the symbols themselves, but rather some RGB values? Correct me if I am wrong.

It's all about the amount of redundancy in the file.
In any file each byte occupies 8 bits, allowing 256 distinct symbols per byte. In a text file, a relatively small number of those symbols are actually used, and the distribution of the symbols is not flat (there are more es than qs). Thus the information "density" is more like 5 bits per byte.
JPEGs, MP3 and MP4 are already compressed and have almost no redundancy. All 256 symbols are used, with about equal frequency, so the information "density" is very close to 8 bits per byte. You cannot compress it further.

Effectively compress strings of 10-1000 characters in Java?

I need to compress strings (written in a known but variable language) of anywhere from 10 to 1000 characters into individual UDP packets.
What compression algorithms available in Java are well suited to this task?
Are there maybe open source Java libraries available to do this?

"It depends".
I would start with just the primary candidates: LZMA ("7-zip"), deflate (direct, zlib: deflate + small wrapper, gzip: deflate + slightly larger wrapper, zip: deflate + even larger wrapper), bzip2 (I doubt this would be that good here, works best with a relative large window), perhaps even one of other LZ* branches like LZS which has an RFC for IP Payload compression but...
...run some analysis based upon the actual data and compression/throughput using several different approaches. Java has both GZIPOutputStream ("deflate in gzip wrapper") and DeflaterOutputStream ("plain deflate", recommend over gzip or zip "wrappers") standard and there are LZMA Java implementations (just need compressor, not container) so these should all be trivial to mock-up.
If there is regularity between the packets then it is is possible this could be utilized -- e.g. build cache mappings, Huffman tables, or just modify the "windows" of one of the other algorithms -- but packet-loss and "de-compressibility" likely needs to be accounted for. Going down this route though adds far more complexity. More ideas for helping out the compressor may be found at SO: How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?.
Also the protocol should likely have a simple "fall back" of zero-compression because some [especially small random] data might not be practically compressible or might "compress" to a larger size (zlib actually has this guard, but also has the "wrapper overhead" so it would be better encoded separately for very small data). The overhead of the "wrapper" for the compressed data -- such as gzip or zip -- also needs to be taken into account for such small sizes. This is especially important to consider of string data less than ~100 characters.
Happy coding.
Another thing to consider is the encoding used to shove the characters into the output stream. I would first start with UTF-8, but that may not always be ideal.
See SO: Best compression algorithm for short text strings which suggests SMAZ, but I do not know how this algorithm will transfer to unicode / binary.
Also consider that not all deflate (or other format) implementations are created equal. I am not privy on Java's standard deflate compared to a 3rd party (say JZlib) in terms of efficiency for small data, but consider Compressing Small Payloads [.NET] which shows rather negative numbers for "the same compression" format. The article also ends nicely:
...it’s usually most beneficial to compress anyway, and determine which payload (the compressed or the uncompressed one) has the smallest size and include a small token to indicate whether decompression is required.
My final conclusion: always test using real-world data and measure the benefits, or you might be in for a little surprise in the end!
Happy coding. For real this time.

The simplest thing to do would be to layer a GZIPOutputStream on top of a ByteArrayOutputStream, as that is built into the JDK, using
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream zos = new GZIPOutputStream(baos);
zos.write(someText.getBytes());
zos.finish();
zos.flush();
byte[] udpBuffer = baos.toByteArray();
There maybe other algorithms that do a better job, but I'd try this first, to see if it fits your needs as it doesn't require any extra jars, and does a pretty good job.

Most standard compression algorithims doesn't work so well with small amounts of data. Often there is a header and a checksum and it takes time for the compression to warmup. I.e. it builds a data dictionary based on the data it has seen.
For this reason you can find that
small packets may be smaller or the same size with no compression.
a simple application/protocol specific compression is better
you have to provide a prebuilt data dictionary to the compression algorithim and strip out the headers as much as possible.
I usually go with second option for small data packets.

good compression algorithm for short strings/url is lzw implementation, it is in java and can be easily ported for client gwt:
https://code.google.com/p/lzwj/source/browse/src/main/java/by/dev/madhead/lzwj/compress/LZW.java
some remarks
use 9 bit code word length for small strings (though you may try which is better). original ratio is from 1 (very small strings, compressed is not larger than original string) to 0.5 (larger strings)
in case of client gwt for other code word lengths it was required to adjust input/output processing to work on per-byte basis, to avoid bugs when buffering bit sequence into long, which is emulated for js.
I'm using it for complex url parameters encoding in client gwt, together with base64 encoding and autobean serialization to json.
upd: base64 implementation is here: http://www.source-code.biz/base64coder/java
you have to change it to make url-safe, i.e. change following characters:
'+' -> '-'
'/' -> '~'
'=' -> '_'

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.