I need to compress lots of long numbers. Those long numbers are like database ids. After compression, it will be sent as part of the request. Other than java.util.zip, is there any better alternative to achieve higher compression rate?
Thanks
It is possible to change byte length of a any number by changing its radix. As computers use bytes for data (radix 256) and humans use base 10 cleartext numbers are not space efficient as they can be use only 10 values out of 256 possible.
Simple java program to demonstrate:
System.out.println(Long.MAX_VALUE);
String sa = Long.toString(Long.MAX_VALUE, Character.MAX_RADIX);
System.out.println(sa);
Outputs:
9223372036854775807 # 20 bytes
1y2p0ij32e8e7 # 14 bytes
Which is a 6 byte reduction (30% compression** in bytes). As Character.MAX_RADIX equals 36 you can achieve even greater compression by writing custom toString method.
Of course this works only for textual representation of numbers. Long.MAX_VALUE number used in this example is only 8 bytes long in its binary form. So even this 30% reduction in size is actually 75% increase when compared to a binary form of the number.
** This method is not really a compression. This is only exploit of storage inefficiency introduced by writing numbers in human readable form. Actual compression like zip will always beat this method, although it will make numbers unreadable by humans. To put it bluntly: you can read aloud numbers in base 10, 16, 36 or even 256. You can't read compressed numbers.
You can compress long numbers using Run Length Encoding: https://en.wikipedia.org/wiki/Run-length_encoding
Related
I'm confused about LEB128 or Little Endian Base 128 format. In the AOSP source code Leb128.java, its read function's return type whether signed or unsigned is int. I know the the size of int in java is 4 bytes aka 32bits. But the max length of LEB128 in AOSP is 5 bytes aka 35 bits. So where are the other lost 3bits.
Thanks for your reply.
Each byte of data in LEB only accounts for 7 bits in the actual output - the remaining bit is used to indicate whether or not it's the end.
From Wikipedia:
To encode an unsigned number using unsigned LEB128 first represent the number in binary. Then zero extend the number up to a multiple of 7 bits (such that the most significant 7 bits are not all 0). Break the number up into groups of 7 bits. Output one encoded byte for each 7 bit group, from least significant to most significant group.
The extra bits aren't so much "lost" as "used to indicate whether or not it's the end of the data".
You can't hope to encode arbitrary 32-bit values and some of them taking less than 4 bytes without some of them taking more than 4 bytes.
I used a Java BitSet to encode a message.
My compressed file size comes out to 954kb, but when I do BitSet.cardinality(), I get around 4mb. Can you explain this?
BitSet.cardinality() returns the number of bits set to true in the BitSet. I think you are looking for BitSet.size(). But keep in mind it will return the number of bits, not bytes.
Assuming after Huffman encoding you have approximately half of the bits set to true, that means your BitSet should have a size of around 4.000.000*2 = 8.000.000 bits in your BitSet which in turn makes around 1.000.000 bytes which is rather close to the 954kb you see.
This should explain your observation.
I have a use-case for getting distributed unique sequence numbers in integer format. UUID comes out to be the best and simple solution for me.
However, I need to have integers only, so I will convert that big hexadecimal number (UUID) to decimal number. UUID has 128 bits and hence will produce a decimal number of 39 digits.
I can't afford to have 39 digits number due to some strict database constraints. So, I get back to basics and try to convert the number to binary first and then decimal. Now, the standard process of converting a hexadecimal directly to binary is to take each hexadecimal digit and convert it into 4 bits. Each hexadecimal number can be converted to a set of 4 bits. Hence, for 32 hex digits in UUID, we get 128 bits (32*4) .
Now, I am thinking of not to follow the rule of converting each hexadecimal digit to 4 bits. Instead I will just use enough bits to represent that digit.
For example , take 12B as one hexadecimal number.
By standard process, conversion to binary comes out to be 0000-0001-0010-1011 (9 bits actually).
By my custom process, it comes out to be 1-10-1011 (7 bits actually).
So, by this method, number of bits got reduced. Now if bits reduced, the digits in the converted decimal number will get reduced and can live with my constraints.
Can you please help in validating my theory? Does this approach has some problem? Will this cause collision ? Is the method correct and can I go ahead with it?
Thanks in advance.
Yes, this will cause collisions.
e.g.
0000-0001-0010-1011 -> 1101011
0000-0000-0110-1011 -> 1101011
Sometime ago I spend couple of days debugging problems with UUID collisions (UUIDS were trimmed), debuging these things is a nightmare. You won't have a good time.
What you need is just to implement your own unique identifier shema --- depending on your use case developing such schema could be either very easy or very hard. You could for example assign each machine an unique number (lets say two bytes) and each machine would assing IDS serialy from 4 byte namespace. And in 6 bytes you have a nice UUID-like schema (with some constraints).
I need to generate a hash value used for uniqueness of many billions of records in Java. Trouble is, I only have 16 numeric digits to play with. In researching this, I have found algorithms for 32-bit hash, which return Java integers. But this is too small, as it only has a range of +/ 2 billion, and have will have more records that that. I cannot go to a 64-bit hash, as that will give me numeric values back that are too large (+/ 4 quintillion, or 19 digits). Trouble is, I am dealing with a legacy system that is forcing me into a static key length of 16 digits.
Suggestions? I know no hash function will guarantee uniqueness, but I need a good one that will fit into these restrictions.
Thanks
If your generated hash is too large you can just mod it with your keyspace max to make it fit.
myhash = hash64bitvalue % 10^16
If you are limited to 16 decimal digits, your key space contains 10^16 values.
Even if you find a hash that gives uniform distribution on your data set, due to Birthday Paradox you will have a 50% chance of collision on ~10^8 items of data, which is an order of magnitude less than your billions of records.
This means that you cannot use any kind of hash alone and rely on uniqueness.
A straightforward solution is to use a global counter instead. If global counter is infeasible, counters with preallocated ranges can be used. For example, 6 most significant digits denote fixed data source index, 10 least significant digits contain monotonous counter maintained by that data source.
So your restriction is 53 bit?
For my understanding order number of bit in hashcode doesn't affect its value (order and value of bit are fully independent from each other). So you could get 64-bit hash function and use only last 53 bits from it. And you must use binary operations for this ( hash64 & (1<<54 - 1) ) not arithmetic.
You don't have to store your hashes in a human readable form (hex, as you said). Just store the 64-bit long datatype (generated by a 64-bit hash function) in your database, which is only 8 bytes. And not the 19 bytes of which you were scared off.
If that isn't a solution, improve the legacy system.
Edit: Wait!
64-bit: 264 =
18446744073709551616
16 hex-digits: 1616 =
18446744073709551616
Exact fit! So make a hex representation of your 64-bit hash, and there you are.
If you can save 16 alphanumeric characters then you can use a hexadecimal representation and pack 16^16 bits into 16 chars. 16^16 is 2^64.
I'm trying to implement some compression algorithms, and I need to deal with bits in Java.
What I need to do is that when I write the value 1 then the value 2, those numbers are stored in the file as bits, so the file size will be 1 byte instead of 2, as 1 is stored in 1 bit and 2 is stored in 2 bits.
Is it possible? Thanks very much
All the I/O methods have a byte as the lowest granularity. You can write bits, but you have to pack them into bytes by yourself. Maybe a one-byte buffer that you write out to the file once it fills up would be appropriate.
Also note that there is no way to know the length of the file in bits (you do not know if the last byte was "full"). So your application needs to take care of that somehow.
You can also google for "BitOutputStream", of which there are a few, though not in libraries that are very common. Maybe just use one of those.
Finally, the file you will be creating will not be a "Text" file, it will be very much binary (even more so than usual...)