How to calculate the size of a BitSet in bytes?

How to calculate the size of a BitSet in bytes? - java

I used a Java BitSet to encode a message.
My compressed file size comes out to 954kb, but when I do BitSet.cardinality(), I get around 4mb. Can you explain this?

BitSet.cardinality() returns the number of bits set to true in the BitSet. I think you are looking for BitSet.size(). But keep in mind it will return the number of bits, not bytes.
Assuming after Huffman encoding you have approximately half of the bits set to true, that means your BitSet should have a size of around 4.000.000*2 = 8.000.000 bits in your BitSet which in turn makes around 1.000.000 bytes which is rather close to the 954kb you see.
This should explain your observation.

Related

Java - How to calculate CRC16 of BitSet

I Have a Java BitSet where i have some data. The length of this BitSet is 545 bits.
Problem: All current known implementations can only work with a byte array, but converting my BitSet to a byte array will change the data because i need to do some padding.
Is there any known implementation which can handly my data without needing to adjust it to whole bytes?

Use .toByteArray() to get the bits into a sequence of bytes. You need to know what CRC-16 definition is required (polynomial, ordering, pre and post processing), and the order in which to process the bits. .toByteArray() will put the first bit in the set in the least-significant bit of the first byte.
Then you can use crcany to generate C code for the CRC-16 you need. The generated code includes a crc16..._rem() routine for updating the CRC with a number of bits. For a BitSet with n bits, you would first compute the CRC on the first n >> 3 bytes. Then use crc16..._rem() to update the CRC using the n & 7 bits in the last byte. It is straightforward to convert the C code to Java.

Java compress lots of long numbers

I need to compress lots of long numbers. Those long numbers are like database ids. After compression, it will be sent as part of the request. Other than java.util.zip, is there any better alternative to achieve higher compression rate?
Thanks

It is possible to change byte length of a any number by changing its radix. As computers use bytes for data (radix 256) and humans use base 10 cleartext numbers are not space efficient as they can be use only 10 values out of 256 possible.
Simple java program to demonstrate:
System.out.println(Long.MAX_VALUE);
String sa = Long.toString(Long.MAX_VALUE, Character.MAX_RADIX);
System.out.println(sa);
Outputs:
9223372036854775807 # 20 bytes
1y2p0ij32e8e7 # 14 bytes
Which is a 6 byte reduction (30% compression** in bytes). As Character.MAX_RADIX equals 36 you can achieve even greater compression by writing custom toString method.
Of course this works only for textual representation of numbers. Long.MAX_VALUE number used in this example is only 8 bytes long in its binary form. So even this 30% reduction in size is actually 75% increase when compared to a binary form of the number.
** This method is not really a compression. This is only exploit of storage inefficiency introduced by writing numbers in human readable form. Actual compression like zip will always beat this method, although it will make numbers unreadable by humans. To put it bluntly: you can read aloud numbers in base 10, 16, 36 or even 256. You can't read compressed numbers.

You can compress long numbers using Run Length Encoding: https://en.wikipedia.org/wiki/Run-length_encoding

Huffman compress file (Got the tree but can't compress)- Java

Alright so I am trying to do a file compress using the Huffman tree.
We got the tree that is working just fine but we are unable to figure out how to write the binary string we get into the file.
So for example our tree returns: '110', it should mean this byte: '00000110' right?
And if the returns: '11111111 11111110' it should mean what? Should we just write it in in byte?
So the question is how do we convert the binary string we get into bytes so we can write it on the file?
Thanks alot,
Ara

So for example our tree returns: '110', it should mean this byte:
'00000110' right?
Wrong. You should have a byte buffer of bits into which you write your bits. Write the three bits 110 into the byte. (You will need to decide on a convention for bit ordering in the byte.) You still have five unused bits in the byte, so there it sits. Now you write 10 into the buffer. The byte buffer now has 11010, and three unused bits. So still it sits. Now you try to write 111011 into the byte buffer. The first three bits go into the byte buffer, giving you 11010111. You now have filled the buffer, so only now do you write out your byte to the file. You are left with 011. You clear your byte buffer of bits since you wrote it out, and put in the remaining 011 from your last code. Your byte buffer now has three bits in it, and five bits unused. Continue in this manner.
The buffer does not have to be one byte. 16-bit or 32-bit buffers are common and are more efficient. You write out bytes whenever the bits therein are eight or more, and shift the remaining 0-7 bits to the start of the buffer.
The only tricky part is what to do at the end, since you may have unused bits in your last byte. Your Huffman codes should have an end symbol to mark the end of the stream. Then you know when you should stop looking for more Huffman codes. If you do not have an end code, then you need to assure somehow that either the remaining bits in the byte cannot be a complete Huffman code, or you need to indicate in some other way where the stream of bits end.

Purpose of byte type in Java

I read this line in the Java tutorial:
byte: The byte data type is an 8-bit signed two's complement integer. It has
a minimum value of -128 and a maximum value of 127 (inclusive). The
byte data type can be useful for saving memory in large arrays, where
the memory savings actually matters. They can also be used in place of
int where their limits help to clarify your code; the fact that a
variable's range is limited can serve as a form of documentation.
I don't clearly understand the bold line. Can somebody explain it for me?

Byte has a (signed) range from -128 to 127, where as int has a (also signed) range of −2,147,483,648 to 2,147,483,647.
What it means is that since the values you're going to use will always be between that range, by using the byte type you're telling anyone reading your code this value will be at most between -128 to 127 always without having to document about it.
Still, proper documentation is always key and you should only use it in the case specified for readability purposes, not as a replacement for documentation.

If you're using a variable which maximum value is 127 you can use byte instead of int so others know without reading any if conditions after, which may check the boundaries, that this variable can only have a value between -128 and 127.
So it's kind of self-documenting code - as mentioned in the text you're citing.
Personally, I do not recommend this kind of "documentation" - only because a variable can only hold a maximum value of 127 doesn't reveal it's really purpose.

Integers in Java are stored in 32 bits; bytes are stored in 8 bits.
Let's say you have an array with one million entries. Yikes! That's huge!
int[] foo = new int[1000000];
Now, for each of these integers in foo, you use 32 bits or 4 bytes of memory. In total, that's 4 million bytes, or 4MB.
Remember that an integer in Java is a whole number between -2,147,483,648 and 2,147,483,647 inclusively. What if your array foo only needs to contain whole numbers between, say, 1 and 100? That's a whole lot of numbers you aren't using, by declaring foo as an int array.
This is when byte becomes helpful. Bytes store whole numbers between -128 and 127 inclusively, which is perfect for what you need! But why choose bytes? Because they use one-fourth of the space of integers. Now your array is wasting less memory:
byte[] foo = new byte[1000000];
Now each entry in foo takes up 8 bits or 1 byte of memory, so in total, foo takes up only 1 million bytes or 1MB of memory.
That's a huge improvement over using int[] - you just saved 3MB of memory.
Clearly, you wouldn't want to use this for arrays that hold numbers that would exceed 127, so another way of reading the bold line you mentioned is, Since bytes are limited in range, this lets developers know that the variable is strictly limited to these bounds. There is no reason for a developer to assume that a number stored as a byte would ever exceed 127 or be less than -128. Using appropriate data types saves space and informs other developers of the limitations imposed on the variable.

I imagine one can use byte for anything dealing with actual bytes.
Also, the parts (red, green and blue) of colors commonly have a range of 0-255 (although byte is technically -128 to 127, but that's the same amount of numbers).
There may also be other uses.
The general opposition I have to using byte (and probably why it isn't seen as often as it can be) is that there's lots of casting needed. For example, whenever you do arithmetic operations on a byte (except X=), it is automatically promoted to int (even byte+byte), so you have to cast it if you want to put it back into a byte.
A very elementary example:
FileInputStream::read returns a byte wrapped in an int (or -1). This can be cast to an byte to make it clearer. I'm not supporting this example as such (because I don't really (at this moment) see the point of doing the below), just saying something similar may make sense.
It could also have returned a byte in the first place (and possibly thrown an exception if end-of-file). This may have been even clearer, but the way it was done does make sense.
FileInputStream file = new FileInputStream("Somefile.txt");
int val;
while ((val = file.read()) != -1)
{
byte b = (byte)val;
// ...
}
If you don't know much about FileInputStream, you may not know what read returns, so you see an int and you may assume the valid range is the entire range of int (-2^31 to 2^31-1), or possibly the range of a char (0-65535) (not a bad assumption for file operations), but then you see the cast to byte and you give that a second thought.
If the return type were to have been byte, you would know the valid range from the start.
Another example:
One of Color's constructors could have been changed from 3 int's to 3 byte's instead, since their range is limited to 0-255.

It means that knowing that a value is explicitly declared as a very small number might help you recall the purpose of it.
Go for real docs when you have to create a documentation for your code, though, relying on datatypes is not documentation.

An int covers the values from 0 to 4294967295 or 2 to the 32nd power. This is a huge range and if you are scoring a test that is out of 100 then you are wasting that extra spacce if all of your numbers are between 0 and 100. It just takes more memory and harddisk space to store ints, and in serious data driven applications this translates to money wasted if you are not using the extra range that ints provide.

byte data types are generally used when you want to handle data in the forms of streams either from file or from network. Reason behind this is because network and files works on the concept of byte.
Example: FileOutStream always takes byte array as input parameter.

Bitmask in big endian

This isn't a question as much as it's a sanity check!
If you needed to read 4 bytes into Java as a bitmask in Big endian and those bytes were:
0x00, 0x01, 0xB6, 0x02.
Making that into an int would be: 112130
The binary would be: 00000000000000011010011000000010
The endian of a series of bytes wouldn't affect the bit position, would it?
Thanks
Tony

Endian-ness reflects the ordering of bytes, but not the ordering of the bits within those bytes.
Let's say I want to represent the (two-byte) word 0x9001.
If I just type this out in binary, that would be 1001000000000001.
If I dump the bytes (from lower address to higher) on a big-endian machine, I would see 10010000 00000001.
If I dump the bytes (from lower address to higher) on a little-endian machine, I would see 00000001 10010000.

In general, if the thing you're reading from is giving you whole bytes, then you don't need to worry about the order of bits making up those bytes: it is just the order of the bytes that matters, as you correctly suppose.
The time you might have to worry about the "endianness" of individual bits is where you're actually reading/writing a stream of bits rather than whole bytes (e.g. if you were writing a compression algorithm that operated at the bit level, you'd have to make a decision about what order to write the bits in).

The only thing you have to pay attention is how exactly you "read 4 bytes into Java" - that's where endianness matters and you can mess it up (DataInputStream assumes big endian). Once the value you've read has become the int 112130, you're set.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.