calculating entropy of binary data (from strings)

calculating entropy of binary data (from strings) - java

I'm trying to wrap my head around entropy. I basically understand this:
Technically, entropy is the sum over all events of the probability of each event times the log probability of that event.
If you consider a 8-bit byte, if all 256 values are equally likely, then the byte contains 8 bits of entropy or equivalently, 8 bits of real information. If some bit patterns are more likely than others, for example the bit pattern for the letter ‘e’, then the byte will contain less than 8 bits of entropy or information.
Running English text is pretty low entropy, at about 2.3 bits per byte. This is why compression algorithms work well on text files.
https://www.quora.com/What-is-entropy-in-terms-of-cryptography
For now let's say I'm using strings of ASCII converted into string arrays of binary, as in
// [0 0 1 1 0 0 1 1]
// [1 1 1 1 0 0 1 1]
// [1 0 1 1 0 0 0 0]
// [0 0 1 1 0 0 1 1]
I've devised 2 methods for calculating entropy (maybe they are both wrong, I don't know). The first is to start with the element[1] and compare each bit to element[0] and element[2], and for each bit they share in the same location, add 1 to a running score. I then divide by 2. I end up with something that is consistent-ish with the 2.3 bytes of entropy as described above. As in, the score for most of my text is that they share 5-6~ bits with their neighbors on average.
The other method is to just add up each 'column' of bits and then average them to find a probability of their appearance. For the above I get
//=[2 1 4 4 0 0 3 3] / 4 = [.5 .25 1 1 0 0 .75 .75] as probabilities
For the above, I am less sure how to derive an entropy 'score.
In any case, I'm curious if people can help me to understand if I am doing this right, or more likely, how I can improve or understand this differently.
Thank you

Related

Give a part of chessboard of 15-puzzle, how to get all of the state of the the part chessboard using BFS?

the board is like this:
1 2 3 4
5 6 7 0
0 0 0 0
0 0 0 0
the '0' represents that is empty, we can move the non-zero number to the '0'.
so how to get all of the state of the board using BFS?
for example, there are two state of the board:
1 2 3 4
0 0 0 0
5 6 7 0
0 0 0 0
1 2 3 0
4 0 0 0
5 0 0 0
6 7 0 0
The reason I ask this question is that I need to process all of the 15-puzzle state using Disjoint pattern database to solve the nearly most difficult state of 15-puzzle in 1 minutes.
15 14 13 12
11 10 9 8
7 6 5 4
3 1 2 0

I need to process all of the 15-puzzle state [..] to solve the nearly most difficult state of 15-puzzle in 1 minutes
Approach 1 - using a database and storing all states
For reasons given by Henry as well, and also supported by [1], solving this problem using a database would require generating the entire A_15 , storing all of it and then finding the shortest path, or some path between a given state and the solved state. This would require a lot of space and a lot of time. See this discussion for an outline of this approach.
Approach 2 - using a specialized depth-first search algorithm
Here is an implementation of this search strategy that uses the IDA algorithm.
Approach 3 - using computational group theory
Yet another way to handle this in a much shorter amount of time is to use GAP (which implements a variant of Schreier-Sims) in order to decompose a given word into a product of generators. There is an example in the docs that shows how to use it to solve the Rubik's cube, and it can be adapted to the 15-puzzle too [2].
[1] Permutation Puzzles - A Mathematical Perspective by Jamie Mulholland - see page 103 and 104 for solvability criteria, and the state space being |A_15| ~ 653 billion
[2] link2 - page 37

What does "bits 6-0" or "bits 10-6 " mean in the javadoc of DataInput?

When reading the javadoc of DataInput specifically in the "Modified UTF-8" section. I come across three tables that say "0 bits 6-0" ,"1 1 0 bits 10-6",...,"1 0 bits 5-0".
I'm a Java newbie so to me it looks like subtractions, not sure, but if that's the case and we add it to the ones and zeros it would make 7 bits.
As far as I know, a byte is made up of 8 bits.
What does these "0 bits 6-0..." mean?

The javadoc is telling you how each byte is divided.
Consider each byte as a vector of 8 individual elements (bits).
The first block has only one byte, and the corresponding possible bit values.
byte 1
bit number 7 6 5 4 3 2 1 0
bit value 0 ? ? ? ? ? ? ? <-- bits 6 - 0
This means that for characters encoded in one byte, the leading bit will always be 0. These are the characters from \u0001 to \u007F.
The second block has two bytes and gets a bit more complicated
byte 1 byte 2
bit number 15 14 13 12 11 10 9 8 | 7 6 5 4 3 2 1 0
bit value 1 1 0 ? ? ? ? ? | 1 0 ? ? ? ? ? ?
^ ^
| |
bits 10 to 6 of bits 5 to 0 of
the utf-8 codepoint the utf-8 codepoint
These are the characters in the range from \u0080 to \u07FF
So for example, a symbol in this range is µ (micro sign).
In normal unicode the bytes are 11000010 10110101
Take a look at this character and see how it lines up with the bits for two-byte chars. You have
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 0 0 0 0 1 0 1 0 1 1 0 1 0 1
Bits 10-6 ------*-*-*----*-* ^-^-^-^-^-^----bits 5-0
You end up with
byte 1 byte 2
bit number 15 14 13 12 11 10 9 8 | 7 6 5 4 3 2 1 0
bit value 1 1 0 - - 0 1 0 | 1 0 1 1 0 1 0 1
Where bytes 11 and 12 would be 0 but I put a - in there just to show their (in)significance.
Sorry for the ascii art, I hope it helps.

It shows ranges of bits. Bits are numbered, the lowest, least significant bit has index 0, the next bit has index 1, the next bit has index 2 etc. For example, number 13 has binary representation 1101. It means that bit number 0 has value 1, bit number 1 has value 0, and bits number 2 and 3 have value of 1. So, for example, in the documentation "0 bits 6-0" means that the highest bit in the byte must be zero, while seven lower bits with the indexes from 6 to 0 are occupied by your number.

That table is talking about how characters is represented in the modified UTF8 encoding. I'll use this part as an example:
The null character '\u0000' and characters in the range '\u0080' to '\u07FF' are represented by a pair of bytes.
Notice that you can use 11 bits to represent those numbers. The largest number you can represent with 11 bits is 2^11 - 1 = 2047 = 7FF in hex. Let's number those 11 bits 0 to 10, so bit-10 is the most significant bit, and bit-0 is the least significant.
The table is telling you that in the modified UTF8 encoding, the characters that those numbers are encoded by are instead represented by 2 bytes.
The first byte starts with the bits 110 (these three bits are fixed), followed by bit-10, bit-9, bit-8, bit-7, bit-6 of the number we are trying to represent.
The second byte starts with the bits 10 (these two bits are fixed), followed by bit-5, bit-4, bit-3, bit-2, bit-1, bit-0 of the number.
In short, "bits 10-6" means "bits 10 to 6", which is not 4, but 5 bits.
I encourage you to compare this with the normal UTF8 encoding to see where the difference is.

How does this code involving xor actually works?

I have a variable that represents the XOR of 2 numbers. For example: int xor = 7 ^ 2;
I am looking into a code that according to comments finds the rightmost bit that is set in XOR:
int rightBitSet = xor & ~(xor - 1);
I can't follow how exactly does this piece of code work. I mean in the case of 7^2 it will indeed set rightBitSet to 0001 (in binary) i.e. 1. (indeed the rightmost bit set)
But if the xor is 7^3 then the rightBitSet is being set to 0100 i.e 4 which is also the same value as xor (and is not the rightmost bit set).
The logic of the code is to find a number that represents a different bit between the numbers that make up xor and although the comments indicate that it finds
the right most bit set, it seems to me that the code finds a bit pattern with 1 differing bit in any place.
Am I correct? I am not sure also how the code works. It seems that there is some relationship between a number X and the number X-1 in its binary representation?
What is this relationship?

The effect of subtracting 1 from a binary number is to replace the least significant 1 in it with a 0, and set all the less significant bits to 1. For example:
5 - 1 = 101 - 1 = 100 = 4
4 - 1 = 100 - 1 = 011 = 3
6 - 1 = 110 - 1 = 101 = 5
So in evaluating x & ~(x - 1): above x's least significant 1, ~(x - 1) has the same set bits as ~x, so above x's least significant 1, x & ~(x-1) has no 1 bits. By definition, x has a 1 bit at its least significant 1, and as we saw above ~(x - 1) will, too, but ~(x - 1) will have 0s below that point. Therefore, x & ~(x - 1) will have only one 1 bit, at the least significant bit of x.

On integer multiplication, overflow, and information loss

I'm reading through Chapter 3 of Joshua Bloch's Effective Java. In Item 8: Always override hashCode when you override equals, the author uses the following combining step in his hashing function:
result = 37 * result + c;
He then explains why 37 was chosen (emphasis added):
The multiplier 37 was chosen because it is an odd prime. If it was even and
the multiplication overflowed, information would be lost because multiplication
by two is equivalent to shifting. The advantages of using a prime number are less
clear, but it is traditional to use primes for this purpose.
My question is why does it matter that the combining factor (37) is odd? Wouldn't multiplication overflow result in a loss of information regardless of whether the factor was odd or even?

Consider what happens when a positive value is repeatedly multiplied by two in a base-2 representation -- all the set bits eventually march off the end, leaving you with zero.
An even multiplier would result in hash codes with less diversity.
Odd numbers, on the other hand, may result in overflow, but without loss of diversity.

The purpose of a hashCode is to have random bits based on the input (especially the lower bits as these are often used more)
When you multiple by 2 the lowest bit can only be 0, which lacks randomness. If you multiple by an odd number the lowest bit can be odd or even.
A similar question is what do you get here
public static void main(String... args) {
System.out.println(factorial(66));
}
public static long factorial(int n) {
long product = 1;
for (; n > 1; n--)
product *= n;
return product;
}
prints
0
Every second number is an even and every forth a multiple of 4 etc.

The solution lies in Number Theory and the Lowest common denominator of your multiplier and your modulo number.
An example may help. Lets say instead of 32bit you only got 2 bit to represent a number. So you got 4 numbers(classes). 0, 1, 2 and 3
An overflow in the CPU is the same as a modulo operation
Class - x2 - mod 4 - x2 - mod 4
0 0 0 0 0
1 2 2 4 0
2 4 0 0 0
3 6 2 4 0
After 2 operations you only got 1 possible number(class) left. So you have 'lost' information.
Class - x3 - mod 4 - x3 - mod 4 ...
0 0 0 0 0
1 3 3 9 1
2 6 2 6 2
3 9 1 3 3
This can go on forever and you still have all 4 classes. So you dont lose information.
The key is, that the LCD of your muliplier and your modulo class is 1. That holds true for all odd numbers because your modulo number is currently always a power of 2. They dont have to be primes and they dont have to be 37 specifically. But information loss is just one criteria why 37 is picked other criterias are distribution of values etc.

Non-math simple version of why...
Prime numbers are used for hashing to keep diversity.
Perhaps diversity is more important because of Set and Map implementations. These implementations use last bits of object hash numbers to index internal arrays of entries.
For example, in a HashMap with internal table (array) for entries with size 8 it will use last 3 bits of hash numbers to adress table entry.
static int indexFor(int h, int length) {
return h & (length-1);
}
In fact it's not but if Integer object would have
hash = 4 * number;
most of table elements will be empty but some will contain too many entries. This would lead to extra iterations and comparison operations while searching for particular entry.
I guess the main concern of Joshua Bloch was to distribute hash integers as even as possible to optimize performance of collections by distributing objects evenly in Maps and Sets. Prime numbers intuitively are seems to be a good factor of distribution.

Prime numbers aren't strictly necessary to ensure diversity; what's necessary is that the factor be relatively prime to the modulus.
Since the modulus for binary arithmetic is always a power of two, any odd number is relatively prime, and would suffice. If you were to take a modulus other than by overflow, though, a prime number would continue to ensure diversity (assuming you didn't choose the same prime...).

Why if (n & -n) == n then n is a power of 2?

Line 294 of java.util.Random source says
if ((n & -n) == n) // i.e., n is a power of 2
// rest of the code
Why is this?

Because in 2's complement, -n is ~n+1.
If n is a power of 2, then it only has one bit set. So ~n has all the bits set except that one. Add 1, and you set the special bit again, ensuring that n & (that thing) is equal to n.
The converse is also true because 0 and negative numbers were ruled out by the previous line in that Java source. If n has more than one bit set, then one of those is the highest such bit. This bit will not be set by the +1 because there's a lower clear bit to "absorb" it:
n: 00001001000
~n: 11110110111
-n: 11110111000 // the first 0 bit "absorbed" the +1
^
|
(n & -n) fails to equal n at this bit.

The description is not entirely accurate because (0 & -0) == 0 but 0 is not a power of two. A better way to say it is
((n & -n) == n) when n is a power of two, or the negative of a power of two, or zero.
If n is a power of two, then n in binary is a single 1 followed by zeros.
-n in two's complement is the inverse + 1 so the bits lines up thus
n 0000100...000
-n 1111100...000
n & -n 0000100...000
To see why this work, consider two's complement as inverse + 1, -n == ~n + 1
n 0000100...000
inverse n 1111011...111
+ 1
two's comp 1111100...000
since you carry the one all the way through when adding one to get the two's complement.
If n were anything other than a power of two† then the result would be missing a bit because the two's complement would not have the highest bit set due to that carry.
† - or zero or a negative of a power of two ... as explained at the top.

You need to look at the values as bitmaps to see why this is true:
1 & 1 = 1
1 & 0 = 0
0 & 1 = 0
0 & 0 = 0
So only if both fields are 1 will a 1 come out.
Now -n does a 2's complement. It changes all the 0 to 1 and it adds 1.
7 = 00000111
-1 = NEG(7) + 1 = 11111000 + 1 = 11111001
However
8 = 00001000
-8 = 11110111 + 1 = 11111000
00001000 (8)
11111000 (-8)
--------- &
00001000 = 8.
Only for powers of 2 will (n & -n) be n.
This is because a power of 2 is represented as a single set bit in a long sea of zero's.
The negation will yield the exact opposite, a single zero (in the spot where the 1 used to be) in a sea of 1's. Adding 1 will shift the lower ones into the space where the zero is.
And The bitwise and (&) will filter out the 1 again.

In two's complement representation, the unique thing about powers of two, is that they consist of all 0 bits, except for the kth bit, where n = 2^k:
base 2 base 10
000001 = 1
000010 = 2
000100 = 4
...
To get a negative value in two's complement, you flip all the bits and add one. For powers of two, that means you get a bunch of 1s on the left up to and including the 1 bit that was in the positive value, and then a bunch of 0s on the right:
n base 2 ~n ~n+1 (-n) n&-n
1 000001 111110 111111 000001
2 000010 111101 111110 000010
4 000100 111011 111100 000100
8 001000 110111 111000 001000
You can easily see that the result of column 2 & 4 is going to be the same as column 2.
If you look at the other values missing from this chart, you can see why this doesn't hold for anything but the powers of two:
n base 2 ~n ~n+1 (-n) n&-n
1 000001 111110 111111 000001
2 000010 111101 111110 000010
3 000011 111100 111101 000001
4 000100 111011 111100 000100
5 000101 111010 111011 000001
6 000110 111001 111010 000010
7 000111 111000 111001 000001
8 001000 110111 111000 001000
n&-n will (for n > 0) only ever have 1 bit set, and that bit will be the least significant set bit in n. For all numbers that are powers of two, the least significant set bit is the only set bit. For all other numbers, there is more than one bit set, of which only the least significant will be set in the result.

It's property of powers of 2 and their two's complement.
For example, take 8:
8 = 0b00001000
-8 = 0b11111000
Calculating the two's complement:
Starting: 0b00001000
Flip bits: 0b11110111 (one's complement)
Add one: 0b11111000
AND 8 : 0b00001000
For powers of 2, only one bit will be set so adding will cause the nth bit of 2n to be set (the one keeps carrying to the nth bit). Then when you AND the two numbers, you get the original back.
For numbers that aren't powers of 2, other bits will not get flipped so the AND doesn't yield the original number.

Simply, if n is a power of 2 that means only one bit is set to 1 and the others are 0's:
00000...00001 = 2 ^ 0
00000...00010 = 2 ^ 1
00000...00100 = 2 ^ 2
00000...01000 = 2 ^ 3
00000...10000 = 2 ^ 4
and so on ...
and because -n is a 2's complement of n (that means the only bit which is 1 remains as it is and the bits on left side of that bit are sit to 1 which is actually doesn't matter since the result of AND operator & will be 0 if one of the two bits is zero):
000000...000010000...00000 <<< n
&
111111...111110000...00000 <<< -n
--------------------------
000000...000010000...00000 <<< n

Shown through example:
8 in hex = 0x000008
-8 in hex = 0xFFFFF8
8 & -8 = 0x000008

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.