The descriptions of bitCount() and bitLength() are rather cryptic:
public int bitCount()
Returns the number of bits in the two's complement representation of this BigInteger that differ from its sign bit. This method is useful when implementing bit-vector style sets atop BigIntegers.
Returns:
number of bits in the two's complement representation of this BigInteger that differ from its sign bit.
public int bitLength()
Returns the number of bits in the minimal two's-complement representation of this BigInteger, excluding a sign bit. For positive BigIntegers, this is equivalent to the number of bits in the ordinary binary representation. (Computes (ceil(log2(this < 0 ? -this : this+1))).)
Returns:
number of bits in the minimal two's-complement representation of this BigInteger, excluding a sign bit.
What is the real difference between these two methods and when should I use which?
I have used bitCount occasionally to count the number of set bits in a positive integer but I've only rarely use bitLength and usually when I meant bitCount because the differences between the descriptions are too subtle for me to instantly grok.
Google Attractor: Java BigInteger bitCount vs bitLength
A quick demonstration:
public void test() {
BigInteger b = BigInteger.valueOf(0x12345L);
System.out.println("b = " + b.toString(2));
System.out.println("bitCount(b) = " + b.bitCount());
System.out.println("bitLength(b) = " + b.bitLength());
}
prints
b = 10010001101000101
bitCount(b) = 7
bitLength(b) = 17
So, for positive integers:
bitCount() returns the number of set bits in the number.
bitLength() returns the position of the highest set bit i.e. the length of the binary representation of the number (i.e. log2).
Another basic function is missing:
bitCount() is useful to find the cardinal of a set of integers;
bitLength() is useful to find the largest of integers that are members in this set;
getLowestSetBit() is still needed to find the smallest of integers that are members in this set (this is also needed to implement fast iterators on bitsets).
There are efficient ways to:
reduce a very large bitset to a bitCount() without having to shift each word stored in the bitset (e.g. 64-bit words) using a slow loop over each of the 64-bits. This does not require any loop and can be computed using a small bounded number of arithmetic operations on 64-bit numbers (with the additional benefit: no need to perform any test for loop conditions, parallelism is possible, less than 64 operations for 64-bit words, so the cost is in O(1) time.
compute the bitLength(): you just need the number of words used to store the bitset, or its highest used index in an array of words, and then a small arithmetic operations on the single word stored at this index: on a 64-bit word, at most 8 arithmetic operations are sufficient, so the cost is in O(1) time.
but for the bitSmallest(): you still need to perform a binary search to locate the highest "bit-splitting" position in a word (at unknown position in the lowest subset of words, that still need to be scanned as long as they are all zeroes, so parallelization is difficult and the cost is O(N) time where N is the bitLength() of the bitset) under which all bits are zeroes. And I wonder if we can avoid the costly tests-and-branches on the first non-all zero words, using only arithmetic, so that full parallelism can be used to give a reply in O(1) time for this last word only.
In my opinion the 3rd problem requires a more efficient storage for bitsets than a flat array of words: we need a representation using a binary tree instead:
Suppose you want to store 64 bits in a bitset
this set is equivalent to storing 2 subsets A and B, of 32 bits for each
but instead of naively storing {A, B} you can store {A or B, (A or B) xor A, (A or B) xor B}, where "or" and "xor" are bit-for-bit operations (this basically adding 50% of info, by not storing jsut two separate elements but their "sum" and their respective difference of this sum).
You can apply it recursively for 128 bits, 256 bits, but in fact you could as well avoid the 50% cost at each step by summing more elements. using the "xor" differences instead of elements themselves can be used to accelerate some operations (not shown here), like other compression schemes that are efficient on sparse sets.
This allows faster scanning of zeroes because you can skip very fast, in O(log2(N)) time the null bits and locate words that have some non-zero bits: they have (A or B)==0.
Another common usage of bitsets is to allow them to represent their complement, but this is not easy when the number of integers that the set could have as members if very large (e.g. to represent a set of 64-bit integers): the bitset should then reserve at least one bit to indicate that the bitsets does NOT store directly the integers that are members of the set, but instead store only the integers that are NOT members of the set.
And an efficient representation of the bitset using a tree-like structure should allow each node in the binary tree to choose if it should store the members or the non-members, depending on the cardinality of members in each subrange (each subrange will represent a subset of all integers between k and (k+2^n-1), where k is the node number in the binary tree, each node storing a single word of n bits; one of these bits storing if the word contains members or non-members).
There's an efficient way to store binary trees in a flat indexed array, if the tree is dense enough to have few words set with bits all set to 0 or all set to 1. If this is not the case (for very "sparse" sets), you need something else using pointers like a B-tree, where each page of the B-tree can be either a flat "dense" range, or an ordered index of subtrees: you'll store flat dense ranges in leaf nodes which can be allocated in a flat array, and you'll sore other nodes separately in another store that can also be an array: instead of a pointer from one node to the other for a subbranch of the btree, you use an index in that array; the index itself can have one bit indicating if you are pointing to other pages of branches, or to a leaf node.
But the current default implementation of bitsets in Java collections does not use these technics, so BitSets are still not efficient enough to store very sparse sets of large integers. You need your own library to reduce the storage requirement and still allow fast lookup in the bitset, in O(log2(N)) time, to determine if an integer is a member or not of the set of integers represented by this optimized bitset.
But anyway the default Java implementation is sufficient if you just need bitCount() and bitLength() and your bitsets are used for dense sets, for sets of small integers (for a set of 16-bit integers, a naive approach storing 64K bit, i.e. using 8KB of memory at most, is generally enough).
For very sparse sets of large integers, it will always be more efficient to just store a sorted array of integer values (e.g. not more than one bit every 128 bits), or a hashed table if the bit set would not set more than 1 bit for every range of 32 bits: you can still add an extra bit in these structures to store the "complement" bit.
But I've not found that getLowestSetBit() was efficient enough: the BigInteger package still cannot support very sparse bitsets without huge memory costs, even if BigInteger can be used easility to represent the "complement" bit just as a "sign bit" with its signum() and substract methods, which are efficient.
Very large and very sparse bitsets are needed for example for somme wellknown operations, like searches in large very databases of RDF tuples in a knowledge database, each tuple being indexed by a very large GUID (represented by 128-bit integers): you need to be able to perform binary operations like unions, differences, and complements.
Related
I am reading the implementation details of Java 8 HashMap, can anyone let me know why Java HashMap initial array size is 16 specifically? What is so special about 16? And why is it the power of two always? Thanks
The reason why powers of 2 appear everywhere is because when expressing numbers in binary (as they are in circuits), certain math operations on powers of 2 are simpler and faster to perform (just think about how easy math with powers of 10 are with the decimal system we use). For example, multication is not a very efficient process in computers - circuits use a method similar to the one you use when multiplying two numbers each with multiple digits. Multiplying or dividing by a power of 2 requires the computer to just move bits to the left for multiplying or the right for dividing.
And as for why 16 for HashMap? 10 is a commonly used default for dynamically growing structures (arbitrarily chosen), and 16 is not far off - but is a power of 2.
You can do modulus very efficiently for a power of 2. n % d = n & (d-1) when d is a power of 2, and modulus is used to determine which index an item maps to in the internal array - which means it occurs very often in a Java HashMap. Modulus requires division, which is also much less efficient than using the bitwise and operator. You can convince yourself of this by reading a book on Digital Logic.
The reason why bitwise and works this way for powers of two is because every power of 2 is expressed as a single bit set to 1. Let's say that bit is t. When you subtract 1 from a power of 2, you set every bit below t to 1, and every bit above t (as well as t) to 0. Bitwise and therefore saves the values of all bits below position t from the number n (as expressed above), and sets the rest to 0.
But how does that help us? Remember that when dividing by a power of 10, you can count the number of zeroes following the 1, and take that number of digits starting from the least significant of the dividend in order to find the remainder. Example: 637989 % 1000 = 989. A similar property applies to binary numbers with only one bit set to 1, and the rest set to 0. Example: 100101 % 001000 = 000101
There's one more thing about choosing the hash & (n - 1) versus modulo and that is negative hashes. hashcode is of type int, which of course can be negative. modulo on a negative number (in Java) is negative also, while & is not.
Another reason is that you want all of the slots in the array to be equally likely to be used. Since hash() is evenly distributed over 32 bits, if the array size didn't divide into the hash space, then there would be a remainder causing lower indexes to have a slightly higher chance of being used. Ideally, not just the hash, but (hash() % array_size) is random and evenly distributed.
But this only really matters for data with a small hash range (like a byte or character).
Other than the difference in methods available, why would someone use a BitSet as opposed to an array of booleans? Is performance better for some operations?
You would do it to save space: a boolean occupies a whole byte, so an array of N booleans would occupy eight times the space of a BitSet with the equivalent number of entries.
Execution speed is another closely related concern: you can produce a union or an intersection of several BitSet objects faster, because these operations can be performed by CPU as bitwise ANDs and ORs on 32 bits at a time.
In addition to the space savings noted by #dasblinkenlight, a BitSet has the advantage that it will grow as needed. If you do not know beforehand how many bits will be needed, or the high numbered bits are sparse and rarely used, (e.g. you are detecting which Unicode characters are present in a document and you want to allow for the unusual "foreign" ones > 128 but you know that they will be rare) a BitSet will save even more memory.
I am caching list of Long indexes in my Java program and it is causing the memory to overflow.
So, decided to cache only the start and end indexes of all continuous indexes and rewrite the ArrayList's required APIs. Now, what data structure will be best here to implement the start-end index cache? Is it better to go for TreeMap and keep start index as key and end index as value?
If I were you, I would use some variation of bit string storage.
In Java bit strings are implemented by BitSet.
For example, to represent arbitrary list of unique 32-bit integers, you could store it as a single bit string 4 billion bits long, so this will take 4 bln / 8 bits = 512MB of memory. This is a lot, but it is worst possible case.
But, you can be a lot smarter than that. For example, you could store it as list or binary tree of some smaller fixed (or dynamic) sized bit strings, say 65536 bits or less (or 8KB or less). In other words, each leaf object in this tree will have small header representing start offset and length (probably power of 2 for simplicity, but it does not have to be), and bit string storing actual array members. For efficiency, you could optionally compress this bit string using gzip or similar algorithm - it will make access slower, but could improve memory efficiency by factor of 10 or more.
If your 20 million index elements are almost consecutive (not very sparse), it should take only around 20mln/8bits ~= 2 million bits = 2 MB to represent it in memory. If you gzip it, it will be probably under 1MB overall.
The most compact representation will depend greatly on the distribution of indices in your specific application.
If your indices are densely clustered, the range-based representation suggested by mvp will probably work well (you might look at implementations of run-length encoding for raster graphics, since they're similar problems).
If your indices aren't clustered in dense runs, that encoding will actually increase memory consumption. For sparsely-populated lists, you might look into primitive data structures such as LongArrayList or LongOpenHashSet in FastUtil, or similar structures in Gnu Trove or Colt. In most VMs, each Long object in your ArrayList consumes 20+ bytes, whereas a primitive long consumes only 8. So you can often realize a significant memory savings with type-specific primitive collections instead of the standard Collections framework.
I've been very pleased with FastUtil, but you might find another solution suits you better. A little simulation and memory profiling should help you determine the most effective representation for your own data.
Most BitSet (compressed or uncompressed) implementations are for integers. Here's one for longs: http://www.censhare.com/en/aktuelles/censhare-labs/yet-another-compressed-bitset which works like an ordered primitive long hash set or long to long hash map.
I have to get in java protected final static int [] SIEVE = new int [ 1 << 32 ];
But i cant force java to that.
Max sieve what i get is 2^26 i need 2^32 to end my homework. I tried with mask but i need to have SIEVE[n] = k where min{k: k|n & k >2}.
EDIT
I need to find Factor numbers from 2 to 2^63-1 using Sieve and sieve must have information that P[n]= is smallest prime with divide n. I know that with sieve i can Factorise number to 2^52. But how do that exercises with holding on to the content.
EDIT x2 problem solved
You can't. A Java array can have at most 2^31 - 1 elements because the size of an array has to fit in a signed 32-bit integer.
This applies whether you run on a 32 bit or 64 bit JVM.
I suspect that you are missing something in your homework. Is the requirement to be able to find all primes less than 2^32 or something? If that is the case, they expect you to treat each int of the int[] as an array of 32 bits. And you need an array of only 2^25 ints to do that ... if my arithmetic is right.
A BitSet is another good alternative.
A LinkedList<Integer> is a poor alternative. It uses roughly 8 times the memory that an array of the same size would, and the performance of get(int) is going to be horribly slow for a long list ... assuming that you use it in the obvious fashion.
If you want something that can efficiently use as much memory as you can configure your JVM to use, then you should use an int[][] i.e. an array of arrays of integers, with the int[] instances being as large as you can make them.
I need to find Factor numbers from 2 to 2^63-1 using Sieve and sieve must have information that P[n]= is smallest prime with divide n. I know that with sieve i can Factorise number to 2^52. But how do that exercises with holding on to the content.
I'm not really sure I understand you. To factorize a number in the region of 2^64, you only need prime numbers up to 2^32 ... not 2^52. (The square root of 2^64 is 2^32 and a non-prime number must have a prime factor that is less than or equal to its square root.)
It sounds like you are trying to sieve more numbers than you need to.
If you really need to store that much data in memory, try using java.util.LinkedList collection instead.
However, there's a fundamental flaw in your algorithm if you need to store 16GB of data in memory.
If you're talking about Sieve of Eratosthenes and you need to store all primes < 2^32 in an array, you still wouldn't need an array of size 2^32. I'd suggest you use java.util.BitSet to find the primes and either iterate and print or store them in a LinkedList as required.
hash function is important in implementing hash table. I know that in java
Object has its hash code, which might be generated from weak hash function.
Following is one snippet that is "supplement hash function"
static int hash(Object x) {
int h = x.hashCode();
h += ~(h << 9);
h ^= (h >>> 14);
h += (h << 4);
h ^= (h >>> 10);
return h;
}
Can anybody help to explain what is the fundamental idea of a hash algorithm
? to generate non-duplicate integer? If so, how does these bitwise
operations make it?
A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. (wikipedia)
Using more "human" language object hash is a short and compact value based on object's properties. That is if you have two objects that vary somehow - you can expect their hash values to be different. Good hash algorithm produces different values for different objects.
What you are usually trying to do with a hash algorithm is convert a large search key into a small nonnegative number, so you can look up an associated record in a table somewhere, and do it more quickly than M log2 N (where M is the cost of a "comparison" and N is the number of items in the "table") typical of a binary search (or tree search).
If you are lucky enough to have a perfect hash, you know that any element of your (known!) key set will be hashed to a unique, different value. Perfect hashes are primarily of interest for things like compilers that need to look up language keywords.
In the real world, you have imperfect hashes, where several keys all hash to the same value. That's OK: you now only have to compare the key to a small set of candidate matches (the ones that hash to that value), rather than a large set (the full table). The small sets are traditionally called "buckets". You use the hash algorithm to select a bucket, then you use some other searchable data structure for the buckets themselves. (If the number of elements in a bucket is known, or safely expected, to be really small, linear search is not unreasonable. Binary search trees are also reasonable.)
The bitwise operations in your example look a lot like a signature analysis shift register, that try to compress a long unique pattern of bits into a short, still-unique pattern.
Basically, the thing you're trying to achieve with a hash function is to give all bits in the hash code a roughly 50% chance of being off or on given a particular item to be hashed. That way, it doesn't matter how many "buckets" your hash table has (or put another way, how many of the bottom bits you take in order to determine the bucket number)-- if every bit is as random as possible, then an item will always be assigned to an essentially random bucket.
Now, in real life, many people use hash functions that aren't that good. They have some randomness in some of the bits, but not all of them. For example, imagine if you have a hash function whose bits 6-7 are biased-- let's say in the typical hash code of an object, they have a 75% chance of being set. In this made up example, if our hash table has 256 buckets (i.e. the bucket number comes from bits 0-7 of the hash code), then we're throwing away the randomness that does exist in bits 8-31, and a smaller portion of the buckets will tend to get filled (i.e. those whose numbers have bits 6 and 7 set).
The supplementary hash function basically tries to spread whatever randomness there is in the hash codes over a larger number of bits. So in our hypothetical example, the idea would be that some of the randomness from bits 8-31 will get mixed in with the lower bits, and dilute the bias of bits 6-7. It still won't be perfect, but better than before.
If you're generating a hash table, then the main thing you want to get across when writing your hash function is to ensure uniformity, not necessarily to create completely unique values.
For example, if you have a hash table of size 10, you don't want a hash function that returns a hash of 3 over and over. Otherwise, that specific bucket will force a search time of O(n). You want a hash function such that it will return, for example: 1, 9, 4, 6, 8... and ensure that none of your buckets are much heavier than the others.
For your projects, I'd recommend that you use a well-known hashing algorithm such as MD5 or even better, SHA and use the first k bits that you need and discard the rest. These are time-tested functions and as a programmer, you'd be smart to use them.
That code is attempting to improve the quality of the hash value by mashing the bits around.
The overall effect is that for a given x.hashCode() you hopefully get a better distribution of hash values across the full range of integers. The performance of certain algorithms will improve if you started with a poor hashcode implementation but then improve hash codes in this way.
For example, hashCode() for a humble Integer in Java just returns the integer value. While this is fine for many purposes, in some cases you want a much better hash code, so putting the hashCode through this kind of function would improve it significantly.
It could be anything you want as long as you adhere to the general contract described in the doc, which in my own words are:
If you call 100 ( N ) times hashCode on an object, all the times must return the same value, at least during that program execution( subsequent program execution may return a different one )
If o1.equals(o2) is true, then o1.hashCode() == o2.hashCode() must be true also
If o1.equals(o2) is false, then o1.hashCode() == o2.hashCode() may be true, but it helps it is not.
And that's it.
Depending on the nature of your class, the hashCode() e may be very complex or very simple. For instance the String class which may have millions of instances needs a very goo hashCode implementation, and use prime numbers to reduce the poisibility of collisions.
If for your class it does make sense to have a consecutive number, that's ok too, there is no reason why you should complicate it every time.