Implementing a very efficient bit structure

Implementing a very efficient bit structure - java

I'm looking for a solution in pesudo code or java or js for the following problem:
We need to implement an efficient bit structure to hold data for N bits (you could think of the bits as booleans as well, on/off).
We need to support the following methods:
init(n)
get(index)
set(index, True/False)
setAll(True/false)
Now I got to a solution with o(1) in all except for init that is o(n). The idea was to create an array where each index saves value for a bit. In order to support the setAll I would also save a timestamp withe the bit vapue to know if to take the value from tge array or from tge last setAll value. The o(n) in init is because we need to go through the array to nullify it, otherwise it will have garbage which can be ANYTHING. Now I was asked to find a solution where the init is also o(1) (we can create an array, but we cant clear the garbage, the garbage might even look like valid data which is wrong and make the solution bad, we need a solution that works 100%).
Update:
This is an algorithmic qiestion and not a language specific one. I encountered it in an interview question. Also using an integer to represent the bit array is not good enough because of memory limits. I was tipped that it has something to do with some kind of smart handling of garbage data in the array without ckeaning it in the init, using some kind of mechanism to not fall because if the garbage data in the array (but I'm not sure how).

Make lazy data structure based on hashmap (while hashmap sometimes might have worse access time than o(1)) with 32-bit values (8,16,64 ints are suitable too) for storage and auxiliary field InitFlag
To clear all, make empty map with InitFlag = 0 (deleting old map is GC's work in Java, isn't it?)
To set all, make empty map with InitFlag = 1
When changing some bit, check whether corresponding int key bitnum/32 exists. If yes, just change bitnum&32 bit, if not and bit value differs from InitFlag - create key with value based on InitFlag (all zeros or all ones) and change needed bit.
When retrieving some bit, check whether corresponding key exists. If yes, extract bit, if not - get InitFlag value
SetAll(0): ifl = 0, map - {}
SetBit(35): ifl = 0, map - {1 : 0x10}
SetBit(32): ifl = 0, map - {1 : 0x12}
ClearBit(32): ifl = 0, map - {1 : 0x10}
ClearBit(1): do nothing, ifl = 0, map - {1 : 0x10}
GetBit(1): key=0 doesn't exist, return ifl=0
GetBit(35): key=1 exists, return map[1]>>3 =1
SetAll(1): ifl = 1, map = {}
SetBit(35): do nothing
ClearBit(35): ifl = 1, map - {1 : 0xFFFFFFF7 = 0b...11110111}
and so on

If this is a college/high-school computer science test or homework assignment question - I suspect they are trying to get you to use BOOLEAN BIT-WISE LOGIC - specifically, saving the bit inside of an int or a long. I suspect (but I'm not a mind-reader - and I could be wrong!) that using "Arrays" is exactly what your teacher would want you to avoid.
For instance - this quote is copied from Google's Search Reults:
long: The long data type is a 64-bit two's complement integer. The
signed long has a minimum value of -263 and a maximum value of 263-1.
In Java SE 8 and later, you can use the long data type to represent an
unsigned 64-bit long, which has a minimum value of 0 and a maximum
value of 264-1
What that means is that a single long variable in Java could store 64 of your bit-wise values:
long storage;
// To get the first bit-value, use logical-or ('|') and get the bit.
boolean result1 = (boolean) storage | 0b00000001; // Gets the first bit in 'storage'
boolean result2 = (boolean) storage | 0b00000010; // Gets the second
boolean result3 = (boolean) storage | 0b00000100; // Gets the third
...
boolean result8 = (boolean) storage | 0b10000000; // Gets the eighth result.
I could write the entire thing for you, but I'm not 100% sure of your actual specifications - if you use a long, you can only store 64 separate binary values. If you want an arbitrary number of values, you would have to use as many 'long' as you need.
Here is a SO posts about binary / boolean values:
Binary representation in Java
Here is a SO post about bit-shifting:
Java - Circular shift using bitwise operations
Again, it would be a job, and I'm not going to write the entire project. However, the get(int index) and set(int index, boolean val) methods would involve bit-wise shifting of the number 1.
int pos = 1;
pos = pos << 5; // This would function as a 'pointer' to the fifth element of the binary number list.
storage | pos; // This retrieves the value stored as position 5.

Related

Efficiently Storing a Short History of Boolean Events for many Components

To preface this - I have no influence on the design of this problem and I can't really give a lot of details about the technical background.
Say I have a lot of components of the same type that regularly get a boolean event - and I need to hold a short history of these boolean events.
A coworker of mine wrote a rather naive implementation using the type Map<Component, CircularFifoQueue<Boolean>>, CircularFifoQueue being data structure from Apache Commons. The code works, but given how generics work in Java and the dimensions used, this is really inefficient as it stores a reference to one of the two singleton boolean objects instead of just one bit.
Generally there are around 100K components and the history is supposed to hold the 5-10 most recent boolean values (might be subject to change but probably won't be larger than 10). This currently means that around 1.5GB of RAM are allocated just for these history maps. Also these changes happen quite frequently so it wouldn't hurt to increase the CPU efficiency if possible.
One obvious change would be to move the history into the Component class to remove the HashMap-induced overhead.
The more complicated question is how to efficiently store the last few boolean values.
One possible way would be to use BitSets, but as those use long[] as their underlying data structure, I doubt it would be the most efficient way to store what is essentially 5 bits.
Another option would be to directly use an integer and shift the value as a way to remove old entries. So basically
int history = 0;
public void set(int length, boolean active){
if(active) {
history |= 1 << length;
} else {
history &= ~(1 << length);
}
// shift one to the right to remove oldest entry
history = history >> 1;
}
Just off the top of my head. This code is untested. I don't know how efficient or if it works, but that is about what I had in mind.
But that would still lead to quite some overhead compared to the optimal case of storing 5 bits of data using 5 bits of memory.
One could achieve some additional saving if the histories of the different components were stored in a contiguous array, but I'm not sure how to handle either one giant contiguous BitSet. Or alternatively a large byte[] where each byte represents one bool-history as explained above.
This is a weirdly specific problem and I'd be really glad about any suggestions.

Setting aside the bit manipulations which I'm sure you'll conquer, please think how efficient is efficient enough.
Every instance of
class Foo {}
allocates 16 bytes. So if you were to introduce
class ComponentHistory {
private final int bits;
}
that's 20 bytes.
If you replace the int with byte, you're still at 20 bytes: byte type is padded to 4 bytes by JVM (at least).
If you define a global array of bits somewhere and refer to it from ComponentHistory, the reference itself is at least 4 bytes.
Basically, you can't win :)
But consider this: if you go with the simplest approach that you have already outlined, that produces simple readable code, your 100K component histories will take up 2MB of RAM - substantial savings from your current level of 1.5GB. Specifically, you've saved 1498MB.
Suppose you indeed invent a cumbersome yet working way of only storing 5 bits per history. You'd then need 500Kb = 60KB to store all histories. With the baseline of 1.5GB, your savings are now 1499.94MB. Savings improve by 0.1%. Does that at all matter? More often than not, I'd prefer to not over-optimize here while sacrificing simplicity.

How many objects are created by using the Integer wrapper class?

Integer i = 3;
i = i + 1;
Integer j = i;
j = i + j;
How many objects are created as a result of the statements in the sample code above and why? Is there any IDE in which we can see how many objects are created (maybe in a debug mode)?

The answer, surprisingly, is zero.
All the Integers from -128 to +127 are pre-computed by the JVM.
Your code creates references to these existing objects.

The strictly correct answer is that the number of Integer objects created is indeterminate. It could be between 0 and 3, or 2561 or even more2, depending on
the Java platform3,
whether this is the first time that this code is executed, and
(potentially) whether other code that relies on boxing of int values runs before it4.
The Integer values for -128 to 127 are not strictly required to be precomputed. In fact, JLS 5.1.7 which specified the Boxing conversion says this:
If the value p being boxed is an integer literal of type int between -128 and 127 inclusive (§3.10.1) ... then let a and b be the results of any two boxing conversions of p. It is always the case that a == b.
Two things to note:
The JLS only requires this for >>literals<<.
The JLS does not mandate eager caching of the values. Lazy caching also satisfies the JLS's behavioral requirements.
Even the javadoc for Integer.valueof(int) does not specify that the results are cached eagerly.
If we examine the Java SE source code for java.lang.Integer from Java 6 through 8, it is clear that the current Java SE implementation strategy is to precompute the values. However, for various reasons (see above) that is still not enough to allow us to give a definite answer to the "how many objects" question.
1 - It could be 256 if execution of the above code triggers class initialization for Integer in a version of Java where the cache is eagerly initialized during class initialization.
2 - It could be even more, if the cache is larger than the JVM spec requires. The cache size can be increased via a JVM option in some versions of Java.
3 - In addition to the platform's general approach to implementing boxing, a compiler could spot that some or all of the computation could be done at compile time or optimized it away entirely.
4 - Such code could trigger either lazy or eager initialization of the integer cache.

First of all: The answer you are looking for is 0, as others already mentioned.
But let's go a bit deeper. As Stephen menthioned it depends on the time you execute it. Because the cache is actually lazy initialized.
If you look at the documentation of java.lang.Integer.IntegerCache:
The cache is initialized on first usage.
This means that if it is the first time you call any Integer you actually create:
256 Integer Objects (or more: see below)
1 Object for the Array to store the Integers
Let's ignore the Objects needed for Store the Class (and Methods / Fields). They are anyway stored in the metaspace.
From the second time on you call them, you create 0 Objects.
Things get more funny once you make the numbers a bit higher. E.g. by the following example:
Integer i = 1500;
Valid options here are: 0, 1 or any number between 1629 to 2147483776 (this time only counting the created Integer-values.
Why? The answer is given in the next sentence of Integer-Cache definition:
The size of the cache may be controlled by the -XX:AutoBoxCacheMax= option.
So you actually can vary the size of the cache which is implemented.
Which means you can reach for above line:
1: new Object if your cache is smaller than 1500
0: new Objects if your cache has been initialized before and contains 1500
1629: new (Integer) - Objects if your cache is set to exactly 1500 and has not been initialized yet. Then Integer-values from -128 to 1500 will be created.
As in the sentence above you reach any amount of integer Objects here up to: Integer.MAX_VALUE + 129, which is the mentioned: 2147483776.
Keep in mind: This is only guaranteed on Oracle / Open JDK (i checked Version 7 and 8)
As you can see the completely correct answer is not so easy to get. But just saying 0 will make people happy.
PS: using the menthoned parameter can make the following statement true: Integer.valueOf(1500) == 1500

The compiler unboxes the Integer objects to ints to do arithmetic with them by calling intValue() on them, and it calls Integer.valueOf to box the int results when they are assigned to Integer variables, so your example is equivalent to:
Integer i = Integer.valueOf(3);
i = Integer.valueOf(i.intValue() + 1);
Integer j = i;
j = Integer.valueOf(i.intValue() + j.intValue());
The assignment j = i; is a completely normal object reference assignment which creates no new objects. It does no boxing or unboxing, and doesn't need to as Integer objects are immutable.
The valueOf method is allowed to cache objects and return the same instance each time for a particular number. It is required to cache ints −128 through +127. For your starting number of i = 3, all the numbers are small and guaranteed to be cached, so the number of objects that need to be created is 0. Strictly speaking, valueOf is allowed to cache instances lazily rather than having them all pre-generated, so the example might still create objects the first time, but if the code is run repeatedly during a program the number of objects created each time on average approaches 0.
What if you start with a larger number whose instances will not be cached (e.g., i = 300)? Then each valueOf call must create one new Integer object, and the total number of objects created each time is 3.
(Or, maybe it's still zero, or maybe it's millions. Remember that compilers and virtual machines are allowed to rewrite code for performance or implementation reasons, so long as its behavior is not otherwise changed. So it could delete the above code entirely if you don't use the result. Or if you try to print j, it could realize that j will always end up with the same constant value after the above snippet, and thus do all the arithmetic at compile time, and print a constant value. The actual amount of work done behind the scenes to run your code is always an implementation detail.)

You can debug the Integer.valueOf(int i) method to find out it by yourself.
This method is called by the autoboxing process by the compiler.

How to get a unique alphanumeric based on a unique integer

My webapplication has a table in the database with an id column which will always be unique for each row. In addition to this I want to have another column called code that will have a 6 digit unique Alphanumeric code with numbers 0-9 and alphabets A-Z. Alphabets and number can be duplicate in a code. i.e. FFQ77J. I understand the uniqueness of this 6 digit alphanumeric code reduces over time as more rows are added but for now I am ok with this.
Requirement (update)
- The code should be at least of length 6
- Each code should be Alphanumeric
So I want to generate this Alphanumeric code.
Question
What is a good way to do this?
Should I generate the code and after the generation, run a query to the database and check if it already exists, and if so then generate a new one? To ensure the uniqueness, does this piece of code need to be synchronized so that only one thread runs it?
Is there something built-in to the database that will let me do this?
For the generation I will be using something like this which I saw in this answer
char[] symbols = new char[36];
char[] buf;
for (int idx = 0; idx < 10; ++idx)
symbols[idx] = (char) ('0' + idx);
for (int idx = 10; idx < 36; ++idx)
symbols[idx] = (char) ('A' + idx - 10);
public String nextString()
{
for (int idx = 0; idx < buf.length; ++idx)
buf[idx] = symbols[random.nextInt(symbols.length)];
return new String(buf);
}

Since it's a requirement for the shortcode to not be guessable, you don't want to tie it to your uniqueID row ID. Otherwise that means your rowID needs to be random, in addition to unique. Starting with a counter 0, and incrementing, makes it pretty obvious when your codes are: 000001, 000002, 000003, and so forth.
For your short code, generate a random 32bit int, omit the sign and convert to base36. Make a call to your database, to ensure it's available.
You haven't explicitly called out scalability, but I think it's important to understand the limitations of your design wrt to scale.
At 2^31 possible 6 char base36 values, you will have collisions at ~65k rows (see Birthday Paradox questions)
From your comment, modify your code:
public String nextString()
{
return Integer.toString(random.nextInt(),36);
}

I would simply do this:
String s = Integer.toString(i, 36).toUpperCase();
Choosing base-36 will use characters 0-9a-z for the digits. To get a string that uses uppercase letters (as per your question) you would need to fold the result to upper case.
If you use an auto increment column for your id, set the next value to at least 60,466,176, which when rendered to base 36 is 100000 - always giving you a 6 digit number.

I would start with 0 for an empty table and do a
SELECT MAX(ID) FROM table
to find the largest id so far. Store it in an AtmoicInteger and convert it using toString
AtomicInteger counter = new AtomicInteger(maxSoFar);
String nextId = Integer.toString(counter.incrementAndGet(), 36);
or for padding. 36 ^^ 6 = 2176782336L
String nextId = Long.toString(2176782336L + counter.incrementAndGet(), 36).substring(1);
This will give you uniqueness and no duplicates to worry about. (it's not random either)

Simply, you can use Integer.toString(int i, int radix). Since you have base 36(26 letters+10 digits) you set the radix to 36 and i to your integer. For example, to use 16501, do:
String identifier=Integer.toString(16501, 36);
You can uppercase it with .toUpperCase()
Now onto your other questions, yes, you should query the database first to ensure it doesn't exist. If depending on the database, it may need to be synchronized, or it may not be as it'll use its own locking system. In any case, you'd need to tell us which database.
On the question of whether there's a builtin, we'd need to know the DB type as well.

To create a random but unique value within a small range here are some ideas I know of:
Create a new random value and try to insert it.
Let a database constraint catch violations. This column should also likely be indexed. The DML may need to be tried several times until a unique ID is found. This will lead to more collisions as time progresses, as noted (see the birthday problem).
Create a "free IDs" table ahead of time and on usage mark the ID as being used (or delete it from the "free IDs" table). This is similar to #1 but shifts when the work is done.
This allows the work of finding "free IDs" to be done at another time, perhaps during a cron job, so that there will not be a contraint violation during the insert keeping the insert itself the "same speed" throughout the usage of said domain. Make sure to use transactions.
Create a 1-to-1/injective "mixer" function such that the output "appears random". The point is this function must be 1-to-1 to inherently avoid duplicates.
This output number would then be "base 36 encoded" (which is also injective); but it would be guaranteed unique as long as the input (say, an auto-increment PK) was unique. This would likely be less random than the other approaches, but should still create a nice-looking non-linear output.
A custom injective function can be created around an 8-bit lookup table fairly trivially - just process a byte at a time and shuffle the map appropriately. I really like this idea, but it can still lead to somewhat predictable output
To find free IDs, approaches #1 and #2 above can use "probing with IN" to minimize the number of SQL statements used. That is, generate a bunch of random values and query for them using IN (keeping in mind what sizes of IN your database likes) and then see which values were free (as having no results).
To create a unique ID not constained to such a small space, a GUID or even hashing (e.g. SHA1) might be useful. However, these only guarantee uniqueness because they have 126/160-bit spaces so that the chance of collision (for different input/time-space) is currently accepted as improbable.
I actually really like the idea of using an injective function. Bearing in mind that it is not good "random" output, consider this pseudo-code:
byte_map = [0..255]
map[0] = shuffle(byte_map, seed[0])
..
map[n] = shuffle(byte_map, seed[1])
output[0] = map[0][input[0]]
..
output[n] = map[n][input[n]]
output_str = base36_encode(output[0] .. output[n])
While a very simple setup, numbers like 0x200012 and 0x200054 will still share common output - e.g. 0x1942fe and 0x1942a9 - although the lines will be changed a bit due to the later application of the base-36 encoding. This could probably be further improved to "make it look more random".

For efficient usage, try caching generated code in a HashSet<String> in your application:
HashSet<String> codes = new HashSet<String>();
This way you don't have to make a db call every time to check whether the generated code is unique or not. All you have to do is:
codes.contains(newCode);
And, yes, you should synchronize your method which updates the cache
public synchronize String getCode ()
{
String newCode = "";
do {
newCode = nextString();
}
while(codes.contains(newCode));
codes.put(newCode);
}

You mentioned in your comments that the relationship between id and code should not be easily guessable. For this you basically need encryption; there are plenty of encryption programs and modules out there that will perform encryption for you, given a secret key that you initially generate. To employ this approach, I would recommend converting your id into ascii (i.e., representing as base-256, and then interpreting each base-256 digit as a character) and then running the encryption, and then converting the encrypted ascii (base-256) into base 36 so you get your alpha-numeric, and then using 6 randomly chosen locations in the base 36 representation to get your code. You can resolve collisions e.g. by just choosing the nearest unused 6-digit alpha-numeric code when a collision occurs, and noting the re-assigned alpha-numeric code for the id in a (code <-> id) table that you will have to maintain anyway since you cannot decrypt directly if you only store 6 base-36 digits of the encrypted id.

Java - BitSet Replacement

In my code I am trying to get if a number exist in the hashmap or not. My code is following:
BitSet arp = new BitSet();
for i = 0 to 10 million
HashMap.get (i)
if number exist
arp.set(i , true)
else
arp.set(i , false)
After that from bitset I get if number i exist or not. However, I found this bitset operation is quite slow (tried with string = string + 0/1 also, more slower). Can anybody help me how to replace this operation with a faster one.

Your code is really difficult to read clearly, but I suspect you're just trying to set bits in the BitSet that are keys from your HashMap?
In that case, your code should just be more or less
BitSet bits = new BitSet(10000000);
for (Integer k : map.keySet()) {
bits.set(k);
}
Even if this wasn't what you meant, as a general rule, BitSet is blazing fast; I suspect it's the rest of your code that's slow.

If you provided your actual relevant code, we could have found some performance errors in the first place. But assuming your code is ok and you profiled your application to make sure that the BitSet operations are actually slow:
If you have enough memory space available, you can always just go for a boolean[] instead of a BitSet.
BitSet internally uses long[] to store the separate bits, so it's very good memory-wise, but can sometimes be a little bit too slow.

Generating sequentially all combination of a finite set using lexicographic order and bitwise arithmetic

Consider all combination of length 3 of the following array of integer {1,2,3}.
I would like to traverse all combination of length 3 using the following algorithm from wikipedia
// find next k-combination
bool next_combination(unsigned long& x) // assume x has form x'01^a10^b in binary
{
unsigned long u = x & -x; // extract rightmost bit 1; u = 0'00^a10^b
unsigned long v = u + x; // set last non-trailing bit 0, and clear to the right; v=x'10^a00^b
if (v==0) // then overflow in v, or x==0
return false; // signal that next k-combination cannot be represented
x = v +(((v^x)/u)>>2); // v^x = 0'11^a10^b, (v^x)/u = 0'0^b1^{a+2}, and x ← x'100^b1^a
return true; // successful completion
}
What should be my starting value for this algorithm for all combination of {1,2,3}?
When I get the output of the algorithm, how do I recover the combination?
I've try the following direct adaptation, but I'm new to bitwise arithmetic and I can't tell if this is correct.
// find next k-combination, Java
int next_combination(int x)
{
int u = x & -x;
int v = u + x;
if (v==0)
return v;
x = v +(((v^x)/u)>>2);
return x;
}

I found a class that exactly solve this problem. See the class CombinationGenerator here
https://bitbucket.org/rayortigas/everyhand-java/src/9e5f1d7bd9ca/src/Combinatorics.java
To recover a combination do
for(Long combination : combinationIterator(10,3))
toCombination(toPermutation(combination);
Thanks everybody for your input.

I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem falls under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it might be faster than the link you have found.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to Java.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.