I have to store millions of entries in a database. Each entry is identified by a set of unique integer identifiers. For example a value may be identified by a set of 10 integer identifiers, each of which are less than 100 million.
In order to reduce the size of the database, I thought of the following encoding using a single 32 bit integer value.
Identifier 1: 0 - 100,000,000
Identifier 2: 100,000,001 - 200,000,000
.
.
.
Identifier 10: 900,000,001 - 1,000,000,000
I am using Java. I can write a simple method to encode/decode. The user code does not have to know that I am encoding/decoding during fetch/store.
What I want to know is: what is the most efficient (fastest) and recommended way to implement such encoding/decoding. A simple implementation will perform a large number of multiplications/subtractions.
Is it possible to use shifts (or bitwise operations) and choose different partition size (the size of each segment still has to be close to 100 million)?
I am open to any suggestions, ideas, or even a totally different scheme. I want to exploit the fact that the integer identifiers are bounded to drastically reduce the storage size without noticeably compromising performance.
Edit: I just wanted to add that I went through some of the answers posted on this forum. A common solution was to split the bits for each identifier. If I use 2 bits for each identifier for a total of 10 identifiers, then my range of identifiers gets severely limited.
It sounds like you want to pack multiple integer values of 0...100m into a single 32bit Integer? Unless you are omitting important information that would allow to store these 0...100m values more efficiently, there is simply no way to do it.
ceil(log2(100m)) = 27bit, which means you have only 5 "spare bits".
You can make the segmentation size 27 bits which gives you 32 * 128 M segements. instead of 42 * 100 M
int value =
int high = value >>> 27;
int low = value & ((1L << 27) -1);
It is worth nothing this calculation is likely to be trivial compared to the cost of using a database.
It's unclear what you actually want to do, but it sounds like you want an integer value, each bit representing having a particular attribute, and applying a bitmask.
A 32-bit integer can save 32 different attributes, 64-bit 64 etc. To have more, you'll need multiple integer columns.
If that's not it, I don't know what you mean by "encode".
Related
What is a good algorithm to generate a unique 64-bit ID starting from multiple numeric 64-bit IDs? Example:
Input: [2, 9875, 0, 223568, ...] a list of random 64-bit IDs
Output: a unique 64-bit numeric ID, that have to be the same for the given input
I'm looking a way to avoid ID collision.
My apologies for the unclear question.
If speed does not matter, what about:
feeding all your ids in the md5-algorithm and than simply use
a) first 64 bits or
b) last 64 bits or
c) first 64 bits xor last 64 bits
If speed matters
What about:
Step 1: reorder the bytes of all 64 bit IDs (in a fixed but different order for each 64 bit ID of your Input.)(This might help a bit if the values are not really randomly distributed)
Step 2: xor all the rearranged 64 bit IDs to get the new 64 bit id.
If you have no extra information about the range of your 64 bit input IDs or the distribution of the values, there is no way to avoid collisions in a 'clever'/'best' way. Because whatever you come up with, you will always find a set of inputs which lead to collisions.
This old question will be useful for you if you want more information, but you can try this:
java.util.UUID.randomUUID().hashCode()
If the previous solution didn't work in your case, try this:
private static final AtomicLong COUNTER = new AtomicLong(System.currentTimeMillis()*100000);
Or this algorithm (GitHub link).
I'm looking at BigInteger as a big number (practically) and I'm trying to perform a left shift on the number. So, when I perform a 32 bit left shift on the number (I'm currently using 2), I get the same number again (which is expected for an integer).
Is there any way I can increase the number of bits used to store the number? I know I can use long; however, I want to cross the 64 bit limit. Is there any way I could do that?
It's difficult to say exactly what your problem is without seeing any actual code, but note that BigInteger instances are immutable. If you write aBigInt.shiftLeft(32) the instance referenced by aBigInt is not changed. Instead, a new BigInteger instance with the result of the operation is returned. Try: aBigInt = aBigInt.shiftLeft(32).
Several months ago I implemented a solution to choose unique values from a range between 1 and 65535 (16 bits). This range is used to generate unique Route Targets suffixes, which for this customer massive network (it's a huge ISP) are a very disputed resource, so any released value needs to become immediately available to the end user.
To tackle this requirement I used a BitSet. Allocate a suffix on the RT index with set and deallocate a suffix with clear. The method nextClearBit() can find the next available value. I handle synchronization / concurrency issues manually.
This works pretty well for a small range... The entire index is small (around 10k), it is blazing fast and can be easy serialized and stored in a blob field.
The problem is, some new devices can handle RT suffixes of 32 bits unsigned (range 1 / 4294967296). Which can't be managed with a BitSet (it would, by itself, consume around 600Mb, plus be limited to int - 32 bits signed - range). Even with this massive range available, the client still wants to free Route Target suffixes that become available to the end user, mainly because the lowest ones (up to 65535) - which are compatible with old routers - are being heavily disputed.
Before I tell the customer that this is impossible and he will have to conform with my reusable index for lower RTs (up to 65550) and use a database sequence for the other ones (which means that when the user frees a Route Target, it will not become available again), would anyone shed some light?
Maybe some kind soul already implemented a high performance number pool for Java (6 if it matters), or I am missing a killer feature of Oracle database (11R2 if it matters)... Just some wishful thinking :).
Thank you very much in advance.
I would combine the following:
your current BitSet solution for 1-65535
PLUS
Oracle-based solution with a sequence for 65536 - 4294967296 which wraps around defined as
CREATE SEQUENCE MyIDSequence
MINVALUE 65536
MAXVALUE 4294967296
START WITH 65536
INCREMENT BY 1
CYCLE
NOCACHE
ORDER;
This sequence gives you ordered values in the specified range and allows for reuse of any values but only after the maximum is reached - which should allow enough time for the values being released... if need be you can keep track of values in use in a table and just increment further if the returned value is already in use - all this can be wrapped nicely into a stored procedure for convenience...
This project may be of some use.
I need to generate a hash value used for uniqueness of many billions of records in Java. Trouble is, I only have 16 numeric digits to play with. In researching this, I have found algorithms for 32-bit hash, which return Java integers. But this is too small, as it only has a range of +/ 2 billion, and have will have more records that that. I cannot go to a 64-bit hash, as that will give me numeric values back that are too large (+/ 4 quintillion, or 19 digits). Trouble is, I am dealing with a legacy system that is forcing me into a static key length of 16 digits.
Suggestions? I know no hash function will guarantee uniqueness, but I need a good one that will fit into these restrictions.
Thanks
If your generated hash is too large you can just mod it with your keyspace max to make it fit.
myhash = hash64bitvalue % 10^16
If you are limited to 16 decimal digits, your key space contains 10^16 values.
Even if you find a hash that gives uniform distribution on your data set, due to Birthday Paradox you will have a 50% chance of collision on ~10^8 items of data, which is an order of magnitude less than your billions of records.
This means that you cannot use any kind of hash alone and rely on uniqueness.
A straightforward solution is to use a global counter instead. If global counter is infeasible, counters with preallocated ranges can be used. For example, 6 most significant digits denote fixed data source index, 10 least significant digits contain monotonous counter maintained by that data source.
So your restriction is 53 bit?
For my understanding order number of bit in hashcode doesn't affect its value (order and value of bit are fully independent from each other). So you could get 64-bit hash function and use only last 53 bits from it. And you must use binary operations for this ( hash64 & (1<<54 - 1) ) not arithmetic.
You don't have to store your hashes in a human readable form (hex, as you said). Just store the 64-bit long datatype (generated by a 64-bit hash function) in your database, which is only 8 bytes. And not the 19 bytes of which you were scared off.
If that isn't a solution, improve the legacy system.
Edit: Wait!
64-bit: 264 =
18446744073709551616
16 hex-digits: 1616 =
18446744073709551616
Exact fit! So make a hex representation of your 64-bit hash, and there you are.
If you can save 16 alphanumeric characters then you can use a hexadecimal representation and pack 16^16 bits into 16 chars. 16^16 is 2^64.
I am storing certain entities in my database with integer Ids of size 32 bits thus using the range of -2.14 billion to +2.14 billion.
I have given tried giving some meaning to my ids due to which my Ids, in the positive range, have finished up a bit quickly. I am looking forward to use the negative integer range of -2.14 billion to 0.
Wanted to know, if you could see any downsides of using negative integers as ids, though personally I don't see any downsides.
There is an old saying in database design that goes like this: "Intelligent keys are not". You should never design for special meaning in an id when a descriptive attribute is more appropriate.
Given than dumb keys are only compared for equality, sign or lack thereof has no impact.