Create unique numeric ID from multiple numeric IDs - java

What is a good algorithm to generate a unique 64-bit ID starting from multiple numeric 64-bit IDs? Example:
Input: [2, 9875, 0, 223568, ...] a list of random 64-bit IDs
Output: a unique 64-bit numeric ID, that have to be the same for the given input
I'm looking a way to avoid ID collision.
My apologies for the unclear question.

If speed does not matter, what about:
feeding all your ids in the md5-algorithm and than simply use
a) first 64 bits or
b) last 64 bits or
c) first 64 bits xor last 64 bits
If speed matters
What about:
Step 1: reorder the bytes of all 64 bit IDs (in a fixed but different order for each 64 bit ID of your Input.)(This might help a bit if the values are not really randomly distributed)
Step 2: xor all the rearranged 64 bit IDs to get the new 64 bit id.
If you have no extra information about the range of your 64 bit input IDs or the distribution of the values, there is no way to avoid collisions in a 'clever'/'best' way. Because whatever you come up with, you will always find a set of inputs which lead to collisions.

This old question will be useful for you if you want more information, but you can try this:
java.util.UUID.randomUUID().hashCode()
If the previous solution didn't work in your case, try this:
private static final AtomicLong COUNTER = new AtomicLong(System.currentTimeMillis()*100000);
Or this algorithm (GitHub link).

Related

BigInteger > 32 bit left shift issue

I'm looking at BigInteger as a big number (practically) and I'm trying to perform a left shift on the number. So, when I perform a 32 bit left shift on the number (I'm currently using 2), I get the same number again (which is expected for an integer).
Is there any way I can increase the number of bits used to store the number? I know I can use long; however, I want to cross the 64 bit limit. Is there any way I could do that?
It's difficult to say exactly what your problem is without seeing any actual code, but note that BigInteger instances are immutable. If you write aBigInt.shiftLeft(32) the instance referenced by aBigInt is not changed. Instead, a new BigInteger instance with the result of the operation is returned. Try: aBigInt = aBigInt.shiftLeft(32).

64 bits Unique ID that isn't sequence

I need to a unique and 64-bit ID Generator?
I know that UUID is 128 bit but If exist any version of id generator in java get it to me.
It's important that the ids are not sequence.
The least significant half of a UUID in Java is very unique, and that would give you a 64bit number.
UUID.randomUUID().getLeastSignificantBits()
Also see: Likelihood of collision using most significant bits of a UUID in Java
As pointed out in the comments, the number is not guaranteed unique, but duplications are unlikely
you could use a Blowfish encryption over a normal counter, or generate a UUID4 and xor the first and last 16 bytes.

Why does Random.nextLong not generate all possible long values in Java?

The Javadoc of the nextLong() method of the Random class states that
Because class Random uses a seed with only 48 bits, this algorithm will not return all possible long values. (Random javadoc)
The implementation is:
return ((long)next(32) << 32) + next(32);
The way I see it is as follows: to create any possible long, we should generate any possible bit pattern of 64 bits with equal likelihood. Assuming the calls to next(int) give us 32 random bits, then the concatenation of these bits will be a sequence of 64 random bits and hence we generate each 64 bit pattern with equal likelihood. And therefore all possible long values.
I suppose that the person who wrote the javadoc knows better and that my reasoning is flaw somehow. Can anyone explain where my reasoning is incorrect and what kind of longs will be returned then?
Since Random is pseudo-random we know that given the same seed it will return the same values. Taking the docs at their word there are 48 bits of seed. This means there are at most 2^48 unique values that can be printed out. If there were more that would mean that some value that we used before in position < 2^48 gives us a different value this time than it did last time.
If we try to join up two results what do we see?
|a|b|c|d|e|f|...|(2^48)-1|
Above are some values. How many pairs are there? a-b, b-c, c-d,... (2^48)-1-a. There are also 2^48 pairs. We can't fill all values of 2^64 with only the 2^48 pairs.
Pseudo-Random Number Generators are like giant rings of numbers. You start somewhere, and then move around the ring step by step, as you pull numbers out. This means that with a given seed - an initial internal state - all subsequent numbers are predetermined. Therefor, since the internal state is only 48 bits wide, only 2 to the power 48 random numbers are possible. So since the next number is given by the previous number, it is now clear why that implementation of nextLong will not generate all possible long values.
Let's say a perfect pseudo random K-bit generator is one that creates all possible 2^K seed values in 2^K trys. We can't do better, as there are only 2^K states, and every state is completly determined by the previous state and determines itself the next state.
Assume we write the output of the 48-bit generator down in binary. We get 2^48 * 48 bits that way.
And now we can say exactly how many 64-bit sequences we can get by going through the list and noting the next 64 bits (wrapping to the start when needed). It is exactly the number of bits we have: 13510798882111488.
Even if we assume that all those 64-bit sequences are pairwise different (which is not at all obvious), we have a long way to go until 2^64: 18446744073709551616.
I write the numbers again:
18446744073709551616 pairwise different 64 bit sequences we need
13510798882111488 64 bit sequences we can get with a 48 bit seed.
This proves that the javadoc writer was right. Only 1/1844th of all long values can be produced with the random generator

Packing many bounded integers into a large single integer

I have to store millions of entries in a database. Each entry is identified by a set of unique integer identifiers. For example a value may be identified by a set of 10 integer identifiers, each of which are less than 100 million.
In order to reduce the size of the database, I thought of the following encoding using a single 32 bit integer value.
Identifier 1: 0 - 100,000,000
Identifier 2: 100,000,001 - 200,000,000
.
.
.
Identifier 10: 900,000,001 - 1,000,000,000
I am using Java. I can write a simple method to encode/decode. The user code does not have to know that I am encoding/decoding during fetch/store.
What I want to know is: what is the most efficient (fastest) and recommended way to implement such encoding/decoding. A simple implementation will perform a large number of multiplications/subtractions.
Is it possible to use shifts (or bitwise operations) and choose different partition size (the size of each segment still has to be close to 100 million)?
I am open to any suggestions, ideas, or even a totally different scheme. I want to exploit the fact that the integer identifiers are bounded to drastically reduce the storage size without noticeably compromising performance.
Edit: I just wanted to add that I went through some of the answers posted on this forum. A common solution was to split the bits for each identifier. If I use 2 bits for each identifier for a total of 10 identifiers, then my range of identifiers gets severely limited.
It sounds like you want to pack multiple integer values of 0...100m into a single 32bit Integer? Unless you are omitting important information that would allow to store these 0...100m values more efficiently, there is simply no way to do it.
ceil(log2(100m)) = 27bit, which means you have only 5 "spare bits".
You can make the segmentation size 27 bits which gives you 32 * 128 M segements. instead of 42 * 100 M
int value =
int high = value >>> 27;
int low = value & ((1L << 27) -1);
It is worth nothing this calculation is likely to be trivial compared to the cost of using a database.
It's unclear what you actually want to do, but it sounds like you want an integer value, each bit representing having a particular attribute, and applying a bitmask.
A 32-bit integer can save 32 different attributes, 64-bit 64 etc. To have more, you'll need multiple integer columns.
If that's not it, I don't know what you mean by "encode".

Good Hash function? (32-bit too small, 64-bit too large)

I need to generate a hash value used for uniqueness of many billions of records in Java. Trouble is, I only have 16 numeric digits to play with. In researching this, I have found algorithms for 32-bit hash, which return Java integers. But this is too small, as it only has a range of +/ 2 billion, and have will have more records that that. I cannot go to a 64-bit hash, as that will give me numeric values back that are too large (+/ 4 quintillion, or 19 digits). Trouble is, I am dealing with a legacy system that is forcing me into a static key length of 16 digits.
Suggestions? I know no hash function will guarantee uniqueness, but I need a good one that will fit into these restrictions.
Thanks
If your generated hash is too large you can just mod it with your keyspace max to make it fit.
myhash = hash64bitvalue % 10^16
If you are limited to 16 decimal digits, your key space contains 10^16 values.
Even if you find a hash that gives uniform distribution on your data set, due to Birthday Paradox you will have a 50% chance of collision on ~10^8 items of data, which is an order of magnitude less than your billions of records.
This means that you cannot use any kind of hash alone and rely on uniqueness.
A straightforward solution is to use a global counter instead. If global counter is infeasible, counters with preallocated ranges can be used. For example, 6 most significant digits denote fixed data source index, 10 least significant digits contain monotonous counter maintained by that data source.
So your restriction is 53 bit?
For my understanding order number of bit in hashcode doesn't affect its value (order and value of bit are fully independent from each other). So you could get 64-bit hash function and use only last 53 bits from it. And you must use binary operations for this ( hash64 & (1<<54 - 1) ) not arithmetic.
You don't have to store your hashes in a human readable form (hex, as you said). Just store the 64-bit long datatype (generated by a 64-bit hash function) in your database, which is only 8 bytes. And not the 19 bytes of which you were scared off.
If that isn't a solution, improve the legacy system.
Edit: Wait!
64-bit: 264 =
18446744073709551616
16 hex-digits: 1616 =
18446744073709551616
Exact fit! So make a hex representation of your 64-bit hash, and there you are.
If you can save 16 alphanumeric characters then you can use a hexadecimal representation and pack 16^16 bits into 16 chars. 16^16 is 2^64.

Categories

Resources