My webapplication has a table in the database with an id column which will always be unique for each row. In addition to this I want to have another column called code that will have a 6 digit unique Alphanumeric code with numbers 0-9 and alphabets A-Z. Alphabets and number can be duplicate in a code. i.e. FFQ77J. I understand the uniqueness of this 6 digit alphanumeric code reduces over time as more rows are added but for now I am ok with this.
Requirement (update)
- The code should be at least of length 6
- Each code should be Alphanumeric
So I want to generate this Alphanumeric code.
Question
What is a good way to do this?
Should I generate the code and after the generation, run a query to the database and check if it already exists, and if so then generate a new one? To ensure the uniqueness, does this piece of code need to be synchronized so that only one thread runs it?
Is there something built-in to the database that will let me do this?
For the generation I will be using something like this which I saw in this answer
char[] symbols = new char[36];
char[] buf;
for (int idx = 0; idx < 10; ++idx)
symbols[idx] = (char) ('0' + idx);
for (int idx = 10; idx < 36; ++idx)
symbols[idx] = (char) ('A' + idx - 10);
public String nextString()
{
for (int idx = 0; idx < buf.length; ++idx)
buf[idx] = symbols[random.nextInt(symbols.length)];
return new String(buf);
}
Since it's a requirement for the shortcode to not be guessable, you don't want to tie it to your uniqueID row ID. Otherwise that means your rowID needs to be random, in addition to unique. Starting with a counter 0, and incrementing, makes it pretty obvious when your codes are: 000001, 000002, 000003, and so forth.
For your short code, generate a random 32bit int, omit the sign and convert to base36. Make a call to your database, to ensure it's available.
You haven't explicitly called out scalability, but I think it's important to understand the limitations of your design wrt to scale.
At 2^31 possible 6 char base36 values, you will have collisions at ~65k rows (see Birthday Paradox questions)
From your comment, modify your code:
public String nextString()
{
return Integer.toString(random.nextInt(),36);
}
I would simply do this:
String s = Integer.toString(i, 36).toUpperCase();
Choosing base-36 will use characters 0-9a-z for the digits. To get a string that uses uppercase letters (as per your question) you would need to fold the result to upper case.
If you use an auto increment column for your id, set the next value to at least 60,466,176, which when rendered to base 36 is 100000 - always giving you a 6 digit number.
I would start with 0 for an empty table and do a
SELECT MAX(ID) FROM table
to find the largest id so far. Store it in an AtmoicInteger and convert it using toString
AtomicInteger counter = new AtomicInteger(maxSoFar);
String nextId = Integer.toString(counter.incrementAndGet(), 36);
or for padding. 36 ^^ 6 = 2176782336L
String nextId = Long.toString(2176782336L + counter.incrementAndGet(), 36).substring(1);
This will give you uniqueness and no duplicates to worry about. (it's not random either)
Simply, you can use Integer.toString(int i, int radix). Since you have base 36(26 letters+10 digits) you set the radix to 36 and i to your integer. For example, to use 16501, do:
String identifier=Integer.toString(16501, 36);
You can uppercase it with .toUpperCase()
Now onto your other questions, yes, you should query the database first to ensure it doesn't exist. If depending on the database, it may need to be synchronized, or it may not be as it'll use its own locking system. In any case, you'd need to tell us which database.
On the question of whether there's a builtin, we'd need to know the DB type as well.
To create a random but unique value within a small range here are some ideas I know of:
Create a new random value and try to insert it.
Let a database constraint catch violations. This column should also likely be indexed. The DML may need to be tried several times until a unique ID is found. This will lead to more collisions as time progresses, as noted (see the birthday problem).
Create a "free IDs" table ahead of time and on usage mark the ID as being used (or delete it from the "free IDs" table). This is similar to #1 but shifts when the work is done.
This allows the work of finding "free IDs" to be done at another time, perhaps during a cron job, so that there will not be a contraint violation during the insert keeping the insert itself the "same speed" throughout the usage of said domain. Make sure to use transactions.
Create a 1-to-1/injective "mixer" function such that the output "appears random". The point is this function must be 1-to-1 to inherently avoid duplicates.
This output number would then be "base 36 encoded" (which is also injective); but it would be guaranteed unique as long as the input (say, an auto-increment PK) was unique. This would likely be less random than the other approaches, but should still create a nice-looking non-linear output.
A custom injective function can be created around an 8-bit lookup table fairly trivially - just process a byte at a time and shuffle the map appropriately. I really like this idea, but it can still lead to somewhat predictable output
To find free IDs, approaches #1 and #2 above can use "probing with IN" to minimize the number of SQL statements used. That is, generate a bunch of random values and query for them using IN (keeping in mind what sizes of IN your database likes) and then see which values were free (as having no results).
To create a unique ID not constained to such a small space, a GUID or even hashing (e.g. SHA1) might be useful. However, these only guarantee uniqueness because they have 126/160-bit spaces so that the chance of collision (for different input/time-space) is currently accepted as improbable.
I actually really like the idea of using an injective function. Bearing in mind that it is not good "random" output, consider this pseudo-code:
byte_map = [0..255]
map[0] = shuffle(byte_map, seed[0])
..
map[n] = shuffle(byte_map, seed[1])
output[0] = map[0][input[0]]
..
output[n] = map[n][input[n]]
output_str = base36_encode(output[0] .. output[n])
While a very simple setup, numbers like 0x200012 and 0x200054 will still share common output - e.g. 0x1942fe and 0x1942a9 - although the lines will be changed a bit due to the later application of the base-36 encoding. This could probably be further improved to "make it look more random".
For efficient usage, try caching generated code in a HashSet<String> in your application:
HashSet<String> codes = new HashSet<String>();
This way you don't have to make a db call every time to check whether the generated code is unique or not. All you have to do is:
codes.contains(newCode);
And, yes, you should synchronize your method which updates the cache
public synchronize String getCode ()
{
String newCode = "";
do {
newCode = nextString();
}
while(codes.contains(newCode));
codes.put(newCode);
}
You mentioned in your comments that the relationship between id and code should not be easily guessable. For this you basically need encryption; there are plenty of encryption programs and modules out there that will perform encryption for you, given a secret key that you initially generate. To employ this approach, I would recommend converting your id into ascii (i.e., representing as base-256, and then interpreting each base-256 digit as a character) and then running the encryption, and then converting the encrypted ascii (base-256) into base 36 so you get your alpha-numeric, and then using 6 randomly chosen locations in the base 36 representation to get your code. You can resolve collisions e.g. by just choosing the nearest unused 6-digit alpha-numeric code when a collision occurs, and noting the re-assigned alpha-numeric code for the id in a (code <-> id) table that you will have to maintain anyway since you cannot decrypt directly if you only store 6 base-36 digits of the encrypted id.
Related
I need to generate the UID (alphanumeric) for my use case but that should be a maximum of 7 characters long as we want UID to be random but manageable, like a PNR (CYB6KL) for example.
Now if I am not wrong, I can generate a random UID that is small, but uniqueness might be compromised because of collisions (birthday paradox), so for 32 bits, 50% collision probability would be around 77k UID generations.
So in essence, I need a way to generate UIDs that are:
Small (max 7 character)
Random
Unique
Don't require lookups for the previous existance.
I will be storing this UID in a database column and it's imperative that the UID is unique. It will NOT be the table's primary key which right now is an autogenerated ID.
I am thinking of something along the lines, but I am not sure about uniqueness.
BigInteger big = new BigInteger(32, new SecureRandom());
return big.toString(32).toUpperCase();
Really appreciate any thoughts that might help on this. Generation must be unique.
Thanks in advance.
You can use a library like hashids for this purpose which implements a bimorphic translation that can encode a numeric value into a string code with a custom alphabet. This should do exactly what you want. If you need this to be traversal-secure, you should use some kind of SecureRandom as source for the underlying numeric value. If not, you could even base this on the auto increment value you already have. The benefit of reusing the primary key is that you can just translate the string code and do a lookup by primary key.
I'm looking for a solution in pesudo code or java or js for the following problem:
We need to implement an efficient bit structure to hold data for N bits (you could think of the bits as booleans as well, on/off).
We need to support the following methods:
init(n)
get(index)
set(index, True/False)
setAll(True/false)
Now I got to a solution with o(1) in all except for init that is o(n). The idea was to create an array where each index saves value for a bit. In order to support the setAll I would also save a timestamp withe the bit vapue to know if to take the value from tge array or from tge last setAll value. The o(n) in init is because we need to go through the array to nullify it, otherwise it will have garbage which can be ANYTHING. Now I was asked to find a solution where the init is also o(1) (we can create an array, but we cant clear the garbage, the garbage might even look like valid data which is wrong and make the solution bad, we need a solution that works 100%).
Update:
This is an algorithmic qiestion and not a language specific one. I encountered it in an interview question. Also using an integer to represent the bit array is not good enough because of memory limits. I was tipped that it has something to do with some kind of smart handling of garbage data in the array without ckeaning it in the init, using some kind of mechanism to not fall because if the garbage data in the array (but I'm not sure how).
Make lazy data structure based on hashmap (while hashmap sometimes might have worse access time than o(1)) with 32-bit values (8,16,64 ints are suitable too) for storage and auxiliary field InitFlag
To clear all, make empty map with InitFlag = 0 (deleting old map is GC's work in Java, isn't it?)
To set all, make empty map with InitFlag = 1
When changing some bit, check whether corresponding int key bitnum/32 exists. If yes, just change bitnum&32 bit, if not and bit value differs from InitFlag - create key with value based on InitFlag (all zeros or all ones) and change needed bit.
When retrieving some bit, check whether corresponding key exists. If yes, extract bit, if not - get InitFlag value
SetAll(0): ifl = 0, map - {}
SetBit(35): ifl = 0, map - {1 : 0x10}
SetBit(32): ifl = 0, map - {1 : 0x12}
ClearBit(32): ifl = 0, map - {1 : 0x10}
ClearBit(1): do nothing, ifl = 0, map - {1 : 0x10}
GetBit(1): key=0 doesn't exist, return ifl=0
GetBit(35): key=1 exists, return map[1]>>3 =1
SetAll(1): ifl = 1, map = {}
SetBit(35): do nothing
ClearBit(35): ifl = 1, map - {1 : 0xFFFFFFF7 = 0b...11110111}
and so on
If this is a college/high-school computer science test or homework assignment question - I suspect they are trying to get you to use BOOLEAN BIT-WISE LOGIC - specifically, saving the bit inside of an int or a long. I suspect (but I'm not a mind-reader - and I could be wrong!) that using "Arrays" is exactly what your teacher would want you to avoid.
For instance - this quote is copied from Google's Search Reults:
long: The long data type is a 64-bit two's complement integer. The
signed long has a minimum value of -263 and a maximum value of 263-1.
In Java SE 8 and later, you can use the long data type to represent an
unsigned 64-bit long, which has a minimum value of 0 and a maximum
value of 264-1
What that means is that a single long variable in Java could store 64 of your bit-wise values:
long storage;
// To get the first bit-value, use logical-or ('|') and get the bit.
boolean result1 = (boolean) storage | 0b00000001; // Gets the first bit in 'storage'
boolean result2 = (boolean) storage | 0b00000010; // Gets the second
boolean result3 = (boolean) storage | 0b00000100; // Gets the third
...
boolean result8 = (boolean) storage | 0b10000000; // Gets the eighth result.
I could write the entire thing for you, but I'm not 100% sure of your actual specifications - if you use a long, you can only store 64 separate binary values. If you want an arbitrary number of values, you would have to use as many 'long' as you need.
Here is a SO posts about binary / boolean values:
Binary representation in Java
Here is a SO post about bit-shifting:
Java - Circular shift using bitwise operations
Again, it would be a job, and I'm not going to write the entire project. However, the get(int index) and set(int index, boolean val) methods would involve bit-wise shifting of the number 1.
int pos = 1;
pos = pos << 5; // This would function as a 'pointer' to the fifth element of the binary number list.
storage | pos; // This retrieves the value stored as position 5.
I was under the impression that the UUID spec required a guaranteed, true, globally unique result, not unique 99.99999999999% of the time, but truly 100% of the time. From the spec:
A UUID is 128 bits long, and can guarantee
uniqueness across space and time.
It looks like java only support V3 and V4 of the UUID spec. V4 isn't truly unique. With the V3 implementation using nameUUIDFromBytes, the following results in duplicates, because the computer is too fast (edit: looping to 10 and called new Date().getTime() will produce duplicates because the computer loops faster than new Date().getTime() can produce a different value on each iteration):
String seed;
for (int i = 0; i < 10; i++) {
seed = "<hostname>" + new Date().getTime();
System.out.println(java.util.UUID.nameUUIDFromBytes(seed.getBytes()));
}
Am I mistaken in assuming that a UUID is 100% unique, and that it is only practically unique but not perfectly so? Is there anyway to do this in Java?
There are different methods of UUID generation. The kind you're using is behaving exactly as it should. You're using nameUUIDFromBytes, a "Static factory to retrieve a type 3 (name based) UUID based on the specified byte array."
This generates the same UUID if given the same name. As you've discovered, your loop is passing-in the same name every time, so you get the same UUID.
Have a look at Gabe's advice here: Which UUID version to use?
He recommends you use V4, which as others have pointed out is good enough for any realistic use case.
Because your entropy is limited to your memory, you can never ensure a UUID is "guaranteed, true, globally unique result". However, 99.99999999999% is already pretty good.
If you want to ensure unique values in your database, you could use a simple integer that's incremented to be sure it's unique. If you want to use UUIDs and be really sure they're unique, you just have to check that upon creation. If there's a duplicate, just create another one until it's unique.
Duplicates can happen, but IIRC, part of them is created dependent on your current time, so if you're just creating one every 5 minutes, you should be safe.
As others have pointed out, the type-4 UUID returned by UUID.randomUUID() is likely to be unique enough for any practical application. Cases where it's not are likely to be pathological: for example, rolling back a VM to a live snapshot, without restarting the Java process, so that the random-number generator goes back to an exact prior state.
By contrast, a type-3 or type-5 UUID is only as unique as what you put into it.
A type-1 UUID (time-based) should be very slightly "more" unique, under certain constraints. The Java platform does not include support for generating a type-1 UUID, but I've written code (possibly not published) to call a UUID generating library via JNI. It was 18 lines of C and 11 lines of Java.
If I generate a UUID from a "seed" string as follows, is there any way for someone to re-generate the original string?
UUID uuid = null;
try {
uuid = UUID.nameUUIDFromBytes(("seedString").getBytes("utf8"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
System.out.println("UUID: " + uuid.toString());
I would assume it isn't possible, as I believe this person found here: Convert UUID to bytes
However, I see that the same UUID is generated every time from a certain String/bytes, and since it has to be unique, simple "seed" values could just be guessed? For example, UUID of f is 8fa14cdd-754f-31cc-a554-c9e71929cce7 so if I see that I know it was generated from "f".
Since you are getting the UUID by casting bytes to a UUID, and you are always using the same starting bytes to cast from, the uuid would always be the same UUID across multiple runs.
I think you've confused a random seed with the "from bytes" method in the UUID routines. It is more like a cast than a seed initialization. And even if it was like a seed initialization, initializing with a constant seed would only mean that you always walk the "same" pseudo-random path (meaning that after walking it once, you can know the next step(s)).
aug also makes an excellent point, which I'll elaborate a bit on here. A UUID is an identifier, which is assumed to be unique only by virtue of there being so many to choose from; however, if you create a routine that returns the same one(s) repeatedly, it's not going to be unique due to your selection mechanism. The actual mechanism doesn't assure uniqueness; even less so when using a routine guaranteed to return identical values.
As they are not guaranteed to be unique (UUIDs have a fixed number of bits and eventually all combinations can be exhausted), one can imagine that there are more inputs than UUIDs (although there's a lot of UUIDs) so UUID collision is inevitable (even if it would theoretically take more time than the heat death of the universe). From a practical side of things, you probably have little to worry about; but, it could still (minuscule chance) happen.
This also means that one can (in theory) guarantee that some two inputs out there can wind up with the same UUID, and as a result, UUIDs are not generally reversible (however, in specific (limited) cases, perhaps they could be made reversible).
There are an infinite number of strings that may generate a given UUID, so even if somebody guesses the string you used to create a given UUID, they may never be sure.
I need to assign a random but unique ID to each row in Mysql table.The ID should be same if the row contains same values.
ie., If the 1st row contains [hi,hello,bye] 2nd row contains[gg,hello,bye] and 3rd row contains[hi,hello,bye] then 1st and 3rd row should generate same ID and 2nd row should genetare different ID.
Thanks in advance.
MD5 Hash could work. Below is chopped up and quick/dirty code that would need updated, but proves the concept.
System.out.println("row1=" + test1 + ":" + tst1.getHash(test1));
System.out.println("row2=" + test2 + ":" + tst1.getHash(test2));
System.out.println("row3=" + test3 + ":" + tst1.getHash(test3));
private String getHash(String inputStr){
try{
MessageDigest md = MessageDigest.getInstance("MD5");
md.update(inputStr.getBytes());
byte byteData[] = md.digest();
StringBuffer sb = new StringBuffer();
for (int i = 0; i < byteData.length; i++) {
sb.append(Integer.toString((byteData[i] & 0xff) + 0x100, 16).substring(1));
}
return sb.toString();
}
catch(Exception e)
{
e.printStackTrace();
return null;
}
}
row1=hi,hello,bye:cfe40e96aa052a484208c2aefb6f39bb
row2=gg,hello,bye:f652785f0e214507e6aea44ecd3ffb7a
row3=hi,hello,bye:cfe40e96aa052a484208c2aefb6f39bb
SELECT CRC32(CONCAT(column1, column2, column3)) FROM MyTable.
Technically CRC32 is not random (but what is?) -- and it has a small chance of generating collisions (different values mapping to the same integer). But it's a start.
If you really want proof that you don't get collisions everything boils down to concatanating all fields, with a seperator not contained in the fields. Of course this normally will be really long and cumbersome to work with.
What everybody normally does is: feed that String in a Hash function. While theoretically not unique, given a suitable Hashfunction with a large enough result, in should be able to find one that is unlikely to produce a collision during the livetime of the human race. For example git is using such a hash (sha1) and Linus Torvalds writes about the chance of a accidental collision:
First off, let me remind people that the inadvertent kind
of collision is really really really damn unlikely, so we'll quite
likely never ever see it in the full history of the universe.
A different thing is a not so accidental collision. At the very first you should make sure that the string you start with isn't the same for different columns. This means:
Make sure all columns are contained
Make sure columns a separated by something not contained in the columns itself. Use escaping if necessary. For example if you just concatenate two columns the values 'abc' + 'def' will give you the same result as 'a' + 'bcdef'
If you have to worry about targeted attacks, i.e. someone actually trying to create entries with the same hash, your best bet is to use a cryptographic hash, possibly one used for password hashing which are often designed to be slow, in order to prevent brute force attacks. Of course this might collide with the requirement for most applications to be as fast as possible.
What you need is a hash function of all the values that you care about in a row. It can't be random because, by definition, it must be deterministic -- given the same values, you always get the same ID. If, by "random" you mean "not sequential" most hash functions should satisfy this need.
Theoretically, you cannot guarantee uniqueness as there is always the probability of collisions. That is, different IDs definitely mean that the row values are different but the converse is not always true. Depending on your needs, you might want to implement explicit matching on actual row values whenever matching IDs are encountered. You might also consider using a cryptographic hash function like MD5 or SHA1 and rely on the probabilities being on your side (in fact, any collision you find using a cryptographic hash function would be a breakthrough of some kind in the field).