Is unique id generation using UUID really unique?

Is unique id generation using UUID really unique? - java

I want generate unique ID just like auto increment in java . So previously i used current nano seconds but i end up with clash since two data comes with in same nano seconds ..
Does UUID solves the above problem ?
Note :: In my project i can even get 10000 rows of records for each and every minute and I will dump those records along with UIDS in to table .And there may be a situation where i would stop my product and restart it after some time ....So during that situation how could UUID class clarifies the previously generated Uids(which i stored in DB) with the new one going to created(Yet to be dumped in DB) ?

While the UUIDs are not guaranteed to be unique, the probability of a duplicate is extremely low. See Random UUID probability of duplicates.
For your application, it makes sense to use the UUID, but you may want to deal with the extremely rare condition, just in case.

I seriously doubt you get two records in the same nano-second as making the call System.nanoTime() takes over 100 ns. It is more likely your clock doesn't have nano second accuracy.
However, if you restart your server, you can get repeating nanoTime().
One way around this is to use
AtomicLong counter = new AtomicLong(System.currentTimeMillis()*1000);
long id = counter.incrementAndGet();
// something like ctz9yamgu8
String id = Long.toString(counter.incrementAndGet(), 36);
This will start a counter when the application restarts and they will not be overlap between restarts unless you sustain over one million ids per second. (Over the life of the instance)
Note: this only works for on a per instance basis. Multiple servers need to use a different approach.

There seems to be some confusion on this page about the nature of UUID.
Study the Wikipedia page. You will see there are different versions of UUID.
You asked:
Does UUID solves the above problem ?
Yes, UUID values do solve your problem.
A point in space and time
The original Version 1 represents a point in space and time, never to be repeated.
Version 1 does this by using the MAC address of the machine on which it is generated (a point in space). To this it combines the current moment. Add in an arbitrary number that increments when a change in the computer clock is noticed. The clock is not as much of an issue now that computers have built-in batteries and network connections to time servers. By combining these, there is no practical chance of collisions.
Because of concerns over the security and privacy issues involved in tracking and divulging the MAC address and moment, some people may not want to use this version. For example, Java omits generating Version 1 from its UUID class.
FYI, the more powerful database servers such as Postgres can generate UUID values including Version 1. You may choose to generate your UUIDs on the database server rather than in your app.
Random
One commonly used version is Version 4, in which 122 of the 128 bits are generated randomly. If a cryptographically-strong random generator is used, this is quite effective. This version has a much higher chance of collisions than in Version 1. But for most practical scenarios, the random-based UUID is entirely reliable.

Related

How to re-generate deleted sequence numbers in hibernate?

As we know the below hibernate annotation generates a new number each time from the sequence starting from 1. Consider a situation wherein i have a set of records with ids(1-5).Now a record is deleted from the table which had id as 3. If we see number 3 is missing from the sequence 1-5 now because of the operation. I have a requirement for the sequence to re-generate and reassign that number 3 when i will be adding new record in the table. How to do this ?
#Id
#GeneratedValue(strategy = GenerationType.IDENTITY)
private int id;

I don't think this is a great idea. A sequence is just a number incremented of 1 each time. This allows it to be fast but already this is a bottleneck for a distributed database for writes as all the nodes need to synchronize on that number.
If you try to get the first available integer, you need basically to do a full table scan, order the records by id and check the first missing one. That's extremely costly and inefficient for something that shall be as cheap as possible.
You should view the id as a technical ID without functional meaning and thus do not care if there are holes in the sequence or not.
Edit:
I also would add the implications go deeper, even in term of business.
If I get an ID for a article I sell as a merchant and I model its deletion as removing the record or even better put a status "deleted" on it potentially with a date and reason for deletion, I have much easier bookkeeping. Actually, I would prefer the last design: keep the record and have a status that is dynamic and potentially with history. The item could be unavailable for 1 year and be used again if I sell it again.
If on the contrary I silently reuse the ID, then, my system may display an old bill with the data of the new article. Instead of ski boots that I don't sell anymore, it may become a PS5 or 1kg of rice. This is error prone.
This may not apply to all business cases, of course, but its better to consider this kind of usage before going with a design that delete data.

I Agree with Nicolas, but Just to clarify.
You are using an "Identity" and not a "Sequence" there are some differences between them, and how are declared and used (Each database could have their propietary implementation).
A Sequence is an independent object in your database with some properties (like start, end,increment,...) and an identity is a "property" of the column that depends on how the database handles it.
In the case of sequence (and depending on the database in some identities) you could create "cyclic" sequences to repeat the numbers after the cycle ends. But never a sequence or identity scans for "gaps" in the ids. (As Nicolas said is really bad for perfomance)
But depending on how your code will work you could create a cycle in a sequence to prevent having an always increasing value. But Only you are sure that there will not be conflicts when inserting new records.

Sharding counters with 180 properties

Does it sound bad to have 180 unindexed properties(columns) with Integer/Long type per entity in datastore?
I need to count 6 requests per user saving by day for analytics reasons and I'm doing everything based on the sharding counters article and webcast:
https://cloud.google.com/appengine/articles/sharding_counters
So basically it's 6 values per day incrementing every new request, so I'm thinking in having:
1 Kind per Month
6 types of analytics * month days = 180
How much is too much in Google Datastore properties?
Thank you

Probably not a good idea.
Keep in mind that every time you want to update a single property value the entire entity will have to be re-written (i.e. retrieved from the datastore, deserialized, updated, re-serialized and re-sent to the datastore). The bigger the entity, the slower the performance.
IMHO it's better to have multiple smaller entities than a big one in such case. It is possible to split a single big entity into multiple smaller ones, efficiently related to each-other - see re-using an entity's ID for other entities of different kinds - sane idea?
Along the same line I believe it's even possible to find a way to encode the day info and the user ID into unique custom key IDs, for easy access. Something like <userid>_YYMMDD or just <userid>_DD

Are IdGeneratorStrategy.Identity values reused after a jdo has been deleted

I'm using Google App Engine.
If a Long key field is generated by IdGeneratorStrategy.Identity and then the object is deleted from the datastore, is there any chance of the key being used again by a different object of the same class?
papercrane on reddit writes:
The documentation for
GenerationType.IDENTITY says that it
means the persistence provider (the
database) will provide the unique ID.
So it is entirely up to your database
software if it decides to reuse IDs
from deleted records. Without knowing
anything else about your problem I'd
say it is possible, but I can't think
of any good reason for a database
server to keep track of which IDs are
in use and recycle old ones. That
seems like a lot of overhead for very
little benefit.
And Mark Ross on Google Groups writes
on how GAE identities are generated:
Since the datastore in prod is
comprised of multiple back-ends, we
use a sharded counter approach to dole
out IDs so that we don't have to worry
about different back-ends handing out
the same id. So, back-end A may be
working from a pool of IDs ranging
from 0 to 100 and back-end B may be
working from a pool of IDs ranging
from 101 to 200, and so on. If your
inserts hit different datastore
back-ends you'll get IDs that jump
around a bit. You can depend on these
IDs being unique, but not
monotonically increasing.
I now think that it is very unlikely that Identity values are reused but it would still be good to have a clear definitive answer.

App Engine will never reuse IDs for a given kind and parent. In fact, I think you'll be hard pressed to find a database that does - keeping a simple counter is far, far simpler than trying to figure out which IDs are still in use, and with 64 bits, you're not going to run out of IDs.

System.currentTimeInMillis() as column names(time sorted) in a row of NoSQL database

I want to use long timestamp value(may be generated by System.currentTimeInMillis()) as column names in my database. Can System.currentTimeInMillis() method guarantee an always increasing values ?? I have seen people complaining that sometimes it became slower.. !
I am also open to other alternatives that may be considerable for putting as increasing column names. I just want to guarantee uniqueness(until they fall in same millisecond when I can consider them ok..) & increasing sequence ( may be also perhaps smaller in size (less bytes) if anyhow possible!).
Edit: I have a NoSQL database where column names(& hence columns) are sorted in a row as ascending/descending number sequence. Thus I am looking to generate timestamps as column names that could enable me to sort the columns by time.
I am looking to store comments of a blog post in a single row using timestamp values as column names to enable sort by time. I think I wouldnt mind even if 10 ms is the resolution since probablity of someone commenting in the same 1/100 of a sec on the same blog post on my application would be very low.
Edit: Thank you all for your comments and suggestions. Really helpful.. I think I have got a solution to work around the problems of seldom failures of System.currentTimeInMillis(). I could implement like this:-
When a user adds a new comment to a post, the frontend with send an id 'suggestedId' which is one greater than the id of last comment( frontend would know about this from the previous database read). This id would be compared with the id generated using System.nanotime(). if the suggestedId is less than the generatedId then generatedId will be used else suggestedId would be used. So it simply means whatever is greater, use that Id. This guarantees monotonocity
Although not truly perfect but yes sounds good for practical usage!
Would you guys like to share your thoughts upon this? Thanks!!!

The general database design issues have been addressed by other commenters, but just on this point:
Can System.currentTimeInMillis() method guarantee an always increasing values ?? I have seen people complaining that sometimes it became slower.. !
For future reference, the word for this (always-increasing values) is monotonicity. No, System.currentTimeMillis() is not monotonic. Not only can it go more slowly, or speed up (if, say, the System it's running on is using NTP for time correction), but it can arbitrarily change up or down (if the user, or a script, changes the system time).
System.nanoTime() does not formally guarantee monotonicity; however, the Hotspot JVM does if and only if the underlying system supports it (modern Linux kernels on modern hardware certainly do). Sounds better - with the caveat that some processors use power management techniques etc which can screw this up in the presence of multiple cores. So it's better, but still not perfect.

On many systems, System.currentTimeMillis() does not resolve below 10 ms increments. So two different calls can easily return the same value.
I suggest that you keep an auxiliary table with a counter that you can increment to give the next value.
Why do you want this for column names? It seems a very odd sort of data base design.

I am looking to store comments of a blog post in a single row using timestamp values as column names to enable sort by time.
I'm no NoSQL expert, but I'd say it's not a good idea to store comments as columns in one row. Why don't you add a row per comments along with a timestamp you can sort by?
Using a traditional relational database the table could look like this:
comments
--------
id (PK)
blog_id (FK)
created_on (timestamp)
text
Selecting the comments in order would then be in SQL:
SELECT * from comments WHERE blog_id = ? ORDER BY created_on

System.currentTimeMillis() typically has around 10-20ms granularity, but even if it had 1ms granularity, in principle, 1ms is an eternity in computing time and it would be quite plausible, depending on what you're doing, for two calls to end up with the same value. However, I'm guessing that even 20ms is probably not an eternity compared to how frequently people make blog comments.
So, if two people post a comment within the same 20ms (or whatever), just sorting on this value will not define an order for the posts in question. But do you particularly care about this unlikely situation. If you do, then you need to build in a little bit of extra logic (have a counter for the number of messages posted "this millisecond"). I personally wouldn't bother in your use case.
As far as I can understand, you're also storing the data in a fundamentally silly way. Why not just have a "Comments" table with a row per comment and a single time column, which you can sort on as required.

Many databases provide a way to get serial numbers into column. For example see this -- PostgreSQL Autoincrement

Distributed sequence number generation?

I've generally implemented sequence number generation using database sequences in the past.
e.g. Using Postgres SERIAL type http://www.neilconway.org/docs/sequences/
I'm curious though as how to generate sequence numbers for large distributed systems where there is no database. Does anybody have any experience or suggestions of a best practice for achieving sequence number generation in a thread safe manner for multiple clients?

OK, this is a very old question, which I'm first seeing now.
You'll need to differentiate between sequence numbers and unique IDs that are (optionally) loosely sortable by a specific criteria (typically generation time). True sequence numbers imply knowledge of what all other workers have done, and as such require shared state. There is no easy way of doing this in a distributed, high-scale manner. You could look into things like network broadcasts, windowed ranges for each worker, and distributed hash tables for unique worker IDs, but it's a lot of work.
Unique IDs are another matter, there are several good ways of generating unique IDs in a decentralized manner:
a) You could use Twitter's Snowflake ID network service. Snowflake is a:
Networked service, i.e. you make a network call to get a unique ID;
which produces 64 bit unique IDs that are ordered by generation time;
and the service is highly scalable and (potentially) highly available; each instance can generate many thousand IDs per second, and you can run multiple instances on your LAN/WAN;
written in Scala, runs on the JVM.
b) You could generate the unique IDs on the clients themselves, using an approach derived from how UUIDs and Snowflake's IDs are made. There are multiple options, but something along the lines of:
The most significant 40 or so bits: A timestamp; the generation time of the ID. (We're using the most significant bits for the timestamp to make IDs sort-able by generation time.)
The next 14 or so bits: A per-generator counter, which each generator increments by one for each new ID generated. This ensures that IDs generated at the same moment (same timestamps) do not overlap.
The last 10 or so bits: A unique value for each generator. Using this, we don't need to do any synchronization between generators (which is extremely hard), as all generators produce non-overlapping IDs because of this value.
c) You could generate the IDs on the clients, using just a timestamp and random value. This avoids the need to know all generators, and assign each generator a unique value. On the flip side, such IDs are not guaranteed to be globally unique, they're only very highly likely to be unique. (To collide, one or more generators would have to create the same random value at the exact same time.) Something along the lines of:
The most significant 32 bits: Timestamp, the generation time of the ID.
The least significant 32 bits: 32-bits of randomness, generated anew for each ID.
d) The easy way out, use UUIDs / GUIDs.

You could have each node have a unique ID (which you may have anyway) and then prepend that to the sequence number.
For example, node 1 generates sequence 001-00001 001-00002 001-00003 etc. and node 5 generates 005-00001 005-00002
Unique :-)
Alternately if you want some sort of a centralized system, you could consider having your sequence server give out in blocks. This reduces the overhead significantly. For example, instead of requesting a new ID from the central server for each ID that must be assigned, you request IDs in blocks of 10,000 from the central server and then only have to do another network request when you run out.

Now there are more options.
Though this question is "old", I got here, so I think it might be useful to leave the options I know of (so far):
You could try Hazelcast. In it's 1.9 release it includes a Distributed implementation of java.util.concurrent.AtomicLong
You can also use Zookeeper. It provides methods for creating sequence nodes (appended to znode names, though I prefer using version numbers of the nodes). Be careful with this one though: if you don't want missed numbers in your sequence, it may not be what you want.
Cheers

It can be done with Redisson. It implements distributed and scalable version of AtomicLong. Here is example:
Config config = new Config();
config.addAddress("some.server.com:8291");
Redisson redisson = Redisson.create(config);
RAtomicLong atomicLong = redisson.getAtomicLong("anyAtomicLong");
atomicLong.incrementAndGet();

If it really has to be globally sequential, and not simply unique, then I would consider creating a single, simple service for dispensing these numbers.
Distributed systems rely on lots of little services interacting, and for this simple kind of task, do you really need or would you really benefit from some other complex, distributed solution?

There are a few strategies; but none that i know can be really distributed and give a real sequence.
have a central number generator. it doesn't have to be a big database. memcached has a fast atomic counter, in the vast majority of cases it's fast enough for your entire cluster.
separate an integer range for each node (like Steven Schlanskter's answer)
use random numbers or UUIDs
use some piece of data, together with the node's ID, and hash it all (or hmac it)
personally, i'd lean to UUIDs, or memcached if i want to have a mostly-contiguous space.

Why not use a (thread safe) UUID generator?
I should probably expand on this.
UUIDs are guaranteed to be globally unique (if you avoid the ones based on random numbers, where the uniqueness is just highly probable).
Your "distributed" requirement is met, regardless of how many UUID generators you use, by the global uniqueness of each UUID.
Your "thread safe" requirement can be met by choosing "thread safe" UUID generators.
Your "sequence number" requirement is assumed to be met by the guaranteed global uniqueness of each UUID.
Note that many database sequence number implementations (e.g. Oracle) do not guarantee either monotonically increasing, or (even) increasing sequence numbers (on a per "connection" basis). This is because a consecutive batch of sequence numbers gets allocated in "cached" blocks on a per connection basis. This guarantees global uniqueness and maintains adequate speed. But the sequence numbers actually allocated (over time) can be jumbled when there are being allocated by multiple connections!

Distributed ID generation can be archived with Redis and Lua. The implementation available in Github. It produces a distributed and k-sortable unique ids.

I know this is an old question but we were also facing the same need and was unable to find the solution that fulfills our need.
Our requirement was to get a unique sequence (0,1,2,3...n) of ids and hence snowflake did not help.
We created our own system to generate the ids using Redis. Redis is single threaded hence its list/queue mechanism would always give us 1 pop at a time.
What we do is, We create a buffer of ids, Initially, the queue will have 0 to 20 ids that are ready to be dispatched when requested. Multiple clients can request an id and redis will pop 1 id at a time, After every pop from left, we insert BUFFER + currentId to the right, Which keeps the buffer list going. Implementation here

I have written a simple service which can generate semi-unique non-sequential 64 bit long numbers. It can be deployed on multiple machines for redundancy and scalability. It use ZeroMQ for messaging. For more information on how it works look at github page: zUID

Using a database you can reach 1.000+ increments per second with a single core. It is pretty easy. You can use its own database as backend to generate that number (as it should be its own aggregate, in DDD terms).
I had what seems a similar problem. I had several partitions and I wanted to get an offset counter for each one. I implemented something like this:
CREATE DATABASE example;
USE example;
CREATE TABLE offsets (partition INTEGER, offset LONG, PRIMARY KEY (partition));
INSERT offsets VALUES (1,0);
Then executed the following statement:
SELECT #offset := offset from offsets WHERE partition=1 FOR UPDATE;
UPDATE offsets set offset=#offset+1 WHERE partition=1;
If your application allows you, you can allocate a block at once (that was my case).
SELECT #offset := offset from offsets WHERE partition=1 FOR UPDATE;
UPDATE offsets set offset=#offset+100 WHERE partition=1;
If you need further throughput an cannot allocate offsets in advance you can implement your own service using Flink for real time processing. I was able to get around 100K increments per partition.
Hope it helps!

The problem is similar to:
In iscsi world, where each luns/volumes have to be uniquely identifiable by the initiators running on the client side.
The iscsi standard says that the first few bits have to represent the Storage provider/manufacturer information, and the rest monotonically increasing.
Similarly, one can use the initial bits in the distributed system of nodes to represent the nodeID and the rest can be monotonically increasing.

One solution that is decent is to use a long time based generation.
It can be done with the backing of a distributed database.

My two cents for gcloud. Using storage file.
Implemented as cloud function, can easily be converted to a library.
https://github.com/zaky/sequential-counter

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.