I have a question regarding UUID generation.
Typically, when I'm generating a UUID I will use a random or time based generation method.
HOWEVER, I'm migrating legacy data from MySQL over to a C* datastore and I need to change the legacy (auto-incrementing) integer IDs to UUIDS. Instead of creating another denormalized table with the legacy integer IDs as the primary key and all the data duplicated, I was wondering what folks thought about padding 0's onto the front of the integer ID to form a UUID. Example below.
*Something important to note is that the legacy IDs highest values will never top 1 million, so overflow isn't really an issue.
The idea would look like this:
Legacy ID: 123456 ---> UUID: 00000000-0000-0000-0000-000000123456
This would be done using some string concats and the UUID.fromString("00000000-0000-0000-0000-000000123456" method.
Does this seem like a bad pattern to anyone? I'm not a huge fan of the idea, gives me a bad taste in my mouth, but I don't have a technical reason for why haha.
As far as collisions go, the probability of a collision occurring is still ridiculously low. So I'm not worried about increasing collisions. I suppose it just seems like bad practice to me, that its "too easy".
We faced the same kind of issue before when migrating from Oracle with ids generated by sequence to Cassandra with generated UUIDs.
We had to design a type to both support old data coming from Oracle with type long and new data with uuid.
The obvious solution is to use type blob to store the id. A blob can encode a long or an uuid.
This solution only works for partition key because you query them using =. It won't work for clustering column using operators like > or < because we need an ordering on their value.
There was a small objection at that time, which was using a blob to store the id makes it opaque to user, for example in cqlsh when you're doing a SELECT and you need to provide the id, how would you make a blob ?
Fortunately, the native functions of CQL bigIntAsBlob(), blobAsBigInt(), uuidAsBlob() and blobAsUUID() come in very handy.
I've decided to go in a different direction from doanduyhai's answer.
In order to maintain data consistency, we decided to fully de-normalize the data and create another table in C* that is keyed on our legacy IDs. When migrating the objects from our legacy into C*, they are assigned a new randomly generated UUID, which will be their new primary ID for the future. The legacy IDs will be kept around until such a time that we decide they are no longer needed. Upon that time, we can cleanly drop the legacy ID table and be done with them.
This solution allowed for a cleaner break from our legacy ID system in the future, and allowed us to prevent the use of strange custom made UUIDs. I also wasn't a huge fan of having the ID field as a blob type that could have multiple types of data stored in it since, in the future, we plan on only wanting UUIDs to be there.
I'm going to design a merchant application. After merchants are registered with the system they will be able to add their products, discounts, price etc. And there is smart mobile apps to visit each merchants and their products.
So regarding the database (hope to use MySQL) design I have three options.
Use one database and use single table structure to maintain catalog with column called merchant_id.
Use one database and create same table structure per each merchant with unique prefix in table name.
Use separate database with table structure for each merchant when they registers with the system. In this case will maintain a master DB to keep merchant's db information.
We are developing a single application to cater all the merchant's and customers requests and there will be a lot of merchants and customers interact with the system.
Currently we are planning to use Spring MVC and Spring Data JPA.
So I'm troubling with get the correct decision in terms of scalability and maintainability or etc. Your expertise advices/recommendation are highly appreciated.
1) Use one database and use single table structure to maintain catalog
with column called merchant_id.
This is the easiest route to take.
Pros
Low maintenance. Any changes to the DB make it to one schema / database.
Cons
Does not scale beyond X merchants and N transactions per second on the database.
2) Use one database and create same table structure per each merchant
with unique prefix in table name.
This is a hybrid model of sorts, and writing the SQL and trying to track which prefix belongs to which app can be messy if you do not handle it correctly.
Pros
Can scale a little better
Cons
Maintenance overhead on each table; such as adding a new column called created to the table user requires you to modify user_111 and user_121 etc
You can possibly mix up queries by attempting to join user_111 with access_121.
Use separate database with table structure for each merchant when they
registers with the system. In this case will maintain a master DB to
keep merchant's db information.
This provides the most scale but also gives you the most maintenance overhead.
Pros
Can scale each database individually based on the type of customer you have and the traffic they provide.
Cons
High maintenance for each database because individual parameters are tweaked at the DB level too (SSD / Shared buffers / fsync time with the disk / write caches etc ).
If you're starting out by designing a system where you will not know what kind of traffic it will attract on day 1, choose #1. Should the traffic be unexpectedly large, you can always scale vertically and place the high traffic customers on another db later (through a hashing mechanism that puts the customers into db buckets )
If you expect the site traffic to be large enough and already have capacity planned out for the customers, go for #3. You must bear the brunt of the maintenance overhead, but at least you get to scale each database based on the traffic that hits it.
I'm not a fan of #2 since I've seen that approach let down some products that implemented it.
In my opinion option 1 is the way to go. The benefit I see is that you can work over this table with aggregate queries to perform calculations over each merchant, e.g. your admin view wants to see the top 20 merchants with the highest number of products uploaded.
The drawback you might see in option 1 is that this table will be huge. This can be addressed with partitioning techniques and properly chosen indexes.
Option 2 and 3 are not nice because they introduce redundancy in your schema.
Also you can consider that with JPA your entity classes naturally map to tables, but I think table prefixes per merchant would be painful to hack with JPA. This is also a +1 for option 1.
What benefits do you see in option 2 and 3? I don't really see any advantage, only drawbacks.
Q1.: What is the difference between applying sequence Id in a database using
A.
CREATE TABLE Person
(
id long NOT NULL AUTO_INCREMENT
...
PRIMARY KEY (id)
)
versus
B.
#Entity
public class Person {
#Id
#TableGenerator(name="TABLE_GEN", table="SEQUENCE_TABLE", pkColumnName="SEQ_NAME",
valueColumnName="SEQ_COUNT", pkColumnValue="PERSON_SEQ")
#GeneratedValue(strategy=GenerationType.TABLE, generator="TABLE_GEN")
private long id;
...
}
My system is highly concurrent. Since my DB is a Microsoft SQL server, I do not think it supports #SequenceGenerator, so I have to stay with #TableGenerator which is prone to concurrency issues.
Q2. This link here (http://en.wikibooks.org/wiki/Java_Persistence/Identity_and_Sequencing#Advanced_Sequencing) suggests that B might suffer from concurrency issues, but I do not understand the proposed solution. I would greatly appreciate it if someone could explain to me how to avoid concurrency issues with B. Here is a snippet of their solution:
If a large sequence pre-allocation size is used this becomes less of an issue, because the sequence table is rarely accessed.
Q2.1: How much allocation size are we talking about here? Should I do allocationSize=10 or allocationSize=100?
Some JPA providers use a separate (non-JTA) connection to allocate the sequence ids in, avoiding or limiting this issue. In this case, if you use a JTA data-source connection, it is important to also include a non-JTA data-source connection in your persistence.xml.
Q2.2: I use EclipseLink as my provider; do I have to do what it suggests above?
Q3. If B suffers from concurrency issues, does A suffer the same?
Using a TableGenerator the next id value will be looked up and maintained in a table and basically maintained by JPA and not your database. This may lead to concurrency issue when you have multiple threads accessing your database and trying to figure out what the next value for the id field may be.
The auto_increment type will make your database take care about the next id of your table, ie. it will be determined automatically by the database server when running the insert - which surely is concurrency safe.
Update:
Is there something that keeps you away from using GenerationType.AUTO?
GenerationType.AUTO does select an appropriate way to retrieve the id for your entity. So in best case in uses the built-in functionality. However, you need to check the generated SQLs and see what exactly happens there - as MSSQL does not offer sequences I assume it would use GenerationType.IDENTITY.
As said the auto_increment column takes care about assigning the next id value, ie. there is no concurrency issue there - even with multiple threads tackling the database in parallel. The challenge is to transfer this feature to be used by JPA.
A: uses IDENTITY id generation, #GeneratedValue(IDENTITY)
B: uses TABLE id generation
JPA supports three types, IDENTITY, SEQUENCE and TABLE.
There are trade-offs with both.
IDENTITY does not allow preallocation, so requires an extra SELECT after every INSERT, prevents batch writing, and requires a flush to access the id which may lead to poor concurrency.
TABLE allows preallocation, but can have concurrency issues with locks on the sequence table.
Technically SEQUENCE id generation is the best, but not all databases support it.
With TABLE sequencing if you use a preallocaiton size of 100, then only every 100 inserts will lock the row in the sequence table, so as long as you don't commonly have 100 inserts at the same time, you will not suffer any loss in concurrency. If you application does a lot of inserts, maybe use 1000 or larger value.
EclipseLink will use a separate transaction for TABLE sequencing, so any concurrency issue with locks to the sequence table will be reduced. If you are using JTA, then you need to specify a non-jta-datasource to do this and configure a sequence-connection-pool in your persistence.xml properties.
I'm using Google App Engine.
If a Long key field is generated by IdGeneratorStrategy.Identity and then the object is deleted from the datastore, is there any chance of the key being used again by a different object of the same class?
papercrane on reddit writes:
The documentation for
GenerationType.IDENTITY says that it
means the persistence provider (the
database) will provide the unique ID.
So it is entirely up to your database
software if it decides to reuse IDs
from deleted records. Without knowing
anything else about your problem I'd
say it is possible, but I can't think
of any good reason for a database
server to keep track of which IDs are
in use and recycle old ones. That
seems like a lot of overhead for very
little benefit.
And Mark Ross on Google Groups writes
on how GAE identities are generated:
Since the datastore in prod is
comprised of multiple back-ends, we
use a sharded counter approach to dole
out IDs so that we don't have to worry
about different back-ends handing out
the same id. So, back-end A may be
working from a pool of IDs ranging
from 0 to 100 and back-end B may be
working from a pool of IDs ranging
from 101 to 200, and so on. If your
inserts hit different datastore
back-ends you'll get IDs that jump
around a bit. You can depend on these
IDs being unique, but not
monotonically increasing.
I now think that it is very unlikely that Identity values are reused but it would still be good to have a clear definitive answer.
App Engine will never reuse IDs for a given kind and parent. In fact, I think you'll be hard pressed to find a database that does - keeping a simple counter is far, far simpler than trying to figure out which IDs are still in use, and with 64 bits, you're not going to run out of IDs.
There seems to only be 2nd class support for composite database keys in Java's JPA (via EmbeddedId or IdClass annotations). And when I read up on composite keys, regardless of language, people keep coming across as them being a bad thing. But I cannot understand why. Are composite keys still acceptable to use these days? If not, why not?
I've found one person who agrees with me:
http://weblogs.sqlteam.com/jeffs/archive/2007/08/23/composite_primary_keys.aspx
But another who doesn't:
http://weblogs.java.net/blog/bleonard/archive/2006/11/using_composite.html
Is it just me, or are people not able to make the distinction of where a composite key is appropriate or not? I see composite primary keys useful when the table doesn't represent an entity - i.e. when it represents a join table.
A simple example:
Actor { Id, Name, Email }
Movie { Id, Name, Year }
Character { Id, Name }
Role { Actor, Movie, Character }
Here Actor, Movie and Character obviously benefit from having an Id column as the primary key.
But Role is a Many-To-Many join table. I see no point in creating an id just to identify a row in the database. To me it seems obvious that the primary key is { Actor, Movie, Character }. It also seems like a rather limiting feature, especially if the data in the join table changes all the time, you could find yourself with primary key collisions once the primary key sequence wraps around to 0.
So, back to the original question, is it still acceptable practice to use composite primary keys? If not, why not?
In my personal opinion you should avoid composite primary keys due to several reasons:
Future changes: when you design a database you sometimes miss what in the future will become important. A significant example for this is thinking a combination of two or more fields is unique (and thus can become a primary key), whereas in the future you want to allow NULLs or other non-unique values in them. Having a single primary key is a good solid solution against such changes.
Uniformity: If every table has a unique numerical ID, and you also maintain some standard as to its name (e.g. "ID" or "tablename_id"), the code and SQL referring to it is clearer (in my opinion).
There are other reasons, but these are just a few.
The main question I would ask is why not use a separate primary key if you have a unique set of fields? What's the cost? An additional integer index? That's not too bad.
Hope that helps.
I think there's no problem using a composite key.
To me the database it's a component on its own, that should be treated the same way we treat code : for instance we want clean code, that communicates clearly its intent, that does one thing and does it well, that doesn't add any uneeded level of complexity, etc.
Same thing with the db, if the PK is composite, this is the reality, so the model should be kept clean and clear. A composite PK it's clearer than the mix auto-increment + constraint. When you see an ID column that does nothing you need to ask what's the real PK, are there any other hidden things that you should be aware of, etc. A clear PK doesn't leave any doubts.
The db is the base of your app, to me we need the most solid base that we can have. On this base we'll build the app ( web or not ). So I can't see why we should bend the db model to conform to some specific in one development tool/framework/language. The data is directing the application, not the other way around. What if the ORM changes in the future and becomes obsolete and a better solution appears that imposes another model ? We can't play with the db model to fit this or that framework, the model should stay the same, it should not depend on what tool we're using to access the data ...
If the db model change in the future, it should change because functionality changed. If we would know today how this functionality will change, we'll be modeling this already. ANd any future change will be dealt with when the time comes, we can't predict for instance the impact on existing data, so one extra column doesn't guarantee that it will withold any future change ...
We should design for today's functionality, and keep the db model the simplest possible, this way it will be easy to change/evolve in the future.
Religious wars have been, and still are, going on on this subject.
OO people have this zealous thing about "identity", and will tell you that the only thing that matters is the ability for you to "identify" "real-life objects" inside your programs, and that composite, "real-life" keys will only get you into trouble when trying to achieve that goal.
Data people have this thing about "uniqueness" that is perceived as "zealous" by the OO side, and will tell you that the only thing that matters is that if the business tells you that the combination of (values for) attribute X and attribute Y must be unique, then it is your job to see to it that the database enforces this business rule of uniqueness of the combined X+Y.
How you want your question answered is just a matter of which religion you prefer. My personal religion is the Data one. That religion has proven to be able to survive any hype and trend ever since 1969.
Similar questions have been asked on SO, and there is no consensus ;)
If you develop a web application, you will love single column pk's, as they make your URLs simpler.
For a sequence to wrap you'd need 2 billion records in a single table (32bit), or 10^18 with 64 bit pk's.
Btw, your data model does not allow for movie characters with unknown actors.
My general opinion is... no. don't use composite primary keys.
They will typically complicate ORMs if you use them (ORMs sometimes go so far as to call composite primary keys "legacy behaviour") and generally if you're using multiple keys, one or more of them will tend to be natural rather than technical keys, which for me is the bigger problem: IMHO you should certainly favour technical primary keys.
More on this in Database Development Mistakes Made by AppDevelopers.
It's a religious thing. I use natural keys and shun surrogates. I have no problem with composite keys either in theory or in practice.
Only the most trivial logical model would involve no composite keys. Call me lazy but I see no need to complicate the data model by introducing surrogates into the physical model on implementation. Sure, I'd consider one on a table if performance issues were found but I take the same approach as for denormalization i.e. as a last resort. Habitually using surrogates amounts to premature optimization, IMO.
In Ruby for Rails, when not explicitly specifying otherwise, your Role table would be kind of like you described (if the columns are actually the IDs from the other tables). Still, in the database you might want to ensure unique combinations by defining a unique index on those three columns, if only to help the database optimizing your queries. With that unique index in place and the framework not using any other primary key anyway, there is no need for a an additional numeric primary key in your Role table. Having said that, the unique index could could very be defined as a composite primary key instead.
As for future changes: defining a strict database for your first iteration will prevent unexpected data to be persisted, which will make migrations much easier.
So: I would use composite primary keys.
I would only ever use them in join tables. The only way to absolutely ensure that every record identifier is unique and consistent over time is to use a synthetic key.
Composite keys seem OK in theory, which is why they are tempting to use, but practice has shown that they usually indicate that there is a flaw in your data model. Worse still, in many cases they will fail to guarantee uniqueness, given a large enough data set. And data sets always grow over time, so using them may mean that you have planted a bomb in your application which will only explode when the application has been in production use for a while.
I think that people are underplaying ORMs. Every mainstream programming language has a defacto ORM, and has had for years, because they solve the fundamental incompatibility between OO and relational structures. Trying to write any complex, testable OO software against SQL databases without an ORM is very inefficient, at best.
Good ORMs also provide practices and tooling that make it much easier to create and maintain consistent high-quality database schema, so on average, a team will come out well ahead by working with an ORM. Handcrafting schema is rather like writing C++ ...people can do it, but in the real world it is so hard to maintain quality over time that the average product is not good.
I have almost never seen a case where a composite key was a good idea (exception, joining table consisting of only two surrogate keys). In the first palce you are wasting space in the child tables. You are harming performance in the joins as integer joins are generally much faster. If you have the composite key as a clustered index (talking SQL Server here), then you are causing the database to be less efficient about storing records and less efficient in building other indexes - all of which use the clusterd index.
When the data in the key changes (As it almost inevitably will) then you need to update all related tables as well casuing massive unecessary updates and wasting processing power on a task that is completely uneeded when the database is designed to use surrogaste keys. Primary keys need not only to be unique but to be unchanging. Composite keys often fail the second test.
So you are thinking of using a technique that harms performance, causes poor use of memory and database storage, uses way more space in child records (another waste of resources) and requires painful updating of what may be millions of child records when things change. And which might make it hard to use an ORM? Why would you do that? Because you are too lazy to put a surrogate key on and then define a unique index on the potential composite key? Is there any gain at all to using a composite index? For the lack of 5 minutes of work you are permanently harming your database?
In terms of the domain model, I see nothing wrong with creating a composite primary key when the table doesn't represent an entity - i.e. when it represents a join table (as you mention in your question), other than if it is not montonically increasing, then you will get a certain amount of page splits during insertions.
Some ORM's don't cope well with composite primary keys, so perhaps it is safer to create a surrogate auto-integer for the primary key, and cover the columns with a non-clustered index.