Should I use composite primary keys or not?

Should I use composite primary keys or not? - java

There seems to only be 2nd class support for composite database keys in Java's JPA (via EmbeddedId or IdClass annotations). And when I read up on composite keys, regardless of language, people keep coming across as them being a bad thing. But I cannot understand why. Are composite keys still acceptable to use these days? If not, why not?
I've found one person who agrees with me:
http://weblogs.sqlteam.com/jeffs/archive/2007/08/23/composite_primary_keys.aspx
But another who doesn't:
http://weblogs.java.net/blog/bleonard/archive/2006/11/using_composite.html
Is it just me, or are people not able to make the distinction of where a composite key is appropriate or not? I see composite primary keys useful when the table doesn't represent an entity - i.e. when it represents a join table.
A simple example:
Actor { Id, Name, Email }
Movie { Id, Name, Year }
Character { Id, Name }
Role { Actor, Movie, Character }
Here Actor, Movie and Character obviously benefit from having an Id column as the primary key.
But Role is a Many-To-Many join table. I see no point in creating an id just to identify a row in the database. To me it seems obvious that the primary key is { Actor, Movie, Character }. It also seems like a rather limiting feature, especially if the data in the join table changes all the time, you could find yourself with primary key collisions once the primary key sequence wraps around to 0.
So, back to the original question, is it still acceptable practice to use composite primary keys? If not, why not?

In my personal opinion you should avoid composite primary keys due to several reasons:
Future changes: when you design a database you sometimes miss what in the future will become important. A significant example for this is thinking a combination of two or more fields is unique (and thus can become a primary key), whereas in the future you want to allow NULLs or other non-unique values in them. Having a single primary key is a good solid solution against such changes.
Uniformity: If every table has a unique numerical ID, and you also maintain some standard as to its name (e.g. "ID" or "tablename_id"), the code and SQL referring to it is clearer (in my opinion).
There are other reasons, but these are just a few.
The main question I would ask is why not use a separate primary key if you have a unique set of fields? What's the cost? An additional integer index? That's not too bad.
Hope that helps.

I think there's no problem using a composite key.
To me the database it's a component on its own, that should be treated the same way we treat code : for instance we want clean code, that communicates clearly its intent, that does one thing and does it well, that doesn't add any uneeded level of complexity, etc.
Same thing with the db, if the PK is composite, this is the reality, so the model should be kept clean and clear. A composite PK it's clearer than the mix auto-increment + constraint. When you see an ID column that does nothing you need to ask what's the real PK, are there any other hidden things that you should be aware of, etc. A clear PK doesn't leave any doubts.
The db is the base of your app, to me we need the most solid base that we can have. On this base we'll build the app ( web or not ). So I can't see why we should bend the db model to conform to some specific in one development tool/framework/language. The data is directing the application, not the other way around. What if the ORM changes in the future and becomes obsolete and a better solution appears that imposes another model ? We can't play with the db model to fit this or that framework, the model should stay the same, it should not depend on what tool we're using to access the data ...
If the db model change in the future, it should change because functionality changed. If we would know today how this functionality will change, we'll be modeling this already. ANd any future change will be dealt with when the time comes, we can't predict for instance the impact on existing data, so one extra column doesn't guarantee that it will withold any future change ...
We should design for today's functionality, and keep the db model the simplest possible, this way it will be easy to change/evolve in the future.

Religious wars have been, and still are, going on on this subject.
OO people have this zealous thing about "identity", and will tell you that the only thing that matters is the ability for you to "identify" "real-life objects" inside your programs, and that composite, "real-life" keys will only get you into trouble when trying to achieve that goal.
Data people have this thing about "uniqueness" that is perceived as "zealous" by the OO side, and will tell you that the only thing that matters is that if the business tells you that the combination of (values for) attribute X and attribute Y must be unique, then it is your job to see to it that the database enforces this business rule of uniqueness of the combined X+Y.
How you want your question answered is just a matter of which religion you prefer. My personal religion is the Data one. That religion has proven to be able to survive any hype and trend ever since 1969.

Similar questions have been asked on SO, and there is no consensus ;)
If you develop a web application, you will love single column pk's, as they make your URLs simpler.
For a sequence to wrap you'd need 2 billion records in a single table (32bit), or 10^18 with 64 bit pk's.
Btw, your data model does not allow for movie characters with unknown actors.

My general opinion is... no. don't use composite primary keys.
They will typically complicate ORMs if you use them (ORMs sometimes go so far as to call composite primary keys "legacy behaviour") and generally if you're using multiple keys, one or more of them will tend to be natural rather than technical keys, which for me is the bigger problem: IMHO you should certainly favour technical primary keys.
More on this in Database Development Mistakes Made by AppDevelopers.

It's a religious thing. I use natural keys and shun surrogates. I have no problem with composite keys either in theory or in practice.
Only the most trivial logical model would involve no composite keys. Call me lazy but I see no need to complicate the data model by introducing surrogates into the physical model on implementation. Sure, I'd consider one on a table if performance issues were found but I take the same approach as for denormalization i.e. as a last resort. Habitually using surrogates amounts to premature optimization, IMO.

In Ruby for Rails, when not explicitly specifying otherwise, your Role table would be kind of like you described (if the columns are actually the IDs from the other tables). Still, in the database you might want to ensure unique combinations by defining a unique index on those three columns, if only to help the database optimizing your queries. With that unique index in place and the framework not using any other primary key anyway, there is no need for a an additional numeric primary key in your Role table. Having said that, the unique index could could very be defined as a composite primary key instead.
As for future changes: defining a strict database for your first iteration will prevent unexpected data to be persisted, which will make migrations much easier.
So: I would use composite primary keys.

I would only ever use them in join tables. The only way to absolutely ensure that every record identifier is unique and consistent over time is to use a synthetic key.
Composite keys seem OK in theory, which is why they are tempting to use, but practice has shown that they usually indicate that there is a flaw in your data model. Worse still, in many cases they will fail to guarantee uniqueness, given a large enough data set. And data sets always grow over time, so using them may mean that you have planted a bomb in your application which will only explode when the application has been in production use for a while.
I think that people are underplaying ORMs. Every mainstream programming language has a defacto ORM, and has had for years, because they solve the fundamental incompatibility between OO and relational structures. Trying to write any complex, testable OO software against SQL databases without an ORM is very inefficient, at best.
Good ORMs also provide practices and tooling that make it much easier to create and maintain consistent high-quality database schema, so on average, a team will come out well ahead by working with an ORM. Handcrafting schema is rather like writing C++ ...people can do it, but in the real world it is so hard to maintain quality over time that the average product is not good.

I have almost never seen a case where a composite key was a good idea (exception, joining table consisting of only two surrogate keys). In the first palce you are wasting space in the child tables. You are harming performance in the joins as integer joins are generally much faster. If you have the composite key as a clustered index (talking SQL Server here), then you are causing the database to be less efficient about storing records and less efficient in building other indexes - all of which use the clusterd index.
When the data in the key changes (As it almost inevitably will) then you need to update all related tables as well casuing massive unecessary updates and wasting processing power on a task that is completely uneeded when the database is designed to use surrogaste keys. Primary keys need not only to be unique but to be unchanging. Composite keys often fail the second test.
So you are thinking of using a technique that harms performance, causes poor use of memory and database storage, uses way more space in child records (another waste of resources) and requires painful updating of what may be millions of child records when things change. And which might make it hard to use an ORM? Why would you do that? Because you are too lazy to put a surrogate key on and then define a unique index on the potential composite key? Is there any gain at all to using a composite index? For the lack of 5 minutes of work you are permanently harming your database?

In terms of the domain model, I see nothing wrong with creating a composite primary key when the table doesn't represent an entity - i.e. when it represents a join table (as you mention in your question), other than if it is not montonically increasing, then you will get a certain amount of page splits during insertions.
Some ORM's don't cope well with composite primary keys, so perhaps it is safer to create a surrogate auto-integer for the primary key, and cover the columns with a non-clustered index.

Related

Is there a difference between finding by primary key vs finding by unique column?

If i have an entity with a primary key id and an unique column name. Is there any difference whether i do a SQL request findById(long id) or findByName(String name). Can a search for the primary key be done in O(1) while the other one works in O(n)? Which data structures are used for saving them?

The difference is speed :
Running SQL query against Integer will always be faster than running against a string.

From the perspective of order complexity of the operation, then the two are equivalent.
As others have pointed out, an integer lookup is generally faster than a string lookup. Here are three reasons:
The index would typically be smaller because, integers are 4 bytes and strings are typically bigger.
Indexes on fixed length keys have some additional efficiencies in the tree structure (no need to "find the end of the string").
In many databases, strings incur additional overhead to handle collations.
That said, another factor is that the primary key is often clustered in many databases. This eliminates the final lookup of the row in data pages -- which might be a noticeable efficiency as well. Note that not all databases support clustered indexes, so this is not true in all cases.

If both columns were INTEGER, then the answer would be "no". A PRIMARY KEY is effectively a UNIQUE constraint on a column and little more. Additionally, as usually they both cause internal indexes to be created, again they behave basically the same way.
In your specific case, however, the NAME column is a string. Even though it has a UNIQUE constraint, by virtue of its data type you will incur in some performance loss.
As your question is probably dictated by "ease of use" to some extent (for your debugging purposes it's certainly easy to remember the "name" than it is to remember the "id") the questions you need to ask yourself are:
Will the NAME column always be unique or could it be change to something not unique? Should it actually be unique in the first place (maybe they set it up wrong)?
How many rows do you expect in your table? This is important to know as while a small table won't really show any performance issue, a high cardinality may start to show some.
How many transactions/second do you expect? If it's an internal application or a small amateurial project, you can live with the NAME column being queried whereas if you need extreme scalability you should stay away from it.

Integer to UUID conversion using padded 0's

I have a question regarding UUID generation.
Typically, when I'm generating a UUID I will use a random or time based generation method.
HOWEVER, I'm migrating legacy data from MySQL over to a C* datastore and I need to change the legacy (auto-incrementing) integer IDs to UUIDS. Instead of creating another denormalized table with the legacy integer IDs as the primary key and all the data duplicated, I was wondering what folks thought about padding 0's onto the front of the integer ID to form a UUID. Example below.
*Something important to note is that the legacy IDs highest values will never top 1 million, so overflow isn't really an issue.
The idea would look like this:
Legacy ID: 123456 ---> UUID: 00000000-0000-0000-0000-000000123456
This would be done using some string concats and the UUID.fromString("00000000-0000-0000-0000-000000123456" method.
Does this seem like a bad pattern to anyone? I'm not a huge fan of the idea, gives me a bad taste in my mouth, but I don't have a technical reason for why haha.
As far as collisions go, the probability of a collision occurring is still ridiculously low. So I'm not worried about increasing collisions. I suppose it just seems like bad practice to me, that its "too easy".

We faced the same kind of issue before when migrating from Oracle with ids generated by sequence to Cassandra with generated UUIDs.
We had to design a type to both support old data coming from Oracle with type long and new data with uuid.
The obvious solution is to use type blob to store the id. A blob can encode a long or an uuid.
This solution only works for partition key because you query them using =. It won't work for clustering column using operators like > or < because we need an ordering on their value.
There was a small objection at that time, which was using a blob to store the id makes it opaque to user, for example in cqlsh when you're doing a SELECT and you need to provide the id, how would you make a blob ?
Fortunately, the native functions of CQL bigIntAsBlob(), blobAsBigInt(), uuidAsBlob() and blobAsUUID() come in very handy.

I've decided to go in a different direction from doanduyhai's answer.
In order to maintain data consistency, we decided to fully de-normalize the data and create another table in C* that is keyed on our legacy IDs. When migrating the objects from our legacy into C*, they are assigned a new randomly generated UUID, which will be their new primary ID for the future. The legacy IDs will be kept around until such a time that we decide they are no longer needed. Upon that time, we can cleanly drop the legacy ID table and be done with them.
This solution allowed for a cleaner break from our legacy ID system in the future, and allowed us to prevent the use of strange custom made UUIDs. I also wasn't a huge fan of having the ID field as a blob type that could have multiple types of data stored in it since, in the future, we plan on only wanting UUIDs to be there.

Should I use a key/value database to store my API logs?

I get a lot of logs from my API. I analyse those logs to get interesting information like how many users for the API in this month or what type of activities they do.
All of the analysis I do depend on a period. So the timestamp is very important for me.
In fact, actually I use indexes on the timestamp. The problem is that timestamp is continue.
My question is which database is the more appropriate for my use case?
I heard about key/value databases, is it interesting to use the timestamp as a key?
Thanks.

This is a two-year-old article from IBM that talks more about SQL implementation, but it is also possibly something to keep in mind when you do a NoSQL implementation:
"Why CURRENT TIMESTAMP produces poor primary keys" - https://www.ibm.com/developerworks/community/blogs/SQLTips4DB2LUW/entry/current_timestamp?lang=en
Of course, your app would be different, I'm not sure of the granularity of your time-stamping, but it is possible to have two items logfiled at the same timestamp.
You might be better off creating some other form of unique key algorithm for your key-value store, adding some sort of serialization per timestamp. So the first item at a timestamp is ".1", the second ".2", etc. So you'd have some sort of timestamp.serialid format.
The other thought I have is: are you merging API log files from multiple applications/processes or machines? You might be able to do some sort of elementid.appid.timestamp.serialid to make unique key.
It all depends on your use case, so I can't say more for sure. I also wonder what you want to do with your key-value store in terms of reads/analysis after-the-fact, as that might highly alter your NoSQL solution. If you are planning to do a lot of log analysis, then, yes, there's a good reason to put that into a NoSQL database, especially if you want to do something like fast analysis of data, and then push some of the older items back into disk for storage.
As for databases, obviously each vendor will stick up for their product; but choose the best tool for the job. Best to try before you buy, and test things out for your specific setup. I'm from Aerospike, so I'm obviously biased towards it as a Key-Value store: http://www.aerospike.com/
Talked to a Very Smart Guy today, and he also suggested that you might want to use something like "milliseconds since date-time 'x'" as a primary key. Depending on what you are logging, there might still be a chance of collision with that as a primary key.
Therefore, another suggestion would be to take all entries for that primary key (ex: all log entries for that millisecond) and load them into the same record, in a kind of "bucket." You'd need application logic to parse out the multiple log entries under the same primary key, but that's another way to skin the cat.

ejb3: mapping many-to-many relationship jointable with a simple primary key

often in books, i see that when a many-to-many relationship is translated to a DB schema, the JoinTable gets a compound key consisting of the primary keys of the tables involved in many-to-many relationship.
I try to avoid compound keys completely. So i usually create a surrogate key even for the JoinTable and allow the database to fill it up by trigger or whatever feature a database has for incrementing primary keys. It just seems like a much simpler approach.
the only issue i can think is that, there are chances of duplication of foreign key pair in the JoinTable. But this can be avoided by a simple query before a row is inserted in the JoinTable.
Since books always use the compound keys approach, i wanted to know if, there are any negative effects if i use simple one column surrogate keys for the JoinTable?

In my opinion, using a single primary key is a bad idea.
First, as you said, a single primary key won't ensure unicity in the database. Sure, you can check this at runtime using a simple query, but this is costly, and if you forget just one time to do your check query, you can put your database in an inconsistent state.
Also, I think that using an additional column for the primary key is, in this case, not necessary. You don't need to identify relationships by a unique primary key, since the relationship is already defined by two unique keys: the primary keys of your two tables. In this case, you'll have unnecessary data, that will add complexity to your data model, and that you'll probably never use.

I try to avoid compound keys completely.
Huh? Why? I mean, why totally avoiding them? What's the reason behind this?
So i usually create a surrogate key even for the JoinTable and allow the database to fill it up by trigger or whatever feature a database has for incrementing primary keys. It just seems like a much simpler approach.
No offense but we don't have the same definition of simplicity then. I really don't see where it is simpler.
The only issue i can think is that, there are chances of duplication of foreign key pair in the JoinTable. But this can be avoided by a simple query before a row is inserted in the JoinTable.
First of all, don't use a SELECT to check uniqueness (you can have a race condition, a SELECT without locking the whole table won't guarantee anything), use a UNIQUE constraint, that's what UNIQUE is for.
And lets imagine one second that a SELECT would have been possible, do you really find that simpler? In general, people try to avoid hitting the database if possible. They also avoid having to do extra work.
Since books always use the compound keys approach, i wanted to know if, there are any negative effects if i use simple one column surrogate keys for the JoinTable?
So you mean something like this:
A A_B B
------- ------------------ --------
ID (PK) ID (PK), ID (PK)
A_ID (FK),
B_ID (FK),
UNIQUE(A_ID, B_ID)
Sure, you could do that (and you could even map it with JPA if you use some kind of trigger or identity column for the ID). But I don't see the point:
The above design is just not the standard way to map a (m:n) relation, it's not what people are used to find.
A_B is not really an Entity by itself (which is what the model somehow suggests, see #1).
The couple (A_ID, B_ID) is a natural candidate for the key, why not using it (and wasting space)?
The above design is not simpler, it does introduce more complexity.
To sum up, I don't see any advantage.

Multilingual fields in DB tables

I have an application that needs to support a multilingual interface, five languages to be exact. For the main part of the interface the standard ResourceBundle approach can be used to handle this.
However, the database contains numerous tables whose elements contain human readable names, descriptions, abstracts etc. It needs to be possible to enter each of these in all five languages.
While I suppose I could simply have fields on each table like
NameLang1
NameLang2
...
I feel that that leads to a significant amount of largely identical code when writing the beans the represent each table.
From a purely object oriented point of view the solution is however simple. Each class simply has a Text object that contains the relevant text in each of the languages. This is further helpful in that only one of the language is mandated, the others have fallback rules (e.g. if language 4 is missing return language 2 which fall back to language 1 which is mandatory).
Unfortunately, mapping this back to a relational database, means that I wind up with a single table that some 10-12 other tables FK to (some tables have more than one FK to it in fact).
This approach seems to work and I've been able to map the data to POJOs with Hibernate. About the only thing you cant do is map from a Text object to its parent (since you have no way of knowing which table you should link to), but then there is hardly any need to do that.
So, overall this seems to work but it just feels wrong to have multiple tables reference one table like this. Anyone got a better idea?
If it matters I'm using MySQL...

I had to do that once... multilingual text for some tables... I don't know if I found the best solution but what I did was have the table with the language-agnostic info and then a child table with all the multilingual fields. At least one record was required in the child table, for the default language; more languages could be added later.
On Hibernate you can map the info from the child tables as a Map, and get the info for the language you want, implementing the fallback on your POJO like you said. You can have different getters for the multilingual fields, that internally call the fallback method to get the appropiate child object for the needed language and then just return the required field.
This approach uses more table (one extra table for every table that needs multilingual info) but the performance is much better, as well as the maintenance I think...

The standard translation approach as used, for example, in gettext is to use a single string to describe the concept and make a call to a translate method which translates to the destination language.
This way you only need to store in the database a single string (the canonical representation) and then make a call in your application to the translate method to get the translated string. No FKs and total flexibility at the cost of a little of runtime performance (and maybe a bit more of maintenance trouble, but with some thought there's no need to make maintenance a problem in this scenario).

The approach I've seen in an application with a similar problem is that we use a "text id" column to store a reference, and we have a single table with all the translations. This provides some flexibility also in reusing the same keys to reduce the amount of required translations, which is an expensive part of the project.
It also provides a good separation between the data, and the translations which in my opinion is more of an UI thing.
If it is the case that the strings you require are not that many after all, then you can just load them all in memory once and use some method to provide translations by checking a data structure in memory.
With this approach, your beans won't have getters for each language, but you would use some other translator object:
MyTranslator.translate(myBean.getNameTextId());

Depending on your requirements, it may be best to have a separate label table for each table which needs to be multilingual. e.g.: you have a XYZ table with a xyz_id column, and a XYZ_Label table with a xyz_id, language_code, label, other_label, etc
The advantage of this, over having a single huge labels table, is that you can do unique constraints on the XYZ_labels table (e.g.: The english name for XYZ must be unique), and you can do indexed lookups much more efficiently, since the index will only be covering a single table at a time (e.g.: if you need to look up XYZ entities by english name) .

What about this:
http://rob.purplerockscissors.com/2009/07/24/internationalizing-websites/
...that is what user "Chochos" says in response #2

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.