Integer to UUID conversion using padded 0's - java

I have a question regarding UUID generation.
Typically, when I'm generating a UUID I will use a random or time based generation method.
HOWEVER, I'm migrating legacy data from MySQL over to a C* datastore and I need to change the legacy (auto-incrementing) integer IDs to UUIDS. Instead of creating another denormalized table with the legacy integer IDs as the primary key and all the data duplicated, I was wondering what folks thought about padding 0's onto the front of the integer ID to form a UUID. Example below.
*Something important to note is that the legacy IDs highest values will never top 1 million, so overflow isn't really an issue.
The idea would look like this:
Legacy ID: 123456 ---> UUID: 00000000-0000-0000-0000-000000123456
This would be done using some string concats and the UUID.fromString("00000000-0000-0000-0000-000000123456" method.
Does this seem like a bad pattern to anyone? I'm not a huge fan of the idea, gives me a bad taste in my mouth, but I don't have a technical reason for why haha.
As far as collisions go, the probability of a collision occurring is still ridiculously low. So I'm not worried about increasing collisions. I suppose it just seems like bad practice to me, that its "too easy".

We faced the same kind of issue before when migrating from Oracle with ids generated by sequence to Cassandra with generated UUIDs.
We had to design a type to both support old data coming from Oracle with type long and new data with uuid.
The obvious solution is to use type blob to store the id. A blob can encode a long or an uuid.
This solution only works for partition key because you query them using =. It won't work for clustering column using operators like > or < because we need an ordering on their value.
There was a small objection at that time, which was using a blob to store the id makes it opaque to user, for example in cqlsh when you're doing a SELECT and you need to provide the id, how would you make a blob ?
Fortunately, the native functions of CQL bigIntAsBlob(), blobAsBigInt(), uuidAsBlob() and blobAsUUID() come in very handy.

I've decided to go in a different direction from doanduyhai's answer.
In order to maintain data consistency, we decided to fully de-normalize the data and create another table in C* that is keyed on our legacy IDs. When migrating the objects from our legacy into C*, they are assigned a new randomly generated UUID, which will be their new primary ID for the future. The legacy IDs will be kept around until such a time that we decide they are no longer needed. Upon that time, we can cleanly drop the legacy ID table and be done with them.
This solution allowed for a cleaner break from our legacy ID system in the future, and allowed us to prevent the use of strange custom made UUIDs. I also wasn't a huge fan of having the ID field as a blob type that could have multiple types of data stored in it since, in the future, we plan on only wanting UUIDs to be there.

Related

jOOQ using one database for generating code and another for executing it

We are changing databases, from one that supports an 8 bit int to one that does not. Our code breaks when Liquibase creates a DB that causes jOOQ to generate "short" variables but our code uses byte/Byte - this breaks code signatures.
Rather than recode, somebody suggested that we continue to use the previous database (HSQLDB) to generate the code and it "should" run with the new database. There are dissenting opinions, I cannot find anything definitive beyond intuition and it seems to be counter to what jOOQ was designed for. Has anyone done this successfully?
There is obviously no absolute yes/no answer to such a question, but there are several solutions/workarounds:
Use the previous database product to generate code
This will work for a short period of time, e.g. right now, but as you move on, it will be an extremely limiting factor for your schema design. You will continue tailoring your DDL and some other design decisions around what HSQLDB can do, and you won't be able to leverage other features of your new database product. This can be especially limiting when migrating data, as ALTER TABLE statements are quite different between dialects.
I would recommend this approach only for a very short period of time, e.g. if you can't thoroughly fix this right away.
Use jOOQ's <forcedType/> mechanism to rewrite your data types
jOOQ's code generator allows for rewriting data types prior to loading the meta data of your schema into the code generator. This way, you can pretend your byte types are TINYINT on your new database product, even if your new database product doesn't support TINYINT.
This is a thorough solution that you may want to implement regardless of what product you're using, as it will give you a way to re-define parts of your schema just for jOOQ's code generator, independently of how you're generating your code.
The feature is documented here:
https://www.jooq.org/doc/latest/manual/code-generation/codegen-advanced/codegen-config-database/codegen-database-forced-types
This is definitely a more long term solution for your case.
Notice, a future jOOQ will be able to use CHECK constraints as input meta data to decide whether to apply such a <forcedType/>. I would imagine that you will place a CHECK (my_smallint BETWEEN -128 AND 127) constraint on every such column, so you could easily recognise which columns to apply the <forcedType/> to: https://github.com/jOOQ/jOOQ/issues/8843
Until that feature is available, you can implement it yourself via programmatic code generator configuration:
https://www.jooq.org/doc/latest/manual/code-generation/codegen-programmatic/
Or, starting with jOOQ 3.12, by using a SQL expression to produce the regular expression that <forcedType/> matches. E.g. in Oracle:
<forcedType>
<name>TINYINT</name>
<sql>
select listagg(owner || '.' || table_name || '.'
|| regexp_replace(search_condition_vc, ' between.*', ''), '|')
from user_constraints
where constraint_type = 'C'
and regexp_like(search_condition_vc, '.* between -128 and 127');
</sql>
</forcedType>
You could use a file based meta data source
jOOQ doesn't have to connect to a live database instance to reverse engineer your schema. You can also pass DDL code to jOOQ, or XML files:
https://www.jooq.org/doc/latest/manual/code-generation/codegen-ddl/
https://www.jooq.org/doc/latest/manual/code-generation/codegen-xml/
This is not really solving your problem directly, but maybe, it might make solving it a bit easier. However, there are other limitations to these approaches, e.g. stored procedures aren't currently (jOOQ 3.12) supported, so I'm just adding this for completeness' sake here, not to suggest you use it right now.

Should I use a key/value database to store my API logs?

I get a lot of logs from my API. I analyse those logs to get interesting information like how many users for the API in this month or what type of activities they do.
All of the analysis I do depend on a period. So the timestamp is very important for me.
In fact, actually I use indexes on the timestamp. The problem is that timestamp is continue.
My question is which database is the more appropriate for my use case?
I heard about key/value databases, is it interesting to use the timestamp as a key?
Thanks.
This is a two-year-old article from IBM that talks more about SQL implementation, but it is also possibly something to keep in mind when you do a NoSQL implementation:
"Why CURRENT TIMESTAMP produces poor primary keys" - https://www.ibm.com/developerworks/community/blogs/SQLTips4DB2LUW/entry/current_timestamp?lang=en
Of course, your app would be different, I'm not sure of the granularity of your time-stamping, but it is possible to have two items logfiled at the same timestamp.
You might be better off creating some other form of unique key algorithm for your key-value store, adding some sort of serialization per timestamp. So the first item at a timestamp is ".1", the second ".2", etc. So you'd have some sort of timestamp.serialid format.
The other thought I have is: are you merging API log files from multiple applications/processes or machines? You might be able to do some sort of elementid.appid.timestamp.serialid to make unique key.
It all depends on your use case, so I can't say more for sure. I also wonder what you want to do with your key-value store in terms of reads/analysis after-the-fact, as that might highly alter your NoSQL solution. If you are planning to do a lot of log analysis, then, yes, there's a good reason to put that into a NoSQL database, especially if you want to do something like fast analysis of data, and then push some of the older items back into disk for storage.
As for databases, obviously each vendor will stick up for their product; but choose the best tool for the job. Best to try before you buy, and test things out for your specific setup. I'm from Aerospike, so I'm obviously biased towards it as a Key-Value store: http://www.aerospike.com/
Talked to a Very Smart Guy today, and he also suggested that you might want to use something like "milliseconds since date-time 'x'" as a primary key. Depending on what you are logging, there might still be a chance of collision with that as a primary key.
Therefore, another suggestion would be to take all entries for that primary key (ex: all log entries for that millisecond) and load them into the same record, in a kind of "bucket." You'd need application logic to parse out the multiple log entries under the same primary key, but that's another way to skin the cat.

Primary Key Type: int vs long

I know some software shops have been burned by using the int type for the primary key of a persistent class. That being said, not all tables grow past 2 billions. As a matter of fact, most don't.
So, do you guys use the long type only for those classes that are mapped to potentially large tables OR for every persistent class just to be consistent? What's the industry concensus?
I'll leave this question open for a while so that you can share with us your success/horror stories.
Long can be advantageous even if the table does not grow super large, yet has a high turnover ie if rows are deleted/inserted frequently. Your auto-generated/sequential unique identifier may increment to a high number while the table remains small.
I generally use Long because the performance benefits are not noticeable in most of my projects, however a bug due to overflow would be very noticeable!
That's not to say that Int is not a better option for other people's scenarios, for example for data crunching or complex query systems. Just be clear of the risks/benefits and how they impact your specific project.
I don't know about "burned". It's not difficult to change from int to long when you need to. The conversion is straight forward in SQL, and then it's just a search and replace in your client code (or make the change in your persistence layer, and then compile and see what breaks.) You're moving from one integer type to another, so you don't have to worry about subtle conversion issues or truncation..
Going from float to double would be a lot harder.
I use Integer for my surrogate keys unless I have a need for them to be something else. It is not necessary to always use a Long if you don't have a need for it.
(I typically use JPA/Hibernate in my projects running against either Oracle 10g or MySQL 5.x databases.)
Because Int will always be faster for Select/Sorts.

What are the various options and their tradeoffs for storing a UUID in a MYSQL table?

I'm planning on using client provided UUID's as the primary key in several tables in a MySQL Database.
I've come across various mechanisms for storing UUID's in a MySQL database but nothing that compares them against each other. These include storage as:
BINARY(16)
CHAR(16)
CHAR(36)
VARCHAR(36)
2 x BIGINT
Are there any better options, how do the options compare against each other in terms of:
storage size?
query overhead? (index issues, joins etc.)
ease of inserting and updating values from client code? (typically Java via JPA)
Are there any differences based on which version of MySQL your running, or the storage engine? We're currently running 5.1 and were planning on using InnoDB. I'd welcome any comments based on practical experience of trying to use UUIDs. Thanks.
I would go with storing it in a Binary(16) column, if you are indeed set on using UUIDs at all. something like 2x bigint would be quite cumbersome to manage. Also, i've heard of people reversing them because the start of the UUIDs on the same machine tend to be the same at the beginning, and the different parts are at the end, so if you reverse them, your indexes will be more efficient.
Of course, my instinct says that you should be using auto increment integers unless you have a really good reason for using the UUID. One good reason is generating unique keys accross different databases. The other option is that you plan to have more records than an INT can store. Although not many applications really need things like this. THere is not only a lot of efficiency lost when not using integers for your keys, and it's also harder to work with them. they are too long to type in, and passing them around in your URLs make the URLs really long. So, go with the UUID if you need it, but try to stay away.
I have used UUIDs for smart client online/offline storage and data synchronization and for databases that I knew would have to be merged at some point. I have always used char(36) or char(32)(no dashes). You get a slight performance gain over varchar and almost all databases support char. I have never tried binary or bigint. One thing to be aware of, is that char will pad with spaces if you do not use 36 or 32 characters. Point being, don't write a unit test that sets the ID of an object to "test" and then try to find it in the database. ;)

Should I use composite primary keys or not?

There seems to only be 2nd class support for composite database keys in Java's JPA (via EmbeddedId or IdClass annotations). And when I read up on composite keys, regardless of language, people keep coming across as them being a bad thing. But I cannot understand why. Are composite keys still acceptable to use these days? If not, why not?
I've found one person who agrees with me:
http://weblogs.sqlteam.com/jeffs/archive/2007/08/23/composite_primary_keys.aspx
But another who doesn't:
http://weblogs.java.net/blog/bleonard/archive/2006/11/using_composite.html
Is it just me, or are people not able to make the distinction of where a composite key is appropriate or not? I see composite primary keys useful when the table doesn't represent an entity - i.e. when it represents a join table.
A simple example:
Actor { Id, Name, Email }
Movie { Id, Name, Year }
Character { Id, Name }
Role { Actor, Movie, Character }
Here Actor, Movie and Character obviously benefit from having an Id column as the primary key.
But Role is a Many-To-Many join table. I see no point in creating an id just to identify a row in the database. To me it seems obvious that the primary key is { Actor, Movie, Character }. It also seems like a rather limiting feature, especially if the data in the join table changes all the time, you could find yourself with primary key collisions once the primary key sequence wraps around to 0.
So, back to the original question, is it still acceptable practice to use composite primary keys? If not, why not?
In my personal opinion you should avoid composite primary keys due to several reasons:
Future changes: when you design a database you sometimes miss what in the future will become important. A significant example for this is thinking a combination of two or more fields is unique (and thus can become a primary key), whereas in the future you want to allow NULLs or other non-unique values in them. Having a single primary key is a good solid solution against such changes.
Uniformity: If every table has a unique numerical ID, and you also maintain some standard as to its name (e.g. "ID" or "tablename_id"), the code and SQL referring to it is clearer (in my opinion).
There are other reasons, but these are just a few.
The main question I would ask is why not use a separate primary key if you have a unique set of fields? What's the cost? An additional integer index? That's not too bad.
Hope that helps.
I think there's no problem using a composite key.
To me the database it's a component on its own, that should be treated the same way we treat code : for instance we want clean code, that communicates clearly its intent, that does one thing and does it well, that doesn't add any uneeded level of complexity, etc.
Same thing with the db, if the PK is composite, this is the reality, so the model should be kept clean and clear. A composite PK it's clearer than the mix auto-increment + constraint. When you see an ID column that does nothing you need to ask what's the real PK, are there any other hidden things that you should be aware of, etc. A clear PK doesn't leave any doubts.
The db is the base of your app, to me we need the most solid base that we can have. On this base we'll build the app ( web or not ). So I can't see why we should bend the db model to conform to some specific in one development tool/framework/language. The data is directing the application, not the other way around. What if the ORM changes in the future and becomes obsolete and a better solution appears that imposes another model ? We can't play with the db model to fit this or that framework, the model should stay the same, it should not depend on what tool we're using to access the data ...
If the db model change in the future, it should change because functionality changed. If we would know today how this functionality will change, we'll be modeling this already. ANd any future change will be dealt with when the time comes, we can't predict for instance the impact on existing data, so one extra column doesn't guarantee that it will withold any future change ...
We should design for today's functionality, and keep the db model the simplest possible, this way it will be easy to change/evolve in the future.
Religious wars have been, and still are, going on on this subject.
OO people have this zealous thing about "identity", and will tell you that the only thing that matters is the ability for you to "identify" "real-life objects" inside your programs, and that composite, "real-life" keys will only get you into trouble when trying to achieve that goal.
Data people have this thing about "uniqueness" that is perceived as "zealous" by the OO side, and will tell you that the only thing that matters is that if the business tells you that the combination of (values for) attribute X and attribute Y must be unique, then it is your job to see to it that the database enforces this business rule of uniqueness of the combined X+Y.
How you want your question answered is just a matter of which religion you prefer. My personal religion is the Data one. That religion has proven to be able to survive any hype and trend ever since 1969.
Similar questions have been asked on SO, and there is no consensus ;)
If you develop a web application, you will love single column pk's, as they make your URLs simpler.
For a sequence to wrap you'd need 2 billion records in a single table (32bit), or 10^18 with 64 bit pk's.
Btw, your data model does not allow for movie characters with unknown actors.
My general opinion is... no. don't use composite primary keys.
They will typically complicate ORMs if you use them (ORMs sometimes go so far as to call composite primary keys "legacy behaviour") and generally if you're using multiple keys, one or more of them will tend to be natural rather than technical keys, which for me is the bigger problem: IMHO you should certainly favour technical primary keys.
More on this in Database Development Mistakes Made by AppDevelopers.
It's a religious thing. I use natural keys and shun surrogates. I have no problem with composite keys either in theory or in practice.
Only the most trivial logical model would involve no composite keys. Call me lazy but I see no need to complicate the data model by introducing surrogates into the physical model on implementation. Sure, I'd consider one on a table if performance issues were found but I take the same approach as for denormalization i.e. as a last resort. Habitually using surrogates amounts to premature optimization, IMO.
In Ruby for Rails, when not explicitly specifying otherwise, your Role table would be kind of like you described (if the columns are actually the IDs from the other tables). Still, in the database you might want to ensure unique combinations by defining a unique index on those three columns, if only to help the database optimizing your queries. With that unique index in place and the framework not using any other primary key anyway, there is no need for a an additional numeric primary key in your Role table. Having said that, the unique index could could very be defined as a composite primary key instead.
As for future changes: defining a strict database for your first iteration will prevent unexpected data to be persisted, which will make migrations much easier.
So: I would use composite primary keys.
I would only ever use them in join tables. The only way to absolutely ensure that every record identifier is unique and consistent over time is to use a synthetic key.
Composite keys seem OK in theory, which is why they are tempting to use, but practice has shown that they usually indicate that there is a flaw in your data model. Worse still, in many cases they will fail to guarantee uniqueness, given a large enough data set. And data sets always grow over time, so using them may mean that you have planted a bomb in your application which will only explode when the application has been in production use for a while.
I think that people are underplaying ORMs. Every mainstream programming language has a defacto ORM, and has had for years, because they solve the fundamental incompatibility between OO and relational structures. Trying to write any complex, testable OO software against SQL databases without an ORM is very inefficient, at best.
Good ORMs also provide practices and tooling that make it much easier to create and maintain consistent high-quality database schema, so on average, a team will come out well ahead by working with an ORM. Handcrafting schema is rather like writing C++ ...people can do it, but in the real world it is so hard to maintain quality over time that the average product is not good.
I have almost never seen a case where a composite key was a good idea (exception, joining table consisting of only two surrogate keys). In the first palce you are wasting space in the child tables. You are harming performance in the joins as integer joins are generally much faster. If you have the composite key as a clustered index (talking SQL Server here), then you are causing the database to be less efficient about storing records and less efficient in building other indexes - all of which use the clusterd index.
When the data in the key changes (As it almost inevitably will) then you need to update all related tables as well casuing massive unecessary updates and wasting processing power on a task that is completely uneeded when the database is designed to use surrogaste keys. Primary keys need not only to be unique but to be unchanging. Composite keys often fail the second test.
So you are thinking of using a technique that harms performance, causes poor use of memory and database storage, uses way more space in child records (another waste of resources) and requires painful updating of what may be millions of child records when things change. And which might make it hard to use an ORM? Why would you do that? Because you are too lazy to put a surrogate key on and then define a unique index on the potential composite key? Is there any gain at all to using a composite index? For the lack of 5 minutes of work you are permanently harming your database?
In terms of the domain model, I see nothing wrong with creating a composite primary key when the table doesn't represent an entity - i.e. when it represents a join table (as you mention in your question), other than if it is not montonically increasing, then you will get a certain amount of page splits during insertions.
Some ORM's don't cope well with composite primary keys, so perhaps it is safer to create a surrogate auto-integer for the primary key, and cover the columns with a non-clustered index.

Categories

Resources