PHP Generating unique id from database - java

In my project, i am generating a unique id from a table of database with taking largest integer value of an attribute 'serial_key' with adding 1 to that number. It is generating the unique index to add new tuple of records.
But this mechanism failed when i deployed the application on multiple PCs in intranet or internet, it was generating the same unique id on all different machines at an instant. And i have plenty of data in the server so i have to manage the same pattern of the id, since it was constructed taking a specific format. Please suggest how to resolve this problem. Thanks.

You can use the HiLo Algorithm, to generate unique keys. It can be tuned for performance or continous keys. If you implement it in all your clients (i guess java and php) you would get unique keys across as many keys as you like (or your database performance allows). You would also not be dependent on any database and if you tune it for throughput you wouldn't need much additional database querys.
See this SO-Answer.

You can solve this by using an AUTO_INCREMENT attribute on your serial_key column. That way you don't have to worry about data collision. This is a common practice for primary keys.

Proper "allocator table" design for portable DB-based key allocation: to be used in preference to Scott Ambler's misguided "hi-lo" idea.
create table KEY_ALLOC (
SEQ varchar(32) not null,
NEXT bigint not null,
primary key (SEQ)
);
To allocate the next, say, 20 keys (which are then held as a range in the server & used as needed):
select NEXT from KEY_ALLOC where SEQ=?;
update KEY_ALLOC set NEXT=(old value+20) where SEQ=? and NEXT=(old value);
Providing you can commit this transaction (use retries to handle contention), you have allocated 20 keys & can dispense them as needed.
This scheme is 10x faster than allocating from an Oracle sequence, and is 100% portable amongst all databases.
Unlike Scott Ambler's hi-lo idea, it treats the keyspace as a contiguous linear numberline -- from which it efficiently allocates small chunks of configurable size. These keys are human-friendly and large chunks are not wasted. Mr Ambler's idea allocates the high 16- or 32-bits, and requires either ugly composite keys or generating large human-unfriendly key values as the hi-words increment.
Comparison of allocated keys:
Linear_Chunk Hi_Lo
100 65536
101 65537
102 65538
.. server restart
120 131072
121 131073
122 131073
.. server restart
140 196608
Guess which keys are easier to work with for a developer or DB admin.
I actually corresponded with him back in the 90's to suggest this improved scheme to him, but he was too stuck & obstinate to acknowledge the advantages of using a linear number-line.

Related

DynamoDB Scan Query and BatchGet

We have a Dynamo DB table structure which consists Hash and Range as primary key.
Hash = date.random_number
Range = timestamp
How to get items within X and Y timestamp? Since hash key is attached with random_number, those many times query has to be fired. Is it possible to give multiple hash values and single RangeKeyCondition.
What would be most efficient in terms of cost and time?
Random number range is from 1 to 10.
If I understood correctly, you have a table with the following definition of Primary Keys:
Hash Key : date.random_number
Range Key : timestamp
One thing that you have to keep in mind is that , whether you are using GetItem or Query, you have to be able to calculate the Hash Key in your application in order to successfully retrieve one or more items from your table.
It makes sense to use the random numbers as part of your Hash Key so your records can be evenly distributed across the DynamoDB partitions, however, you have to do it in a way that your application can still calculate those numbers when you need to retrieve the records.
With that in mind, let's create the query needed for the specified requirements. The native AWS DynamoDB operations that you have available to obtain several items from your table are:
Query, BatchGetItem and Scan
In order to use BatchGetItem you would need to know beforehand the entire primary key (Hash Key and Range Key), which is not the case.
The Scan operation will literally go through every record of your table, something that in my opinion is unnecessary for your requirements.
Lastly, the Query operation allows you to retrieve one or more items from a table applying the EQ (equality) operator to the Hash Key and a number of other operators that you can use when you don't have the entire Range Key or would like to match more than one.
The operator options for the Range Key condition are: EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN
It seems to me that the most suitable for your requirements is the BETWEEN operator, that being said, let's see how you could build the query with the chosen SDK:
Table table = dynamoDB.getTable(tableName);
String hashKey = "<YOUR_COMPUTED_HASH_KEY>";
String timestampX = "<YOUR_TIMESTAMP_X_VALUE>";
String timestampY = "<YOUR_TIMESTAMP_Y_VALUE>";
RangeKeyCondition rangeKeyCondition = new RangeKeyCondition("RangeKeyAttributeName").between(timestampX, timestampY);
ItemCollection<QueryOutcome> items = table.query("HashKeyAttributeName", hashKey,
rangeKeyCondition,
null, //FilterExpression - not used in this example
null, //ProjectionExpression - not used in this example
null, //ExpressionAttributeNames - not used in this example
null); //ExpressionAttributeValues - not used in this example
You might want to look at the following post to get more information about DynamoDB Primary Keys:
DynamoDB: When to use what PK type?
QUESTION: My concern is querying multiple times because of random_number attached to it. Is there a way to combine these queries and hit dynamoDB once ?
Your concern is completely understandable, however, the only way to fetch all the records via BatchGetItem is by knowing the entire primary key (HASH + RANGE) of all records you intend to get. Although minimizing the HTTP roundtrips to the server might seem to be the best solution at first sight, the documentation actually suggests to do exactly what you are doing to avoid hot partitions and uneven use of your provisioned throughput:
Design For Uniform Data Access Across Items In Your Tables
"Because you are randomizing the hash key, the writes to the table on
each day are spread evenly across all of the hash key values; this
will yield better parallelism and higher overall throughput. [...] To
read all of the items for a given day, you would still need to Query
each of the 2014-07-09.N keys (where N is 1 to 200), and your
application would need to merge all of the results. However, you will
avoid having a single "hot" hash key taking all of the workload."
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
Here there is another interesting point suggesting the moderate use of reads in a single partition... if you remove the random number from the hash key to be able to get all records in one shot, you are likely to fall on this issue, regardless if you are using Scan, Query or BatchGetItem:
Guidelines for Query and Scan - Avoid Sudden Bursts of Read Activity
"Note that it is not just the burst of capacity units the Scan uses
that is a problem. It is also because the scan is likely to consume
all of its capacity units from the same partition because the scan
requests read items that are next to each other on the partition. This
means that the request is hitting the same partition, causing all of
its capacity units to be consumed, and throttling other requests to
that partition. If the request to read data had been spread across
multiple partitions, then the operation would not have throttled a
specific partition."
And lastly, because you are working with time series data, it might be helpful to look into some best practices suggested by the documentation as well:
Understand Access Patterns for Time Series Data
For each table that you create, you specify the throughput
requirements. DynamoDB allocates and reserves resources to handle your
throughput requirements with sustained low latency. When you design
your application and tables, you should consider your application's
access pattern to make the most efficient use of your table's
resources.
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with hash and
range type primary key with Customer ID as the hash attribute and
date/time as the range attribute. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

How can I analyse ~13GB of data?

I have ~300 text files that contain data on trackers, torrents and peers. Each file is organised like this:
tracker.txt
time torrent
time peer
time peer
...
time torrent
...
I have several files per tracker and much of the information is repeated (same information, different time).
I'd like to be able to analyse what I have and report statistics on things like
How many torrents are at each tracker
How many trackers are torrents listed on
How many peers do torrents have
How many torrents to peers have
The sheer quantity of data is making this hard for me to. Here's What I've tried.
MySQL
I put everything into a database; one table per entity type and tables to hold the relationships (e.g. this torrent is on this tracker).
Adding the information to the database was slow (and I didn't have 13GB of it when I tried this) but analysing the relationships afterwards was a no-go. Every mildly complex query took over 24 hours to complete (if at all).
An example query would be:
SELECT COUNT(DISTINCT torrent)
FROM TorrentAtPeer, Peer
WHERE TorrentAtPeer.peer = Peer.id
GROUP BY Peer.ip;
I tried bumping up the memory allocations in my my.cnf file but it didn't seem to help. I used the my-innodb-heavy-4G.cnf settings file.
EDIT: Adding table details
Here's what I was using:
Peer Torrent Tracker
----------- ----------------------- ------------------
id (bigint) id (bigint) id (bigint)
ip* (int) infohash* (varchar(40)) url (varchar(255))
port (int)
TorrentAtPeer TorrentAtTracker
----------------- ----------------
id (bigint) id (bigint)
torrent* (bigint) torrent* (bigint)
peer* (bigint) tracker* (bigint)
time (int) time (int)
*indexed field. Navicat reports them as being of normal type and Btree method.
id - Always the primary key
There are no foreign keys. I was confident in my ability to only use IDs that corresponded to existing entities, adding a foreign key check seemed like a needless delay. Is this naive?
Matlab
This seemed like an application that was designed for some heavy lifting but I wasn't able to allocate enough memory to hold all of the data in one go.
I didn't have numerical data so I was using cell arrays, I moved from these to tries in an effort to reduce the footprint. I couldn't get it to work.
Java
My most successful attempt so far. I found an implementation of Patricia Tries provided by the people at Limewire. Using this I was able to read in the data and count how many unique entities I had:
13 trackers
1.7mil torrents
32mil peers
I'm still finding it too hard to work out the frequencies of the number of torrents at peers. I'm attempting to do so by building tries like this:
Trie<String, Trie<String, Object>> peers = new Trie<String, Trie<String, Object>>(...);
for (String line : file) {
if (containsTorrent(line)) {
infohash = getInfohash(line);
}
else if (containsPeer(line)) {
Trie<String, Object> torrents = peers.get(getPeer(line));
torrents.put(infohash, null);
}
}
From what I've been able to do so far, if I can get this peers trie built then I can easily find out how many torrents are at each peer. I ran it all yesterday and when I came back I noticed that the log file wan't being written to, I ^Z the application and time reported the following:
real 565m41.479s
user 0m0.001s
sys 0m0.019s
This doesn't look right to me, should user and sys be so low? I should mention that I've also increased the JVM's heap size to 7GB (max and start), without that I rather quickly get an out of memory error.
I don't mind waiting for several hours/days but it looks like the thing grinds to a halt after about 10 hours.
I guess my question is, how can I go about analysing this data? Are the things I've tried the right things? Are there things I'm missing? The Java solution seems to be the best so far, is there anything I can do to get it work?
You state that your MySQL queries took too long. Have you ensured that proper indices are in place to support the kind of request you submitted? In your example, that would be an index for Peer.ip (or even a nested index (Peer.ip,Peer.id)) and an index for TorrentAtPeer.peer.
As I understand you Java results, you have much data but not that many different strings. So you could perhaps save some time by assigning a unique number to each tracker, torrent and peer. Using one table for each, with some indexed value holding the string and a numeric primary key as the id. That way, all tables relating these entities would only have to deal with those numbers, which could save a lot of space and make your operations a lot faster.
I would give MySQL another try but with a different schema:
do not use id-columns here
use natural primary keys here:
Peer: ip, port
Torrent: infohash
Tracker: url
TorrentPeer: peer_ip, torrent_infohash, peer_port, time
TorrentTracker: tracker_url, torrent_infohash, time
use innoDB engine for all tables
This has several advantages:
InnoDB uses clustered indexes for primary key. Means that all data can be retrieved directly from index without additional lookup when you only request data from primary key columns. So InnoDB tables are somewhat index-organized tables.
Smaller size since you do not have to store the surrogate keys. -> Speed, because lesser IO for the same results.
You may be able to do some queries now without using (expensive) joins, because you use natural primary and foreign keys. For example the linking table TorrentAtPeer directly contains the peer ip as foreign key to the peer table. If you need to query the torrents used by peers in a subnetwork you can now do this without using a join, because all relevant data is in the linking table.
If you want the torrent count per peer and you want the peer's ip in the results too then we again have an advantage when using natural primary/foreign keys here.
With your schema you have to join to retrieve the ip:
SELECT Peer.ip, COUNT(DISTINCT torrent)
FROM TorrentAtPeer, Peer
WHERE TorrentAtPeer.peer = Peer.id
GROUP BY Peer.ip;
With natural primary/foreign keys:
SELECT peer_ip, COUNT(DISTINCT torrent)
FROM TorrentAtPeer
GROUP BY peer_ip;
EDIT
Well, original posted schema was not the real one. Now the Peer table has a port field. I would suggest to use primary key (ip, port) here and still drop the id column. This also means that the linking table needs to have multicolumn foreign keys. Adjusted the answer ...
If you could use C++, you should take a look at Boost flyweight.
Using flyweight, you can write your code as if you had strings, but each instance of a string (your tracker name, etc.) uses only the size of a pointer.
Regardless of the language, you should convert the IP address to an int (take a look at this question) to save some more memory.
You most likely have a problem that can be solved by NOSQL and distributed technologies.
i) I would write a distributed system using Hadoop/HBase.
ii) Rent several tens / hundred AWS machines, but only for a few seconds (It'll still cost you less than a $0.50)
iii) Profit!!!

Ways to generate roughly sequential Ids (32 bits & 64 bits sizes) in a distributed application

What are the good ways to generate unique "roughly" sequential Ids (32 bits & 64 bits sizes) in a distributed application ?
[Side Note: My DB does not provide this facility.] Also I do not want to use 128 bits UUIDs.
EDIT: Thanks all for your response!
As some of you suggest using Mysql db like flickr's ticket servers, I doubt that using multiple servers(in order to eliminate Single point of failure) may disturb some sequential nature of the generated Ids since some servers may lag behind the others. While I am ok with a lag of a few Id sequences but cannot afford huge disturbances in sequentiality.
Building on #Chris' idea of having a table to generate the "next" identity. A more reliable way of doing this is to have two servers and a round-robin load-balancer. Server 1 distributes odd numbers, Server 2 distributes even numbers. If you lose one server, you get all odds or all evens, but the show still goes on.
Flickr uses something very similar for their ID system
http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
Then creatively use MYSQL's atomic REPLACE syntax like follows:
CREATE TABLE `Tickets64` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`stub` char(1) NOT NULL default '',
PRIMARY KEY (`id`),
UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM
REPLACE INTO Tickets64 (stub) VALUES ('a');
SELECT LAST_INSERT_ID();
Where the stub value is sufficient to generate the next identity in an atomic fashion.
Update with the OP Chronology and Sequence requirements
With Chronology as a driver your choices change a litte - an atomic state in a SPOF - Chris' idea in SQL for example. This will be a bottleneck, and it's state must be durable to prevent duplicate IDs being issued.
To achieve chronology at scale with high-availability in a distributed system, causal sequence algorithms are pretty much the only way to go - there are a few out there:
Lamport timestamps
Vector Clocks
Dependency Sequences
Hierarchical Clocks
The calling pattern is quite different than the SPOF strategy, they require you to track and pass a memento of sequence, timestamp or version - effectively session information for the sequence you are generating. They do however guarantee causal order for any given sequence or item even in a distributed system. In most cases an event PK would be the compound of item identifier + causal sequence id.
Since you have a database, you can create a table with the next value in it. Aquiring this value will require you to select the value and update the row with the new next value, where the row had the old value, if the update fails to affect any rows then some other process or thread was able to aquire the value before you and you need to retry your attempt to get the next value.
The pseudo code for this would look something like this
do
select nextvalue into MyValue from SequenceTable
update SequenceTable set nextvalue= nextvalue + 1 where nextvalue = MyValue
while rowsaffected = 0
In the above, MyValue is the variable that will hold the nextvalue from the database table SequenceTable, and rowsaffected is an indicator that indicates how many rows where affected by the last SQL statement which in this case is the update statement.

Distributed sequence number generation?

I've generally implemented sequence number generation using database sequences in the past.
e.g. Using Postgres SERIAL type http://www.neilconway.org/docs/sequences/
I'm curious though as how to generate sequence numbers for large distributed systems where there is no database. Does anybody have any experience or suggestions of a best practice for achieving sequence number generation in a thread safe manner for multiple clients?
OK, this is a very old question, which I'm first seeing now.
You'll need to differentiate between sequence numbers and unique IDs that are (optionally) loosely sortable by a specific criteria (typically generation time). True sequence numbers imply knowledge of what all other workers have done, and as such require shared state. There is no easy way of doing this in a distributed, high-scale manner. You could look into things like network broadcasts, windowed ranges for each worker, and distributed hash tables for unique worker IDs, but it's a lot of work.
Unique IDs are another matter, there are several good ways of generating unique IDs in a decentralized manner:
a) You could use Twitter's Snowflake ID network service. Snowflake is a:
Networked service, i.e. you make a network call to get a unique ID;
which produces 64 bit unique IDs that are ordered by generation time;
and the service is highly scalable and (potentially) highly available; each instance can generate many thousand IDs per second, and you can run multiple instances on your LAN/WAN;
written in Scala, runs on the JVM.
b) You could generate the unique IDs on the clients themselves, using an approach derived from how UUIDs and Snowflake's IDs are made. There are multiple options, but something along the lines of:
The most significant 40 or so bits: A timestamp; the generation time of the ID. (We're using the most significant bits for the timestamp to make IDs sort-able by generation time.)
The next 14 or so bits: A per-generator counter, which each generator increments by one for each new ID generated. This ensures that IDs generated at the same moment (same timestamps) do not overlap.
The last 10 or so bits: A unique value for each generator. Using this, we don't need to do any synchronization between generators (which is extremely hard), as all generators produce non-overlapping IDs because of this value.
c) You could generate the IDs on the clients, using just a timestamp and random value. This avoids the need to know all generators, and assign each generator a unique value. On the flip side, such IDs are not guaranteed to be globally unique, they're only very highly likely to be unique. (To collide, one or more generators would have to create the same random value at the exact same time.) Something along the lines of:
The most significant 32 bits: Timestamp, the generation time of the ID.
The least significant 32 bits: 32-bits of randomness, generated anew for each ID.
d) The easy way out, use UUIDs / GUIDs.
You could have each node have a unique ID (which you may have anyway) and then prepend that to the sequence number.
For example, node 1 generates sequence 001-00001 001-00002 001-00003 etc. and node 5 generates 005-00001 005-00002
Unique :-)
Alternately if you want some sort of a centralized system, you could consider having your sequence server give out in blocks. This reduces the overhead significantly. For example, instead of requesting a new ID from the central server for each ID that must be assigned, you request IDs in blocks of 10,000 from the central server and then only have to do another network request when you run out.
Now there are more options.
Though this question is "old", I got here, so I think it might be useful to leave the options I know of (so far):
You could try Hazelcast. In it's 1.9 release it includes a Distributed implementation of java.util.concurrent.AtomicLong
You can also use Zookeeper. It provides methods for creating sequence nodes (appended to znode names, though I prefer using version numbers of the nodes). Be careful with this one though: if you don't want missed numbers in your sequence, it may not be what you want.
Cheers
It can be done with Redisson. It implements distributed and scalable version of AtomicLong. Here is example:
Config config = new Config();
config.addAddress("some.server.com:8291");
Redisson redisson = Redisson.create(config);
RAtomicLong atomicLong = redisson.getAtomicLong("anyAtomicLong");
atomicLong.incrementAndGet();
If it really has to be globally sequential, and not simply unique, then I would consider creating a single, simple service for dispensing these numbers.
Distributed systems rely on lots of little services interacting, and for this simple kind of task, do you really need or would you really benefit from some other complex, distributed solution?
There are a few strategies; but none that i know can be really distributed and give a real sequence.
have a central number generator. it doesn't have to be a big database. memcached has a fast atomic counter, in the vast majority of cases it's fast enough for your entire cluster.
separate an integer range for each node (like Steven Schlanskter's answer)
use random numbers or UUIDs
use some piece of data, together with the node's ID, and hash it all (or hmac it)
personally, i'd lean to UUIDs, or memcached if i want to have a mostly-contiguous space.
Why not use a (thread safe) UUID generator?
I should probably expand on this.
UUIDs are guaranteed to be globally unique (if you avoid the ones based on random numbers, where the uniqueness is just highly probable).
Your "distributed" requirement is met, regardless of how many UUID generators you use, by the global uniqueness of each UUID.
Your "thread safe" requirement can be met by choosing "thread safe" UUID generators.
Your "sequence number" requirement is assumed to be met by the guaranteed global uniqueness of each UUID.
Note that many database sequence number implementations (e.g. Oracle) do not guarantee either monotonically increasing, or (even) increasing sequence numbers (on a per "connection" basis). This is because a consecutive batch of sequence numbers gets allocated in "cached" blocks on a per connection basis. This guarantees global uniqueness and maintains adequate speed. But the sequence numbers actually allocated (over time) can be jumbled when there are being allocated by multiple connections!
Distributed ID generation can be archived with Redis and Lua. The implementation available in Github. It produces a distributed and k-sortable unique ids.
I know this is an old question but we were also facing the same need and was unable to find the solution that fulfills our need.
Our requirement was to get a unique sequence (0,1,2,3...n) of ids and hence snowflake did not help.
We created our own system to generate the ids using Redis. Redis is single threaded hence its list/queue mechanism would always give us 1 pop at a time.
What we do is, We create a buffer of ids, Initially, the queue will have 0 to 20 ids that are ready to be dispatched when requested. Multiple clients can request an id and redis will pop 1 id at a time, After every pop from left, we insert BUFFER + currentId to the right, Which keeps the buffer list going. Implementation here
I have written a simple service which can generate semi-unique non-sequential 64 bit long numbers. It can be deployed on multiple machines for redundancy and scalability. It use ZeroMQ for messaging. For more information on how it works look at github page: zUID
Using a database you can reach 1.000+ increments per second with a single core. It is pretty easy. You can use its own database as backend to generate that number (as it should be its own aggregate, in DDD terms).
I had what seems a similar problem. I had several partitions and I wanted to get an offset counter for each one. I implemented something like this:
CREATE DATABASE example;
USE example;
CREATE TABLE offsets (partition INTEGER, offset LONG, PRIMARY KEY (partition));
INSERT offsets VALUES (1,0);
Then executed the following statement:
SELECT #offset := offset from offsets WHERE partition=1 FOR UPDATE;
UPDATE offsets set offset=#offset+1 WHERE partition=1;
If your application allows you, you can allocate a block at once (that was my case).
SELECT #offset := offset from offsets WHERE partition=1 FOR UPDATE;
UPDATE offsets set offset=#offset+100 WHERE partition=1;
If you need further throughput an cannot allocate offsets in advance you can implement your own service using Flink for real time processing. I was able to get around 100K increments per partition.
Hope it helps!
The problem is similar to:
In iscsi world, where each luns/volumes have to be uniquely identifiable by the initiators running on the client side.
The iscsi standard says that the first few bits have to represent the Storage provider/manufacturer information, and the rest monotonically increasing.
Similarly, one can use the initial bits in the distributed system of nodes to represent the nodeID and the rest can be monotonically increasing.
One solution that is decent is to use a long time based generation.
It can be done with the backing of a distributed database.
My two cents for gcloud. Using storage file.
Implemented as cloud function, can easily be converted to a library.
https://github.com/zaky/sequential-counter

What are the various options and their tradeoffs for storing a UUID in a MYSQL table?

I'm planning on using client provided UUID's as the primary key in several tables in a MySQL Database.
I've come across various mechanisms for storing UUID's in a MySQL database but nothing that compares them against each other. These include storage as:
BINARY(16)
CHAR(16)
CHAR(36)
VARCHAR(36)
2 x BIGINT
Are there any better options, how do the options compare against each other in terms of:
storage size?
query overhead? (index issues, joins etc.)
ease of inserting and updating values from client code? (typically Java via JPA)
Are there any differences based on which version of MySQL your running, or the storage engine? We're currently running 5.1 and were planning on using InnoDB. I'd welcome any comments based on practical experience of trying to use UUIDs. Thanks.
I would go with storing it in a Binary(16) column, if you are indeed set on using UUIDs at all. something like 2x bigint would be quite cumbersome to manage. Also, i've heard of people reversing them because the start of the UUIDs on the same machine tend to be the same at the beginning, and the different parts are at the end, so if you reverse them, your indexes will be more efficient.
Of course, my instinct says that you should be using auto increment integers unless you have a really good reason for using the UUID. One good reason is generating unique keys accross different databases. The other option is that you plan to have more records than an INT can store. Although not many applications really need things like this. THere is not only a lot of efficiency lost when not using integers for your keys, and it's also harder to work with them. they are too long to type in, and passing them around in your URLs make the URLs really long. So, go with the UUID if you need it, but try to stay away.
I have used UUIDs for smart client online/offline storage and data synchronization and for databases that I knew would have to be merged at some point. I have always used char(36) or char(32)(no dashes). You get a slight performance gain over varchar and almost all databases support char. I have never tried binary or bigint. One thing to be aware of, is that char will pad with spaces if you do not use 36 or 32 characters. Point being, don't write a unit test that sets the ID of an object to "test" and then try to find it in the database. ;)

Categories

Resources