How do I set a hibernate sequence manually in mysql? - java

I'm doing some data migration after some data model refactoring and I'm taking a couple tables with composite primary keys and combining them into a larger table and giving it its own unique primary key. At this point, I've written some SQL to copy the old table data into a new table and assign a primary key using AUTO_INCREMENT. After the migration is done, I remove the AUTO_INCREMENT from the PK field. So, now that's all gravy, but the problem is that I need the hibernate sequence to know what the next available PK will be. We use the TABLE strategy generally for all of our entities and I'd like to stay consistent and avoid using AUTO_INCREMENT and the IDENTITY strategy for future objects. I've gotten away with temporarily setting the respective row in the generated "hibernate_sequences" table to the max id of the newly created table, but this is just a bandaid fix to the problem. Also, this results in the next IDs created to be much larger than the max id. I'm certain this is because I don't understand the HiLo id-assigning mechanism, which is why I'm posting here. Is there a way to set this up so that the Ids will be sequential? Or, where is the code that generates the HiLo value so that I can calculate what it should be to ensure sequential ids?

If I understood you correctly, the problem is that hibernate doesn't generate sequental IDs for you. But that is how hi/lo generator works and I do not understand exactly why you do not like it.
In basic, Hi/lo generator is based on supporting HIGH and LOW values separately. When LOW reaches its limit it is reset and HIGH is incremented. The result key is based on combining HIGH and LOW values together. E.g. assume key is double word and HIGH and LOW are words. HIGH can be left two bytes and LOW right two bytes.
Jumps in ID depend on two factors - the max value for LOW and on event which triggers changing the value of HIGH.
By default in Hibernate, max value for LOW is Short.MAX_VALUE and is reset on each generator initialization. HIGH value is read from the table and incremented on each initialization, also it is incremented when LOW reaches it's upper limit. All this means that on each restart of your application, you will have gaps in IDs.
Looking at the code, it seems, that if you would use value <1 for max_lo, the key will be generated just by incrementing hi value, which is read from DB. You would probably like that behaviour :)
Have a look at the source code for org.hibernate.id.MultipleHiLoPerTableGenerator#generate

Using org.hibernate.id.MultipleHiLoPerTableGenerator#generate, I figured out that my batches were of size 50, and so for my purposes using the max id / 50 + 1 generated a usable number to throw into the sequence to make them as close to sequential as possible.

Related

DynamoDB Scan Query and BatchGet

We have a Dynamo DB table structure which consists Hash and Range as primary key.
Hash = date.random_number
Range = timestamp
How to get items within X and Y timestamp? Since hash key is attached with random_number, those many times query has to be fired. Is it possible to give multiple hash values and single RangeKeyCondition.
What would be most efficient in terms of cost and time?
Random number range is from 1 to 10.
If I understood correctly, you have a table with the following definition of Primary Keys:
Hash Key : date.random_number
Range Key : timestamp
One thing that you have to keep in mind is that , whether you are using GetItem or Query, you have to be able to calculate the Hash Key in your application in order to successfully retrieve one or more items from your table.
It makes sense to use the random numbers as part of your Hash Key so your records can be evenly distributed across the DynamoDB partitions, however, you have to do it in a way that your application can still calculate those numbers when you need to retrieve the records.
With that in mind, let's create the query needed for the specified requirements. The native AWS DynamoDB operations that you have available to obtain several items from your table are:
Query, BatchGetItem and Scan
In order to use BatchGetItem you would need to know beforehand the entire primary key (Hash Key and Range Key), which is not the case.
The Scan operation will literally go through every record of your table, something that in my opinion is unnecessary for your requirements.
Lastly, the Query operation allows you to retrieve one or more items from a table applying the EQ (equality) operator to the Hash Key and a number of other operators that you can use when you don't have the entire Range Key or would like to match more than one.
The operator options for the Range Key condition are: EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN
It seems to me that the most suitable for your requirements is the BETWEEN operator, that being said, let's see how you could build the query with the chosen SDK:
Table table = dynamoDB.getTable(tableName);
String hashKey = "<YOUR_COMPUTED_HASH_KEY>";
String timestampX = "<YOUR_TIMESTAMP_X_VALUE>";
String timestampY = "<YOUR_TIMESTAMP_Y_VALUE>";
RangeKeyCondition rangeKeyCondition = new RangeKeyCondition("RangeKeyAttributeName").between(timestampX, timestampY);
ItemCollection<QueryOutcome> items = table.query("HashKeyAttributeName", hashKey,
rangeKeyCondition,
null, //FilterExpression - not used in this example
null, //ProjectionExpression - not used in this example
null, //ExpressionAttributeNames - not used in this example
null); //ExpressionAttributeValues - not used in this example
You might want to look at the following post to get more information about DynamoDB Primary Keys:
DynamoDB: When to use what PK type?
QUESTION: My concern is querying multiple times because of random_number attached to it. Is there a way to combine these queries and hit dynamoDB once ?
Your concern is completely understandable, however, the only way to fetch all the records via BatchGetItem is by knowing the entire primary key (HASH + RANGE) of all records you intend to get. Although minimizing the HTTP roundtrips to the server might seem to be the best solution at first sight, the documentation actually suggests to do exactly what you are doing to avoid hot partitions and uneven use of your provisioned throughput:
Design For Uniform Data Access Across Items In Your Tables
"Because you are randomizing the hash key, the writes to the table on
each day are spread evenly across all of the hash key values; this
will yield better parallelism and higher overall throughput. [...] To
read all of the items for a given day, you would still need to Query
each of the 2014-07-09.N keys (where N is 1 to 200), and your
application would need to merge all of the results. However, you will
avoid having a single "hot" hash key taking all of the workload."
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
Here there is another interesting point suggesting the moderate use of reads in a single partition... if you remove the random number from the hash key to be able to get all records in one shot, you are likely to fall on this issue, regardless if you are using Scan, Query or BatchGetItem:
Guidelines for Query and Scan - Avoid Sudden Bursts of Read Activity
"Note that it is not just the burst of capacity units the Scan uses
that is a problem. It is also because the scan is likely to consume
all of its capacity units from the same partition because the scan
requests read items that are next to each other on the partition. This
means that the request is hitting the same partition, causing all of
its capacity units to be consumed, and throttling other requests to
that partition. If the request to read data had been spread across
multiple partitions, then the operation would not have throttled a
specific partition."
And lastly, because you are working with time series data, it might be helpful to look into some best practices suggested by the documentation as well:
Understand Access Patterns for Time Series Data
For each table that you create, you specify the throughput
requirements. DynamoDB allocates and reserves resources to handle your
throughput requirements with sustained low latency. When you design
your application and tables, you should consider your application's
access pattern to make the most efficient use of your table's
resources.
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with hash and
range type primary key with Customer ID as the hash attribute and
date/time as the range attribute. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

PHP Generating unique id from database

In my project, i am generating a unique id from a table of database with taking largest integer value of an attribute 'serial_key' with adding 1 to that number. It is generating the unique index to add new tuple of records.
But this mechanism failed when i deployed the application on multiple PCs in intranet or internet, it was generating the same unique id on all different machines at an instant. And i have plenty of data in the server so i have to manage the same pattern of the id, since it was constructed taking a specific format. Please suggest how to resolve this problem. Thanks.
You can use the HiLo Algorithm, to generate unique keys. It can be tuned for performance or continous keys. If you implement it in all your clients (i guess java and php) you would get unique keys across as many keys as you like (or your database performance allows). You would also not be dependent on any database and if you tune it for throughput you wouldn't need much additional database querys.
See this SO-Answer.
You can solve this by using an AUTO_INCREMENT attribute on your serial_key column. That way you don't have to worry about data collision. This is a common practice for primary keys.
Proper "allocator table" design for portable DB-based key allocation: to be used in preference to Scott Ambler's misguided "hi-lo" idea.
create table KEY_ALLOC (
SEQ varchar(32) not null,
NEXT bigint not null,
primary key (SEQ)
);
To allocate the next, say, 20 keys (which are then held as a range in the server & used as needed):
select NEXT from KEY_ALLOC where SEQ=?;
update KEY_ALLOC set NEXT=(old value+20) where SEQ=? and NEXT=(old value);
Providing you can commit this transaction (use retries to handle contention), you have allocated 20 keys & can dispense them as needed.
This scheme is 10x faster than allocating from an Oracle sequence, and is 100% portable amongst all databases.
Unlike Scott Ambler's hi-lo idea, it treats the keyspace as a contiguous linear numberline -- from which it efficiently allocates small chunks of configurable size. These keys are human-friendly and large chunks are not wasted. Mr Ambler's idea allocates the high 16- or 32-bits, and requires either ugly composite keys or generating large human-unfriendly key values as the hi-words increment.
Comparison of allocated keys:
Linear_Chunk Hi_Lo
100 65536
101 65537
102 65538
.. server restart
120 131072
121 131073
122 131073
.. server restart
140 196608
Guess which keys are easier to work with for a developer or DB admin.
I actually corresponded with him back in the 90's to suggest this improved scheme to him, but he was too stuck & obstinate to acknowledge the advantages of using a linear number-line.

How do I reuse Oracle sequence gaps in primary key column?

I used Oracle sequence as primary key of a table, and used int in Java application mapping this primary key, now I found my customer has reached to the maximum int in table, even the sequence can be continuous increase. but Java int has no longer able to store it, I don't want change Java code from int to long because of very big cost. then I found customer DB there have many big gaps in ID column. can any way I can reuse these missing Id number?
If can do this in DB level, something like I can re-org this sequence to add these missing number to it, so no Java code change then I can use these gaps. it should be great.
I will write a function to find the gap ranges, after having these numbers, If I can, I want assign them to pool of the sequence value, so maybe from now on, it will not use auto-incrementing, just use the number I assigned. in Java code, I can continue use findNextNumber to call the sequence. but sequence will be able to return the value I assigned to it. it seems impossible, right? any alternative?
Do you mean, will the sequence ever return a value that is in a "gap" range? I don't think so, unless you drop/re-create it for some reason. I guess you could write a function of some sorts to find the PK gaps in your table, then save those gap ranges to another table, and "roll" your own sequence function using the gap table. Very ugly. Trying to "recover" these gaps just sounds like a desperate attempt to avoid the unavoidable - your java PK data type should have aligned with the DB data type. I had the same problem a long time ago with a VB app that had a class key defined as 16-bit integer, and the sequence exceeded 32K, had to change the variables to a Long. I say, bite the bullet, and make the conversion. A little pain now, will save you a lot of ongoing pain later. Just my opinion.
I would definitely make the change to be able to use longer numbers, but in the meantime you might manage until you can make that change by using a sequence that generates negative numbers. There'd be a performance impact on the maintenance of the PK index, and it would grow larger disproportionately quicker, though

Ways to generate roughly sequential Ids (32 bits & 64 bits sizes) in a distributed application

What are the good ways to generate unique "roughly" sequential Ids (32 bits & 64 bits sizes) in a distributed application ?
[Side Note: My DB does not provide this facility.] Also I do not want to use 128 bits UUIDs.
EDIT: Thanks all for your response!
As some of you suggest using Mysql db like flickr's ticket servers, I doubt that using multiple servers(in order to eliminate Single point of failure) may disturb some sequential nature of the generated Ids since some servers may lag behind the others. While I am ok with a lag of a few Id sequences but cannot afford huge disturbances in sequentiality.
Building on #Chris' idea of having a table to generate the "next" identity. A more reliable way of doing this is to have two servers and a round-robin load-balancer. Server 1 distributes odd numbers, Server 2 distributes even numbers. If you lose one server, you get all odds or all evens, but the show still goes on.
Flickr uses something very similar for their ID system
http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
Then creatively use MYSQL's atomic REPLACE syntax like follows:
CREATE TABLE `Tickets64` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`stub` char(1) NOT NULL default '',
PRIMARY KEY (`id`),
UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM
REPLACE INTO Tickets64 (stub) VALUES ('a');
SELECT LAST_INSERT_ID();
Where the stub value is sufficient to generate the next identity in an atomic fashion.
Update with the OP Chronology and Sequence requirements
With Chronology as a driver your choices change a litte - an atomic state in a SPOF - Chris' idea in SQL for example. This will be a bottleneck, and it's state must be durable to prevent duplicate IDs being issued.
To achieve chronology at scale with high-availability in a distributed system, causal sequence algorithms are pretty much the only way to go - there are a few out there:
Lamport timestamps
Vector Clocks
Dependency Sequences
Hierarchical Clocks
The calling pattern is quite different than the SPOF strategy, they require you to track and pass a memento of sequence, timestamp or version - effectively session information for the sequence you are generating. They do however guarantee causal order for any given sequence or item even in a distributed system. In most cases an event PK would be the compound of item identifier + causal sequence id.
Since you have a database, you can create a table with the next value in it. Aquiring this value will require you to select the value and update the row with the new next value, where the row had the old value, if the update fails to affect any rows then some other process or thread was able to aquire the value before you and you need to retry your attempt to get the next value.
The pseudo code for this would look something like this
do
select nextvalue into MyValue from SequenceTable
update SequenceTable set nextvalue= nextvalue + 1 where nextvalue = MyValue
while rowsaffected = 0
In the above, MyValue is the variable that will hold the nextvalue from the database table SequenceTable, and rowsaffected is an indicator that indicates how many rows where affected by the last SQL statement which in this case is the update statement.

How large can an appengine task payload be?

I'm using the new experimental taskqueue for java appengine and I'm trying to create tasks that aggregate statistics in my datastore. I'm trying to count the number of UNIQUE values within all the entitities (of a certain type) in my datastore. More concretely, say entity of type X has a field A. I want to count the NUMBER of unique values of A in my datastore.
My current approach is to create a task which queries for the first 10 entities of type X, creating a hashtable to store the unique values of A in, then passing this hashtable to the next task as the payload. This next task will count the next 10 entities and so on and so forth until I've gone through all the entities. During the execution of the last task, I'll count the number of keys in my hashtable (that's been passed from task to task all along) to find the total number of unique values of A.
This works for a small number of entities in my data store. But I'm worried that this hashtable will get too big once I have a lot of unique values. What is the maximum allowable size for the payload of an appengine task?????
Can you suggest any alternative approaches?
Thanks.
According to the docs, the maximum task object size is 100K.
"Can you suggest any alternative approaches?".
Create an entity for each unique value, by constructing a key based on the value and using Model.get_or_insert. Then Query.count up the entities in batches of 1000 (or however many you can count before your request times out - more than 10), using the normal paging tricks.
Or use code similar to that given in the docs for get_or_insert to keep count as you go - App Engine transactions can be run more than once, so a memcached count incremented in the transaction would be unreliable. There may be some trick around that, though, or you could keep the count in the datastore provided that you aren't doing anything too unpleasant with entity parents.
This may be too late, but perhaps it can be of use. First, anytime you have a remote chance of wanting to walk serially through a set of entities, suggest using either a date_created or date_modified auto_update field which is indexed. From this point you can create a model with a TextProperty to store your hash table using json.dumps(). All you need to do is pass the last date processed, and the model id for the hash table entity. Do a query with date_created later than the last date, json_load() the TextProperty, and accumulate the next 10 records. Could get a bit more sophisticated (e.g. handle date_created collisions by utilizing the parameters passed and a little different query approach). Add a 1 second countdown to the next task to avoid any issues with updating the hash table entity too quickly. HTH, -stevep

Categories

Resources