JDO - Persisting two entities with same key - java

I'm working on an AppEngine project and I'm using JDO on top of the AppEngine datastore for persistence. I have an entity that uses an encoded string as the key and also uses an application generated keyname (also a string). I did this because my app would frequently scoop data (potentially scooping the same thing) from the wild and attempt to persist them. In an attempt to avoid persisting several entities which essentially contain the same data, I decided to hash some properties about these data so as to get a consistent keyname (not manipulating keys directly because of entity relationships).
The problem now is that whenever I calculate my hash (keyname) and attempt to store the entity, if it already exists in the datastore, the datastore (or JDO or whoever the culprit is) silently overwrites the properties of the entity in the datastore without raising any exception. This has serious effects on the app because it overrides the timeStamps (a field) of the entities (which we use for ordering).
How best can I get around this?

You need to do get-before-set (Check and set or CAS).
CAS is a fundamental tenant of concurrency, and it's a necessary evil of parallel computing.
Gets are much cheaper than sets anyway, so it may actually save you money.
Instead of blind writing to datastore, first retrieve; if the entity doesn't exist, catch the exception and just put the entity. If it does exist, do a deep compare before you save. If nothing has changed, don't persist it (and save that cost). If it has changed, choose your merge strategy however you please. One (slightly ugly) way to maintain dated revisions is to store the previous entity as a field in the updated entity (may not work for many revisions).
But, in this case, you have to get before set. If you don't expect many duplicates and want to be really chintzy, you can do an exists query first... Which is to do a keys-only count query on the key you want to use (costs 7x less than a full get). If (count() == 0) then put() else getAndMaybePut() fi
The count query syntax might look slow, but from my benchmarks, it's the fastest (and cheapest) possible way to tell if an entity exists:
public boolean exists(Key key){
Query q;
if (key.getParent() == null)
q = new Query(key.getKind());
else
q = new Query(key.getKind(), key.getParent());
q.setKeysOnly();
q.setFilter(new FilterPredicate(
Entity.KEY_RESERVED_PROPERTY, FilterOperator.EQUAL, key));
return 1 == DatastoreServiceFactory.getDatastoreService().prepare(q)
.countEntities(FetchOptions.Builder.withLimit(1));
}

You must do a get() to see if an entity with the same key exists before you put() the new entity. There is no way around doing this.
You can use memcache and local "in-memory" caching to speed up your get() operation. This may only help if you are likely to read the same information multiple times. If not, the memcache query may actually slow down your process.
To ensure that two requests do not overwrite each other you should use a transaction (not possible with a query as suggested by Ajax unless you put all items in a single entity group which may limit your updates to 1 per second)
In pseudo code:
Create Key from hashing data
Check in-memory cache for key (use a ConcurrentHashSet of keys), return if found
Check MemcacheService for key, return if found
Start transaction
Get entity from datastore, return if found
Create entity in datastore
Commit transaction, return if fails due to concurrent update
Put Key in cache (in-memory and memcache)
Step 7 will fail if another request (thread) has already written the same key at the same time.

What I suggest you is that instead of saving the ID as a string either use a Long ID for your entity or you may use Key datatype, which is auto generated by appengine.
#PersistenceCapable
public class Test{
#PrimaryKey
#Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
private Long ID;
// getter and setter
}
This will return a unique value to you everytime.

Related

Guaranteed FIFO using JPA (Hibernate implementation) with MySQL

I need to persist a queue of tasks in MySQL. When reading them from DB I have to make sure the order is exactly the same as they have been persisted.
In general I prefer to have the solution DB agnostic (i.e. pure JPA) but adding some flavor of Hibernate and/or MySQL is acceptable as well.
My (probably naive) first version looks like:
em.createNamedQuery("MyQuery", MyTask.class).setFirstResult(0).setMaxResults(count).getResultList();
Where MyQuery doesn't have any "order by" clause i.e. it looks like:
SELECT t FROM MyTasks
Would such approach guarantee that the incoming results/entities are ordered in the way they have been persisted? What if I enable caching as well?
I was also thinking of adding an extra field to the task entity which is a timestamp in milliseconds (UTC from 1970-01-01) and then order by it in the query but then I might be in a situation where two tasks get generated immediately one after the other and they have the same timestamp.
Any solutions/ideas are welcome!
EDIT:
I just realised that auto increment (at least in MySQL) would throw an exception once it reaches its max value and no more inserts would be possible. This means I shouldn't worry about having the counter reset by the DB and I could explicitly order by an "auto increment" column in my query. Of course I would have another problem to deal with i.e. what to do in case the volume is so high that the largest possible unsigned integer type in MySQL is not big enough but this problem is not nesessarily coupled with the problem I am dealing right now.
Focusing in a pure JPA solution, cause the entity MyTasks must have a primary key I suggest you to use Sequence Generator for its primary key and sort the result of your query using order by clause on the key.
For example:
#Entity
class MyTask {
#Id #GeneratedValue(strategy=GenerationType.SEQUENCE)
private Long id;
You can also tight it a little bit with your database using #SequenceGenerator to specify a generator defined in the database.
Edit: Did you take a look at the #PrePersist option for setting the timestamp? Maybe you can combine the timestamp field and the id sequenced generation and order by in this order, so timestamp conflicts are resolved by id comparation (which are unique).
Most RDBMS's will store in the order of insertion and given no other instruction will order results that way too. If you don't want to leave it to chance, you have a couple of options.
1) You can generate a reasonably unique ID by using a timestamp and a incrementing fixed-length number,
OR
2) You can just define your table with an autonumbered primary key (which is probably easier).
If the table has a primary key to order by, then by default, most RDBMS's will return things in ascending primary key order... or you can enforce it explicitly in your query.
JPA (with or without cache) and RDBMS not guarantee of persisting or uploading sequence when you do not use order instruction. To solve task you should add integral primary key to the entity and use it when gather data as it mentioned other answereres.

JPA insert transaction concurrency

I have more of theoretical question:
When data gets inserted into a database? is it after persist or after commit is called? Because I have a problem with unique keys (manually generated) - they get duplicate. I'm thinking this is due multiple users inserting data simultaneously into a same table.
UPDATE 1:
I generate keys in my application. Keys example: '123456789123','123456789124','123456789125'...
Key field is varchar type, because there are lot of old keys (I can't delete or change them) like 'VP123456','VP15S3456'. Another problem, that after inserting them into one database, these keys have to be inserted in another database. And I don't know what are DB sequences and Atomic objects..
UPDATE 2:
These keys are used in finance documents and not as database keys. So they must be unique, but they are not used anywhere in programming as object keys.
I would suggest you create a Singleton that takes care of generating your keys. Make sure you can only get a new id once the singleton has initialized with the latest value from the database.
To safeguard you from incomplete inserts into the two databases I would suggest you try to use XA transactions. This will allow you to have all-or-nothing inserts and updates. So if any of the operations on any of the databases fails, everything will be rolled back. Of course there is a downside of XA transactions; they are quite slow and not all databases and database drivers support it.
How do you generate these keys? Have you tried using sequences in DB or atomic objects?
I'm asking because it is normal to populate DB concurrently.
EDIT1:
You can write a method that returns new keys based on atomic counter, this way you'll know that anytime you request a new key you receive a unique key. This strategy may and will lead to some keys being discarded but it is a small price to pay, unless it is a requirement that keys in the database are sequential.
private AtomicLong counter; //initialized somewhere else.
public String getKey(){
return "VP" + counter.incrementAndGet();
}
And here's some help on DB Sequences in Oracle, MySql, etc.

Checking if Entity exists in google app engine datastore.

What is the best/fastest way to check if an Entity exists in a google-app-engine datastore? For now I'm trying to get the entity by key and checking if the get() returns an error.
I don't know the process of getting an Entity on the datastore. Is there a faster way for doing only this check?
What you proposed would indeed be the fastest way to know if your entity exists. The only thing slowing you down is the time it takes to fetch and deserialize your entity. If your entity is large, this can slow you down.
IF this action (checking for existence) is a major bottleneck for you and you have large entities, you may want to roll your own system of checking by using two entities - first you would have your existing entity with data, and a second entity that either stores the reference to the real entity, or perhaps an empty entity where the key is just a variation on the original entity key that you can compute. You can check for existence quickly using the 2nd entity, and then fetch the first entity only if the data is necessary.
The better way I think would just be to design your keys such they you know there would not be duplicates, or that your operations are idempotent, so that even if an old entity was overwritten, it wouldn't matter.
com.google.appengine.api has been deprecated in favor of the App Engine GCS client.
Have you considered using a query? Guess-and-check is not a scalable way to find out of an entity exists in a data store. A query can be created to retrieve entities from the datastore that meet a specified set of conditions:
https://developers.google.com/appengine/docs/java/datastore/queries
EDIT:
What about the key-only query? Key-only queries run faster than queries that return complete entities. To return only the keys, use the Query.setKeysOnly() method.
new Query("Kind").addFilter(Entity.KEY_RESERVED_PROPERTY, FilterOperator.EQUAL, key).setKeysOnly();
Source: [1]: http://groups.google.com/group/google-appengine-java/browse_thread/thread/b1d1bb69f0635d46/0e2ba938fad3a543?pli=1
You could fetch using a List<Key> containing only one Key, that method returns a Map<Key, Entity> which you can check if it contains an actual value or null, for example:
Entity e = datastoreService.get(Arrays.asList(key)).get(key);
In general though I think it'd be easier to wrap the get() in a try/catch that returns null if the EntityNotFoundException is caught.

using objectify how to get a subset of properties for an object

I have a large object that I store using objectify. I need a list of those objects with only subset of the properties populated. How can this be done?
App Engine stores and retrieves entities as encoded Protocol Buffers. There's no way for the underlying infrastructure to store, update, or retrieve only part of an entity, so there's no point having a library that does this - hence Objectify, like other libraries, don't. If you regularly need to access only part of an entity, split those fields into a separate entity.
It's not a good idea to split an entity in two in a noSql database: when you need to read a list of entries, you would be obliged to do n requests to get the second part of the list (n x m if your data is split in more entities). This is naturally due to the fact that there is no possible join in noSql databases.
What could be done is to "cache": duplicate the needed subset in another entity to get the most of performance. It has the disadvantage of being obliged to write twice on a persist of the main entity (if a field of the subset was changed).
What I usually do is write a /** OPTIMIZE xxxx */ comment on the class that needs to read a subset and get back to it when I need more performance.

Hibernate + "ON DUPLICATE KEY" logic

I am looking for a way to save or update records, according to the table's unique key which is composed of several columns).
I want to achieve the same functionality used by INSERT ... ON DUPLICATE KEY UPDATE - meaning to blindly save a record, and have the DB/Hibernate insert a new one, or update the existing one if the unique key already exists.
I know I can use #SQLInsert( sql="INSERT INTO .. ON DUPLICATE KEY UPDATE"), but I was hoping not to write my own SQLs and let Hibernate do the job. (I am assuming it will do a better job - otherwise why use Hibernate?)
Hibernate may throw a ConstraintViolationException when you attempt to insert a row that breaks a constraint (including a unique constraint). If you don't get that exception, you may get some other general Hibernate exception - it depends on the version of Hibernate and the ability of Hibernate to map the MySQL exception to a Hibernate exception in the version and type of database you are using (I haven't tested it on everything).
You will only get the exception after calling flush(), so you should make sure this is also in your try-catch block.
I would be careful of implementing solutions where you check that the row exists first. If multiple sessions are updating the table concurrently you could get a race condition. Two processes read the row at nearly-the-same time to see if it exists; they both detect that it is not there, and then they both try to create a new row. One will fail depending on who wins the race.
A better solution is to attempt the insert first and if it fails, assume it was there already. However, once you have an exception you will have to roll back, so that will limit how you can use this approach.
This doesn't really sound like a clean approach to me. It would be better to first see if an entity with given key(s) exists. If so, update it and save it, if not create a new one.
EDIT
Or maybe consider if merge() is what you're looking for:
if there is a persistent instance with the same identifier currently associated with the session, copy the state of the given object onto the persistent instance
if there is no persistent instance currently associated with the session, try to load it from the database, or create a new persistent instance
the persistent instance is returned
the given instance does not become associated with the session, it remains detached
< http://docs.jboss.org/hibernate/core/3.3/reference/en/html/objectstate.html
You could use saveOrUpdate() from Session class.

Categories

Resources