App Engine JDO Transaction on multiple many-to-one - java

I have a simple domain model as follows
Driver - key(string), run-count, unique-track-count
Track - key(string), run-count, unique-driver-count, best-time
Run - key(?), driver-key, track-key, time, boolean-driver-update, boolean-track-updated
I need to be able to update a Run and a Driver in the same transaction; as well as a Run and a Track in the same transaction (obviously to make sure i don't update the statistics twice, or miss out on an increment counter)
Now I have tried assigning as run key, a key made up of driver-key/track-key/run-key(string)
This will let me update in one transaction the Run entity and the Driver entity.
But if I try updating the Run and Track entities together, it will complain that it cannot transact over multiple groups. It says that it has both the Driver and the Truck in the transaction and it can't operate on both...
tx.begin();
run = pmf.getObjectById(Run.class, runKey);
track = pmf.getObjectById(Track.class, trackKey);
//This is where it fails;
incrementCounters();
updateUpdatedFlags();
tx.commit();
Strangely enough when I do a similar thing to update Run and Driver it works fine.
Any suggestions on how else I can map my domain model to achieve the same functionality?

With Google App Engine, all of the datastore operations must be on entities in the same entity group. This is because your data is usually stored across multiple tables, and Google App Engine cannot do transactions across multiple tables.
Entities with owned one-to-one and one-to-many relationships are automatically in the same entity group. So if an entity contains a reference to another entity, or a collection of entities, you can read or write to both in the same transactions. For entities that don't have an owner relationship, you can create an entity with an explicit entity group parent.
You could put all of the objects in the same entity group, but you might get some contention if too many users are trying to modify objects in an entity group at the same time. If every object is in its own entity group, you can't do any meaningful transactions. You want to do something in between.
One solution is to have Track and Run in the same entity group. You could do this by having Track contain a List of Runs (if you do this, then Track might not need run-count, unique-driver-count and best-time; they could be computed when needed). If you do not want Track to have a List of Runs, you can use an unowned one-to-many relationship and specify the entity group parent of the Run be its Track (see "Creating Entities With Entity Groups" on this page). Either way, if a Run is in the same entity group as its track, you could do transactions that involve a Run and some/all of its Tracks.
For many large systems, instead of using transactions for consistency, changes are done by making operations that are idempotent. For instance, if Driver and Run were not in the same entity group, you could update the run-count for a Driver by first doing a query to get the count of all runs before some date in the past, then, in a transaction, update the Driver with the new count and the date when it was last computed.
Keep in mind when using dates that machines can have some kind of a clock drift, which is why I suggested using a date in the past.

I think I found a lateral but still clean solution which still makes sense in my domain model.
The domain model changes slightly as follows:
Driver - key(string-id), driver-stats - ex. id="Michael", runs=17
Track - key(string-id), track-stats - ex. id="Monza", bestTime=157
RunData - key(string-id), stat-data - ex. id="Michael-Monza-20101010", time=148
TrackRun - key(Track/string-id), track-stats-updated - ex. id="Monza/Michael-Monza-20101010", track-stats-updated=false
DriverRun - key(Driver/string-id), driver-stats-updated - ex. id="Michael/Michael-Monza-20101010", driver-stats-updated=true
I can now update atomically (i.e. precisely) the statistics of a Track with the statistics from a Run, immediately or in my own time. (And same with the Driver / Run statistics).
So basically I have to expand a little bit the way I model my problem, in a non-conventional relational way. What do you think?

realize this is late, but..
Have you seen this method for Bank Account transfers?
http://blog.notdot.net/2009/9/Distributed-Transactions-on-App-Engine
It seems to me that you could do something similar by breaking out your increment counters into two steps as a IncrementEntity and process that, picking up the pieces later if a transaction fails etc.
From the blog:
In a transaction, deduct the required
amount from the paying account, and
create a Transfer child entity to
record this, specifying the receiving
account in the 'target' field, and
leaving the 'other' field blank for
now.
In a second transaction, add the
required amount to the receiving
account, and create a Transfer child
entity to record this, specifying the
paying account in the 'target' field,
and the Transfer entity created in
step 1 in the 'other' field.
Finally,
update the Transfer entity created in
step 1, setting the 'other' field to
the Transfer we created in step 2.
The blog has code examples in Python, but is should be easy to adapt

There's an interesting google io session on this topic http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
I guess you could update the Run stats and then fire two tasks to update the Driver and the Track individually.

Related

Load entire tables including relationships into memory with JPA

I have to process a huge amount of data distributed over 20 tables (~5 million records in summary) and I need to efficently load them.
I'm using Wildfly 14 and JPA/Hibernate.
Since in the end, every single record will be used by the business logic (in the same transaction), I decided to pre-load the entire content of the required tables into memory via simply:
em.createQuery("SELECT e FROM Entity e").size();
After that, every object should be availabe in the transaction and thus be available via:
em.find(Entity.class, id);
But this doesn't work somehow and there are still a lot of calls to the DB, especially for the relationships.
How can I efficiently load the whole content of the required tables including
the relationships and make sure I got everything / there will be no further DB calls?
What I already tried:
FetchMode.EAGER: Still too many single selects / object graph too complex
EntityGraphs: Same as FetchMode.EAGER
Join fetch statements: Best results so far, since it simultaneously populates the relationships to the referred entities
2nd Level / Query Cache: Not working, probably the same problem as em.find
One thing to note is that the data is immutable (at least for a specific time) and could also be used in other transactions.
Edit:
My plan is to load and manage the entire data in a #Singleton bean. But I want to make sure I'm loading it the most efficient way and be sure the entire data is loaded. There should be no further queries necessary when the business logic is using the data. After a specific time (ejb timer), I'm going to discard the entire data and reload the current state from the DB (always whole tables).
Keep in mind, that you'll likely need a 64-bit JVM and a large amount of memory. Take a look at Hibernate 2nd Level Cache. Some things to check for since we don't have your code:
#Cacheable annotation will clue Hibernate in so that the entity is cacheable
Configure 2nd level caching to use something like ehcache, and set the maximum memory elements to something big enough to fit your working set into it
Make sure you're not accidentally using multiple sessions in your code.
If you need to process things in this way, you may want to consider changing your design to not rely on having everything in memory, not using Hibernate/JPA, or not use an app server. This will give you more control of how things are executed. This may even be a better fit for something like Hadoop. Without more information it's hard to say what direction would be best for you.
I understand what you're asking but JPA/Hibernate isn't going to want to cache that much data for you, or at least I wouldn't expect a guarantee from it. Consider that you described 5 million records. What is the average length per record? 100 bytes gives 500 megabytes of memory that'll just crash your untweaked JVM. Probably more like 5000 bytes average and that's 25 gB of memory. You need to think about what you're asking for.
If you want it cached you should do that yourself or better yet just use the results when you have them. If you want a memory based data access you should look at a technology specifically for that. http://www.ehcache.org/ seems popular but it's up to you and you should be sure you understand your use case first.
If you are trying to be database efficient then you should just understand what your doing and design and test carefully.
Basically it should be a pretty easy task to load entire tables with one query each table and link the objects, but JPA works different as to be shown in this example.
The biggest problem are #OneToMany/#ManyToMany-relations:
#Entity
public class Employee {
#Id
#Column(name="EMP_ID")
private long id;
...
#OneToMany(mappedBy="owner")
private List<Phone> phones;
...
}
#Entity
public class Phone {
#Id
private long id;
...
#ManyToOne
#JoinColumn(name="OWNER_ID")
private Employee owner;
...
}
FetchType.EAGER
If defined as FetchType.EAGER and the query SELECT e FROM Employee e Hibernate generates the SQL statement SELECT * FROM EMPLOYEE and right after it SELECT * FROM PHONE WHERE OWNER_ID=? for every single Employee loaded, commonly known as 1+n problem.
I could avoid the n+1 problem by using the JPQL-query SELECT e FROM Employee e JOIN FETCH e.phones, which will result in something like SELECT * FROM EMPLOYEE LEFT OUTER JOIN PHONE ON EMP_ID = OWNER_ID.
The problem is, this won't work for a complex data model with ~20 tables involved.
FetchType.LAZY
If defined as FetchType.LAZY the query SELECT e FROM Employee e will just load all Employees as Proxies, loading the related Phones only when accessing phones, which in the end will lead into the 1+n problem as well.
To avoid this it is pretty obvious to just load all the Phones into the same session SELECT p FROM Phone p. But when accessing phones Hibernate will still execute SELECT * FROM PHONE WHERE OWNER_ID=?, because Hibernate doesn't know that there are already all Phones in its current session.
Even when using 2nd level cache, the statement will be executed on the DB because Phone is indexed by its primary key in the 2nd level cache and not by OWNER_ID.
Conclusion
There is no mechanism like "just load all data" in Hibernate.
It seems there is no other way than keep the relationships transient and connect them manually or even just use plain old JDBC.
EDIT:
I just found a solution which works very well. I defined all relevant #ManyToMany and #OneToMany as FetchType.EAGER combinded with #Fetch(FetchMode.SUBSELECT) and all #ManyToOne with #Fetch(FetchMode.JOIN), which results in an acceptable loading time. Next to adding javax.persistence.Cacheable(true) to all entities I added org.hibernate.annotations.Cache to every relevant collection, which enables collection caching in the 2nd level cache. I disabled 2nd level cache timeout eviction and "warm up" the 2nd level cache via #Singleton EJB combined with #Startup on server start / deploy. Now I have 100% control over the cache, there are no further DB calls until I manually clear it.

Is hibernate search remote indexing possible?

We are migrating a whole application originally developed in Oracle Forms a few years back, to a Java (7) web based application with Hibernate (4.2.7.Final) and Hibernate Search (4.1.1.Final).
One of the requirements is: as users are using the new migrated version, they able to use the Oracle Forms version - so Hibernate Search indexes will be out of sync. Is it feasable to implement a servlet so that some PL-SQL accesses some link that updates the local indexes in the application server (AS)?
I thought of implementing a some sort clustering mechanism for hibernate, but as I read through the documentation I realised that as clustering may be a good option for scalabillity and performance, for maintaining legacy data in sync may be a bit overkill.
Does anyone have any idea of how to implement a service, accessible via servlet, to update local AS indexes in a given model entity with a given ID?
I don't know what exactly you mean by the clustering part, but anyways:
It seems like you are facing a similar problem like me. I am currently in the works of creating a Hibernate-Search adaption for JPA providers (that are not Hibernate-ORM, meaning EclipseLink, TopLink, etc.) and I am working on an automatic reindexing feature at the moment. Since JPA doesn't have a event system suitable for reindexation with Hibernate-Search I came up with the idea to use triggers on a database level to keep track of everything.
For a basic OneToOne relationship it's pretty straight forward and for other things like relation-tables or anything that is not stored in the main table of an entity it gets a bit trickier, but once you got a system for OneToOne relationships it's not that hard to get to that next step. Okay, Let's start:
Imagine two Entities: Place and Sorcerer in the Lord of the rings universe. In order to keep things simple let's just say they are in a (quite restrictive :D) 1:1 relationship with each other. Normally you end up with 2 tables named SORCERER and PLACE.
Now you have to create 3 triggers (one for CREATE, one for DELETE and one for UPDATE) on each Table (SORCERER and PLACE) that store information about what entity (only the id, for mapping tables there are always multiple ids) has changed and how (CREATE, UPDATE, DELETE) into special UPDATE tables. Let's call these PLACE_UPDATES and SORCERER_UPDATES.
In addition to the ID of the original Object that has changed and the event-type these will need an ID field that is needed to be UNIQUE among all UPDATE tables. This is needed because if you want to feed information from the Update tables to the Hibernate-Search index you have to make sure the events are in the right order or you will break your index. How such an UNIQUE ID can be created on your database should be easy to find on the internet/stackoverflow.
Okay. Now that you have set up the triggers correctly you will just have to find a way to access all the UPDATES tables in a feasible fashion (I do this via querying from multiple tables at once and sorting each query by our UNIQUE id field and then just comparing the first result of each query with the others) and then update my index.
This can be a bit tricky and you have to find the correct ways of dealing with the specific update event but it can be done (that's what I am currently working on).
If you're interested in that part, you can find it here:
https://github.com/Hotware/Hibernate-Search-JPA/blob/master/hibernate-search-db/src/main/java/com/github/hotware/hsearch/db/events/IndexUpdater.java
The link to the whole project is:
https://github.com/Hotware/Hibernate-Search-JPA/
This uses Hibernate-Search 5.0.0.
I hope this was of help (at least a little bit).
And about your remote indexing problem:
The update tables can easily be used as some kind of dump for events until you send them to the remote machine that is to be updated.

Google App Engine update only one property of an entity that has many efficiently java

Looking for an efficient way to update only one property for an entity in GAE.
I know I can do a get by key, set a property and then put. But will the get not be very inefficient as it will load all properties? I have heard that you can make property specific queries but I was worried that once you load an entity with only say one or two out of its total properties, then put it back in the datastore that the properties not loaded in the query will be lost.
Any Advice?
PS also not sure about the query method because I heard direct gets are more efficient. Any possibility of a query that specifies simply the key and therefore will be just as efficient?
Afaik, entities are stored in a serialised form, so it makes no difference if you need one or all properties as they will all be loaded when entity's serialised form is loaded.
The "property specific queries" are actually called projection queries. They work on indexes only and only recreate "projected" fields you queried by. Since entities are only partially loaded (only projected fields are loaded) they should not be saved back to the Datastore.
Just use normal query and then multi-put. Yes, direct gets are more efficient (and less costly) but you need to have key/id of the entity.
If you need to update one property far more than others, you can move it into a separate, simpler entity that you can load and update independently of the main entity. This could be a child entity, or a separate one that shares key characteristics.
E.g.
Email <- main entity
Unread <- child entity of email
When the email is created, create an unread entity. When it's read, delete the unread entity. When searching for unread emails, perform a key-only query on the Unread entities, extract parent keys to find the Email entities you want.

Concurrent updates of same entity

Gentlemen/ladies,
I've got a problem with concurrent updates of the same entity.
Process 1 obtains collection of objects. This process doesn't use Hibernate to retrieve data for the sake of performance which sounds a bit far-fetched for me. This process also updates some fields of some objects from the collection using Hibernate.
Process 2 obtains an object similar to one of those in collection (basically the same row in DB) and updates it somehow. This process uses Hibernate.
Since process 1 and process 2 don't know about each other they can update the same entity, leaving it in non-consistent state.
For example:
process 1 obtains collection
process 2 obtains one entity and removes some of its properties along with an entity it was linking to
process 1 gets back and tries to save that entity and gets entity not found exception
I need to deal with this situation.
So what can be done?
For now I see two ways:
create layer above database that will keep track of every entity in the system effectively prohibiting from creating multiple instances of same entity
set up optimistic locks and since some entities are not obtained by Hibernate I need to implement it somehow different
Any ideas would be very helpful
Thanks in advance
Since process 1 and process 2 don't know about each other they can update the same entity, leaving it in non-consistent state.
I'd reformulate that: both processes can update the same data. Only Hibernate would know the entities while the other process seems to access the data via JDBC.
I'd go for option 2 which would involve a version column in your entities.
IIRC Hibernate would then add a WHERE version = x condition to the queries and check whether all rows have been updated and if not an OptimistictLockException would be thrown. You could do the same in your JDBC queries, i.e. UPDATE ... SET ... version = x + 1 ... WHERE version = x AND additionalConditions and check the number of updates rows that is returned by JDBC.

How to build a change tracking system - not audit system

I have a requirement in which I need to capture data changes (not auditing) and life cycle states on inventory.
Technology:
Jave, Oracle, Hibernate + JPA
For the data changes, we have been given a list of data elements that are to be monitored. If the element changes we are to notify a given 3rd party vendor. What I want to do is make this a generic service that we can provide to any of our current and future 3rd party vendors.
We don't care who made the change or what the new value is just that it changed.
The thought is that the data layer of our application would use annotation on each of the data elements. If that data element changed, then it would place a message into a queue. The message bean would then read the queue and make an entry in a table.
Table to look something like the following:
Table Name: ATL_CHANGE_TRACKER
Key columns
INVENTORY_ID Inventory Id of the vehicle
SALEEVENT_ITEM_ID SaleEvent item of the vehicle
FIELD_CHANGED_ID Id of the field that got changed or action. Link to subscription
UPDATE_DTM Indicates the date time when change occured.
For a given inventory, we could have up to 200 entries in this table (monitoring 200 fields across many tables).
Then a daemon for the given 3rd party would then read from this table based on the fields that it has subscribed to (could be all the fields). It would then read what every table it is required to to create the message to be sent to the 3rd party. Decouple the provider of the data and the user of the data.
Identify the list of fields/actions that are available
Table Name: ATL_FIELD_ACTION
Key columns
ID
NAME Name of the field/action - Example Color,Make
REC_CRE_TIME_STAMP
REC_CRE_USER_ID
LAST_UPDATE_USER_ID
LAST_UPDATE_TIME_STAMP
Subscription table, if 3rd Party company xyz is interested in 60 fields, the 60 fields will be mapped to this table.
ATL_FIELD_ACTION_SUBSCRIPTION
Key columns
ATL_FIELD_ACTION_ ID ID of the atl_field_action table
CONSUMER 3rd Party Name
FUNCTION Name of the 3rd Party Transmission that it is used for
STATUS
REC_CRE_TIME_STAMP
REC_CRE_USER_ID
LAST_UPDATE_USER_ID
LAST_UPDATE_TIME_STAMP
The second part is that there will be actions on the life cycle of the inventory which will need to be recored also. In this case, when the state of the inventory changes a message will be placed on the same queue and that entry will be entered in the same table.
Again, the daemon will have subscribed to these states and will collect the ones it is interested in.
The goal here is to not have the business tier/data tier care who wants the data - just that it needs to provide it so those interested can get it.
Wonder if anyone has done something like this - any gotchas - off the shelf - open source solutions to do this.
For a high-level discussion on the topic, I would suggest reading this article by Martin Fowler.
Its sounds like you have write-once, read-many type of data, it might produce large volumes of data, and the data is different for different clients. If you ask me, it sounds like this may be a good place to make use of either a NOSQL database or hack your Oracle database to act as a NOSQL database. See here for a discussion on how someone did this with MySQL.
Otherwise, you may look at creating an "immutable" database table and have Hibernate write new records every time it does an update as described here.
Couple things.
First, you get to do all of this work yourself. The JPA/Hibernate lifecycle listeners, while they have an event for when an update occurs, you aren't passed the "old" object and the "new" object. So, you're going to have to keep track of what fields change using some other method.
Second, again with lifecycle listeners, be careful inside of them, as the transaction state is a bit murky. At least on Glassfish/EclipseLink, I've had "strange" problems using either the JPA or JMS from a lifecycle listener. Just weird behavior. We went to a non-transactional queue to capture all of our information that we track from the lifecycle events.
If having the change data committed on its own transaction is acceptable, then there is value is pushing the data on to a faster, internal queue (which can feed a listener that posts it to an MDB). This just gets the auditing "out of band" with your transaction, give you better transaction throughput. But if you need to have the change information committed with the same transaction, this won't work. For example, you could put something on the queue and then the transaction may be rolled back (for whatever) reason, leaving the change on the queue showing it happened, when it in fact failed. That's a potential issue with this.
But if you're posting a lot of audit information, then this can be a concern.
If the auditing information has a short life span (with respect to the rest of the data), then you should probably make an effort to cull the audit tables, they can get pretty large.
Also, if practical, don't disregard the use of DB triggers for this. They can be quite efficient and effective at this process.

Categories

Resources