Event-driven way of (batch-)removing expired data in Java application

Event-driven way of (batch-)removing expired data in Java application - java

In our application, we have a concept called 'overlay', which changes the behaviour for a user account until it expires. Each user account can have zero or one overlays, which are persisted with the user account in the database.
A user can activate an overlay (which requires setting the expiration time) and manually remove it (if it should be removed before its expiry). Also the overlay's expiration time can be updated. When an overlay is removed (manually or due to expiration), an event is triggered which results in more logic being invoked (e.g. firing a webhook).
What is now a good way to automatically remove expired overlays from the database? The application is built with Grails, so it can use Java frameworks, like Quartz. The application runs on several AWS EC2 instances behind a load balancer. The mechanism need to survive application restarts.
The obvious solution is a periodic batch job, that runs every X seconds, scans for all expired overlays, and then deletes them. Downside is that this
creates peaks in the load (since removing overlays creates events which trigger more behaviour)
leaves expired overlays in the DB before they are removed
Is there a more "event-driven" way so that the overlays are removed right when they expire? We have around 50.000 user accounts.

A possible approach is to check (and possibly delete) upon use - i.e. whenever you're retrieving the possible overlay from the database, check if it's expired and delete it if so.
Technically, you could create a database procedure that will check the overlay existence and expiry, possibly delete expired record and then return the correct result to user. If you hide the table (or column) storing the expiry dates from user (= your application) and only allow access through that procedure, the result will be:
The expired entries are still in database for some time
possibly indefinitely, but if at most 1 overlay per user, it's not too much data hopefully
it is there, but the application cannot see them
From the application's viewpoint, there's no possibility to get expired entries
No peaks in resources consumption
Might add some overhead

Related

Detect concurrent access to a java object

I have a question about concurrent access in java.
The project context is as follows: JSF 2.1; Richfaces 4, hibernate 3.6, spring 3.2, java 7.
I have a table that displays hundreds of thousands of folders (each folder is a java object), each table line is a dynamic link to access the consultation or modification of the object(folder).
My question is: how can I prevent more than one user from viewing or editing a folder?
In other words, how do you detect access to a folder (object java) by more than one user at the same time?
I know that with the word Synchronized on methods makes it possible to prevent concurrent access, but I want to detect concurrent access to warn the other user or the other users that the folder (object java) is already open and we have to wait for it closed.
Thank you

Synchronization and all the other locking / blocking methods is only preventing concurrent access from within the code.
It does not prevent 2 users from editing the same "thing" at the same time. This is something you have to put into your code on your own.
There are two most common ways to achieve this:
Optimistic Locking:
You assume, that even if multiple users view the same item, there are not multiple changes at the same time. Hence, you only check during SAVING if there is a concurrent modification. To Achieve that:
The Object in Question needs a Timestamp, when it was last modified. (By anyone)
Once a user opens an item to view, you store that timestamp in the session context.
If the user clicks "save", you compare the session-context-timestamp with the timestamp of the actual object. If they are the same, changes can be saved. If the Object has been modified in between, you need to display an error, that it has been modified, and changes can't be saved without loading the new base-values prior.
Pessimistic Locking:
You assume, that whenever a user views an Object, that there will be a change. To Achieve that:
The Object in Question needs a Int-Flag, that it is currently beeing edited (By anyone - Flag = UserID)
Once a second user tries to load that object, you deny cause someone is editing currently.
Tricky part here is to clean up "locks", if the editing user just quits the page without "aborting" through application methods. To handle this, you need to define a Lock-Timestamp, define some timeout value (for example 30min) and do two more checks:
Editing user exceeded the timeout value upon save? he can't save anymore.
(As an extra you can allow later saving, if the object has not been locked by someone else in between, use the lock-user-id to figure out)
Object is locked, but lock is older than 30 minutes? Second User can open again, claiming the lock-user-id for himself.

Alternative for session bean

I have web application based on jsp and spring mvc where i need resolve this task :
The user must be able to add new instances of the main entity using wizard dialog. The wizard consists of 3 steps:
On the first step there must be a form which allows filling main entity’s fields, including association with the entity related as many-to one (it’s recommended to use drop-down field). The form should contain fields of different types: text, number, date, radio button, etc. Some fields should be required and some are not.
Example: input name, surname, birth date, phone, number of kids, select gender (radiobutton), department (drop-down), etc.
On the second step user fills additional attributes, including association with the entity related as many-to-many with the current one.
Example: associate employee with skills that (s)he has (checkboxes), add some note (textarea).
On the third step all the fields from previous 2 steps should be displayed as read-only fields. The user should confirm saving this data into database. After the user confirms saving, the data should be saved into database, and user should be redirected to the page with the list of objects.
How can i transfer and hold information without using sessions(Http session, session scope)?

You need to keep state across multiple server interactions. There are several possibilities, in general factors such as the size of the state data to be retained influence our decisions.
It sounds like you have some small number of hundreds of bytes here, so you're not particularly constrained by size - a few Megabytes would be more of a challenge.
First possibility, keep it all in the browser in JavaScript variables, no actual need to send anything to server. This is typical of a modern dynamic Web UI, where the server serves up data rather than pages. Sounds like you're in a multi-page world so discount this option.
Second, just put some data (possibly encrypted, in a cookie) effectively the browser is keeping the data for you, but it's shared across the pages.
Third use Http Session state - you case does sound very much like a typical candidate for a session. Why do you want to avoid it? Depending upon your server's capabilities this approach may not give great resilience behaviour (if the state is on one server instance then all requests for a session must be served by the same server). Note that HTTP Session and EJB Session Beans are not the same thing, HttpSessions are lighter weight.
Use a custom session "database" - maybe literally a SQL database maybe something lighter. For larger scale data entry cases, where a user may take 10s of minutes to complete many pages this may be the best option - the user's work is saved should they need to break off and resume later. It's more development work and you need to look at housekeeping too, but it's sometimes the best option.
In summary: be very clear why you reject the "obvious" HTTP session technique, in terms of simplicity it's where I'd start.

Memcache getItemCount() counting expired keys?

I want to get the count of "alive" keys at any given time. Now according to the API documentation getItemCount() is meant to return this.
However it's not. Expired keys are not reducing the value of getItemCount(). Why is this? How can I accurately get a count of all "active" or "alive" keys that have not expired?
Here's my put code;
syncCache.put(uid, cachedUID, Expiration.byDeltaSeconds(3), SetPolicy.SET_ALWAYS);
Now that should expire keys after 3 seconds. It expires them but getItemCount() does not reflect the true count of keys.
UPDATE:
It seems memcache might not be be what I should be using so here's what I'm trying to do.
I wish to write a google-app engine server/app that works as a "users online" feature for a desktop application. The desktop application makes a http request to the app with a unique ID as a paramater. The app stores this UID along with a timestamp. This is done every 3 minutes.
Every 5 mninute any entries that have a timestamp outside of that 5 minute window are removed. Then you count how many entries you have and that's how many users are "online".
The expire feature seemed perfect as then I wouldn't even need to worry about timestamps or clearing expired entries.

It might be a problem in the documentation, the python one does not mention anything about alive ones. This is reproducible also in python.
See also this related post How does the lazy expiration mechanism in memcached operate?.

getItemCount() might return expired keys because it's the way memcache works and many other caches too.
Memcache can help you to do what you describe but not in the way you tried to do it. Consider completely opposite situation: you put online users in memcache and then appengine wipe them out from memcache because of lack of free memory. cache doesn't give you any guaranties you items will be stored any particular period. So memcache can let you decrease number of requests to datastore and latency.
One of the ways to do it:
Maintain sorted map (user-id, last-login-refreshed) entries which stored in datastore (or in memcache if you do not need it to be very precise). on every login/refresh you update value by particular key, and from periodic cron-job remove old users from the map. So size of map would be number of logged in use at particular moment.
Make sure map will fit in 1 Mb which is limit for both memcache or datastore.

Concurrency: Only one user editing an item at a time

I've been researching how to fix this issue for sometime but can't seem to find a proper solution.
Here's the issue:
I have a Java EE application where many users can login, they are presented with an item list, and they can select and edit any one of those.
All users see the same item list.
As mentioned, they can edit an item but I'd like to restrict the editing function to one user. That is, many users can edit different items simultaneously but only one user can edit a particular item.
When one user is editing an item, an message should appear to any other user trying to edit that item.
I have implemented this by setting a flag on the item, inUse, to true and then check for that. When the user is done editing the item, either by clicking save or cancel, the flag is set to false.
Problem with this approach is to account for cases when the user leaves his browser open or the browser is closed.
I tried setting a session timeout but can't seem to make that work because when the session times out, I don't have access to that item. I only have access to the httprequest session id.
Perhaps this is the wrong approach since it seems it's an issues that many applications would have and a less hackie solution should exist.
I looked into using threads and synchronized methods but don't know how that would work because once the user enters into the edit item screen, the method exits and releases the lock.
I found this solution Only one user allowed to edit content at a time but not sure if that's the way to go in Java.
Is there a more elegant/java solution? If so can you point me in the right direction please? How would you implement this?
Thanks!
The solution:
Although originally I thought optimistic locking was the way to go, I quickly realized that it wouldn't really work for my environment. I decided to go with a combination of pessimistic locking (http://www.agiledata.org/essays/concurrencyControl.html#PessimisticLocking) and timeouts.
When an item is accessed, I set an inUse field to true and the object's last accessed field to the current time.
Everytime when somebody tries to edit the object, I check the inUse field and the lastAccessed field + 5 mins. So basically, I give 5 mins to edit the user.

Do it like they do in a database, where a timestamp is used. The timestamp is kept with the record and when a person submits his edit, the edit does not go through unless the timestamp is the same (meaning 'no edits have occurred since I read this record'). Then when the edit does go through, a new timestamp is created.

First of all, in your persistence layer, you really should be doing optimistic locking, using a version/timestamp field.
At a UI level, to handle your use case I would do resource leasing:
Add two fields to your table:
LAST_LEASE_TIME: Time of the last lease
LAST_LEASE_USER: User that leased the record for the last time.
When a user tries to edit your record, first check that the record is not leased, or that the lease has expired (that is, the lease is not older that the specified lease time) or that the user is the one that was granted the lease.
From your web browser, periodically renew the lease, for example with an AJAX call.
When the user ends editing the record, explicitly expire the lease.
By doing leasing, you solve the "closed browser" problem: after the lease period expires without any lease renovation, the algorithm automatically "releases" the resource.

Sounds like you could use : Session Beans Quote:
In general, you should use a session bean if the following circumstances hold:
At any given time, only one client has access to the bean instance.
The state of the bean is not persistent, existing only for a short period of time (perhaps a few hours).

Martin Fowler describes 4 patterns for such a problem:
Online Optimistic Locking
Online Pessimistic Locking
Offline Optimistic Locking
Offline Pessimistic Locking
You should decide which one to use according to your problem.
JPA, JDO and Hibernate provide 1 and 2 out of the box.
Hibernate can handle 3 too, (I'm not sure about JPA and JDO).
None handle 4 out of the box and you shall implement it yourself.

Store data in session, how and when to detect if data is stale

The scenario I have is this.
User does a search
Handler finds results, stores in session
User see results, decides to click one of them to view
After viewing, user clicks to "Back to Search"
Handler detects its a back to search, skips search and instead retrieves from session
User sees the same results as expected
At #5, if there was a new item created and fits the user's search criteria, thus it should be part of the results. But since in #5 I'm just retrieving from session it will not detect it.
My question is, should I be doing an extra step of checking? If so, how to check effectively without doing an actual retrieve (which would defeat the purpose)? Maybe do select count(*) .... and compare that with count of resultset in session?

Caching something search results in a session is something I strongly advise against. Web apps should strive to have the smallest session state possible. Putting in blanket logic to cache search results (presumably several kb at least) against user session state is really asking for memory problems down the road.
Instead, you should have a singleton search service which manages its own cache. Although this appears similar in strategy to caching inside the session, it has several advantages:
you can re-use common search results among users; depending on the types of searches this could be significant
you can manage cache size in the service layer; something like ehcache is easy to implement and gives you lots of configurability (and protection against out of memory issues)
you can manage cache validity in the service layer; i.e. if the "update item" service has had its save() method triggered, it can tell the search service to invalidate either its entire cache or just the cached results that correspond with the newly updated/created item.
The third point above addresses your main question.

It depends on your business needs. If it's imperative that the user have the latest up to date results then you'll have to repull them.
A count wouldn't be 100% because there could be corresponding deletions.
You might be able to compare timestamps or something but I suspect all the complexity involved would just introduce further issues.
Keep it simple and rerun your search.

In order to see if there are new items, you likely will have to rerun your search - even just to get a count.
You are effectively caching the search results. The normal answer is therefore either to expire the results after a set amount of time (eg. the results are only valid for 1 minute) or have a system that when the data is changed, the cache is invalidated, causing the search to have to run again.
Are there likely to be any new results by the time the user gets back there? You could just put a 'refresh' button on the search results pages to cause the search to be run again.

What kind of refresh rate are you expecting in the DB items? Would the search results change drastically even for short intervals, because I am not aware of such a scenario but you might have a different case.
Assuming that you have a scenario where your DB is populated by a separate thread or threads and you have another independent thread to search for results, keep track of the timestamp of the latest item inserted into the DB in your cache.
Now, when user wants to see search results again compare the timestamps i.e. compare your cache timestamp with that of the last item inserted into the DB. If there is no match then re-query else show from your cache.
If your scenario confirms to my assumption that the DB is not getting updated too frequently (w.r.t. to a specific search term or criteria) then this could save you from querying the DB too often.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.