Duplicates because Oracle is too fast or multithread? - java

I have a problem with duplicate records arriving in our database via a Java web service, and I think it's to do with Oracle processing threads.
Using an iPhone app we built, users add bird observations to a new site they visit on holiday. They create three records at "New Site A" (for example). The iPhone packages each of these three records into separate JSON strings containing the same date and location details.
On Upload, the web service iterates through each JSON string.
Iteration/Observation 1. It checks the database to see if the site exists, and if not, creates a new site and adds the observation into a hanging table.
Iteration/Obs 2. The site should now exists in the database, but it isn't found by the database site check in Iteration 1, and a second new site is created.
Iteration/Obs 3. The check for existing site NOW WORKS, and the third observation is attached to one of the existing sites. So the web service and database code does work.
The web service commits at the end of each iteration.
Is the reason for the second iteration not finding the new site in the database due to delays in Oracle commit after it's been called by the Java, so that it's already started processing iteration 2 by the time iteration 1 is truly complete, OR is it possible that Oracle is running each iteration on a separate thread?
One solution we thought about was to use Thread.sleep(1000) in the web service, but I'd rather not penalize the iPhone users.
Thanks for any help you can provide.
Iain

Sounds like a race condition to me. Probably your observation 1 and 2 are arriving very close to each other, so that 1 is still processing when 2 arrives. Oracle is ACID-compliant, meaning your transaction for observation 2 cannot see the changes made in transaction one, unless this one was completed before transaction two started.
If you need a check-then-create functionality, you'd best synchronize this at a single point in your back end.
Also, add a constraint in your DB to avoid the duplication at all costs.

It's not an Oracle problem; Thread.sleep would be a poor solution, especially since you don't know root cause.
Your description is confusing. Are the three JSON strings sent in one HTTP request? Does the order matter, or does processing any of them first set up the new location for the ones that follow?
What's a "hanging table"?
Is this a parent-child relation between location and observation? So the unit of work is to INSERT a new location into the parent table followed by three observations in the child table that refer back to the parent?
I think it's a problem with your queries and how they're written. I can promise you that Oracle is fast enough for this trivial problem. If it can handle NASDAQ transaction rates, it can handle your site.
I'd write your DAO for Observation this way:
public interface ObservationDao {
void saveOrUpdate(Observation observation);
}
Keep all the logic inside the DAO. Test it outside the servlet and put it aside. Once you have it working you can concentrate on the web app.

Related

Java Design Pattern (Orchestration/Workflow)

I need to automate the workflow after an event occurred. I do have experience in CRUD applications but not in Workflow/Batch processing. Need help in designing the system.
Requirement
The workflow involves 5 steps. Each step is a REST call and are dependent on previous step.
EX of Steps: (VerifyIfUserInSystem, CreateUserIfNeeded, EnrollInOpt1, EnrollInOpt2,..)
My thought process is to maintain 2 DB Tables
WORKFLOW_STATUS Table which contains columns like
(foreign key(referring to primary table), Workflow Status: (NEW, INPROGRESS, FINISHED, FAILED), Completed Step: (STEP1, STEP2,..), Processed Time,..)
EVENT_LOG Table to maintain the track of Events/Exceptions for a particular record
(foreign key, STEP, ExceptionLog)
Question
#1. Is this a correct approach to orchestrate the system(which is not that complex)?
#2. As the steps involve REST Calls, I might have to stop the process when a service is not available and resume the process in a later point of time. I am not sure for many retry attempts should be made and how to maintain the no of attempts made before marking it as FAILED. (Guessing create another column in the WORKFLOW_STATUS table called RETRY_ATTEMPT and set some limit before marking it Failed)
#3 Is the EVENT_LOG Table a correct design and what datatype(clob or varchar(2048)) should I be using for exceptionlog? Every step/retry attempts will be inserted as a new record to this table.
#4 How to reset/restart a FAILED entry after a dependent service is back up.
Please direct me to an blogs/videos/resources if available.
Thanks in advance.
Have you considered using a workflow orchestration engine like Netflix's Conductor? docs, Github.
Conductor comes with a lot of the features you are looking for built in.
Here's an example workflow that uses two sequential HTTP requests (where the 2nd requires a response from the first):
Input supplies an IP address (and a Accuweather API key)
{
"ipaddress": "98.11.11.125"
}
HTTP request 1 locates the zipCode of the IP address.
HTTP request 2 uses the ZipCode (and the apikey) to report the weather.
The output from this workflow is:
{
"zipcode": "04043",
"forecast": "rain"
}
Your questions:
I'd use an orchestration tool like Conductor.
Each of these tasks (defined in Conductor) have retry logic built in. How you implement will vary on expected timings etc. Since the 2 APIs I'm calling here are public (and relatively fast), I don't wait very long between retries:
"retryCount": 3,
"retryLogic": "FIXED",
"retryDelaySeconds": 5,
Inside the connection, there are more parameters you can tweak:
"connectionTimeOut": 1600,
"readTimeOut": 1600
There is also exponential retry logic if desired.
The event log is stored in ElasticSearch.
You can build error pathways for all your workflows.
I have this workflow up and running in the Conductor Playground called "Stack_overflow_sequential_http". Create a free account. Run the workflow - click "run workflow, select "Stack_overflow_sequential_http" and use the JSON above to see it in action.
The get_weather connection is a very slow API, so it may fail a few times before succeeding. Copy the workflow, and play with the timeout values to improve the success.
You describe an Enterprise Integration Pattern with enrichments/transformations from REST calls and stateful aggregation of the results over time (consequently meaning many such flows may be in progress at any one time). Apache Camel was designed for exactly these scenarios.
See What exactly is Apache Camel?

Building a sharded list in Google App Engine

I am looking for a good design pattern for sharding a list in Google App Engine. I have read about and implemented sharded counters as described in the Google Docs here but I am now trying to apply the same principle to a list. Below is my problem and possible solution - please can I get your input?
Problem:
A user on my system could receive many messages kind of like a online chat system. I'd like the server to record all incoming messages (they will contain several fields - from, to, etc). However, I know from the docs that updating the same entity group often can result in an exception caused by datastore contention. This could happen when one user receives many messages in a short time thus causing his entity to be written to many times. So what about abstracting out the sharded counter example above:
Define say five entities/entity groups
for each message to be added, pick one entity at random and append the message to it writing it back to the store,
To get list of messages, read all entities in and merge...
Ok some questions on the above:
Most importantly, is this the best way to go about things or is there a more elegant/more efficient design pattern?
What would be a efficient way to filter the list of messages by one of the fields say everything after a certain date?
What if I require a sharded set instead? Should I read in all entities and check if the new item already exists on every write? Or just add it as above and then remove duplicates whenever the next request comes in to read?
why would you want to put all messages in 1 entity group ?
If you don't specify a ancestor, you won't need sharding, but the end user might see some lagging when querying the messages due to eventual consistency.
Depends if that is an acceptable tradeoff.

Is there a good patterns for distributed software and one backend database for this problem?

I'm looking for a high level answer, but here are some specifics in case it helps, I'm deploying a J2EE app to a cluster in WebLogic. There's one Oracle database at the backend.
A normal flow of the app is
- users feed data (to be inserted as rows) to the app
- the app waits for the data to reach a certain size and does a batch insert into the database (only 1 commit)
There's a constraint in the database preventing "duplicate" data insertions. If the app gets a constraint violation, it will have to rollback and re-insert one row at a time, so the duplicate rows can be "renamed" and inserted.
Suppose I had 2 running instances of the app. Each of the instances is about to insert 1000 rows. Even if there is only 1 duplicate, one instance will have to rollback and insert rows one by one.
I can easily see that it would be smarter to re-insert the non-conflicting 999 rows as a batch in this instance, but what if I had 3 running apps and the 999 rows also had a chance of duplicates?
So my question is this: is there a design pattern for this kind of situation?
This is a long question, so please let me know where to clarify. Thank you for your time.
EDIT:
The 1000 rows of data is in memory for each instance, but they cannot see the rows of each other. The only way they know if a row is a duplicate is when it's inserted into the database.
And if the current application design doesn't make sense, feel free to suggest better ways of tackling this problem. I would appreciate it very much.
http://www.oracle-developer.net/display.php?id=329
The simplest would be to avoid parallel processing of the same data. For example, your size or time based event could run only on one node or post a massage to a JMS queue, so only one of the nodes would process it (for instance, by using similar duplicate-check, e.g. based on a timestamp of the message/batch).

how to create a copy of a table in HBase on same cluster? or, how to serve requests using original state while operating on a working state

Is there an efficient way to create a copy of table structure+data in HBase, in the same cluster? Obviously the destination table would have a different name. What I've found so far:
The CopyTable job, which has been described as a tool for copying data between different HBase clusters. I think it would support intra-cluster operation, but have no knowledge on whether it has been designed to handle that scenario efficiently.
Use the export+import jobs. Doing that sounds like a hack but since I'm new to HBase maybe that might be a real solution?
Some of you might be asking why I'm trying to do this. My scenario is that I have millions of objects I need access to, in a "snapshot" state if you will. There is a batch process that runs daily which updates many of these objects. If any step in that batch process fails, I need to be able to "roll back" to the original state. Not only that, during the batch process I need to be able to serve requests to the original state.
Therefore the current flow is that I duplicate the original table to a working copy, continue to serve requests using the original table while I update the working copy. If the batch process completes successfully I notify all my services to use the new table, otherwise I just discard the new table.
This has worked fine using BDB but I'm in a whole new world of really large data now so I might be taking the wrong approach. If anyone has any suggestions of patterns I should be using instead, they are more than welcome. :-)
All data in HBase has a certain timestamp. You can do reads (Gets and Scans) with a parameter indicating that you want to the latest version of the data as of a given timestamp. One thing you could do would be to is to do your reads to server your requests using this parameter pointing to a time before the batch process begins. Once the batch completes, bump your read timestamp up to the current state.
A couple things to be careful of, if you take this approach:
HBase tables are configured to store the most recent N versions of a given cell. If you overwrite the data in the cell with N newer values, then you will lose the older value during the next compaction. (You can also configure them to with a TTL to expire cells, but that doesn't quite sound like it matches your case).
Similarly, if you delete the data as part of your process, then you won't be able to read it after the next compaction.
So, if you don't issue deletes as part of your batch process, and you don't write more versions of the same data that already exists in your table than you've configured it to save, you can keep serving old requests out of the same table that you're updating. This effectively gives you a snapshot.

Way to know table is modified

There are two different processes developed in Java running independently,
If any of the process modifyies the table, can i get any intimation? As the table is modified. My objective is i want a object always in sync with a table in database, if any modification happens on table i want to modify the object.
If table is modified can i get any intimation regarding this ? Do Database provide any facility like this?
We use SQL Server and have certain triggers that fire when a table is modified and call an external binary. The binary we call sends a Tib rendezvous message to notify other applications that the table has been updated.
However, I'm not a huge fan of this solution - Much better to control writing to your table through one "custodian" process and have other applications delegate to that. To enforce this you could change permissions on your table so that only your custodian process can write to the database.
The other advantage of this approach is being able to provide a caching layer within your custodian process to cater for common access patterns. Granted that a DBMS performs caching anyway, but by offering it at the application layer you will have more control / visibility over it.
No, database doesn't provide these services. You have to query it periodically to check for modification. Or use some JMS solution to send notifications from one app to another.
You could add a timestamp column (last_modified) to the tables and check it periodically for updates or sequence numbers (which are incremented on updates similiar in concept to optimistic locking).
You could use jboss cache which provides update mechanisms.
One way, you can do this is: Just enclose your database statement in a method which should return 'true' when successfully accomplished. Maintain the scope of the flag in your code so that whenever you want to check whether the table has been modified or not. Why not you try like this???
If you're willing to take the hack approach, and your database stores tables as files (eg, mySQL), you could always have something that can check the modification time of the files on disk, and look to see if it's changed.
Of course, databases like Oracle where tables are assigned to tablespaces, and tablespaces are what have storage on disk it won't work.
(yes, I know this is a bad approach, that's why I said it's a hack -- but we don't know all of the requirements, and if he needs something quick, without re-writing the whole application, this would technically work for some databases)

Categories

Resources