How to manage transactions over multiple databases

How to manage transactions over multiple databases - java

I have a web application which receives requests to save orders in a database. I want to write to 2 different databases - one Cassandra instance and one PostgreSQL instance. I am using plain Java and JDBC (with apache DBUtis) with a lightweight web application library at the front.
What I am unsure about is how to implement transactionality across the two databases, i.e. if a write to one of the databases fails, then rollback the other write and put an error message in the error log.
Are there any mechanisms in Java to implement this? I know of such a thing as two phase commit, is that what I would be looking for here? Are there any alternatives?

Both Cassandra & PostgreSQL support linearizability and compare-and-set (CAS), so you can implement transactions on the client side.
If you want Serializable Isolation level then you should take a look on the Percolator's transactions. The Percolator's transactions are quite known in the industry and have been used in the Amazon's DynamoDB transaction library, in the CockroachDB database and in the Google's Pecolator system itself. A step-by-step visualization of the Percolator's transactions may help you to understand it.
If you expect contention and can deal with Read Committed isolation level then RAMP transactions by Peter Bailis may suit you. I also created a step-by-step RAMP visualization.
The third approach is to use compensating transactions also known as the saga pattern. It was described in the late 80s in the Sagas paper but became more actual with the raise of distributed systems. Please see the Applying the Saga Pattern talk for inspiration.

Related

Lightweight In-Process Distributed Transaction Manager for Java?

I'm trying to extend Clojure to add durability to refs in a way that allows users to choose which data store instances to persist to. That requires distributed transactions. Are there any really lightweight, in-process distributed transaction managers, supporting XA, for Java? If not, and I have to roll my own, are there any good resources explaining what a distributed transaction coordinator has to support? Specifically, I'm having trouble understanding what the semantics of the 3 parts of an XID are really supposed to be. As an initial implementation, I'm using BDB JE.

I know these two:
Bitronix: This is the one we are using currently, it seems to work OK and it is easy to configure.
Atomikos: We have tried this, but it is a little harder to configure than Bitronix, and it has some hardcoded dependencies to java.util.logging which we did not want. It should more feature-complete than Bitronix as it is an open source version of a commercially supported product.

http://www.atomikos.com should do what you are looking for...

Java Database Connection Pool for multiple databases (Shards)

I was wondering what the best technique for implementing a DB connection pool for a web application which uses shards. From what I can tell most (all?) open-source implementations only support a single database behind. At least, I have not found one that supports shards.
Also, even though I using shards not all of the database will have the same schema as I will have other databases too. I'm not sure if that's important to mention.
The only solution that I can come up with so far is to write a layer that sits on top of multiple and distinct pools. Each distinct pool can be any of the available single database implementations.
Are there already solutions for this? What would be the best technique otherwise?
Thanks in advance,
Stephen.

I don't think there is an open-source implementations that support sharding. Maybe, there isn't a real need since creating a layer on top of multiple database pools is not too hard. It only takes a shard mapping function(e.g. hash function) and a manager class to track multiple pools.
If you are worried that not all db have the same schemas, you can put additional schema tracking config into your manager class, so it knows which shards can serve the schema. That's, you need to track schema to shard info in additional to db pool. This is not really much additional work since you need the shard config anyway to determine how to pull the right shard from the pool (e.g. User id mod 10 = 1 should pull from Shard 1)
Good luck

What is the technology stack that you are using currently? I know that Hibernate has a sharding project, but I have not used it, just listened to some podcasts about it.
More information about it can be found here. Also, the previously mentioned podcast could be found here.
The podcast explains what a few of the issues with sharding in general, some of the hurtles the Hibernate plugin has taken care of, and then explains their anticipated path forward. Hope that helps a little bit!

There is the hibernate shards project you could take a look at.

Is it possible to use more than one persistence unit in a transaction, without it being XA?

Our application needs to use (read-only) a couple different persistence units pointing to different databases (different, commercial vendors as well).
We do not have the budget to enable 2pc on one of them (Sybase). Is there a way to use these in a transaction without it having to be an XA transaction?
We're using Websphere 6.1, Sybase 12.5.3, Oracle 10g, Java EE 5, and JPA with Hibernate Entity Manager.
Update: The oracle PU is updated rarely 1 or 2 per month, the sybase PU is updated very frequently -- many times per day. Isolation is definitely a concern for the latter, consistency between the two is not necessary to enforce.

Careful.
Read-only does not always mean that 2PC does not apply. If you have two databases, and you read both but only update one, you need a transaction to guarantee consistent results. Suppose you have a scenario where you read database A, then use those results to read and update database B. If you fail to use a transaction with database A, then it is possible that while your operation is active, the data you have read from database A can be read and updated by another application. In this case you can get inconsistent data in database B.
If you truly are reading BOTH databases and updating neither, again you may think that a distributed transaction and its accompanying locking is unnecessary. Once again though, maybe not. You may get inconsistent reads in this scenario as well, if other applications are updating the same databases. It depends on your requirements and the other users of the database.
I would suggest reading up on isolation levels to get some insight into the locking that applies, even during read operations, for all durable stores like databases. Transactional locking may be unnecessary; for example it is unnecessary if you are dealing with data that effectively does not change (no writes by any app).
Maybe there is a business solution here - negotiate with your vendor to drop the price of XA enablement, and pay it. With the economy, you may get a deal you can afford. Side note: I am surprised that you can license a database and NOT get transactions. I was not aware that it was possible to license Sybase in that way.

Atomikos TransactionsEssentials is a free, open source JTA/XA with connection pools for JDBC (and JMS).
One of its features is its added support for non-xa datasources. If readonly (your case) it is safe and easy to use our non-xa datasource to include your Sybase into a JTA transaction.
Best
Guy

An alternative to Hibernate or TopLink? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Is there a viable alternative to Hibernate? Preferably something that doesn't base itself on JPA.
Our problem is that we are building a complex (as in, many objects refer to each other) stateful RIA system. It seems as Hibernate is designed to be used mainly on one-off applications - JSF and the like.
The problem is mainly that of lazy loading. Since there can be several HTTP requests between the initialization and actually loading lazy collections, a session per transaction is out of the question. A long-lived session (one per application) doesn't work well either, because once a transaction hits a snag and throws an exception, the whole session is invalidated, thus the lazy loaded objects break. Then there's all kinds of stuff that just don't work for us (like implicit data persisting of data from outside an initialized transaction).
My poor explanations aside, the bottom line is that Hibernate does magic we don't like. It seems like TopLink isn't any better, it also being written on top of EJB.
So, a stateless persistence layer (or even bright-enough object-oriented database abstraction layer) is what we would need the most.
Any thoughts, or am I asking for something that doesn't exist?
Edit: I'm sorry for my ambiguous terminology, and thank you all for your corrections and insightful answers. Those who corrected me, you are all correct, I meant JPA, not EJB.

If you're after another JPA provider (Hibernate is one of these) then take a look at EclipseLink. It's far more fully-featured than the JPA 1.0 reference implementation of TopLink Essentials. In fact, EclipseLink will be the JPA 2.0 reference implementation shipped with Glassfish V3 Final.
JPA is good because you can use it both inside and outside a container. I've written Swing clients that use JPA to good effect. It doesn't have the same stigma and XML baggage that EJB 2.0/2.1 came with.
If you're after an even lighter weight solution then look no further than ibatis, which I consider to be my persistence technology of choice for the Java platform. It's lightweight, relies on SQL (it's amazing how much time ORM users spend trying to make their ORM produce good SQL) and does 90-95% of what JPA does (including lazy loading of related entities if you want).
Just to correct a couple of points:
JPA is the peristence layer of EJB, not built on EJB;
Any decent JPA provider has a whole lot of caching going on and it can be hard to figure it all out (this would be a good example of "Why is Simplicity So Complex?"). Unless you're doing something you haven't indicatd, exceptions shouldn't be an issue for your managed objects. Runtime exceptions typically rollback transactions (if you use Spring's transaction management and who doesn't do that?). The provider will maintain cached copies of loaded or persisted objects. This can be problematic if you want to update outside of the entity manager (requiring an explicit cache flush or use of EntityManager.refresh()).

As mentioned, JPA <> EJB, they're not even related. EJB 3 happens to leverage JPA, but that's about it. We have a bunch of stuff using JPA that doesn't even come close to running EJB.
Your problem is not the technology, it's your design.
Or, I should say, your design is not an easy fit on pretty much ANY modern framework.
Specifically, you're trying to keep transactions alive over several HTTP requests.
Naturally, most every common idiom is that each request is in itself one or more transactions, rather than each request being a portion of a larger transaction.
There is also obvious confusion when you used the term "stateless" and "transaction" in the same discussion, as transactions are inherently stateful.
Your big issue is simply managing your transactions manually.
If you transaction is occurring over several HTTP requests, AND those HTTP requests happen to be running "very quicky", right after one another, then you shouldn't really be having any real problem, save that you WILL have to ensure that your HTTP requests are using the same DB connection in order to leverage the Databases transaction facility.
That is, in simple terms, you get a connection to the DB, stuff it in the session, and make sure that for the duration of the transaction, all of your HTTP requests go through not only that same session, but in such a way that the actual Connection is still valid. Specifically, I don't believe there is an off the shelf JDBC connection that will actually survive failover or load balancing from one machine to another.
So, simply, if you want to use DB transactions, you need to ensure that your using the same DB Connection.
Now, if your long running transaction has "user interactions" within it, i.e. you start the DB transaction and wait for the user to "do something", then, quite simply, that design is all wrong. You DO NOT want to do that, as long lived transactions, especially in interactive environments, are just simply Bad. Like "Crossing The Streams" Bad. Don't do it. Batch transactions are different, but interactive long lived transactions are Bad.
You want to keep your interactive transactions as short lived as practical.
Now, if you can NOT ensure you will be able to use the same DB connection for your transaction, then, congratulations, you get to implement your own transactions. That means you get to design your system and data flows as if you have no transactional capability on the back end.
That essentially means that you will need to come up with your own mechanism to "commit" your data.
A good way to do this would be where you build up your data incrementally into a single "transaction" document, then feed that document to a "save" routine that does much of the real work. Like, you could store a row in the database, and flag it as "unsaved". You do that with all of your rows, and finally call a routine that runs through all of the data you just stored, and marks it all as "saved" in a single transaction mini-batch process.
Meanwhile, all of your other SQL "ignores" data that is not "saved". Throw in some time stamps and have a reaper process scavenging (if you really want to bother -- it may well be actually cheaper to just leave dead rows in the DB, depends on volume), these dead "unsaved" rows, as these are "uncomitted" transactions.
It's not as bad as it sounds. If you truly want a stateless environment, which is what it sounds like to me, then you'll need to do something like this.
Mind, in all of this the persistence tech really has nothing to do with it. The problem is how you use your transactions, rather than the tech so much.

I think you should have a look at apache cayenne which is a very good alternative to "big" frameworks. With its decent modeler, the learning curve is shorten by a good documentation.

I've looked at SimpleORM last year, and was very impressed by its lightweight no-magic design. Now there seems to be a version 3, but I don't have any experience with that one.

Ebean ORM (http://www.avaje.org)
It is a simpler more intuitive ORM to use.
Uses JPA Annotations for Mapping (#Entity, #OneToMany etc)
Sessionless API - No Hibernate Session or JPA Entity Manager
Lazy loading just works
Partial Object support for greater performance
Automatic Query tuning via "Autofetch"
Spring Integration
Large Query Support
Great support for Batch processing
Background fetching
DDL Generation
You can use raw SQL if you like (as good as Ibatis)
LGPL licence
Rob.

BEA Kodo (formerlly Solarmetric Kodo) is another alternative. It supports JPA, JDO, and EJ3. It is highly configurable and can support agressive pre-fetching, detaching/attaching of objects, etc.
Though, from what you've described, Toplink should be able to handle your problems. Mostly, it sounds like you need to be able to attach/detach objects from the persistence layer as requests start and end.

Just for reference, why the OP's design is his biggest problem: spanning transactions across multiple user requests means you can have as many open transactions at a given time as there are users connected to your app - a transaction keeps the connection busy until it is committed/rolled back. With thousand of simultaneously connected users, this can potentially mean thousands of connections. Most databases don't support this.

Neither Hibernate nor Toplink (EclipseLink) is based on EJB, they are both POJO persistancy frameworks (ORM).
I agree with the previous answer: iBatis is a good alternative to ORM frameworks: full control over sql, with a good caching mechanism.

One other option is Torque, I am not saying it is better than any of the options mentioned above but just that it is another option to look at.
It is getting quite old now but may fit some of your requirements.
Torque

When I was myself looking for a replacement to Hibernate I stumbled upon DataNucleus Access Platform, which is an Apache2-licensed ORM. It isn't just ORM as it provides persistence and retrieval of data also in other datasources than RDBMS, like LDAP, DB4O and XML. I don't have any usage experience, but it looks interesting.

Consider breaking your paradigm completely with something like tox. If you need Java classes you could load the XML result into JDOM.

What JDBC tools do you use for synchronization of data sources?

I'm hoping to find out what tools folks use to synchronize data between databases. I'm looking for a JDBC solution that can be used as a command-line tool.
There used to be a tool called Sync4J that used the SyncML framework but this seems to have fallen by the wayside.

I have heard that the Data Replication Service provided by Db4O is really good. It allows you to use Hibernate to back onto a RDBMS - I don't think it supports JDBC tho (http://www.db4o.com/about/productinformation/drs/Default.aspx?AspxAutoDetectCookieSupport=1)
There is an open source project called Daffodil, but I haven't investigated it at all. (https://daffodilreplicator.dev.java.net/)
The one I am currently considering using is called SymmetricDS (http://symmetricds.sourceforge.net/)
There are others, they each do it slightly differently. Some use triggers, some poll, some use intercepting JDBC drivers. You need to decide what technical limitations you are under to determine which one you really want to use.
Wikipedia provides a nice overview of different techniques (http://en.wikipedia.org/wiki/Multi-master_replication) and also provides a link to another alternative DBReplicator (http://dbreplicator.org/).

If you have a model and DAO layer that exists already for your codebase, you can just create your own sync framework, it isn't hard.
Copy data is as simple as:
read an object from database A
remove database metadata (uuid, etc)
insert into database B
Syncing has some level of knowledge about what has been synced already. You can either do it at runtime by getting a list of uuids from TableInA and TableInB and working out which entries are new, or you can have a table of items that need to be synced (populate with a trigger upon insert/update in TableInA), and run from that. Your tool can be a TimerTask so databases are kept synced at the time granularity that you desire.
However there is probably some tool out there that does it all without any of this implementation faff, and each implementation would be different based on business needs anyway. In addition at the database level there will be replication tools.

True synchronization requires some data that I hope your database schema has (you can read the SyncML doc to see how they proceed). Sync4J won't help you much, it's really high-level and XML oriented. If you don't foresee any conflicts (which means: really easy synchronisation), you could try with a lightweight ETL like Enhydra Octopus.

I'm primarily using Oracle at the moment, and the most full-featured route I've come across is Red Gate's Data Compare:
http://www.red-gate.com/products/oracle-development/data-compare-for-oracle/
This old blog gives a good summary of the solution routes available:
http://www.novell.com/coolsolutions/feature/17995.html
The JDBC-specific offerings I've come across have been very basic. The solution mentioned by Aidos seems the most feature complete if you want to go down the publish-subscribe route:
http://symmetricds.codehaus.org/
Hope this helps.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.