Concurrency of JPA when same DB used by other applications

Concurrency of JPA when same DB used by other applications - java

I am developing a Spring MVC JPA web application. When this application is deployed in live the same DB that my application interacts with will be used by other 2 Dotnet and VB applications at the same time. I am managing the concurrency of my JPA app through version column.
Will there be any issue when running these 3 apps at the same time for the same database? All systems are using the same tables.
I will have to build another application (most probably Spring MVC + JPA) for the same DB in the future. Will there be any issues when running both apps at the same time (in terms of persisting the same tables in both apps etc.)?

Multiple applications accessing the same database tables can be (and usually is) a concurrency nightmare. Just adding a version column to tables does not help because each application may use different concurrency management mechanisms. Common problems encountered when sharing a database across applications (assuming read-write access for all and in no particular order):
Concurrency mismatch: Imagine one application using optimistic locking throughout, another using pessimistic locking and a third using no locking at all (since all are maintained by different teams). Even if there is a central architecture group handing out good application architecture advice to everyone, developers can do whatever they like and end up causing concurrency hell.
Deadlocks: Imagine one application using SERIALIZABLE isolation level for all transactions and performing long-running operations on the database, causing queues, deadlocks and timeouts. Even if other applications do not corrupt data, they may end up seeing too many errors due to deadlocks and timeouts, reducing their usefulness.
Data validity: It is common for developers to use in-memory caches for storing fixed or slow-changing data to save repeated database roundtrips. In a JPA application backed by Hibernate, developers could use a second-level cache. However, if another application updates the database, the cache would be holding stale (and therefore inaccurate) data.
Data integrity: Different applications may use different parts of the data. If all are allowed to update the common data independently, how is referential integrity maintained? What about business rules? Will they have to be duplicated across the applications?
Team communication overhead: Each team has to keep the other teams informed about changes they need to make to the schema so that they do not step on to each others' toes. This may even reduce agility if other teams do not agree with changes required by a team or if priorities cannot be aligned.
Schema errors: What happens if someone deletes, renames, moves or archives tables or columns required by an application?
Access control: If the underlying data is sensitive and requires authentication and authorization, access control checks will have to be duplicated across the applications.
Ownership: Who owns the common tables becomes a challenge as each team may have its own stakeholders, roadmap, constraints and priorities.
Portability: If the underlying database (vendor and type) has to be changed some day, all applications will need to be changed.
Auditing and versioning: If common data needs to be audited and/or versioned (multiple versions of data rows), the code will either have to be duplicated across applications or built into the database (which may not be easy within the database, for example, knowing which application user changed a record). If done within the database, portability is affected again because the syntax may vary from one database vendor to another.
Database-specific optimizations: Some scenarios (for example, reporting) may require native queries or other optimizations that may not be possible or too hard to perform using an ORM technology like JPA. If multiple applications require access to optimized queries, such functionality will either have to be built into the database (stored procedures) or duplicated across the applications (in possibly different ways).
Archival and partitioning: Different applications may have different archival and partitioning requirements. Who makes sure that everyone's needs are met equally?
Standardization: If different teams are allowed to manage the common database objects on their own, how are things such as data dictionary, naming conventions, etc. managed? A table that is used by different teams may have differently (and mostly annoyingly) named columns, constraints, etc.
A better approach is to expose the common database tables using a service. The service can keep things consistent and controlled at all times. A common team handling changes to the service can ensure that unexpected changes are minimized and also communicated to all affected parties in a timely manner.

Related

Google App Engine Multitenancy API

According to the GAE docs on the Multitenancy API:
Multitenancy is the name given to a software architecture in which one instance of an application, running on a remote server, serves many client organizations (also known as tenants).
But isn't this what every web application is? Dozens, hundreds, maybe even thousands of users all logging in to the system, accessing the same software, but from inside the context of their own "user accounts"? Or is Google's Multitenancy API some kind of API for developing generic data abstraction layers that can be used as backends for multiple apps?
I guess I don't get the meaning of a "Google multi-tenant" app, and as such, don't understand the purpose or usefulness of the Multitenancy API. Thanks in advance for any clarity here!

Consider the standard way that multitenancy is implemented: You add a "tenant ID" field to one or more tables, then include that ID in a WHERE clause. And you index that field.
You could take the same approach in App Engine, adding an indexed property to some of your entities to hold a tenant ID, carefully including that ID in GQL WHERE clauses (or a filters). This'll cost you a bit more on writes (for the two indexes on that property), and more if the ID participates in queries that include other filters, as those would require additional composite indexes that include the ID.
Or you user our multitenancy API, which gives you the same effect without the additional costs for index writes. You get slightly simpler code, and less expense.

Multitenancy here doesn't refer to users of your app as such, but 'instances' of your app with 'separate' datastores.
They aren't really separate instances or separate datastores, as those requests might be served by a shared instance and they are definitely talking to the same datastore. However, by using the API you can set up your app so that the data is partitioned into separate namespaces which don't pollute each other.
If you have only one user on your app, then multi-users and multi-tenanting is pretty much the same thing. If you have multiple users, then generally you'll be sharing data between the users. If so, you can use multitenancy to share data within only a certain group of users and partition the rest off in their own tenancy.
As jtahlborn rightly states, each of our GAE apps is already a tenant on the GAE infrastructure. We aren't able to share data between different apps because they are completely partitioned from each other.
As Dave says, we could implement multitenancy ourselves by adding some kind of domain name or partition id to all our data. The API just gives an easier way to do that.

The difference is whose tenants you are talking about. GAE was multi-tenant from day one in that each program(tenant) ran in a common GAE infrastructure. however, initially, your program itself just managed one body of data (when GAE was first released). the GAE "multi-tenancy API" enables your single program to manage its(your) own tenants (so your tenants as opposed to GAE's tenants).
to state it concisely and confusingly: the "multi-tenancy API" allows you to manage your own tenants(users) within a single GAE program, which is in turn hosted as a tenant(program) within the GAE infrastructure.
in theory, of course, you could always have done this from day 1 in GAE, but all the work for managing the data between your tenants would have been handled in your code. the "multi-tenancy API" attempts to remove that pain from the programmer and make it much simpler to segment the data within your program.

Is it possible to access concurrently to the same DB from two different Servlets?

I'm developing a web application, based on a DB.
In my application, I access to this one from two different Servlets, and it's possible these accesses are concurrently.
I need to know if it is permitted, and if not, how can I do it?
Is there some trick to perform queries in a thread-safe way?

It is possible and how to handle will be database responsibility based on DB settings (isolation level settings).
Here are the isolation levels in SQL Server and these may vary based on DB.
1.Read uncommitted (the lowest level where transactions are isolated only enough to ensure that physically corrupt data is not read)
2.Read committed (Database Engine default level)
3.Repeatable read
4.Serializable (the highest level, where transactions are completely isolated from one another)

One of the primary design requirements for databases is concurrent access. Fact is, you are most probably already doing it in any one of your servlets since they can serve several requests in parallel, usng multiple db connections. Using two connections from one app is (almost) exactly the same thing as using two connections from two apps.

Database distribution

What are the possibilities to distribute data selectively?
I explain my question with an example.
Consider a central database that holds all the data. This database is located in a certain geographical location.
Application A needs a subset of the information present in the central database. Also, application A may be located in a geographical location different (and maybe far) from the one where the central database is located.
So, I thought about creating a new database at the same location of application A that would contain a subset of information of the central database.
Which technology/product allow me to deploy such a configuration?
Thanks

Look for database replication. SQL Server can do this for sure, others (Oracle, MySQL, ...) should have it, too.
The idea is that the other location maintains a (subset) copy. Updates are exchanged incrementally. The way to treat conflicts depends on your application.

Most major database software such as MySql and SQL server can do the job, but it
is not a good model. With the growth of the application (traffic and users),
not only will you create a load on the central database server (which might be serving
other applications),but you will also be abusing your network bandwidth to transfer data
between the far away database and the application server.
A better model is to keep your data close to the application server, and use the far away
database for backup and recovery purposes only. You can use an FC\IP SAN (or any other
storage network architecture) as your storage network model, based on your applications' needs.

One big question that you didn't address is if Application A needs read-only access to the data or if it needs to be read-write.
The immediate concept that comes to mind when reading your requirements is sharding. In MySQL, this can be accomplished with partitioning. That being said, before you jump into partitions, make sure you read up on their pros and cons. There are instances where partitioning can slow things down if your indexes are not well chosen, or your partitioning scheme is not well thought out.
If your needs are read-only, then this should be a fairly simple solution. You can use MySQL in a Master-Slave context, and use App A off a slave. If you need read-write, then this becomes much more complex.
Depending on your write needs, you can split your reads to your slave, and your writes to the master, but that significantly adds complexity to your code structure (need to deal with multiple connections to multiple dbs). The advantage of this kind of layout is that you don't need to have complex DB infrastructure.
On the flip side, you can keep your code as is, and use a Master-Master replication in MySQL. Although not officially supported by Oracle, a lot of people have had success in this. A quick Google search will find you a huge list of blogs, howtos, etc. Just keep in mind that your code has to be properly written to support this (ex: you cannot use auto-increment fields for PKs, etc).
If you have cash to spend, then you can look at some of the more commercial offerings. Oracle DB and SQL Server both support this.
You can also use Block Based data replication, such as DRDB (and Mysql DRDB) to handle the replication between your nodes, but the problem you always will encounter is what happens if your link between the two nodes fails.
The biggest issue you will encounter is how to handle conflicting updates in 2 separate DB nodes. If your data is geographically dependent, then this may not be an issue for you.
Long story short, this is not an easy (or inexpensive) problem to resolve.

It's important to address the possibility of conflicts at the design phase anytime you are talking about replicating databases.
Moving on from that, SAP's Sybase Replication Server will allow you to do just that, either with Sybase database's or 3rd party databases.
In Sybase's world this is frequently called a corporate roll-up environment. There may be multiple geographically seperated databases each with a subset of data which they have primary control over. At the HQ, there is a server that contains all the various subsets in one repository. You can choose to replicate whole tables, or replicate based on values in individual rows/columns.
This keeps the databases in a loosely consistent state. Transaction rates, Geographic separation, and the latency that can be inherent to network will impact how quickly updates move from one database to another. If a network connection is temporarily down, Sybase Replication Server will queue up transaction, and send them as soon as the link comes back up, but the reliability and stability of the replication system will be affected by the stability of the network connection.
Again, as others have stated it's not cheap, but it's relatively straight forward to implement and maintain.
Disclaimer: I have worked for Sybase, and am still part of the SAP family of companies.

SaaS / Multi-Tenancy approaches for Java-based (GWT, Spring, Hibernate) web applications

I am currently looking into converting a single-tenant Java based web-app that uses Spring, GWT, Hibernate, Jackrabbit, Hibernate Search / Lucene (among others) into a fully fledged SaaS style app.
I stumbled across an article that highlights the following 7 "things" as important changes to make to a single tenant app to make it an SaaS app:
The application must support multi-tenancy.
The application must have some level of self-service sign-up.
There must be a subscription/billing mechanism in place.
The application must be able to scale efficiently.
There must be functions in place to monitor, configure, and manage the application and tenants.
There must be a mechanism in place to support unique user identification and authentication.
There must be a mechanism in place to support some level of customization for each tenant.
My question is has anyone implemented any of the above 7 things in a SaaS /multi-tenant app using similar technologies to those that I have listed? I am keen to get as much input regarding the best ways to do so before I go down the path that I am currently considering.
As a start I am quite sure that I have a good handle on how to handle multiple tenants at a model level. I am thinking of adding a tenant ID to all of our tables and then using a Hibernate filter (and a Full Text Filter for Hibernate Search) to filter based on the logged on user's tenant ID for all queries.
I do however have some concerns around performance as well especially when our number of tenants grows quite high.
Any suggestions on how to implement such a solution will be greatly appreciated (and I apologise if this question is a bit too open-ended).

I would recommend that you architect your application to support all the 4 types of tenant isolation namely separate database for each tenant, separate schema for each tenant, separate table for each tenant and shared table for all tenants with a tenant ID. This will give you the flexibility to horizontally partition your database as you grow, having multiple databases each having a group of smaller tenants and also the ability to have a separate database for some large tenants. Some of your large tenants could also insist that their data (database) should reside in their premise, while the application can run off the cloud.
Here is an exaustive check list of non-functional and infrastructure level features that you may want to consider while architecting your application (some of them you may not need immediately, but think of a business situation of how you will handle such a need if your competition starts offering it)
tenant level customization of a) UI themes and logos b) forms and grids, c) data model extensions and custom fields, d) notification templates, e) pick up lists and master data
tenant level creation and administration of roles and privileges, field level access permissions, data scope policies
tenant level access control settings for modules and features, so that specific modules and features could be enabled / disabled depending on the subscription package.
Metering and monitoring of tasks / events / transactions and restriction of access control once the purchased quota is exceeded. The ability to meter any new entity in the future if and when your business model changes.
Externalising the business rules and workflows out of your code base and representing them as meta data, so that you can customize them for each tenant group / tenant.
Query builder for creating custom reports that is aware of the tenant as well as custom fields added by specific tenants.
Tenant encapsulation and framework level connection string management such that your developers do not have to worry about tenant IDs while writing queries.
All these are based on our experience in building a general purpose multi-tenant framework that can be used for any domain or application. Unfortunately, you cannot use our framework as it is based on .NET
But the engineering needs of any multi-tenant SaaS product (new or migrated) are the same irrespective of the technology stack that you use.

All of the technologies that you listed are quite common and reasonable for both single- and multi-tenant applications. I'd say supporting the 7 "things" for SaaS is much more of a function of how you use the technologies than which. It sounds like you already have a single-tenant application that works. So there's probably not much reason to deviate from the technology selections there unless something is just not working very well already. Your question is otherwise fairly open-ended though, so it's hard to be too much more specific there.
I do have some feedback on splitting the database (and perhaps other things) by tenant ID though. If you know you might eventually have a lot of tenants (say many thousands or more, particularly if they're small) then what you suggest is perhaps best. If however you'll have a smaller number of tenants (particularly if they're large) you might want to consider a database per tenant, so they each have their own table space. By that I mean a single database installation with multiple instances of the same schema inside of it, one per tenant.
There are a few reasons this can be an advantage. One is performance as you mentioned. Adding a tenant ID to every single table is overhead on disk access, query time and increases code complexity. Every index in the database will need to include the tenant ID as well. You run an additional risk of mixing data between tenants if you're not careful (although a Hibernate filter would help mitigate that). With a database per tenant you could restrict access to only the correct one. Porting your current application will probably be a lot easier too, you basically just need to intercept your request somewhere early to decide the tenant based on the URL and point to the right database. Backups are also easy to do per tenant, particularly useful if you ever intend on allowing them to download a backup.
On the other hand there are reasons not to do this. You'll have a lot of database schemas to deal with and they'll have to be updated independently (which can actually be an advantage if you want to avoid taking all tenants down for a schema change, you can roll them out incrementally). It lets you have special cases that could deviate from treating the platform as a true multi-tenant SaaS deployment that's upgraded all at once, resulting in management of multiple versions in production. Lastly I've heard there is a breaking point with just about every database vendor out there in the number of schema instances they'll support in one installation (supposedly some can go to hundreds of thousands though).
It really depends on your use case of course. You mentioned single-tenant which leads me to believe you don't have too many tenants right now, however you do mention growing to lots of tenants. I'm not sure if you mean hundreds or millions, yet either way I hope this helps some with your considerations. Best of luck!

There is no simple answer. I can describe my own solution. It may serve as an inspiration for the others.
tenant per database (postgres)
one additional database shared between tenants
Spring + MyBatis
Spring Security authentication
Details here: http://blog.trixi.cz/2012/01/multitenancy-using-spring-and-postgresql/

For (1): Hibernate supporting multi-tenant configurations out of the box from version 4.
At the moment of writing supported are DB-per-tenant and schema-per-tenant and keeping all tenants in a same DB using discriminator is not yet supported. We have used this functionality successfully in our application (DB-per-client approach).
For (3): After some investigation done we decided to go with Braintree to implement billing. Another solutions many people recommend: Authorize.net, Stripe, PayPal.
For (4): We have used clustered configuration with Hibernate/Spring and JBoss Cache for 2nd level caching. At these days this became "common" and using PaaS services like Jelastic you can even get it pre-configured out of the box.

What you describe is a full service Saas style application serving multiple tenants. There are a few things you have to decide like how critical is data isolation? If you are building for a medical or financial domain, data isolation is a critical factor.
Well, I cannot help answer all your points, but I would suggest looking at database-per-tenant approach for your application as it provides the highest level of data isolation.
Since you are using the Java, Spring, Hibernate stack, I can help you with a small example application I wrote. It is a working example which you can quickly run in your local laptop. I have shared it here. Do take a look and let me know if it answers some of your questions.

Standard Workflow when working with JPA

I am currently trying to wrap my head around working with JPA. I can't help but feel like I am missing something or doing it the wrong way. It just seems forced so far.
What I think I know so far is that their are couple of ways to work with JPA and tools to support this.
You can do everything in Java using annotations, and let JPA (whatever implementation you decide to use) create your schema and update it when changes are made.
You can use a tool to reverse engineer you database and generate the entity classes for you. When the schema is updated you have to regenerate these classes, or manually update them.
There seems to be drawbacks to both, and benefits to both (as with all things). My question is in an ideal situation what is the standard workflow with JPA? Most schemas will require updates during the maintenance phase and especially during the development phase, so how is this handled?

It's not always a good approach to generate the DB schema from the annotated entities. Although in theory it sounds great - in practice often the generated schema is not optimal and would not satisfy and experienced DBA.
The approach that I follow in my workflow is to create the entities and db schema separately, while still using a pretty intelligent tool for the schema creating - either something like Liquibase, that is database agnostic, supports revisions, rollbacks, etc... or a custom baked migration tool that simply runs heavily optimized db specific sql scripts.
It probably sounds to you less than ideal, but I can assure it gets the jobs done and keep your schema related code consistent since, as grigory pointed out - not everything related to the database can be generated from the entities anyways.
I can, however, be useful to generate the schema from the entities for the test database against which unit and integration tests are being run. Assuming you're using say PostgreSQL is production you might decide to speed things up for the unit tests running some embedded in-memory database like H2 which gets created from the entities before the tests are started and disappears automatically(since it was in-memory) after the tests finish executing. This is a very common practice.

As usual the answer is it depends...
Ideal approach (in ideal world) would probably be your 1st option: maintain everything using JPA annotations and forward engineer database artifacts using utility tool (e.g. use Hibernate Maven plugin).
It depends on the level of support for your database artifacts - not everything either belongs or suitable for annotations. That is why my projects usually use parallel maintenance for both and using unit tests to keep them in sync.
It also depends on resources available. If you have a dedicated DBA who is responsible for your database then delegating maintenance to her would make sense.
Other consideration is how much database development is really done in JPA. Are there also stored procedures or other non-JPA applications that use the same back-end, or maybe you just integrate with other team's database...

If this is an existing application, I would check what you have existing, if the database structure complex as can be seen with the DDL and the DDL shows significant logic is being done on the database itself, then you are better off using plain SQLs and let the DBA maintain your data structures. JPA does not really lend well when the database structures are already complicated and there is no business benefit to use JPA at that point.
What needs to happen is a project to migrate to JPA. There are a few advantages to that:
Business Logic is removed from the database layer (which is harder to scale horizontally) to the application tier.
Java developers are generally cheaper compared to a DBA. Though you still need someone who can do both database thinking and Java thinking to do this properly and that's more rare.
By reducing the database to become as simple datastore, you can break yourself from vendor lock-in.
If done right, you can have a different database for development (can be DB2 Express C which is free) and have a more robust database for your integration and production environments (e.g. DB/2 for zOS). This allows you to be able to have more developers without worrying about licensing costs as much.
As for schemas being generated and such, there are actually four workflows that can occur:
For design, an Object-Relational (rather than an Entity-Relational) diagram serves as a contract between the application team and the database team. The end result is the JPA objects will run in the physical data structure that the DBA sets.
For Java application development, just let each developer have their own database and let them blow it up as much as they want. The JPA code will generate the schemas for you.
For database development, the generated schemas and class diagrams are passed onto review by the DBA to see where performance can be improved upon. Specifically they are there to specify the indices which are not available in the JPA standard since it is not cross database. They are also there to set up the table spaces and all the access controls and schemas for the development, but at least the gist of the structure can be taken away from them and passed onto the application team which gives the application team more flexibility to adapt to changes. What would normally happen is the DBA just includes some generated SQL and then have alters to add additional columns and others that would be used for other purposes outside of the application (the JPA structure needs only what is needed by the application, it does not need to map one-to-one 100% to the database)
For migration, the DBA needs to do a differential analysis between the two schemas. There's a program called dbsolo (not free) that can do it with most databases. However, if things were done in JPA, the structures are simpler since in theory there is no longer any business logic on the database thus reducing the complexity of data migrations due to upgrades.
The net of it is you can't just say you're using JPA without involving the whole delivery team which will have to include the DBA willing to relinquish control and ownership of the structure of the data to the application team, but still be part of the design and reviews.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.