I was wondering what the best technique for implementing a DB connection pool for a web application which uses shards. From what I can tell most (all?) open-source implementations only support a single database behind. At least, I have not found one that supports shards.
Also, even though I using shards not all of the database will have the same schema as I will have other databases too. I'm not sure if that's important to mention.
The only solution that I can come up with so far is to write a layer that sits on top of multiple and distinct pools. Each distinct pool can be any of the available single database implementations.
Are there already solutions for this? What would be the best technique otherwise?
Thanks in advance,
Stephen.
I don't think there is an open-source implementations that support sharding. Maybe, there isn't a real need since creating a layer on top of multiple database pools is not too hard. It only takes a shard mapping function(e.g. hash function) and a manager class to track multiple pools.
If you are worried that not all db have the same schemas, you can put additional schema tracking config into your manager class, so it knows which shards can serve the schema. That's, you need to track schema to shard info in additional to db pool. This is not really much additional work since you need the shard config anyway to determine how to pull the right shard from the pool (e.g. User id mod 10 = 1 should pull from Shard 1)
Good luck
What is the technology stack that you are using currently? I know that Hibernate has a sharding project, but I have not used it, just listened to some podcasts about it.
More information about it can be found here. Also, the previously mentioned podcast could be found here.
The podcast explains what a few of the issues with sharding in general, some of the hurtles the Hibernate plugin has taken care of, and then explains their anticipated path forward. Hope that helps a little bit!
There is the hibernate shards project you could take a look at.
Related
within our company it's kind of standard to create repositories for data which is originally stored in the database as described for example in https://thinkinginobjects.com/2012/08/26/dont-use-dao-use-repository/.
Our web infrastructure consist of a few independent web applications within Tomcat 7 for printing, product description, product order (this is not persisted in the database!), category description etc.
They are all build on Servlet 2 API.
So each instance/implementation of repository holds a specialised kind of data represented by serializable classes and the instances of this serialzable classes are set up/filled by an periodically executed database query (for every resultrow the setters of the fields are called; reminds me of domain oriented entity beans with CMP).
The repositories are initialized on the servlets init sequences (so every servlet keeps it's own set of instances).
Each context has a own connection to the Oracle database (set up by resource description file on deployment).
All the data is read only, we never need to write back to the database.
Because we need some of these data types for more than one web application (context) and some even for more than one servlet within the same web context repositories with an identical data type are instantiated more than once - e.g. four times, twice within the same application.
In the end some of the data is doubled and I'm not sure if this is as clever and efficient as it should be. It should be possible to share the same repository object to more than one application (JNDI?) but at least it must be possible to share it for several servlets within the same application context.
Despite I'm irritated by the idea to use a "self build" repository instead of something like a well tested, open developed cache (ehcache, jcs, ...) because some of these caches also provide options for distributed caches (so it should also work within the same container).
If certain entries are searched the search algorithm iterates over all entries in the repository (s. link above). For every search pattern there are specialised functions which are directly called from within the business logic classes using the "entity beans"; there's no specification object or interface.
In the end the application server as a whole does not perform that well and it uses a hell lot of RAM (at least for approximately 10000 DB entries); this is in my opinion most probably correlated to the use of serializeable XSD-to-JAXB-generated classes.
Additionally every time a application is deployed for tests you have to wait at least two minutes until all entries of the database have been loaded into the repositories - when deploying on live there's a well recognizable out of service phase on context/servlet start up.
I tend to think all of this is closely related to the solutions I described above.
Because I haven't got any experiences in this field and I'm new in the company I don't want to be to obtrusive.
Maybe you can help me to evaluate ideas for a better setup:
Is it for performance and memory better to unify all the repositories into one "repository servlet" and request objects from there via HTTP (don't think so, though it seems quite modular/distributed system friendly) or should I try to go with JNDI (never did that before) and connect to the repository similar to a JDBC database?
Wouldn't it be even more sensible, faster and efficient to at least use only one single connection pool for the whole Tomcat (and reference this connection pool from within the web apps deployment descriptor)? Or might that slow down connections or limit it in any other aspect?
I was told that the cache system (ehcache) didn't work well (at least not with the performance of the self written solution - though: I can't believe that). I imagine the usage of repositories backed by a distributed (as across all contexts) cache used in all web applications should not only reduce memory footprint significantly but should not be significantly slower. - I believe it will be faster and have shorter start up times respectively it shouldn't be needed to redeploy it that often.
I'm very grateful for every tip or hint and your thoughts. Would be marvellous to get a peer review of my ideas based on practical experiences.
So thank you very much in advance!
Is it better to hold a repository for every web application (context) or is it better to share a common instance by JDNI or a similar technique
Unless someone proves me otherwise I would say there is no way to do it, in a standard way, meaning as defined in the Servlet Sepc or in the rest of the Java EE spec canon.
There are technical ways to do it which probably depend on a specific application server implementation, but this cannot be "better" in its universal sense.
If you have two applications that operate on the same data, I wonder whether the partitioning of the applications is useful. Maybe all functionality operating on some kind of data needs to be in the same application?
within our company it's kind of standard to create repositories for data which is originally stored in the database as described for example in https://thinkinginobjects.com/2012/08/26/dont-use-dao-use-repository/.
I looked up Evans in our book shelf. The blog post is quite weird. A repository and a DAO are basically the same thing, it provides CRUD operations for an object or for a tree of objects (Evans says only the the aggregate roots).
The repositories are initialized on the servlets init sequences (so every servlet keeps it's own set of instances). Each context has a own connection to the Oracle database (set up by resource description file on deployment). [ ... ]
In the end the application server as a whole does not perform that well and it uses a hell lot of RAM
When something performs badly its the best to do profiling, e.g. with YourKit or with perf and FlameGraphs if you are on Linux. If your applications need a lot of RAM, analyze the heap e.g. with Eclipse MAT. There is no way somebody can give you a recommendation or hint on a best practice without seeing any line of code.
A general answer would include anyting about performance tuning for Oracle DBs, JDBC, Java Collections and Concurrent Programming, Networking and Operating Systems.
I was told that the cache system (ehcache) didn't work well (at least not with the performance of the self written solution - though: I can't believe that)
I can. EHCache is between 10-20 times slower then a simple HashMap. See: cache benchmarks. You only need a map, when you do a complete preload and don't have any mutations.
I imagine the usage of repositories backed by a distributed (as across all contexts) cache used in all web applications should not only reduce memory footprint significantly but should not be significantly slower
Distributed caches need to go over the network and add serialization/deserialization overhead. That's probably another factor 30 slower. When is the distributed cache updated?
I'm very grateful for every tip or hint and your thoughts.
Wrap up:
Do the normal software engineering homework, do profiling and analyzing and spend the effort of tuning at the right places
Ask specific questions on one topic on stackoverflow and share your code and performance data. Ask a question about one thing at one time and read https://stackoverflow.com/help/on-topic
You may also come to the conclusion that there is nothing to tune. There are applications out there that need a day to build up an in memory data structure from persistent data. Maybe its just a lot of data? If you do not like the downtime use green blue deployment. Also use smaller data sets for development and testing
I have a web application which receives requests to save orders in a database. I want to write to 2 different databases - one Cassandra instance and one PostgreSQL instance. I am using plain Java and JDBC (with apache DBUtis) with a lightweight web application library at the front.
What I am unsure about is how to implement transactionality across the two databases, i.e. if a write to one of the databases fails, then rollback the other write and put an error message in the error log.
Are there any mechanisms in Java to implement this? I know of such a thing as two phase commit, is that what I would be looking for here? Are there any alternatives?
Both Cassandra & PostgreSQL support linearizability and compare-and-set (CAS), so you can implement transactions on the client side.
If you want Serializable Isolation level then you should take a look on the Percolator's transactions. The Percolator's transactions are quite known in the industry and have been used in the Amazon's DynamoDB transaction library, in the CockroachDB database and in the Google's Pecolator system itself. A step-by-step visualization of the Percolator's transactions may help you to understand it.
If you expect contention and can deal with Read Committed isolation level then RAMP transactions by Peter Bailis may suit you. I also created a step-by-step RAMP visualization.
The third approach is to use compensating transactions also known as the saga pattern. It was described in the late 80s in the Sagas paper but became more actual with the raise of distributed systems. Please see the Applying the Saga Pattern talk for inspiration.
Background]
- There are two java applications (A and B), and they can only communicate via Oracle DB
- A and B share the same database table
- A and B stores the data in cache
Problem]
If A performs simple transaction (insert/update/delete), the cache in A is updated. Also, the cache in B should be updated automatically!
Current Status]
Two solutions I found and tried
- Solution1) Using DatabaseChangeListener
- Solution2) Using Socket Programming
Question]
The solution will be used for company, and I would like to know if there is anything that I can improve my solutions.
1) What could be disadvantages if I use DatabaseChangeListener?
2) What could be disadvantages if I use socket programming? (Maybe it's too low-level that developer cannot maintain due to company policy?)
3) I heard there are 3rd party cache that also supports synchronization. Am I correct?
Please let me know if you need more information!
Thank you very much in advance!
[EDIT]
If would be much appreciated if you can leave a comment when you down-vote this. I would like to know how I can improve this question with your feedback! Thank you
Your question appears every now and then with slightly different aspects. One useful answer to that is here: Guava Cache, how to block access while doing removal
About using the DatabaseChangeListener:
Although you are fine with oracle, I would discourage the use of vendor specific interfaces. For me, it would be okay to use, if it is an performance optimization, but I would never use vendor specific interfaces for basic functionality.
Second, the usage of the change listener may still lead to dirty reads.
About "distributed caches" as veritas suggested:
There is a difference between distributed caches and clustered caches. Distributed caches spread (aka distribute) the cached data on different nodes, clustered caches are caches for clustered applications that keep track of data consistency within the cluster. A distributed cache usually is a clustered cache, but not the other way around. For a general idea on the topic I recommend the infinispan documentation on clustering as an intro: http://infinispan.org/docs/7.0.x/user_guide/user_guide.html#_clustering
Wrap up:
A clustered cache implementation is the thing you need. However, if you want data consistency, you still need to carefully design your transaction handling.
You can, of course, also do socket communication yourself and send simple object invalidate messages to the other applications. The challenging part is the error handling. When was the invalidate successful? Is there a timeout for the other nodes to acknowledge? When to drop a node and maintain a cluster state at all?
I will suggest for the 3rd Party Cache, if you have many similar use cases or many tables need to be updated .
Please read about terracotta Distributed Cache.
It gives exactly what you want.
You can also look for hazelcast or memcached
There are technically two questions here, but are tightly coupled :)
I'm using Hibernate in a new project. It's a POS project.
It uses Oracle database.
We have decided to use Hibernate because the project is large, and because it provides (the most popular) ORM capabilities.
Spring is, for now, out of the question - the reason being: the project is a Swing client-server application, and it adds needless complexity. And, also, Spring is supposed to be very hungry on the hardware resources.
There is a possibility to throw away Hibernate, and to use JDBC. Why? The project requirement is precise database interaction. Meaning, we should have complete control over the connections, sessions and transactions(and, yes, going as low as unoptimized queries).
The first question is - what are your opinions on using the mentioned requrement?
The second question revolves around Hibernate.
We developed a simple Hibernate pilot project.
Another project requirement is - one database user / one connection per user / one session per user / transactions are flexibile(we can end them when we want, as sessions).
Multiple user can log in the application at the same time.
We achived something like that. To be precise, we achived the full described functionality without the multiple users requirement.
Now, looking at the available resources, I came to a conclusion that if we are to have multiple users on the database(on the same schema), we will end up using multiple SessionFactory, implementing a dynamic ConnectionProvider for new user connections. Why?
The users hashed passwords are in the database, so we need to dynamically add a user to the list of current users.
The second question is - can this be done a little easier, it seems weird that Hibernate doesn't support such configurations.
Thank you.
If you're pondering about weather to use Hibernate or JDBC, honestlly go for JDBC. If your domain model is not too complex, you don't really get a lot of advantages from using hibernate. On the other hand using JDBC will greatly improve performance, as you have better control on your queries, and you get A LOT less memory usage from not habing all the Hibernate overhead. Balance this my making an as detailed as possible first scetch of your model. If you're able to schetch it all from the start (no parts that are possible to change wildly in throughout the project), and if said model doesn't look to involved, JDBC will be your friend.
About your users and sessions there, I think you might be mistaking (tho it could just be me), but I don't think you need multiple SessionFactories to have multiple sessions. SessionFactory is a heavy object to initialize, but once you have one you can get multiple hibernate session objects from it which are lightweight.
As a final remark, if you truly stick with an ORM solution (for whatever reason), if possible chose EclipseLink JPA2 implementation. JPA2 has more features over hibernate and the Eclipselink implementation is less buggy then hibernate.
So, as far as Hibernate goes, I still dont know if the only way to dynamicaly change database users(change database connections) was to create multiple session factories, but I presume it is.
We have lowered our requriements, and decided to use Hibernate, use only one user on the database(one connection), one session per user(multiple sessions/multiple "logical" users). We created a couple of Java classes to wrap that functionality. The resources how this can be done can be found here.
Why did we use Hibernate eventually? Using JDBC is more precise, and more flexibile, but the effort to once again map the ResultSet values into objects is, again, the same manual ORM approach.
For example, if I have a GUI that needs to save a Page, first I have to fetch all the Page Articles and then, after I save the Page, update all the Articles FK to that Page. Notice that Im speaking in nouns(objects), and I dont see any other way to wrap the Page/Articles, except using global state. This is the one thing I wouldnt like to see in my application, and we are, after all, using Java, a OO language.
When we already have an ORM mapper that can be configured(forced would be the more precise word to use in this particular example) to process these thing itself, why to go programming it?
Also, we decided to user google Guice - its much faster, typesafe, and could significantly simplify our development/maintence/testing.
I'm hoping to find out what tools folks use to synchronize data between databases. I'm looking for a JDBC solution that can be used as a command-line tool.
There used to be a tool called Sync4J that used the SyncML framework but this seems to have fallen by the wayside.
I have heard that the Data Replication Service provided by Db4O is really good. It allows you to use Hibernate to back onto a RDBMS - I don't think it supports JDBC tho (http://www.db4o.com/about/productinformation/drs/Default.aspx?AspxAutoDetectCookieSupport=1)
There is an open source project called Daffodil, but I haven't investigated it at all. (https://daffodilreplicator.dev.java.net/)
The one I am currently considering using is called SymmetricDS (http://symmetricds.sourceforge.net/)
There are others, they each do it slightly differently. Some use triggers, some poll, some use intercepting JDBC drivers. You need to decide what technical limitations you are under to determine which one you really want to use.
Wikipedia provides a nice overview of different techniques (http://en.wikipedia.org/wiki/Multi-master_replication) and also provides a link to another alternative DBReplicator (http://dbreplicator.org/).
If you have a model and DAO layer that exists already for your codebase, you can just create your own sync framework, it isn't hard.
Copy data is as simple as:
read an object from database A
remove database metadata (uuid, etc)
insert into database B
Syncing has some level of knowledge about what has been synced already. You can either do it at runtime by getting a list of uuids from TableInA and TableInB and working out which entries are new, or you can have a table of items that need to be synced (populate with a trigger upon insert/update in TableInA), and run from that. Your tool can be a TimerTask so databases are kept synced at the time granularity that you desire.
However there is probably some tool out there that does it all without any of this implementation faff, and each implementation would be different based on business needs anyway. In addition at the database level there will be replication tools.
True synchronization requires some data that I hope your database schema has (you can read the SyncML doc to see how they proceed). Sync4J won't help you much, it's really high-level and XML oriented. If you don't foresee any conflicts (which means: really easy synchronisation), you could try with a lightweight ETL like Enhydra Octopus.
I'm primarily using Oracle at the moment, and the most full-featured route I've come across is Red Gate's Data Compare:
http://www.red-gate.com/products/oracle-development/data-compare-for-oracle/
This old blog gives a good summary of the solution routes available:
http://www.novell.com/coolsolutions/feature/17995.html
The JDBC-specific offerings I've come across have been very basic. The solution mentioned by Aidos seems the most feature complete if you want to go down the publish-subscribe route:
http://symmetricds.codehaus.org/
Hope this helps.