Second level cache for java web app and its alternatives

Second level cache for java web app and its alternatives - java

Between the transitions of the web app I use a Session object to save my objects in.
I've heard there's a program called memcached but there's no compiled version of it on the site,
besides some people think there are real disadvantages of it.
Now I wanna ask you.
What are alternatives, pros and cons of different approaches?
Is memcached painpul for sysadmins to install? Is it difficult to embed it to the existing infrastructure from the perspective of a sysadmin?
What about using a database to hold temporary data between web app transitions?
Is it a normal practice?

What about using a database to hold
temporary data between web app
transitions? Is it a normal practice?
Database have indeed a cache already. A well design application should try to leverage it to reduce the disk IO.
The database cache works at the data level. That's why other caching mechanism can be used to address different levels. At the java level, you can use the 2nd level cache of hibernate, which can cache entities and query result. This can notably reduce the network IO between the app. server and the database.
Then you may want to address horizontal scalability, that is, to add servers to manage the load. In this case, the 2nd level cache need to be distributed across the nodes. This exists (see JBoss cache), but can get slightly complicated to manage.
Distributed cache tend to worker better if they have simpler scheme based on key/value. That's what memcached is, but there are also other similar solutions. The biggest problem with distributed caches is invalidation of outdated entries -- which can itself turn into a performance bottleneck.
Don't think that you can use a distributed cache as-is to make your performance problems vanish. Designing a scalable distributed architecture requires experience and is always a matter of trade-off between what to optimize and not.
To come back to your question: for regular application, there is IMHO no need of a distributed cache. Decent disk IO and network IO lead usually to decent performance.
EDIT
For non-persistent objects, you have several options:
The HttpSession. Objects need to implement Serializable. The exact way the session is managed depends on the container. In a cluster, the session is usually replicated twice, so that if one node crashes you still have one copy. There is then session affinity to route the request to the server that has the session in memory.
Distributed cache. A system like memcached may indeed make sense, but I don't know the details.
Database. You could of course dump any Serializable object in the database in a BLOB. Can be an option if the web servers are not as reliable as the database server.
Again, for regular application, I would try to go as far as possible with the HttpSession.

How about Ehcache? It's an easy to use pure Java solution ready to plug in to Hibernate. As far as I remember it's supported by containers.
It's quite painless in my experience.

http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html#performance-cache
This page should have everything that you need (hopefully !)

Related

Cluster system architecture?

I would like to develop application for ~500 active users (sessions at one time). System would not process any massive calculations. It will be simple read/write to database solution. However, to application would be uploaded about 50mb of data daily per user. (it would be analysed and clean by other application every day when non users will be active). Actually I'm working on design of this application and I've got few questions about that.
Should I consider developing application working in some cluster with load balance or one server will handle this amount of usage?
If yes, is there any guidelines about developing application to work in cluster? Is there any difference than developing single server application?
Should I be worried about database of this application? What problems should I expect when 2 servers will read/write data to single database at same time? Maybe it also should work in cluster?
I would be pleased for any help and/or articles about design this mid size applications.

This depends on you NFR (non functional requirements). Next to load balancing, a cluster provides higher availability.
You'll have to make your back-end state-less so that requests from the same user can end up on another node without the user noticing. This makes it more expensive to build scaling software. So consider your options carefully.
Accessing a database from multiple servers is not different than accessing it from multiple threads.

To answer your first question, I think using an infrastructure provider that lets you easily scale (up or down) your application is always a big plus and can help you save money. My main experience with this kind of providers is with Amazon Web Services (AWS).
I don't know precisely what technology you are planning to use, but a general setup like that on AWS would make sense to me is:
A set of EC2 instances (= virtual servers) running behind an ELB (a load balancer)
An auto scaling group containing the EC2 instances. You can look it up, but an auto scaling group basically lets you automatically add and remove instances depending on various factors (server load, disk I/O, etc.)
The use of RDS for your database. It supports multiple DBMS such as MySQL and Oracle. It also provides you with nice features such as replication, automated backups and monitoring.
The use of CodeDeploy to deploy your application on the servers
(I'm voluntarly using the AWS names so that you can read the documentation if you are interested.)
This would basically let you scale to a lot more than 500 concurrent users if needed, and could save you some money when you are handling less users. Note that auto scaling groups can also be scheduled. For instance : « I want at least 5 instances during the day (max 50), but you can go down to 2 (and still up to 50) between 1am and 4am »
The services I mentionned are quite widely documented, so you can look it up if you'd like some more specific details.
I won't discuss in detail your two other questions because I'm not an expert on the subject, but the database can indeed be a bottleneck since it may involve a lot of I/Os.
Hope this helps :)

Single Java Cache for multiple Application Server

We got multiple Application Server behind a Reverse Proxy. We want a single cache on another host which all Application Servers can easily use, thus the cache has to have some kind of network support. Furthermore the setup should be easy probably supporting docker, but this is not a must. The cache duration is about 1d. The API should be as easy and standardized as possible (JCache?).
In a later stage we want to prepolutate the Cache.
Which options do I have?
Background: In a first step we want to reduce load on the backend systems, which provides mainly SOAP Services. So we want to cache the SOAP response (JAX-WS). The cache hit rate will be probably about 25% in a first stage.
Later we want to use the same cache for JPA as well (we already have in memory caching enabled for each Application Servcer and use a Cache Coordination strategy).
To use even more caching we will need some sort cache categories.

In general: The question is to broad and actually you are asking for a product recommendation. Please take a look at the stackoverflow question guidelines.
About your question:
There is no "single cache" for any purpose. Furthermore, there can be many variants in software and system architecture, with a single cache product, too. The best solution depends not on the application but on the type of data access you want to cache. Some questions that come to my mind:
Do you have a mostly read or a read/write usage pattern?
What is the type of access, point, range, or a full scan? What type of operations you do on the data? What is the object count and typical object size? Are there hot spots? How many application servers you have? Is there a memory limit in the application servers? How costly is it to generate the data in the backend (latency and resource costs)?
One general recommendation: If you only have a few application servers, I would start with local caching in the application servers and ignore the fact that there may be redundant requests on the backend from different application servers. This way you can keep the existing system architecture. Putting in a separate cache server or servers needs a lot of planing and a lot considerations for staging, deployment and operation your application.
One second general recommendation: The cache hit rate will be probably about 25% in a first stage A cache with this hitrate will be pretty useless. It may happen that you don't get any performance gain from the cache at all. There may be reasons to do it anyway, e.g. to improve the application for flash crowds. This needs some more detailed elaboration. Double check you numbers!
I am looking forward for more detailed questions :)

What about using the cache server from Ehcache ?
It provides a RESTful interface and can run on a dedicated server.

GAE/GWT server side data inconsistent / not persisting between instances

I'm writing a game app on GAE with GWT/Java and am having a issues with server-side persistent data.
Players are polling using RPC for active games and game states, all being stores on the server. Sometimes client polling fails to find game instances that I know should exist. This only happens when I deploy to google appspot, locally everything is fine.
I understand this could be to do with how appspot is a clouded service and that it can spawn and use a new instance of my servlet at any point, and the existing data is not persisting between instances.
Single games only last a minute or two and data will change rapidly, (multiple times a second) so what is the best way to ensure that RPC calls to different instances will use the same server-side data?
I have had a look at the DataStore API and it seems to be database like storage which i'm guessing will be way too slow for what I need. Also Memcache can be flushed at any point so that's not useful.
What am I missing here?

You have two issues here: persisting data between requests and polling data from clients.
When you have a distributed servlet environment (such as GAE) you can not make request to one instance, save data to memory and expect that data is available on other instances. This is true for GAE and any other servlet environment where you have multiple servers.
So to you need to save data to some shared storage: Datastore is costly, persistent, reliable and slow. Memcache is fast, free, but non-reliable. Usually we use a combination of both. Some libraries even transparently combine both: NDB, objectify.
On GAE there is also a third option to have semi-persisted shared data: backends. Those are always-on instances, where you control startup/shutdown.
Data polling: if you have multiple clients waiting for updates, it's best not to use polling. Polling will make a lot of unnecessary requests (data did not change on server) and there will still be a minimum delay (since you poll at some interval). Instead of polling you use push via Channel API. There are even GWT libs for it: gwt-gae-channel, gwt-channel-api.

Short answer: You did not design your game to run on App Engine.
You sound like you've already answered your own question. You understand that data is not persisted across instances. The two mechanisms for persisting data on the server side are memcache and the datastore, but you also understand the limitations of these. You need to architect your game around this.
If you're not using memcache or the datastore, how are you persisting your data (my best guess is that you aren't actually persisting it). From the vague details, you have not architected your game to be able to run across multiple instances, which is essential for any app running on App Engine. It's a basic design principle that you don't know which instance any HTTP request will hit. You have to rearchitect to use the datastore + memcache.
If you want to use a single server, you can use backends, which behave like single servers that stick around (if you limit it to one instance). Frankly though, because of the cost, you're better off with Amazon or Rackspace if you go this route. You will also have to deal with scaling on your own - ie if a game is running on a particular server instance, you need to build a way such that playing the game consistently hits that instance.

Remember you can deploy GWT applications without GAE, see this explanation:
https://developers.google.com/web-toolkit/doc/latest/DevGuideServerCommunication#DevGuideRPCDeployment
You may want to ask yourself: Will your application ever NEED multiple server instances or GAE-specific features?
If so, then I agree with Peter Knego's reply regarding memcache etc.
If not, then you might be able to work around your problem by choosing a different hosting option (other than GAE). Particularly one that lets you work with just a single instance. You could then indeed simply manage all your game data in server memory, like I understand you have been doing so far.
If this solution suits your purpose, then all you need to do is find a suitable hosting provider. This may well be a cloud-based PaaS offer, provided that they let you put a hard limit (unlike with GAE) on the number of server instances, and that it goes as low as one. For example, Heroku (currently) lets you do that, as far as I understand, and apparently it's suitable for GWT applications, according to this thread:
https://stackoverflow.com/a/8583493/2237986
Note that the above solution involves a bit of fiddling and I don't know your needs well enough to make a strong recommendation. There may be easier and better solutions for what you're trying to do. In particular, have a look at non-cloud-based hosting options and server architectures that are optimized for highly time-critical, real-time multiplayer gaming.
Hope this helps! Keep us posted on your progress.

Recommendations for an in memory database vs thread safe data structures

TLDR: What are the pros/cons of using an in-memory database vs locks and concurrent data structures?
I am currently working on an application that has many (possibly remote) displays that collect live data from multiple data sources and renders them on screen in real time. One of the other developers have suggested the use of an in memory database instead of doing it the standard way our other systems behaves, which is to use concurrent hashmaps, queues, arrays, and other objects to store the graphical objects and handling them safely with locks if necessary. His argument is that the DB will lessen the need to worry about concurrency since it will handle read/write locks automatically, and also the DB will offer an easier way to structure the data into as many tables as we need instead of having create hashmaps of hashmaps of lists, etc and keeping track of it all.
I do not have much DB experience myself so I am asking fellow SO users what experiences they have had and what are the pros & cons of inserting the DB into the system?

Well a major con would be the mismatch between Java and a DB. That's a big headache if you don't need it. It would also be a lot slower for really simple access. On the other hand, the benefits would be transactions and persistence to the file system in case of a crash. Also, depending on your needs, it allows for querying in a way that might be difficult to do with a regular Java data structure.
For something in between, I would take a look at Neo4j. It is a pure Java graph database. This means that it is easily embeddable, handles concurrency and transactions, scales well, and does not have all of the mismatch problems that relational DBs have.
Updated If your data structure is simple enough - a map of lists, map of maps, something like that, you can probably get away with either the concurrent collections in the JDK or Google Collections, but much beyond that, and you will likely find yourself recreating an in memory database. And if your query constraints are even remotely difficult, you're going to have to implement all of those facilities yourself. And then you'll have to make sure that they work concurrently etc. If this requires any serious complexity or scale(large datasets), I would definitely not roll your own unless you really want to commit to it.
If you do decided to go with an embedded DB there are quite a few choices. You might want to start by considering whether or not you want to go the SQL or the NoSQL route. Unless you see real benefits to go SQL, I think it would also greatly add to the complexity of your app. Hibernate is probably your easiest route with the least actual SQL, but its still kind of a headache. I've done it with Derby without serious issues, but it's still not straightforward. You could try db4o which is an object database that can be embedded and doesn't require mapping. This is a good overview. Like I had said before, if it were me if I would likely try Neo4j, but that could just be me wanting to play with new and shiny things ;) I just see it as being a very transparent library that makes sense. Hibernate/SQL and db4o just seems like too much hand waving to feel lightweight.

You could use something like Space4J and get the benefits of both a collections like interface and an in memory database. In practical use something as basic as a Collection is an in memory database with no index. A List is an in memory database with a single int index. A Map is an in memory database with a single index type T based index and no concurrency unless synchronized or a java.util.concurrency.* implementation.

I was once working for a project which has been using Oracle TimesTen. This was back in early 2006 when Java 5 was just released and java.util.concurrent classes were barely known. The system we have developed had reasonably big scalability and throughput requirements (it was one of the core telco boxes for SMS/MMS messaging).
Briefly speaking, reasoning for TimesTen was fair: "let's outsource our concurrency/scalability problems to somebody else and focus on our business domain" and made perfect sense then. But this was back in 2006. I don't think such a decision would be made today.
Concurrency is hard, but so is handling of in-memory databases. Freeing yourself of concurrency problems you'd have to become an expert of in-memory database world. Fine tuning TimesTen for replication is hard (we had to hire a professional consultant from Oracle to do this). License(s) don't come for free. You also need to worry about additional layer which is not open source and/or might be written in a different language than the one you understand.
But it is really hard to make any judgement without knowing your experience, budget, time requirements, etc. Do a shopping around, spend some time for looking into decent concurrency frameworks (such as http://akkasource.org/) ...and let us know what you have decided ;)

Below are few questions which could facilitate a decision.
Queries - do you need to query/reproject/aggregate your data in different forms?
Transactions - do you ever need to rollback added data?
Persistence - do you only need to present the gathered data or do you also need to store it in some way?
Scalability - will your data always fit in the memory?
Performance - how fast should it be?

It is unclear to me why you feel that an in memory database cannot be thread safe.
Why don't you look at JDO and DataNucleus? They have a lot of different datastores where you get to plug in what your back end persistence provider is at run time as a configuration step. Your application code is dependent on an ORM but that ORM might be plugged into an RDBMS, DB40, NeoDatis, LDAP, etc. If one backend doesn't work for you, then switch to another.

Propagation of Oracle Transactions Between C++ and Java

We have an existing C++ application that we are going to gradually replace with a new Java-based system. Until we have completely reimplemented everything in Java we expect the C++ and Java to have to communicate with each other (RMI, SOAP, messaging, etc - we haven't decided).
Now my manager thinks we'll need the Java and C++ sides to participate in the same Oracle DB transaction. This is related to, but different from the usual distrbuted transaction problem of having a single process co-ordinate 2 transactional resources, such as a DB and a message queue.
I think propagating a transaction across processes is a terrible idea from a performance and stability point-of-view, but I am still going to be asked for a solution.
I am familiar with XA transactions and I've done some work with the JBoss Transaction Manager, but my googling hasn't turned up anything good on propagating an XA transaction between 2 processes.
We are using Spring on the Java side and their documentation explicitly states they do not provide any help with transaction propagation.
We are not planning on using a traditional Java EE server (for example: IBM Websphere), which may have support for propagation (not that I can find any definitive documentation).
Any help or pointers on solutions is greatly appreciated.

There is an example on Laurent Schneider's blog of using the DBMS_XA package inside Oracle to permit multiple sessions to work in the same transaction. So it would be possible to have Java and C++ sessions participating in the same transaction without needing any sort of additional coordinator.
Alternately, you might consider using Workspace Manager. That was originally designed to support extremely long-running transactions (i.e. manipulating lots of spatial data for a proposed development). Essentially, you can create a workspace, which in your case would be roughly equivalent to a named transaction. Both the Java and C++ code could enter that workspace (from separate sessions) and both could manipulate and commit data in that workspace. When the transaction was complete, you could then merge the workspace to the LIVE workspace, which is equivalent to doing a commit in a normal transaction.
On the other hand, I would strongly agree with your initial assessment that coordinating transactions between processes is very likely to be a bad idea from a performance, stability, simplicity, and maintenance standpoint. On the other hand, it may well be a legitimate business requirement depending on how the C++ code is going to be retired (i.e. whether it is possible to replace code in such a way that transactions can be either exclusively Java or exclusively C++)

I have been using Hazlecast Messaging and Distributed memory locks to solve some of these concerns, however using such a tool would require that you redisign your software in those parts where you touch the same data. C++ client docs here Java client here
Oracle also has a similar product called Oracle Coherence that may help you, see locking in the dev guide.
Also the database contains a MQ system called Oracle Streams Advanced queueing ( transactional persistent queues) that might help you in some situations. Oracle AQ integrates well with Oracle triggers.
Additionally there is the Database Change Notification that may help you update caches or notify processes of updates, this can be used together with the Optimistic Offline Lock pattern.
See also Software transactional memory
Apache Zookeeper can also help you with distributed locking.

I believe JBoss Transaction Manager supports 2pc tx propagation across web service calls. You could, I suppose integrate your systems that way, but the performance would stink.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.