Java distributed objects with locality?

Java distributed objects with locality? - java

I am evaluating various Java object distribution libraries (Terracotta, JCS, JBoss, Hazelcast ...) for an application server and I'm having trouble understanding their behavior on various axes.
My requirements for distributed objects are not many -- they boil down to one-to-one and one-to-many messaging. There's more, but for the rest we just use JDBC and I assume I can plop a cache in front of this using any of the available libraries.
I would like a system that distributes objects and exhibits locality properties -- in other words, a server that grabs an object tends to hold onto it without excess communication to other nodes. Hazelcast looks simple (and peer-to-peer is nice) but seems to require objects are distributed evenly across all nodes.
I'd like a way to persist objects, preferably transparently. I plan on using EC2, so I have the option of temporary, free, limited local storage (the disk) and permanent, non-free, unlimited storage (S3). It'd be great not to worry about OutOfMemoryErrors.
I like the simplicity and "magic" of Terracotta but it scares the beejeezus out of me. Also in order to truly scale you have to spend $$$$, otherwise you're communicating with a single hub.
I'm cheap and I want something not only free but mature and with a large userbase.
Thanks for any input.

Terracotta seems like a perfect fit for your situation.
It's simple to setup
it can be configured to be persistent (use an EBS volume for EC2)
it's closely integrated with Ehcache (actually Terracotta bought Ehcache) for great distributed caching performance
the free offering scales pretty well with several clients.
Just start playing around with it. I bet you'll love it. To ease your performance fears, simply run a through put test for message passing. This shouldn't take much more than an afternoon of your time.
I have to admit that I haven't used Terracotta for a year and that I don't know the others you suggested.

Terracotta does fit the bill. I understand your objections, but here's my comments:
1) Terracotta does exhibit locality - and is probably the best system at it compared to those you mentioned. Objects are only brought in to a local JVM where requested. Locking for reads or writes is performed using a leasing mechanism. This means if you exhibit perfect locality in your system then you will incur very little network overhead.
2) Terracotta provides disk persistence out of the box - in the OSS version (you don't have to pay $$$$)
3) Why does it scare you so much? Just use EHCache as a cache, or the Hibernate 2nd Level Plugin. It's incredibly easy to setup and use.
4) Yes, Terracotta FX requires you to pay (for scale-out servers). However I would suggest that if you have a system that is mostly read and exhibits true locality then I don't think you'll have a problem getting the scale you are looking for. With Terracotta 3.2 the performance of the Hibernate 2nd Level Cache is 100,000 ops/s using 8 application servers and one Terracotta server at 100/0 read/write ratio and 12,000 ops/s using the same config at 95/5 read/write ratio.
(I just did a talk for the Bay Area SDForum on these numbers so I happen to have them handy)

Yes Hazelcast will distribute your objects across the cluster. However you can enable near cache if you want to reduce the communication cost.
http://www.hazelcast.com/documentation.jsp#MapNearCache

Btw, it's not clear what you are looking for (messaging is not the same as clustering/distributed objects).
If you are looking for messaging in Java I recommend you have a look at RabbitMQ (it's Erlang based but that doesn't matter).

Related

Java applications on Oracle Exadata

For reasons that are beside the point, a company has bought an Exadata Eighth Rack. Some of the managers thought that this would improve performance of current applications. The problem is that hardly any application makes intensive database work (yes, this is a good moment for looking at facepalm animated gifs). So, at the moment, migrations have proven just little benefit.
The question is obvious. Most of the applications are written in Java, and some of them make intensive use of Solr and Cassandra. For what I know, Exadata is intended for storing data, while Exalogic can hold applications too. Anyway, I'm wondering if there is some way of taking advantage of mentioned infrastructure.

Replace Solr with Oracle Text.
Before I get down-voted: normally I would not recommend replacing existing code built with a popular, open-source program with a seldom-used, proprietary product. But if you want to use a lot of space and CPU on your database servers then Oracle Text can definitely help.
As more generic advice, the primary role of a database is not to store data. A file system can do that. Databases are built to join data. If an application is reading a large amount of data and doing ad hoc joins, those are the jobs you want to move to the database.

Exadata -> Oracle Database extreme performance.
Exalogic -> Fusion Middleware extreme performance. (Java goes here)
Your best move will be refactoring the application to put as much workload as possible on the DB (PL/SQL).
Another thing I could think of, but this would be a radical approach I have never really tried it myself (Yes I work with Exadatas too) maybe you can give it a shot and let us know here...
What about using all those GBs on the Exadata's RAM and start tuning your Java application's latency? I mean with that gruesome amount of Memory you can try and set a real nice amount of heap and avoid Garbage Collection induced latency. Please do let me know here what comes out if you actually try this.

Which protocol do the Java applications use to connect to Oracle?
If it's not IPC (inter process communication, aka BEQUEATH, aka shared memory), but maybe TCP and you have many fast & tiny roundtrips, than this would be your low-hanging fruit - eliminate the network stack.
edit: just realized that exadata cannot run java applications by default (only ODA does) - so it wouldn't be possible to make use of IPC. However, perhaps you're able to test the impact of IPC in one of your applications using the former infrastructure?

Exadata cannot host any customer application. You cannot install anything there. You only can host Oracle database on Exadata.
It means you can use database features like DBFS (file system over Oracle database), Java option (storing and executing java code in database). But you need to check what options you have license for. And internal JVM is used, which cannot be customized or upgraded.
Exadata is database appliance designed to work with large amount of differently accessed data in very effective and manageable way.

JVM heap replication between two machines

What are the basic principles of how two separable computers connected within the same network running the same Java application maintain the same state by syncing their heap between each other?
I believe Terracotta does this task but I have no idea how would some pseudo code look like that would describe its core functions.
I'm just looking for understanding of this technology.

Terracotta DSO works by manipulating the byte code of your classes (and the JDK's classes etc). The instructions on how and when to do this are part of the Terracotta configuration file.
The bytecode modification looks for certain byte codes such as a field read or write or a monitor enter or exit. Whenever those instructions occur, code is added around that location that does the appropriate action in the distributed store. For example when a monitor is obtained due to synchronization, a distributed lock is obtained as well (whether it is a read or write lock is dependent on the configuration). If a field in a shared object is written, the distributed system must verify that a write lock is being held and then send the data value is sent to the clustered server, which stores it on disk or shares it over the network as appropriate.
Note that Terracotta does not share the entire heap, only the graph of objects indicated by the configuration. In general, there would be little point in sharing an entire heap. It is better instead for the application to describe the domain objects needed across the distributed application.
There are many optimizations employed to make the operations above efficient: only field deltas are sent over the wire and in a form much more efficient than Java serialization, many deltas can be bundled and sent in batches, locks are actually "checked out" to a particular client so that if the application data is partitioned across clients, most distributed locks are actually a local operation not involving a network call, etc.

Terracotta can indeed handle that if you tell it to - see the description of its DSO - Distributed Shared Objects.
It sounds cool but I would prefer something like EHcache (can be backed by Terracotta again) which functions on a bit more high level.

One emerging technology that somehow tackles this problem is Distributed Software Transactional Memory. You get strong data consistency guarantees (i.e. 1-copy serializability) and a powerful concurrency control mechanism: transactions.
AFAIK, there is no mature solution out there, but it is promising.

I would recommend that you investigate http://www.jboss.org/infinispan and see if it will fulfill your needs.

Scalability of a single server for running a Java Web application

I want to gain more insight regarding the scale of workload a single-server Java Web application deployed to a single Tomcat instance can handle. In particular, let's pretend that I am developing a Wiki application that has a similar usage pattern like Wikipedia. How many simultaneous requests can my server handle reliably before going out of memory or show signs of excess stress if I deploy it on a machine with the following configuration:
4-Core high-end Intel Xeon CPU
8GB RAM
2 HDDs in RAID-1 (No SSDs, no PCIe based Solid State storages)
RedHat or Centos Linux (64-bit)
Java 6 (64-bit)
MySQL 5.1 / InnoDB
Also let's assume that the MySQL DB is installed on the same machine as Tomcat and that all the Wiki data are stored inside the DB. Furthermore, let's pretend that the Java application is built on top of the following stack:
SpringMVC for the front-end
Hibernate/JPA for persistence
Spring for DI and Security, etc.
If you haven't used the exact configuration but have experience in evaluating the scalability of a similar architecture, I would be very interested in hearing about that as well.
Thanks in advance.
EDIT: I think I have not articulated my question properly. I mark the answer with the most up votes as the best answer and I'll rewrite my question in the community wiki area. In short, I just wanted to learn about your experiences on the scale of workload your Java application has been able to handle on one physical server as well as some description regarding the type and architecture of the application itself.

You will need to use group of tools :
Loadtesting Tool - JMeter can be used.
Monitoring Tool - This tool will be used to monitor various numbers of resources load. There are Lot paid as well as free ones. Jprofiler,visualvm,etc
Collection and reporting tool. (Not used any tool)
With above tools you can find optimal value. I would approach it in following way.
will get to know what should be ratio of pages being accessed. What are background processes and their frequency.
Configure my JMeter accordingly (for ratios) , and monitor performance for load applied ( time to serve page ...can be done in JMeter), monitor other resources using Monitor tool. Also check count of error ratio. (NOTE: you need to decide upon what error ratio is not acceptable.)
Keep increasing Load step by step and keep writting various numbers of interest till server fails completely.
You can decide upon optimal value based on many criterias, Low error rate, Max serving time etc.
JMeter supports lot of ways to apply load.

To be honest, it's almost impossible to say. There's probably about 3 ways (of the top of my head to build such a system) and each would have fairly different performance characteristics. You best bet is to build and test.
Firstly try to get some idea of what the estimated volumes you'll have and the latency constraints that you'll need to meet.
Come up with a basic architecture and implement a thin slice end to end through the system (ideally the most common use case). Use a load testing tool like (Grinder or Apache JMeter) to inject load and start measuring the performance. If the performance is acceptable - be conservative your simple implementation will likely include less functionality and be faster than the full system - continue building the system and testing to make sure you don't introduce a major performance bottleneck. If not come up with a different design.
If your code is reasonable the bottleneck will likely be the database and somewhere in the region 100s of db ops per second. If that is insufficient then you may need to think about caching.

Definitely take a look at Spring Insight for performance monitoring and analysis.

English Wikipedia has 14GB data. A 8GB mem cache would have very high hit/miss ratio, and I think harddisk read would be well within its capacity. Therefore, the app is most likely network bound.
English Wikipedia has about 3000 page views per second. It is possible that tomcat can handle the load by careful tuning, and the network has enough throughput to server the traffic.
So the entire wikipedia site can be hosted on one moderate machine? Probably not. Just an idea.
-
http://stats.wikimedia.org/EN/TablesWikipediaEN.htm
http://stats.wikimedia.org/EN/TablesPageViewsMonthly.htm

Tomcat doesn't allow for spreading over multiple machines. If you really are concerned about scalability, you must consider what to do when your application outgrows a single machine.

What is Terracotta?

What is Terracotta?
What services does it offer?
What problems does it solve?
What other products solve problems similar to those that Terracotta solves?

Find a great article about Terracotta and how it works at InfoQ written directly by Orion Letizi, co-founder and software engineer at Terracotta:
http://www.infoq.com/articles/open-terracotta-intro
It helped me to prepare for a webcast about terracotta and how it can be used for clustering and scaling grails applications and gave me a good overview about Terracotta.

I like to think about Terracottas DSO in terms of advanced parallel architectures: Terracotta turns your message-passing multicomputer into a usual unified memory multiprocessor. Multicomputers are different from multiprocessors in that processors share memory and, therefore, are easier to program because you just write into memory in usual multithreading way. Though, you it means that you need to explicitly synchronize access to the shared data using a lock, system saves you from the need to explicitly message-passing data marshaling and resolves the biggest parallel programming issue -- the cache coherence -- for you. Multiprocessor marshals the data for you when you take/release the lock. It is, therefore, desirable. But, initially you have a bunch of computers -- a multicomputer.
The magic is achieved by injecting some code into your classes at object field/lock access points. To correspond DB world, Terracotta considers all updates done under a lock atomic (transaction). Likewise multiprocessors can have a global storage, Terracotta allows to back up the locally updated data to disk.

What other products solve problems similar to those that Terracotta solves?
Try Hazelcast, It is super simple to use. Peer to peer, highly scalable, fully open source clustering technology for Java. It is simply distributed Map, Queue, MultiMap, ExecutorService. You can use its Map as a distributed cache.

I found an article in JavaWorld about Terracotta at http://www.javaworld.com/javaworld/jw-01-2009/jw-01-osjp-terracotta.html.

Terracotta + Compass = Hibernate + HSQLDB + JMS?

I am currently in need of a high performance java storage mechanism.
This means:
1) I have 10,000+ objects with 1 - Many Relationship.
2) The objects are updated every 5 seconds, with the most recent updates persistent in the case of system failure.
3) The objects need to be queryable in a reasonable time (1-5 seconds). (IE: Give me all of the objects with this timestamp or give me all of the objects within these location boundaries).
4) The objects need to be available across various Glassfish installs.
Currently:
I have been using JMS to distribute the objects, Hibernate as an ORM, and HSQLDB to provide the needed recoverablity.
I am not exactly happy with the performance. Especially the JMS part of this.
After doing some Stack Overflow research, I am wondering if this would be a better solution. Keep in mind that I have no experience with what Terracotta gives me.
I would use Terracotta to distribute objects around the system, and something else need to give the ability to "query" for attributes of those objects.
Does this sound reasonable? Would it meet these performance constraints? What other solutions should I consider?

I know it's not what you asked, but, you may want to start by switching from HSQLDB to H2. H2 is a relatively new, pure Java DB. It is written by the same guy who wrote HSQLDB and he claims the performance is much better. I'm using it for some time now and I'm very happy with it. It should be a very quick transition (add a Jar, change the connection string, create the database) so it's worth a shot.
In general, I believe in trying to get the most of what I have before rewriting the application in a different architecture. Try profiling it to identify the bottleneck first.

At first, Lucene isn't your friend here. (read only)
Terracotta is to scale around at the Logical layer! Your problem seems not to be related to the processing logic. It's more around the Storage/Communication point.
Identify your bottleneck! Benchmark the Storage/Logic/JMS processing time and overhead!
Kill JMS issues with a good JMS framework (eg. ActiveMQ) and a good/tuned configuration.
Maybe a distributed key=>value store is your friend. Try Project Voldemort!
If you like to stay at Hibernate and HSQL, check out the Hibernate 2nd level cache and connection pooling (c3po, container driven...)!

Several Terracotta users have built systems like this in the past, so I can you tell you by proof of existence that it can be done. :)
Compass does have support for clustering with Terracotta so that might help you. I suspect you might get further faster by just being careful with how you create your clustered data structures.
Regarding your requirements and Terracotta:
1) 10k objects is quite small from a Terracotta perspective
2) 5 sec update rate doesn't seem like an issue. Might depend how many nodes there are and whether there is any natural partitioning you can take advantage of. All updates will be persistent.
3) 1-5 second query time seems quite easy. Building your own well-organized data structures for lookup is the tricky part. Obviously you want to avoid scanning all the data.
4) Terracotta currently supports Glassfish v1 and v2.
If you post on the Terracotta forums, you could probably get more Terracotta eyeballs on the problem.

I am currently working on writing the client for a very (very) fast Key/Value distributed hash DB that provides set + list semantics. The DB is C99 and requires GCC and right now I'm battling with good old Java network IO to break my current 30,000 get/sets per/sec barrier. Hope to be done within the week. Drop me a line through my account and I'll get back when its show time.

With such a high update rate, Lucene is almost definitely not what you're looking for, since there is no way to update a document once it's indexed. You'd have to keep all the object versions in the index and select the one with the latest time stamp, which will kill your performance.
I'm no DB expert, but I think you should look into any one of the distributed DB solutions that's been on the news lately. (CouchDB, Cassandra)

Maybe you should take a look to: Prevayler.
Your objects are always in mem.
The "changes" to your objects are persisted.
From time to time you are able to take a snapshot: every object is persisted.

You don't say what vendor you are using for JMS, but I wouldn't surprise me if you have some bottle neck there. I couldn't get more than 100 messages a second from ActiveMq, and whatever I tried in terms of configuration of acknowledgment, queue size, etc we were unable to soak the CPU beyond a few percent.
The solution was to batch many queries into one JMS message. We had a simple class that either sent a batch of messages when it got to 200 queries or reached a timeout (we used 20ms), which gave us a dramatic increase in message throughput.

Guaranteed messaging is going to be much slower than volatile messaging. Given every object is updated every few second, you might consider batching your updates (into say 500 changes or by time say 1-10 ms' worth), sending over volatile messaging, and batching your transactions. In this case you are more likely to be limited by bandwidth. Tuning your use case you may find smaller batch sizes also work efficiently. If bandwidth is critical (say you have a 10 MB connection or slower, then you could use compression over JMS)
You can achieve much higher performance with a custom solution (which also might be simpler) e.g. Hazelcast & JGroups are free (you can add a node(s) which does the database synchronization so your main app doesn't slow down). There are commercial products which handle in the order of half a million durable messages/sec.

Terracotta + jofti = queryable persistent clustered data structures
Search google for terracotta querymap or visit tusharkhairnar.blogspot.com for querymap blog
You may want to integrate timasync as well to update your database. Database is is your system of record use terracotta as caching and database offloading mechanism you can even batch async updates to make it faster so that I'd db contains fairly recent data
Tushar
tusharkhairnar.blogspot.com

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.