Teracotta and Hibernate Search

Teracotta and Hibernate Search - java

Does anyone have experience with using Terracotta with Hibernate Search to satisfy application Queries?
If so:
What magnitude of "object
updates" can it handle? (How's the
performance)
What kind of performance do the
Queries have?
Is it possible to use Terracotta
Hibernate Search without even having
a backing Database to satisfy all
"queries" in Memory?

I am Terracotta's CTO. I spent some time last month looking at Hibernate Search. It is not built in a way to be clustered transparently by Terracotta. Here's why in a nutshell: Hibernate has a custom-built JMS replication of Lucene indexes across JVMs.
The basic idea in Search is that talking to local disk under lucene works really well, whereas fragmenting or partitioning up Lucene indexes across the network introduces sooo much latency as to make Lucene seem bad when it is not Lucene's fault at all. To that end, HIbernate Search doesn't rely on JBossCache or any in-memory partitioning / caching schemes and instead relies on JMS and each JVM's local disk in order to provide up-to-date indexing across a cluster with simultaneous low latency. Then, the beauty of Hibernate Search is that standard Hibernate queries and more can be launch through Hibernate at these natural language indexes in each machine.
At Terracotta it turns out we had a similar idea to Emmanuel and built a SearchableMap product on top of Compass. Each machine gets its own Compass store and the store is configured to spill to disk locally. Terracotta is used to create a multi-master writing capability where any JVM can add to the index and the delta is sent through Terracotta to be replayed / reapplied locally to each disk. It works just like Hibernate Search but with DSO as the networking protocol in place of JMS and w/o the nice Hibernate interfaces but instead with Compass interfaces.
I think we will support Hibernate Search w/ help from JBoss (they would need to factor out the JMS impl as pluggable) by end of the year.
Now to your questions directly:
1.Object updates/sec in Hibernate or SearchableMap should be quite high because both are sending only deltas. In Hibernate's case it is a function of our JMS provider. In Terracotta it is scalable just by adding more Terracotta Servers to the array.
Query performance in both is very fast. Local memory performance in most cases. And if you need to page in from disk, it turns out most OSes do a good job and can respond way faster than any network-based clustering can to queries.
It will be, I think, once we get JBoss to factor out their JMS assumptions, etc.
Cheers,
--Ari

Since people on the Hibernate forums keep referring to this post I feel in need to point out that while Ari's comments where correct at the beginning of 2009, we have been developing and improving a lot.
Hibernate Search provides a set of backend channels out of the box, like the already mentioned JMS based and a more recent addition using JGroups, but we made it also pretty easy to plug in alternative implementations or override some.
In addition to using a custom backend, it's now possible since version 4 to replace the whole strategy and instead of changing the backend implementation only you can use an IndexManager which follows a different design and doesn't use a backend at all; at this time we have two IndexManagers only but we're working on more alternatives; again the idea is to provide nice implementations for the most common
It does have an Infinispan based backend for very quick distribution of the index across different nodes, and it should be straight forward to contribute one based on Terracotta or any other clustering technology. More solutions are coming.

Related

Distributed cache for key-value pairs

I'm looking for a distributed cache for key-value pairs with these features -
Persistence to disk
Open Source
Java Interface
Fast read/write with minimum memory utilisation
Easy to add more machines to the database (Horizontally Scalable)
What are the databases that fit the bill?

Redisson framework also provides distributed cache abilities based on Redis

There are a lot of options that you can make use of.
Redis - the one you've stated by yourself. Its a distinct process, very fast, key-value for sure, but it's not an "in-memory with your application", I mean that you'll always do a socket I/O in order to go to redis process.
Its not written in Java, but it provides a descent Java Driver to work with, moreover there is a spring integration.
If you want a java based solution consider the following:
memcached - a distributed cache
Hazelcast - its a datagrid, its much more than simply key-value store, but you might be interested in this as well.
Infinispan - folks from JBoss have created this one
EHCache - a popular distributed cache
Hope this helps

Choosing a database service - mongohq vs dynamodb

Currently I am gathering information what database servce we should use.
I am still very new to web development but we think we want to have a noSQL database.
We are using Java with Play! 2.
We only need a database for user registration.
Now I am already familiar with GAE ndb which is a key value store such as dynamoDB. MongoDB is a document db.
I am not sure what advantages each solution has.
I also know that dynamoDB runs on SSD's and mongoDB is inmemory.
An advantage of mongoDB would be that Java Play! already "supports" mongodb.
Now we don't expect too much database usage, but we would need to scale pretty fast if our app grows.
What alternatives do I have? What pros/cons do they have?
Considering:
Pricing
Scaling
Ease of use
Play! support?

(Disclosure: I'm a founder of MongoHQ, and would obviously prefer you choose us)
The biggest difference from a developer perspective is the querying capability. On DynamoDB, you need the exact key for a given document, or you need to build your keys in such a way that you can use them for range based queries. In Mongo, you can query on the structure of the document, add secondary indexes, do aggregations, etc.
The advantage of doing it with k/v only is that it forces you to build your application in a way that DynamoDB can scale. The advantage of Mongo flexible queries against your docs is that you can do much faster development, even if you discount what the Play framework includes. It's always going to be quicker to do new development with something like Mongo because you don't have to make your scaling decisions from the get go.
Implementation wise, both Mongo and DynamoDB can grow basically unbounded. Dynamo abstracts most of the decisions on storage, RAM and processor power. Mongo requires that you (or someone like us) make decisions on how much RAM to have, what kind of disks to use, how to managed bottlenecks, etc. The operations hurdles are different, but the end result is very similar. We run multiple Mongo DBs on top of very fast SSDs and it works phenomenally well.
Pricing is incredibly difficult to compare, unfortunately. DynamoDB pricing is based on a nominal per GB fee, but you pay for data access. You need to be sure you understand how your costs are going to grow as your database gets more active. I'm not sure I can predict DynamoDB pricing effectively, but I know we've had customers who've been surprised (to say the least) at how expensive Dynamo ended up being for the stuff they wanted to do.
Running Mongo is much more predictable cost-wise. You likely need 1GB of RAM for every 10GB of data, running a redundant setup doubles your price, etc. It's a much easier equation to wrap your head around and you're not in for quite as nasty of a shock if you have a huge amount of traffic one day.
By far the biggest advantage of Mongo (and MongoHQ) is this: you can leave your provider at any time. If you get irked at your Mongo provider, it's only a little painful to migrate away. If you get irked at Amazon, you're going to have to rewrite your app to work with an entirely different engine. This has huge implications on the support you should expect to receive, hosting Mongo is competitive enough that you get very good support from just about any Mongo specific company you choose (or we'd die).
I addressed scaling a little bit above, but the simplest answer is this: if you define your data model well, either option will scale out just about as far as you can imagine you'd need to go. You are likely to not do this right with Mongo at first, though, since you'll probably be developing quickly. This means that once you can't scale vertically any more (by adding RAM, disk speed, etc to a single server) you will have to be careful about how you choose to shard. The biggest difference between Mongo and Dynamo scaling is when you choose to make your "how do I scale my data?" decisions, not overall scaling ability.
So I'd choose Mongo (duh!). I think you can build a fantastic app on top of DynamoDB, though.

As you said, mongoDB is one step ahead among other options, because you can use morphia plugin to simplify DB interactions(you have JPA support as well). Play framework provides CRUD module (admin console) and secure module as well (for your overall login system), so I strongly suggest you to have a look at' em.

Second level cache for java web app and its alternatives

Between the transitions of the web app I use a Session object to save my objects in.
I've heard there's a program called memcached but there's no compiled version of it on the site,
besides some people think there are real disadvantages of it.
Now I wanna ask you.
What are alternatives, pros and cons of different approaches?
Is memcached painpul for sysadmins to install? Is it difficult to embed it to the existing infrastructure from the perspective of a sysadmin?
What about using a database to hold temporary data between web app transitions?
Is it a normal practice?

What about using a database to hold
temporary data between web app
transitions? Is it a normal practice?
Database have indeed a cache already. A well design application should try to leverage it to reduce the disk IO.
The database cache works at the data level. That's why other caching mechanism can be used to address different levels. At the java level, you can use the 2nd level cache of hibernate, which can cache entities and query result. This can notably reduce the network IO between the app. server and the database.
Then you may want to address horizontal scalability, that is, to add servers to manage the load. In this case, the 2nd level cache need to be distributed across the nodes. This exists (see JBoss cache), but can get slightly complicated to manage.
Distributed cache tend to worker better if they have simpler scheme based on key/value. That's what memcached is, but there are also other similar solutions. The biggest problem with distributed caches is invalidation of outdated entries -- which can itself turn into a performance bottleneck.
Don't think that you can use a distributed cache as-is to make your performance problems vanish. Designing a scalable distributed architecture requires experience and is always a matter of trade-off between what to optimize and not.
To come back to your question: for regular application, there is IMHO no need of a distributed cache. Decent disk IO and network IO lead usually to decent performance.
EDIT
For non-persistent objects, you have several options:
The HttpSession. Objects need to implement Serializable. The exact way the session is managed depends on the container. In a cluster, the session is usually replicated twice, so that if one node crashes you still have one copy. There is then session affinity to route the request to the server that has the session in memory.
Distributed cache. A system like memcached may indeed make sense, but I don't know the details.
Database. You could of course dump any Serializable object in the database in a BLOB. Can be an option if the web servers are not as reliable as the database server.
Again, for regular application, I would try to go as far as possible with the HttpSession.

How about Ehcache? It's an easy to use pure Java solution ready to plug in to Hibernate. As far as I remember it's supported by containers.
It's quite painless in my experience.

http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html#performance-cache
This page should have everything that you need (hopefully !)

Distributed Processing: C++ equivalent of JTA

I'm developing a mission-critical solution where data integrity is paramount and performance a close second. If data gets stuffed up, it's gonna be cata$trophic.
So, I'm looking for the C/C++ version of JTA (Java Transaction API). Does anyone know of any C or C++ libraries that supports distributed transactions? And yes, I've googled it ... unsuccessfully.
I'd hate to be told that there isn't one and I'd need to implement the protocol specified by Distributed TP: The XA Specification.
Please help!
Edit (responding to kervin): If I need to insert records across multiple database servers and I need to commit them atomically, products like Oracle will have solutions for it. If I've written my own message queue server and I want to commit messages to multiple servers atomically, I'll need something like JTA to make sure that I don't stuff up the atomicity of the transaction.

Encina, DCE-RPC, TUXEDO, possibly CORBA (though I hesitate to suggest using CORBA), MTS (again, hmm).
These are the kind of things you want for distributed transaction processing.
Encina used to have a lot of good documentation for its DCE-based system.

There are hundreds. Seriously.
As far as general areas go. Check out Service Oriented Architecture, most of the new products are coming out in that area. Eg. RogueWave HydraSCA
I would start with plain Rogue Wave Suite, then see if I needed an Enterprise Service Bus after looking at that design.
That probably depends a lot on your design requirements and budget.

Oracle Tuxedo is the 800 pound gorilla in this space and was actually the basis for much of the XA specification. It provides distributed transaction management and can handle 100's of thousands of requests/second.
For more information: http://www.oracle.com/tuxedo
Also, if you like SCA (Service Component Architecture), there is an add-on product for Tuxedo called SALT that provides an SCA container for programming in C++, Python, Ruby, and PHP.

OSCache vs. EHCache

Never used a cache like this before. The problem is that I want to load 500,000 + records out of a database and do some selecting/filtering wicked fast.
I'm thinking about using a cache, and preliminarily found EHCache and OSCache, any opinions?

Judging by their releases page, OSCache has not been actively maintained since 2007. This is not a good thing. EhCache, on the other hand, is under constant development. For that reason alone, I would choose EhCache.
Edit Nov 2013: OSCache, like the rest of OpenSymphony, is dead.

They're both pretty solid projects. If you have pretty basic caching needs, either one of them will probably work as well as the other.
You may also wish to consider doing the filtering in a database query if it's feasible. Often, using a tuned query that returns a smaller result set will give you better performance than loading 500,000 rows into memory and then filtering them.

I've used JCS (http://jakarta.apache.org/jcs/) and it seems solid and easy to use programatically.

It sort of depends on your needs. If you're doing the work in memory on one machine, then ehcache will work perfectly, assuming you have enough RAM or a fast enough hard disk so that the overflow doesn't cause disk paging/thrashing. if you find you need to achieve scalability, even despite this particular operation happening a lot, then you'll probably want to do clustering. JGroups /TreeCache from JBoss support this, so does EHcache (I think), and I know it definitely works if you use Ehcache with terracotta, which is a very slick integration. This answer doesn't speak directly to the merits of EHcache and OSCache, so here's that answer: EHcache seems to have the most inertia (used to be the default, well known, active development, including a new cache server), and OSCache seemed (at least at one point) to have slightly more features, but I think that with the options mentioned above those advantages are moot/superseded. Ah, the other thing I forgot to mention is that transactionality of the data is important, and your requirements will refine the list of valid choices.

Choose a cache which complies to JSR 107 which will make your job easy when you want to migrate from one implementation to the other. To be specific on the question go for Ehcache which is more popular and widely used Java caching solution. We are using Ehcache extensively and it works for us.

Other answers discuss pros/cons for caches; but I am wondering whether you actually benefit from cache at all. It is not quite clear exactly what you plan on doing here, and why a cache would be beneficial: if you have the data set at your use, just access that. Cache only helps reuse things between otherwise independent tasks. If this is what you are doing, yes, caching can help. But if it is a big task that can carry along its data set, caching would add no value.

Either way, I recommend using them with Spring Modules.
The cache can be transparent to the application, and cache implementations are trivially easy to swap.
In addition to OSCache and EHCache, Spring Modules also support Gigaspaces and JBoss cache.
As to comparisons....
OSCache is easier to configure
EHCache has more configuration options
They are both rock solid, both support mirroring cache, both work with Terracotta, both support in-memory and to-disk caching.

I have used oscache on several spring projects with spring-modules, using the aop based configuration.
Recently I looked to use oscache + spring modules on a Spring 3.x project, but found spring-modules annotation-based caching is not supported (even by the fork).
I recently found out about this project -
http://code.google.com/p/ehcache-spring-annotations/
Which supports spring 3.x with declarative annotation-based caching using ehcache.

I mainly use EhCache because it used to be the default cache provider for Hibernate. There is a list of caching solutions on Java-Source.net.
I used to have a link that compared the main caching solutions. If I find it I will update this answer.

OSCache is pretty much dead as it has been abandoned a few years ago. You may take a look at Cacheonix, it's been actively developed and we've just released v.2.2.2 with support for caching in the web tier. I'm a committer so you can reach out if you have any questions.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.