In my system with Infinispan 6.0.2, I have added some data in cache and indexing them with lucene. It works well for the searching part.
But because the caches exist in server, sometimes when the server break down, I need to reload the data and index them. That takes a long time.
Then I find that Infinispan can store the index in database and load from an existing Lucene index. I think that should fix my problem. But there is little information in Infinispan user guide, I dont know how to do it. Can someone give me a example???
Infinispan includes a highly scalable distributed Apache Lucene Directory implementation. To create a Directory instance:
import org.apache.lucene.store.Directory;
import org.infinispan.lucene.directory.DirectoryBuilder;
import org.infinispan.Cache;
Cache cache = // create an Infinispan cache, configured as you like
Directory indexDir = DirectoryBuilder.newDirectoryInstance(cache, cache, cache, indexName)
.create();
The indexName is a unique key to identify your index. It takes the same role as the path did on filesystem based indexes: you can create several different indexes giving them different names. When you use the same indexName in another instance connected to the same network (or instantiated on the same machine, useful for testing) they will join, form a cluster and share all content. Using a different indexName allows you to store different indexes in the same set of Caches.
The cache is passed three times in this example, as that is ok for a quick demo, but as the API suggests it’s a good idea to tune each cache separately as they will be used in different ways. More details provided below.
New nodes can be added or removed dynamically, making the service administration very easy and also suited for cloud environments: it’s simple to react to load spikes, as adding more memory and CPU power to the search system is done by just starting more nodes.
Refer documentation and API reference for details
Related
Our team works with a well known OSGI based COTS product that runs as standalone service (it does not interact with multiple instances of itself). The product contains an API which allows developers to build additional functionality into the project. This product stores what can be large sized jars (1-5M) in zookeeper along with other configuration data. The COTS product also includes much opensource (tomcat, zookeeper, many other apache products, etc.). Thanks to the product being written in java, I have a good understanding of the design and source code.
Our instance of the product has been having issues starting up correctly at times and the issue according to the vendor is that the product is either failing to correctly write or read to zookeeper either when the product is stopping or started (Vendor does not yet know for sure). This problem only started to appear as we started to add these large jars to the products ./deploy folder.
I do not believe that the node or path cache use cases apply to this product https://github.com/Netflix/curator/wiki/Recipes
Full disclosure: I currently only have a shallow understanding of zookeeper and have been trying without success to find a recipe/use case where one would use zookeeper to store large binary jars. I also recognize that I may be asking the wrong question to this audience.
Is the above scenario a common use case for zookeeper?
ZooKeeper is a consensus store that allows multiple processes to share a common view of a shared resource, it is not a blob store, and should not be used as one.
Firstly, ZooKeeper is a poor choice for storing data in a standalone instance. If you have no need for distributed consensus between multiple readers/writers then ZooKeeper is complete overkill.
Secondly, ZooKeeper nodes are designed to hold small data which changes frequently, potentially with many readers watching for changes - the JAR files that you are adding seem not to fit this pattern in that there aren't many readers (the product is a standalone instance) and the JAR files are large.
The default ZooKeeper configuration puts a hard limit of 1MB storage per ZNode, and ideally you store a lot less than that. This can be increased, but it is not advised that you do so. I would strongly recommend that you look into using a proper file store (or even just the file system as your node is standalone) to store these JAR files.
I'm trying to find the best indexing solution for implementing a search-engine in my clustered webapp, and I cannot find a clear answer to my questions in official documentations.
My Java/Java EE backend will be deployed among several load-balanced instances. The search-engine will require near-real-time availability of indexed data (i.e. less than 5 seconds between the indexation and the retrievability).
Hibernate Search can work in a clustered environment with JGroups but the documentation also says, about near-real-time that as a tradeoff it requires a non-clustered and non-shared index.
Does that mean that NRTIndexManager cannot be used in a JGroups Slave/Master setup ? i.e. can only be used whith one single node ?
Does that mean that with such a setup, the availability of indexed data depends only on the refresh period (period of index copy to slave nodes) ?
With the standard IndexManager, you only see the latest changes when they are written to the disk and you reopen your IndexSearcher.
By default, Hibernate Search writes to disk and opens a new IndexSearcher for each query so you're sure your searches are always in sync with your database.
The NRTIndexManager is different from the standard one because it allows you to search on the latest changes indexed without an explicit write on disk. It's typically used when you need a high throughput and you can't write everything on the disk right away. So it's not really correlated to the fact that you will see your changes right away or not: it's an optimization when you can allow some index data loss - the latest changes might be lost.
As mentioned in the documentation here http://docs.jboss.org/hibernate/search/5.5/reference/en-US/html_single/#jgroups-backend , you can have a sync JGroups with Hibernate Search blocking until all the indexes are in sync. So it can work for your case.
Note that we are currently working for 5.6 on an Elasticsearch backend which might be of some interest to you as it's typically designed for your case. It's still in beta but it's already in pretty good shape. You might want to take a look to it: http://docs.jboss.org/hibernate/search/5.6/reference/en-US/html/ch11.html .
I am planning to upgrade Solr from single instance option to cloud option. Currently I have 5 cores and each one is configured with data import handler. I have deployed web application along with solr.war inside tomcat folder which will trigger full imports & delta-imports periodically according to my project needs.
Now, I am planning to create 2 shards for this application keeping half of my 5 cores data into each shard.I am not to understand how DIH will work in SolrCloud?
Is it fine if I start full-indexing from both shards?
Or I need to do full indexing from only one shard?
Architecture will look like below
It all depends on how you create your solr cloud: using composite id or implicit routing. Using composite id routing will take care of spreading the documents across all available shards. You can initiate the import from any solr cloud node. In the end the cloud environment will contain the imported document indices spread across all shards.
If you use implicit routing you have control where to keep each document index.
You do not have to use the DIH. Alternatively you can write a small app that uses the solr client to populate the index, which gives you more control.
After lots of googling and reading I finally decided to implement DIH as follows. Please let me know your comments if you feel there will be issues with this architecture.
I have a problem with a product that I am currently working on. Essentially, There is some very commonly used (and very seldomly updated) information that is retrieved from the database on server start up. We do not want to query the database every time this information is needed because it is very frequent. There is a way to update this information through the application (only by an admin). When this method is used, the data in the database is updated and the cached data in that single server (1 of 4) is updated. Unfortunately, if a user hits any of the other servers they will not see the updated information. Restarting the cluster remedies the problem however, that is not a feasible solution for our production environment. Now that I have explained the situation, I am open to suggestions. Thank you for your time.
For a simple solution, you can go to the cluster in the admin console and ripple start it. That stops/stars the nodes gracefully and one at a time. The only impact is a 25% reduction in capacity while it is working.
IBM WebSphere Application Server has a Dynamic Cache that you can use to store Java objects. The cache can be set up to use replication over a replication domain so it can be shared across a cluster.
Your code would use the DistributedMap interface to interact with the cache. All settings for the dynamic cache can be included with your application or it can be pre-configured. Examples are included in the javadoc link.
(Similar to Java EE Application-scoped variables in a clustered environment (Websphere)?)
That is, I think the standard answer would be a "Distributed Object Store". But a crude alternative (that we use) would be to configure a list of server:port combinations to contact to inform each cluster member to update their own copy of the data.
I'm fairly new to the whole web programming stuff and have the following problem:
I have 2 webapps, one a axis web service and another one is a spring application. Both should get a set of data from a library which contains the data in memory. This data is large so copying the data for each app is no option.
What I did so far is developing the library which loads and contains the data in a static container. The plan was, that both apps instatiate the class containing the container and may access the data.
Sadly, this doesn't work. I get an exception that the object I want to use are in different classloaders.
My question is: How can I provide such a container provider for both libraries in tomcat 7?
BTW: A database is no option, because its to slow.
Edit: I should have been clear about the data. The data is a Topic Map stored in an topic map engine. (see http://www.isotopicmaps.org ). The engine is used to access the data and therefore is the access point to the data. We have an own engine, which hold the data inmemory which is faster than a database backend.
I Want to have a servlet which provides the configuration and loading of topic maps and then the two servlets above should be able to read and modify a topic map. Thats why I need to have a sort of shared access point to the engine.
This is what distributed caches, key-value stores, document stores, and noSql databases are built for. There are many options and new ones each day. The free and open-source options are likely to meet your needs and provide you with as much support as you will needs. The one the is currently my favorite is membase.
So you want a distributed in-memory cache for a server cluster. You can use among others Terracotta for this. You can find here a nice introduction to Terracotta.
Update: I actually disagree the argument that a database is "too slow". If it's slow, then the datamodel and/or data access code is simply badly designed.