Solrcloud & data import handler - java

I am planning to upgrade Solr from single instance option to cloud option. Currently I have 5 cores and each one is configured with data import handler. I have deployed web application along with solr.war inside tomcat folder which will trigger full imports & delta-imports periodically according to my project needs.
Now, I am planning to create 2 shards for this application keeping half of my 5 cores data into each shard.I am not to understand how DIH will work in SolrCloud?
Is it fine if I start full-indexing from both shards?
Or I need to do full indexing from only one shard?
Architecture will look like below

It all depends on how you create your solr cloud: using composite id or implicit routing. Using composite id routing will take care of spreading the documents across all available shards. You can initiate the import from any solr cloud node. In the end the cloud environment will contain the imported document indices spread across all shards.
If you use implicit routing you have control where to keep each document index.
You do not have to use the DIH. Alternatively you can write a small app that uses the solr client to populate the index, which gives you more control.

After lots of googling and reading I finally decided to implement DIH as follows. Please let me know your comments if you feel there will be issues with this architecture.

Related

Couchbase Cluster and Bucket management

I am developing a server-side app using Java and couchbase. I am trying to understand the pros and cons of handling the cluster and bucket management from the java code over using the couchbase admin web console.
For instance, should I handle create/ remove buckets, indexing, and update buckets in my java code?
The reason I want to handle as many as couchbase administration functions is my app is expected to run on-prem not a cloud services. I want to avoid that our customers need to learn how to administrate couchbase.
The main reason to use the management APIs programmatically, rather than using the admin console, is exactly as you say: when you need to handle initializing and maintaining yourself, especially if the application needs to be deployed elsewhere. Generally speaking, you'll want to have some sort of database initializer or manager module in your code, which handles bootstrapping the correct buckets and indexes if they don't exist.
If all you need to do is handle preparing the DB environment one time for your application, you can also use the command line utilities that come with Couchbase, or send calls to the REST API. A small deployment script would probably be easier than writing code to do the same thing.

CouchbaseClient VS CouchbaseCluster

I am trying to implement couchbase in my application.
I am confused with
com.couchbase.client.CouchbaseClient
AND
com.couchbase.client.java.CouchbaseCluster.
I tried to google on CouchbaseClient vs CouchbaseCluster but didn't found which one is better & Pros and Cons.
I know we have 3 types of Couchbase Client, one is vBucket-aware, one is traditional old client which support auto clustering via Moxi-Server.
Can someone who have already used couchbase provides me some link or detailed information about these two Java-Client.
I have done some homework on CouchbaseClient and CouchbaseCluster like inserting, updating, deleting documents via both.
In CouchbaseClient the documents stored are Serialized and you cannot view and edit those documents via Couchbase Admin Console, whereas if Documents like StringDocument, JsonDocument, JsonArrayDocument stored via Couchbase cluster can be viewed and are editable over Couchbase Admin Console.
My requirements is I want to use a couchbase client which is AutoConfiurable (vBucket-aware) like if I add new nodes to a cluster, it will auto detect it, or if any node failed, it will auto detect it and does not throw any exception. Further, if I add new cluster, I'd like it to auto detect it and start using it. I don't want to modify the application code for all these things.
There is now two generations of official Couchbase Java SDKs:
generation 1 (currently 1.4.x, not sure of the patch version) is derived from an old Memcached client, Spymemcached... it is now bugfixes only, and it's the one where you have CouchbaseClient as the primary API.
generation 2 is a rewrite, layered into a core artifact and java-client artifact in Maven. Current version is 2.1.3. This is the one where you deal with CouchbaseCluster.
In the old one, you'd have to instantiate one CouchbaseClient for each bucket you deal with.
In the new generation, the notions of cluster and bucket are first class citizens and you can (and should) reuse the same Cluster instance to open references to different Buckets. The Buckets should also be reused (don't open the same bucket several times). Resources are better mutualized this way.
Also, the new generation has more coherent APIs, uses RxJava for asynchronous processing, etc... It is cluster-aware and will get updates of the topology of the cluster (new nodes, failing nodes, etc...).
Note that these two generations are differents artifacts in Maven (old one is couchbase-client while new one is java-client).
There's no way you can get such a notification if you "add new cluster", but that operation doesn't really make sense to me...

How to load an existing Lucene Index in Infinispan?

In my system with Infinispan 6.0.2, I have added some data in cache and indexing them with lucene. It works well for the searching part.
But because the caches exist in server, sometimes when the server break down, I need to reload the data and index them. That takes a long time.
Then I find that Infinispan can store the index in database and load from an existing Lucene index. I think that should fix my problem. But there is little information in Infinispan user guide, I dont know how to do it. Can someone give me a example???
Infinispan includes a highly scalable distributed Apache Lucene Directory implementation. To create a Directory instance:
import org.apache.lucene.store.Directory;
import org.infinispan.lucene.directory.DirectoryBuilder;
import org.infinispan.Cache;
Cache cache = // create an Infinispan cache, configured as you like
Directory indexDir = DirectoryBuilder.newDirectoryInstance(cache, cache, cache, indexName)
.create();
The indexName is a unique key to identify your index. It takes the same role as the path did on filesystem based indexes: you can create several different indexes giving them different names. When you use the same indexName in another instance connected to the same network (or instantiated on the same machine, useful for testing) they will join, form a cluster and share all content. Using a different indexName allows you to store different indexes in the same set of Caches.
The cache is passed three times in this example, as that is ok for a quick demo, but as the API suggests it’s a good idea to tune each cache separately as they will be used in different ways. More details provided below.
New nodes can be added or removed dynamically, making the service administration very easy and also suited for cloud environments: it’s simple to react to load spikes, as adding more memory and CPU power to the search system is done by just starting more nodes.
Refer documentation and API reference for details

Is there an embeddable Java alternative to Redis?

According to this thread, Jedis is the best thing to use if I want to use Redis from Java.
However, I was wondering if there are any libraries/packages providing similarly efficient set operations to those that already exist in Redis, but can be directly embedded in a Java application without the need to set up separate servers. (i.e., using Jetty for web server).
To be more precise, I would like to be able to do the following efficiently:
There are a large set of M users (M not known in advance).
There are a large set of N items.
We want users to examine items, one user/item at a time, which produces a stored result (in a normal database.)
Each time a user arrives, we want to assign to that user the item with the least number of existing results that the user has not already seen before. This produces an approximate round-robin assignment of the items over all arriving users, when we just care about getting all items looked at approximately the same number of times.
The above happens in a parallelized fashion. When M and N are large, Redis accomplishes the above much more efficiently than SQL queries. Is there some way to do this using an embeddable Java library that is a bit more lightweight than starting a Redis server?
I recognize that it's possible to write a pile of code using Java's concurrency libraries that would roughly approximate this (and to some extent, I have done that), but that's not exactly what I'm looking for here.
Have a look at project voldemort . It's an distributed key-value store created by Linked-In, and it supports the ability to be embedded.
In the quick start guide is a small example of running the server embedded vs. stand-alone.
VoldemortConfig config = VoldemortConfig.loadFromEnvironmentVariable();
VoldemortServer server = new VoldemortServer(config);
server.start();
I don't know much about Redis, so I can't compare them feature to feature. In the project we used Voldemort, we used it's readonly backing store with great results. It allowed us to "precompile" a bi-daily database in our processing data-center and "ship it" out to edge data-centers. That way each edge data-center had a local copy of it's dataset.
EDIT: After rereading your question, I wanted to add Gauva's Table -- This Table DataStructure may also be something your looking for and is simlar to what you get with many no-sql databases.
Hazelcast provides a number of distributed data structure implementations which can be used as a pure Java alternative to Redis' services. You could then ship a single "jar" with all required dependencies to run your application. You may have to adjust for the slightly different primitives relative to Redis in your own application.
Commercial solutions in this space include Teracotta's Enterprise Ehcache and Oracle Coherence.
Take a look at lmdb (Lightning Memory Database), because I needed exactly the same thing. I deploy a dropwizard application into a container, and adding redis or another external dependancy is painful. This seems to perform well, has good activity. fyi, though, i have not yet used this in production.
https://github.com/lmdbjava/lmdbjava
Google's Guava Library provides friendly versions of the same (and more) Set operators that redis provides.
https://code.google.com/p/guava-libraries/wiki/CollectionUtilitiesExplained
e.g.
Guava Redis
Sets.intersection(a,b) sinter a b
a.count() scard a
Sets.difference(a,b) sdiff a b
Sets.union(a,b) sunion a b
Multisets are a reasonably straightforward proxy for redis sorted-sets as well.

How do I put data into the datastore of Google's app engine?

I have a little application written in php+mysql I want to port to AppEngine, but I just can't find the way to port my mysql data to the datastore.
How am I supposed to save the data into my datastore? Is that even possible? I can only see documentation for persistence of Java objects, does that mean I have to port my database to a bunch of fake objects, one per line?
Edit: I say fake objects because I don't want to use them, they're just a way to get over a shortcoming of the GAE design.
I have a 30 megs table I need to check on every GET, by using objects I would need to create an object for every row, so I'd have a java class of maybe 45 megs with thousands upon thousands of lines like:
Row Row23423 = new Row (123,346,75,34,"a cow");
I just can't believe this is the only way.
Here's an idea, what about populating the data store by POST-ing the objects one by one? I mean, like the posts in a blog. You write a class that generates and persists the data, and then you Curl the url with the data, one by one. Slow, but it may work?
How to upload data with the bulk loader is described here. It's not supported directly in Java yet, but that doesn't have to stop you - just do the following:
Create an app.yaml that looks something like this:
application: myapp
version: upload
runtime: python
api_version: 1
handlers:
- url: /remote_api
script: $PYTHON_LIB/google/appengine/ext/remote_api/handler.py
login: admin
Make sure the application name is the same as your Java app's, and the version is not the same as the version you're using for Java. Upload this 'empty' app using appcfg.py.
Now, follow the directions for bulk loading in the page linked to above. When it comes time to run the tool, specify the server address with --server=upload.latest.myapp.appspot.com .
Since multiple versions of the same app share the same datastore - even across runtimes - the data uploaded with the Python version will be accessible to the Java one.
There is documentation on the datastore here.
I can't see anything about a raw data-porting service but if you can extract the data from your MySQL database into text files, then it should be relatively easy to write a script to import it into the app engine's data store using the persistence frameworks provided by it.
Your script would take your raw data, convert into a (Java) object model and imprt those Java objects into the store.
Migrating an application to Googles App Engine I think would be quite some task. As you have seen the App Engine does not have a relational database instead it uses BigTable. This will likely involve exporting it to Java objects (serialized in some way) and the inserting them.
You say "fake" objects in your post but I as you will have to use Java objects anyway I don't think they would be fake unless you plan on using one set of objects for the migration and a new set for the application.
There is no (good) general answer to the question of how to port a relational application to the GAE datastore, because the notion of "data" is incompatible between the two. Relational databases are all about the schema. GAE doesn't even have one. It's a schemaless persistent object datastore with very specific APIs. The environment is great for certain types of apps if you're developing from scratch, but it's pretty tricky to port to.
That said, you can import CSV files, as Nick explains, which you should be able to export from MySQL fairly easily. GAE supports Java and Python "at the same time" using the versions mechanism. So you can set up your data store in Python, and then run against it for your application in Java. (A Java version of the bulk loader is under development.)

Categories

Resources