Selective replication in mongodb - java

I have two MongoDB running in two different servers connected via LAN. I want to replicate records from few collections from server 1 to collections in server 2. Is there any way to do it. Below is the pictorial representation of what I want to achieve.
Following are the methods I consider using.
MongoDB replication - But it replicates all collections. Is selective replication possible in MongoDB ??
Oplog watcher APIs - Please suggest some reliable java APIs
Is there any other way to do this ? and what is the best way of doing it ?

MongoDB does not yet support selective replication and it sounds as though you are not actually looking for selective replication but more for selective copying since replication ensures certain rules of using that server.
I am not sure what you mean by a oplog watcher API but it is easy enough to read the oplog over time by just querying it:
> use local
> db.oplog.rs.find()
( http://docs.mongodb.org/manual/reference/local-database/ )
and then storing the latest timestamp of the record you have copied within a script you make.
You can also use a tailable cursor here on the oplog to effectiely listen (pub/sub) to changes and copy them over to your other server.

Related

Hibernate Search cluster and near-real-time search

I'm trying to find the best indexing solution for implementing a search-engine in my clustered webapp, and I cannot find a clear answer to my questions in official documentations.
My Java/Java EE backend will be deployed among several load-balanced instances. The search-engine will require near-real-time availability of indexed data (i.e. less than 5 seconds between the indexation and the retrievability).
Hibernate Search can work in a clustered environment with JGroups but the documentation also says, about near-real-time that as a tradeoff it requires a non-clustered and non-shared index.
Does that mean that NRTIndexManager cannot be used in a JGroups Slave/Master setup ? i.e. can only be used whith one single node ?
Does that mean that with such a setup, the availability of indexed data depends only on the refresh period (period of index copy to slave nodes) ?
With the standard IndexManager, you only see the latest changes when they are written to the disk and you reopen your IndexSearcher.
By default, Hibernate Search writes to disk and opens a new IndexSearcher for each query so you're sure your searches are always in sync with your database.
The NRTIndexManager is different from the standard one because it allows you to search on the latest changes indexed without an explicit write on disk. It's typically used when you need a high throughput and you can't write everything on the disk right away. So it's not really correlated to the fact that you will see your changes right away or not: it's an optimization when you can allow some index data loss - the latest changes might be lost.
As mentioned in the documentation here http://docs.jboss.org/hibernate/search/5.5/reference/en-US/html_single/#jgroups-backend , you can have a sync JGroups with Hibernate Search blocking until all the indexes are in sync. So it can work for your case.
Note that we are currently working for 5.6 on an Elasticsearch backend which might be of some interest to you as it's typically designed for your case. It's still in beta but it's already in pretty good shape. You might want to take a look to it: http://docs.jboss.org/hibernate/search/5.6/reference/en-US/html/ch11.html .

WebSphere propagate changes across all nodes in cluster

I have a problem with a product that I am currently working on. Essentially, There is some very commonly used (and very seldomly updated) information that is retrieved from the database on server start up. We do not want to query the database every time this information is needed because it is very frequent. There is a way to update this information through the application (only by an admin). When this method is used, the data in the database is updated and the cached data in that single server (1 of 4) is updated. Unfortunately, if a user hits any of the other servers they will not see the updated information. Restarting the cluster remedies the problem however, that is not a feasible solution for our production environment. Now that I have explained the situation, I am open to suggestions. Thank you for your time.
For a simple solution, you can go to the cluster in the admin console and ripple start it. That stops/stars the nodes gracefully and one at a time. The only impact is a 25% reduction in capacity while it is working.
IBM WebSphere Application Server has a Dynamic Cache that you can use to store Java objects. The cache can be set up to use replication over a replication domain so it can be shared across a cluster.
Your code would use the DistributedMap interface to interact with the cache. All settings for the dynamic cache can be included with your application or it can be pre-configured. Examples are included in the javadoc link.
(Similar to Java EE Application-scoped variables in a clustered environment (Websphere)?)
That is, I think the standard answer would be a "Distributed Object Store". But a crude alternative (that we use) would be to configure a list of server:port combinations to contact to inform each cluster member to update their own copy of the data.

Choice between REST API or Java API

I have been reading about neo4j last few days. I got very confused about whether I need to use REST API or if can I go with Java APIs.
My need is to create millions of nodes which will have some connection among them. I want to add indexes on few of node attributes for searching. Initially I started with embedded mode of GraphDB with Java API but soon reached OutOfMemory with indexing on few nodes so I thought it would be better if my neo4j is running as service and I connect to it through REST API then it will do all memory management by itself by swapping in/out data to underlying files. Is my assumption right?
Further, I have plans to scale my solution to billion of nodes which I believe wont be possible with single machine's neo4j installation. I also believe Neo4j has the capability of running in distributed mode. For this reason also I thought continuing with REST API implementation is best idea.
Though I couldn't find out any good documentation about how to run Neo4j in distributed environment.
Can I do stuff like batch insertion, etc. using REST APIs as well, which I do with Java APIs with Graph DB running in embedded mode?
Do you know why you are getting your OutOfMemory Exception? This sounds like you are creating all these nodes in the same transaction, which causes it to live in memory. Try committing small chunks at a time, so that Neo4j can write it to Disk. You don't have to manage the memory of Neo4j aside from things like cache.
Distributed mode is in a Master/Slave architecture, so you'll still have a copy of the entire DB on each system. Neo4j is very efficient for disk storage, a Node taking 9 Bytes, Relationship taking 33 Bytes, properties are variable.
There is a Batch REST API, which will group many calls into the same HTTP call, however making REST calls is still a slower then if this were embedded.
There are some disadvantages to using the REST API that you did not mentions, and that's stuff like transactions. If you are going to do atomic operations, where you need to create several nodes, relationships, change properties, and if any step fails not commit any of it, you cannot do this in the REST API.

Is there an embeddable Java alternative to Redis?

According to this thread, Jedis is the best thing to use if I want to use Redis from Java.
However, I was wondering if there are any libraries/packages providing similarly efficient set operations to those that already exist in Redis, but can be directly embedded in a Java application without the need to set up separate servers. (i.e., using Jetty for web server).
To be more precise, I would like to be able to do the following efficiently:
There are a large set of M users (M not known in advance).
There are a large set of N items.
We want users to examine items, one user/item at a time, which produces a stored result (in a normal database.)
Each time a user arrives, we want to assign to that user the item with the least number of existing results that the user has not already seen before. This produces an approximate round-robin assignment of the items over all arriving users, when we just care about getting all items looked at approximately the same number of times.
The above happens in a parallelized fashion. When M and N are large, Redis accomplishes the above much more efficiently than SQL queries. Is there some way to do this using an embeddable Java library that is a bit more lightweight than starting a Redis server?
I recognize that it's possible to write a pile of code using Java's concurrency libraries that would roughly approximate this (and to some extent, I have done that), but that's not exactly what I'm looking for here.
Have a look at project voldemort . It's an distributed key-value store created by Linked-In, and it supports the ability to be embedded.
In the quick start guide is a small example of running the server embedded vs. stand-alone.
VoldemortConfig config = VoldemortConfig.loadFromEnvironmentVariable();
VoldemortServer server = new VoldemortServer(config);
server.start();
I don't know much about Redis, so I can't compare them feature to feature. In the project we used Voldemort, we used it's readonly backing store with great results. It allowed us to "precompile" a bi-daily database in our processing data-center and "ship it" out to edge data-centers. That way each edge data-center had a local copy of it's dataset.
EDIT: After rereading your question, I wanted to add Gauva's Table -- This Table DataStructure may also be something your looking for and is simlar to what you get with many no-sql databases.
Hazelcast provides a number of distributed data structure implementations which can be used as a pure Java alternative to Redis' services. You could then ship a single "jar" with all required dependencies to run your application. You may have to adjust for the slightly different primitives relative to Redis in your own application.
Commercial solutions in this space include Teracotta's Enterprise Ehcache and Oracle Coherence.
Take a look at lmdb (Lightning Memory Database), because I needed exactly the same thing. I deploy a dropwizard application into a container, and adding redis or another external dependancy is painful. This seems to perform well, has good activity. fyi, though, i have not yet used this in production.
https://github.com/lmdbjava/lmdbjava
Google's Guava Library provides friendly versions of the same (and more) Set operators that redis provides.
https://code.google.com/p/guava-libraries/wiki/CollectionUtilitiesExplained
e.g.
Guava Redis
Sets.intersection(a,b) sinter a b
a.count() scard a
Sets.difference(a,b) sdiff a b
Sets.union(a,b) sunion a b
Multisets are a reasonably straightforward proxy for redis sorted-sets as well.

copy data from a mysql database to other mysql database with java

I have developed a small swing desktop application. This app needs data from other database, so for that I've created an small process using java that allows to get the info (using jdbc) from remote db and copy (using jpa) it to the local database, the problem is that this process take a lot of time. is there other way to do it in order to make faster this task ?
Please let me know if I am not clear, I'm not a native speaker.
Thanks
Diego
One good option is to use the Replication feature in MySQL. Please refer to the MySQL manual here for more information.
JPA is less suited here, as object-relational-mapping is costly, and this is bulk data transfer. Here you probably also do not need data base replication.
Maybe backup is a solution: several different approaches listed there.
In general one can also do a mysqldump (on a table for instance) on a cron task compress the dump, and retrieve it.

Categories

Resources