Replicating Couchbase to ElasticSearch (w/ multiple indices)

Replicating Couchbase to ElasticSearch (w/ multiple indices) - java

Currently we're using Couchbase and ElasticSearch(2.x) and replicating data from CB to ES successfully using elasticsearch-transport-couchbase plugin.
The problems began while upgrading to ES 5.6.4. Up until now, we used a single index in ES, and due to the fact that ElasticSearch doesn't recommend this approach anymore we are now trying to create multiple indices in ES (index per type)
That means that we need a way to replicate data from CB (A single bucket) to ES (multiple indices).
What is the best way to approach this?
Possible solutions:
Continue using the elasticsearch-transport-couchbase plugin, but then we'll have to create a lot (~150) XDCR replications, 1 replication per type. I doubt this will scale..
Write our own solution using Spark or Kafka (Neither of them are on Technological stack so implementation might take time, so it's not the most favourable solution)
Any help would be appreciated.

Version 4 of the Couchbase Elasticsearch Connector supports the new "index-per-type" model (and other features, including support for ES 6, secure connections, and replication checkpoint management tools). If you'd like to try it out, your feedback would be invaluable.
Disclaimer: I am a Couchbase employee developing the Elasticsearch connector.

Related

Upgrade Hazelcast 3.5.* to a newer version without losing data

From the Hazelcast official documentation, rolling upgrade is supported starting from version 3.8.
Provided my server version is 3.5, is there a way to create a successful cluster with new boxes running newer versions of Hazelcast?
Naively upgrading to 3.6.* resulted in 2 different clusters (old boxes still running 3.5 and another one with the new ones running 3.6 that obviously has no data as it was never able to touch base with the existing boxes).
My deployment process is as follows:
create a new set of boxes
remove the existing boxes one by one
repeat with a second batch of boxes
My thoughts have gone towards storing a snapshot on disk / db and remount the partition / load from the DB at rollout time, but this might not even be supported and I'm hopeful there might be a better way.

What data structures do you use? For IMaps, ICaches and ILists, you can use Hazelcast Jet. It connects to the old cluster and pumps the data to the new cluster.
This works if your new cluster is on 3.x version. 3.x -> 4.x isn't possible this way. Use Jet 3.x version for it.
See https://docs.hazelcast.org/docs/jet/3.2.2/manual/manual.html#connector-imdg

CouchbaseClient VS CouchbaseCluster

I am trying to implement couchbase in my application.
I am confused with
com.couchbase.client.CouchbaseClient
AND
com.couchbase.client.java.CouchbaseCluster.
I tried to google on CouchbaseClient vs CouchbaseCluster but didn't found which one is better & Pros and Cons.
I know we have 3 types of Couchbase Client, one is vBucket-aware, one is traditional old client which support auto clustering via Moxi-Server.
Can someone who have already used couchbase provides me some link or detailed information about these two Java-Client.
I have done some homework on CouchbaseClient and CouchbaseCluster like inserting, updating, deleting documents via both.
In CouchbaseClient the documents stored are Serialized and you cannot view and edit those documents via Couchbase Admin Console, whereas if Documents like StringDocument, JsonDocument, JsonArrayDocument stored via Couchbase cluster can be viewed and are editable over Couchbase Admin Console.
My requirements is I want to use a couchbase client which is AutoConfiurable (vBucket-aware) like if I add new nodes to a cluster, it will auto detect it, or if any node failed, it will auto detect it and does not throw any exception. Further, if I add new cluster, I'd like it to auto detect it and start using it. I don't want to modify the application code for all these things.

There is now two generations of official Couchbase Java SDKs:
generation 1 (currently 1.4.x, not sure of the patch version) is derived from an old Memcached client, Spymemcached... it is now bugfixes only, and it's the one where you have CouchbaseClient as the primary API.
generation 2 is a rewrite, layered into a core artifact and java-client artifact in Maven. Current version is 2.1.3. This is the one where you deal with CouchbaseCluster.
In the old one, you'd have to instantiate one CouchbaseClient for each bucket you deal with.
In the new generation, the notions of cluster and bucket are first class citizens and you can (and should) reuse the same Cluster instance to open references to different Buckets. The Buckets should also be reused (don't open the same bucket several times). Resources are better mutualized this way.
Also, the new generation has more coherent APIs, uses RxJava for asynchronous processing, etc... It is cluster-aware and will get updates of the topology of the cluster (new nodes, failing nodes, etc...).
Note that these two generations are differents artifacts in Maven (old one is couchbase-client while new one is java-client).
There's no way you can get such a notification if you "add new cluster", but that operation doesn't really make sense to me...

Run ElasticSearch on top relational database

The problem I have is, whether it is possible to use ElasticSearch on top of a relational database.
1. When I insert or delete a record in the relational database, will it reflect in the elastic search?
2. If I insert a document in the elastic search will it be persisted in the database?
3. Does it uses a cache or an in-memory database to facilitate search? If so what is uses?

There is no direct connection between Elasticsearch and relational databases - ES has it's own datastore based on Apache Lucene.
That said, you can as others have noted use the Elasticsearch River plugin for JDBC to load data from a relational database into Elasticsearch. Keep in mind there are a number of limitations to this approach:
It's one way only - The JDBC River for ES only reads from the source
database - it does not push data from ES into the source database.
Deletes are not handled - if you delete data in your source database
after it's been indexed into ES that deletion will not be reflected
in ES.
ElasticSearch river JDBC MySQL not deleting records
and https://github.com/jprante/elasticsearch-river-jdbc/issues/213
It was not intended as a production, scalable solution for
relational database and Elasticsearch integration. From the JDBC
River's author's comment in January of 2014, it was designed as a "
a single node (non-scalable) solution" "for demonstration purposes."
http://elasticsearch-users.115913.n3.nabble.com/Strategy-for-keeping-Elasticsearch-updated-with-MySQL-td4047253.html
To answer your questions directly (assuming you use the JDBC River):
New document inserts can be handled by the JDBC River but existing
data deletes are not.
Data does not flow from Elasticsearch into your relational database. That would need to be custom development work.
Elasticsearch is built on top of Apache Lucene. Lucene in turn
depends a great deal on file system caching at the OS level (which
is why ES recommends keeping heap size down to no more than 50% of
total memory, to leave a lot for the file system cache). In addition
the ES/Lucene stack makes use of a number of internal caches (like
the Lucene field cache and the filter cache)
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-cache.html
and
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html
Internally the filter cache is implemented using a bitset:
http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/

1)You should take a look at the ElasticSearch jdbc river here for inserts (I believe deleted rows aren't managed any more, see developper comment).
2)Unless you do it manually, it is not natively managed by ElasticSearch.
3)Indeed, ElasticSearch use cache to improve performances, especially when using filters. Bitsets (arrays of 0/1) are stored.

Came across this question while looking for a similar thing. Thought an update was due.
My Findings:
Elasticsearch has now deprecated Rivers, though the above-mentioned jprante's River lives on...
Another option I found was the Scotas Push Connector which pushes inserts, updates and deletes from an RDBMS to Elasticsearch. Details here: http://www.scotas.com/product-scotas-push-connector.
Example implementation here: http://www.scotas.com/blog/?p=90

Selective replication in mongodb

I have two MongoDB running in two different servers connected via LAN. I want to replicate records from few collections from server 1 to collections in server 2. Is there any way to do it. Below is the pictorial representation of what I want to achieve.
Following are the methods I consider using.
MongoDB replication - But it replicates all collections. Is selective replication possible in MongoDB ??
Oplog watcher APIs - Please suggest some reliable java APIs
Is there any other way to do this ? and what is the best way of doing it ?

MongoDB does not yet support selective replication and it sounds as though you are not actually looking for selective replication but more for selective copying since replication ensures certain rules of using that server.
I am not sure what you mean by a oplog watcher API but it is easy enough to read the oplog over time by just querying it:
> use local
> db.oplog.rs.find()
( http://docs.mongodb.org/manual/reference/local-database/ )
and then storing the latest timestamp of the record you have copied within a script you make.
You can also use a tailable cursor here on the oplog to effectiely listen (pub/sub) to changes and copy them over to your other server.

High Level Java Client selection for Apache Cassandra [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
There are four high level APIs to access Cassandra and I do not have time to try them all. So I hoped to find somebody who could help me to choose the proper one.
I'll try to write down my findings about them:
Datanucleus-Cassandra-Plugin
pros:
supports JPA1, JPA2, JDO1 - JDO3 - as I read in a review, JDO scales better than Hibernate with JPA
all the pros as mentioned in kundera?
cons:
no exeirience with JDO up to now (relevant only for me of course ;)
documentation not found!
kundera
pros:
JPA 1.0 annotations with all advantages (standard conform, no boilerplate code, ...)
promise for following features in near future: JPA listeners, #PrePersist #PostPersist etc. - Relationships, #OneToMany, #ManyToMany etc. - Transactional support, #Transactional
cons:
early development stage of the plugin?
bugs?
no possibillity to fix problems in the JDO / JPA framework?
s7 pelops
pros:
pure java api --> finer control over persistence?
cons:
pure java api --> boilerplate code
hector 0.7
pros:
mavenized
spring integration --> dependency injection
pure java api --> finer control over persistence?
jmx monitoring?
managing of nodes seems to be easy and flexible
cons:
pure java api (no annotations) --> boiler plate code
Conclusion so far
As I am confident with RDMS, Hibernate, JPA, Spring and not so up to date anymore with EJB, my first impression was, to go for kundera would have been the right choice. But after reading some posts regarding JPO, DataNucleus, I am not sure anymore. As the learning curve should be steep (also for expirienced JPA developers?) for DataNucleus, I am not sure, whether I should go for it.
My major concern is the status of the plugin. Also the forum support/help for JDO and Datanucleus-Cassandra-Plugin, as it is not as wide spread, as far as I understood.
Is anybody out there, who has experience, with some of the framworks already and can give me a hint? Maybe a mixed strategy would make sense as well. In cases (if they exist) JDO is not flexible/sufficient/whatever enough for my needs, to fall back to one of the easier APIs of pelops or hector? Is this possible? Is there an approach like in JPA to get an sql connection and fetch/put data?
After reading a bit on, I found following additional information:
Datanucleus-Cassandra-Plugin is based on the pelops, which also can be accessed for more flexibility, more performance (?), which should be used on the column families with a lot of data, JDO/JPA access should be only used on "administrative" data, where performance is not so important and data amount is not overwhelming.
Which still leaves the question open to start with hector or pelops.
pelops for it's later Datanucleus-Cassandra-Plugin extensibility, or
hector for it's more sufficient support on node hanldling.

I tried most of these solutions and find hector the best. Even when you have some problem you can always reach people who wrote hector in #cassandra in freenode. and the code is more mature as far as I concern. In cassandra client the most critical part would be connection pooling management (since all the clients do mostly the same operations through thrift, but connection pooling is what makes high level client roll). In that case I would vote for hector since I am using it in production for over a year now with no visible problem (1 reconnect issue fixed as soon as I discovered and send an email about it).
I am still using cassandra 0.6 though.

The author of the datanucleus plugin, Todd Nine, is working on the next-gen JPA support in Hector now.

The Hector client was the API that we choose because of the following things that it had:
Connection Pooling (huge performance gain when sharing a connection to a node)
Complete Custom Configuration using interfaces for most everything.
Auto Discovery Hosts
Custom Load Balancing Policy definitions (LeastActiveBalancingPolicy or RoundRobinBalancingPolicy or implement LoadBalancingPolicy)
Light-weight adapter on top of the Thrift API.
Great examples: See hector-examples
Built in JMX support.
Downside of Hector:
Documentation not bad, but the Java Docs are lacking a bit. That could easily be a Git fork / pull request by the user community.
The ORM support was a bit limited, but not urgent for usage in our case. I couldn't get some of the one-to-many associations to work easily, plus lack of describing what type of Cassandra model (super columns or column families for associated collections). Also a lack of Java examples (maybe there are some, please post if you find some).
Also, I tried using kundera with very little success. Not many examples to use or try, very little forum support. It appears to be maintained by one person, which makes it even hard to choose a tool like that. It appears based on the SVN activity it was migrating to using Hadoop instead or support for it as well.

Kundera 2.0.4 released.
Major Changes in this release:
Cross-datastore persistence( Easy to migerate existing mysql app over nosql)
support for relational databases (e.g Mysql etc)
replace solandra with lucene based indexing.
Support added for bi-directinal associations.
Performance improvement fixes.

I would propose also Astyanax, I'm working with it and I'm quite happy. Only the documentation is not really good.
Astyanax API
Astyanax implements a fluent API which guides the caller to narrow or
customize the query via a set of well defined interfaces. We've also
included some recipes that will be executed efficiently and as close
to the low level RPC layer as possible. The client also makes heavy
use of generics and overloading to almost eliminate the need to
specify serializers.
Some key features of the API include:
Key and column types are defined in a ColumnFamily class which
eliminates the need to specify serializers.
Multiple column family key types in the same keyspace. Annotation based composite column names.
Automatic pagination.
Parallelized queries that are token aware.
Configurable consistency level per operation.
Configurable retry policy per operation.
Pin operations to specific node.
Async operations with a single timeout using Futures.
Simple annotation based object mapping.
Operation result returns host, latency, attempt count.
Tracer interfaces to log custom events for operation failure and success.
Optimized batch mutation.
Completely hide the clock for the caller, but provide hooks to customize it.
Simple CQL support.
RangeBuilders to simplify constructing simple as well as composite column ranges.
Composite builders to simplify creating composite column names.
Recipes Recipes for some common use cases:
CSV importer.
JSON exporter to convert any query result to JSON with a wide range of
customizations.
Parallel reverse index search.
Key unique constraint validation.
http://techblog.netflix.com/2012/01/announcing-astyanax.html

I suggest you give Kundera-2.0.1 a try. It has gone a major change since its inception and I see a lot of new features getting added and bugs being fixed. Currently it supports JPA 1.0 and Cassandra 0.7.6 but they are planning to add support for Cassandra 0.8 and JPA 2.0 very soon. There is a pretty good example here: https://github.com/impetus-opensource/Kundera/wiki/Getting-Started-in-5-minutes

You can try Achilles, a new Entity Manager I've developed that supports all CQL3 features.
Entity mapping
JPA style operations
Limited support for join
Mapping of clustered entities using compound primary key
Queries (native, typed, slice)
Support for counters
Support for Consistency level
TTL & timestamp
JUnit 4 Rule to start embedded Cassandra server for testing
And so more ...
There are 2 implementations: Thrift & CQL
The Thrift version relies on Hector under the hood.
The CQL version pulls the brand new Java Driver Core from Datastax for all operations
Quick reference here

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.