How can a Elasticsearch client be notified of a new indexed document? - java

I am using Elasticsearch, and I am building a client (using the Java Client API) to export logs indexed via Logstash.
I would like to be able to be notified (by adding a listener somewhere) when a new document is index (= a new log line have been added) instead of querying the last X documents.
Is it possible ?

This is what you're looking for: https://github.com/ForgeRock/es-change-feed-plugin
Using this plugin, you can register to a websocket channel to receive indexation/deletion events as they happen. It has some limitations, though.
Back in the days, it was possible to install river plugins to stream documents to ES. The river feature has been removed, but this plugin above is like a "reverse river", where outside clients are notified by ES as documents get indexed.
Very useful and seemingly up-to-date with ES 6.x
UPDATE (April 14th, 2019):
According to what was said at Elastic{ON} Zurich 2019, at some point in the 7.x series, there will be a Changes API that will provide index changes notifications (document creation, update, deletion and more).
UPDATE (July 22nd, 2022):
ES 8.x is out and the Changes API is still nowhere in sight ... Good to know, though, that's it's still open at least.

Related

Objectify v5 and v6 at the same time in a google app engine java 8 standard project

We want to do a zero downtime migration of a google app engine java 8 standard project to another region.
Unfortunately google does not support this, so it has to be done manually.
One could export the datastore and import it again, but there may be no downtime and the data must always be consistent.
So the idea came up to create the project in the new region, and embed objectify 5 there with all entities (definitions, not data) used in the old project. Any new data goes in the "new datastore" attached to this new project.
All data not found on this new datastore shall be queried (if necessary) using objectify 6 connected to the "old" project using datastore api.
The advantage would be to not export any data manually at all and only migrate the most important data on the fly, using the mechanism above. (there's a lot of unused garbage, we did not do housekeeping for, but also some very vital data that must be on the new system)
Is this a valid approach? I know I'll probably integrate objectify by code and change package names to not have problems on the "code side"
If there is a better approach to migrate a project to another region, we're happy to hear.
We searched for hours without a proper result
Edit: I'm aware that we must instantly stop requests to the old service / disable writes there. We'd solve this by redirecting traffic (http) from the old project to the new one and disable writes.
This is a valid approach for migration. The traffic from new project can continue to do reads from old Datastore and writes to new one. I would like to add one more point.
However, soon after this switchover you should also plan data migration from old datastore to new one through mass export and import. The app will then have to be pointed to the new ones even for reads. https://cloud.google.com/datastore/docs/export-import-entities
This can be done gracefully by introducing a proxy connection logic in JAVA for connecting with the new Datastore entity. Which means during data migration, you put a condition in OFY6 to check for the new Datastore entity, if it is not available then read data from the old entity. This will ensure Zero downtime and in the backend you can silently and safely turn off the old datastore assuming you already have its full export.
Reads from both the old data source and new data source is a valid way to do migrations.

Performing external operations after Elasticsearch indexing

I'm currently indexing webpages to elasticsearch. The indexing are done through java (Spring) and also through Apache Nutch.
I met with a situation that, I have to call an external API just after indexing or updating a document in elasticsearch. The API processes a field value in the index and store the processed result in the same index in another field. I tried the API call just before indexing and it affects indexing performance (takes too much time). I have to call the external API without affecting indexing or updating elasticsearch document.
Looking for some ideas.
I'm using elasticsearch version 5.6.3.
At the moment ES doesn't support a "notification system" similar to the one that you need (https://discuss.elastic.co/t/notifications-from-elasticsearch-when-documents-are-added/5106/31) this is impractical in most cases due to the distributed nature of ES.
I think that the easier approach would be to push into Kafka/RabbitMQ (a queue) and you could have your ES indexer as a worker in this queue, and then this worker would be the ideal place to send a message to a different queue indicating that the document X is ready for enrichment (add more metadata). And in this case, you don't have to worry about slowing down the indexing speed of your system (you can add more ES indexers). You also don't need to query ES constantly to enrich your documents because you could send the field (or fields) that are needed along with the ES id to the enrichment workers, and they would update that document directly after the call to the external API). Keep in mind that perhaps part of this could be wrapped in a custom ES plugin.
The advantage of this is that you could scale both places (ES indexer/metadata enricher) separately.
Other option could be having some external module that queries ES for a chunk of documents that still haven't been enriched with the external content, and then you could call the external API and then update the document back to ES.
In my case, we had used logstash-kafka-logstash to write to ES. At the consumer end of Kafka, we invoked external API to compute new field, updated that in a POJO and wrote to ES. It has been running pretty well.
Note: you may also want to check if data computation process via external API can be improved.

Replicating Couchbase to ElasticSearch (w/ multiple indices)

Currently we're using Couchbase and ElasticSearch(2.x) and replicating data from CB to ES successfully using elasticsearch-transport-couchbase plugin.
The problems began while upgrading to ES 5.6.4. Up until now, we used a single index in ES, and due to the fact that ElasticSearch doesn't recommend this approach anymore we are now trying to create multiple indices in ES (index per type)
That means that we need a way to replicate data from CB (A single bucket) to ES (multiple indices).
What is the best way to approach this?
Possible solutions:
Continue using the elasticsearch-transport-couchbase plugin, but then we'll have to create a lot (~150) XDCR replications, 1 replication per type. I doubt this will scale..
Write our own solution using Spark or Kafka (Neither of them are on Technological stack so implementation might take time, so it's not the most favourable solution)
Any help would be appreciated.
Version 4 of the Couchbase Elasticsearch Connector supports the new "index-per-type" model (and other features, including support for ES 6, secure connections, and replication checkpoint management tools). If you'd like to try it out, your feedback would be invaluable.
Disclaimer: I am a Couchbase employee developing the Elasticsearch connector.

Proper way to migrate documents in couchbase (API 1.4.x -> 2.0.x)

I would like to migrate documents persisted in couchbase via API 1.4.10 to new documents provided by API 2.0.5 like JsonDocument. I found that there is possibility to add custom transcoders to Bucket, so when decoding documents I can check for flags and decide which transcoder exactly should I use. But it seems to me that this is not quite good solution. Are there any other ways to do that in a proper way? Thanks.
Migration can be done only at runtime upon user request since there are too many documents, we can not migrate them all at once in the background.
You don't need to use a custom transcoder to read documents created with the 1.x SDK. Instead, use the LegacyDocument type to read (and write) documents in legacy format.
More importantly, you shouldn't continue running with a mix of legacy and new documents in the database for very long. The LegacyDocument type is provided to facilitate the migration from the old format to the new SDK. The best practice in this case is to deploy an intermediate version of your application which attempts to read documents in one format, then falls back on trying to read them in the other. Legacy to new or vice versa, depending on which type of document is accessed more frequently at first. Once you have the intermediate version deployed, you should run a background task that will read and convert all documents from the old format to the new. This is pretty straightforward: you just try to read documents as LegacyDocument and, if it succeeds, you store the document right back as a JsonDocument using the CAS value you got earlier. If you can't read the document as legacy, then it's already in the new format. The task should be throttled enough that it doesn't cause a large increase in database load. After the task finishes, remove the fallback code from the application and just read and write everthing as JsonDocument.
You mention having too many documents - how many is that? We've successfully migrated datasets with multiple billions of documents this way. This, admittedly, took several days to run. If you have a database that's larger than that, or has a very low resident ratio, it might not be practical to attempt to convert all documents.

CouchbaseClient VS CouchbaseCluster

I am trying to implement couchbase in my application.
I am confused with
com.couchbase.client.CouchbaseClient
AND
com.couchbase.client.java.CouchbaseCluster.
I tried to google on CouchbaseClient vs CouchbaseCluster but didn't found which one is better & Pros and Cons.
I know we have 3 types of Couchbase Client, one is vBucket-aware, one is traditional old client which support auto clustering via Moxi-Server.
Can someone who have already used couchbase provides me some link or detailed information about these two Java-Client.
I have done some homework on CouchbaseClient and CouchbaseCluster like inserting, updating, deleting documents via both.
In CouchbaseClient the documents stored are Serialized and you cannot view and edit those documents via Couchbase Admin Console, whereas if Documents like StringDocument, JsonDocument, JsonArrayDocument stored via Couchbase cluster can be viewed and are editable over Couchbase Admin Console.
My requirements is I want to use a couchbase client which is AutoConfiurable (vBucket-aware) like if I add new nodes to a cluster, it will auto detect it, or if any node failed, it will auto detect it and does not throw any exception. Further, if I add new cluster, I'd like it to auto detect it and start using it. I don't want to modify the application code for all these things.
There is now two generations of official Couchbase Java SDKs:
generation 1 (currently 1.4.x, not sure of the patch version) is derived from an old Memcached client, Spymemcached... it is now bugfixes only, and it's the one where you have CouchbaseClient as the primary API.
generation 2 is a rewrite, layered into a core artifact and java-client artifact in Maven. Current version is 2.1.3. This is the one where you deal with CouchbaseCluster.
In the old one, you'd have to instantiate one CouchbaseClient for each bucket you deal with.
In the new generation, the notions of cluster and bucket are first class citizens and you can (and should) reuse the same Cluster instance to open references to different Buckets. The Buckets should also be reused (don't open the same bucket several times). Resources are better mutualized this way.
Also, the new generation has more coherent APIs, uses RxJava for asynchronous processing, etc... It is cluster-aware and will get updates of the topology of the cluster (new nodes, failing nodes, etc...).
Note that these two generations are differents artifacts in Maven (old one is couchbase-client while new one is java-client).
There's no way you can get such a notification if you "add new cluster", but that operation doesn't really make sense to me...

Categories

Resources