Performing external operations after Elasticsearch indexing

Performing external operations after Elasticsearch indexing - java

I'm currently indexing webpages to elasticsearch. The indexing are done through java (Spring) and also through Apache Nutch.
I met with a situation that, I have to call an external API just after indexing or updating a document in elasticsearch. The API processes a field value in the index and store the processed result in the same index in another field. I tried the API call just before indexing and it affects indexing performance (takes too much time). I have to call the external API without affecting indexing or updating elasticsearch document.
Looking for some ideas.
I'm using elasticsearch version 5.6.3.

At the moment ES doesn't support a "notification system" similar to the one that you need (https://discuss.elastic.co/t/notifications-from-elasticsearch-when-documents-are-added/5106/31) this is impractical in most cases due to the distributed nature of ES.
I think that the easier approach would be to push into Kafka/RabbitMQ (a queue) and you could have your ES indexer as a worker in this queue, and then this worker would be the ideal place to send a message to a different queue indicating that the document X is ready for enrichment (add more metadata). And in this case, you don't have to worry about slowing down the indexing speed of your system (you can add more ES indexers). You also don't need to query ES constantly to enrich your documents because you could send the field (or fields) that are needed along with the ES id to the enrichment workers, and they would update that document directly after the call to the external API). Keep in mind that perhaps part of this could be wrapped in a custom ES plugin.
The advantage of this is that you could scale both places (ES indexer/metadata enricher) separately.
Other option could be having some external module that queries ES for a chunk of documents that still haven't been enriched with the external content, and then you could call the external API and then update the document back to ES.

In my case, we had used logstash-kafka-logstash to write to ES. At the consumer end of Kafka, we invoked external API to compute new field, updated that in a POJO and wrote to ES. It has been running pretty well.
Note: you may also want to check if data computation process via external API can be improved.

Related

Elasticsearch get task ID from index name in Java?

I am performing a query + reindex on the fly in Java 8 using Elasticsearch 6.2, on AWS. I am also interfacing the ES cluster through a Jest client, again, with the Java APIs provided by the client's library. I return back the results of the query to the user and use those results to start a reindex operation in the background for later use. The reindex operation can be semi-long running and take more than just a few or several seconds. I obviously know what my new index name is, but working in a stateless application on a server, I cannot save the task ID returned from the reindex API to query later, I need to look it up by other means. Doing a little research, I came across this API call in Kibana:
GET /_tasks?actions=*reindex
which will return all tasks that are currently reindexing, or an empty list. From there, I can get the parent task ID and query it for status. This may be a problem as I might have more than one reindex operation happening on the ES cluster at once.
Is there a more intelligent or straight-forward approach to my problem?

Proper way to migrate documents in couchbase (API 1.4.x -> 2.0.x)

I would like to migrate documents persisted in couchbase via API 1.4.10 to new documents provided by API 2.0.5 like JsonDocument. I found that there is possibility to add custom transcoders to Bucket, so when decoding documents I can check for flags and decide which transcoder exactly should I use. But it seems to me that this is not quite good solution. Are there any other ways to do that in a proper way? Thanks.
Migration can be done only at runtime upon user request since there are too many documents, we can not migrate them all at once in the background.

You don't need to use a custom transcoder to read documents created with the 1.x SDK. Instead, use the LegacyDocument type to read (and write) documents in legacy format.
More importantly, you shouldn't continue running with a mix of legacy and new documents in the database for very long. The LegacyDocument type is provided to facilitate the migration from the old format to the new SDK. The best practice in this case is to deploy an intermediate version of your application which attempts to read documents in one format, then falls back on trying to read them in the other. Legacy to new or vice versa, depending on which type of document is accessed more frequently at first. Once you have the intermediate version deployed, you should run a background task that will read and convert all documents from the old format to the new. This is pretty straightforward: you just try to read documents as LegacyDocument and, if it succeeds, you store the document right back as a JsonDocument using the CAS value you got earlier. If you can't read the document as legacy, then it's already in the new format. The task should be throttled enough that it doesn't cause a large increase in database load. After the task finishes, remove the fallback code from the application and just read and write everthing as JsonDocument.
You mention having too many documents - how many is that? We've successfully migrated datasets with multiple billions of documents this way. This, admittedly, took several days to run. If you have a database that's larger than that, or has a very low resident ratio, it might not be practical to attempt to convert all documents.

Preprocessing input text before calling ElasticSearch API

I have a Java client that allows indexing documents on a local ElasticSearch server.
I now want to build a simple Web UI that allows users to query the ES index by typing in some text in a form.
My problem is that, before calling ES APIs to issue the query, I want to preprocess the user input by calling some Java code.
What is the easiest and "cleanest" way to achieve this?
Should I create my own APIs so that the UI can access my Java code?
Should I build the UI with JSP so that I can directly call my Java
code?
Can I somehow make ElasticSearch execute my Java code before
the query is executed? (Perhaps by creating my own ElasticSearch plugin?)

In the end, I opted for the simple solution of using Json-based RESTful APIs. Time proved this to be quite flexible and effective for my case, so I thought I should share it:
My Java code exposes its ability to query an ElasticSearch index, by running an HTTP server and responding to client requests with Json-formatted ES results. I created the HTTP server with a few lines of code, by using sun.net.HttpServer. There are more serious/complex HTTP servers out there (such as Tomcat), but this was very quick to adopt and required zero configuration headaches.
My Web UI makes HTTP GET requests to the Java server, receives Json-formatted data and consumes it happily. My UI is implemented in PHP, but any web language does the job, as long as you can issue HTTP requests.
This solution works really well in my case, because it allows to have no dependencies on ES plugins. I can do any sort of pre-processing before calling ES, and even post-process ES output before sending the results back to the UI.

Depending on the type of pre-processing, you can create an Elasticsearch plugin as custom analyser or custom filter: you essentially extend the appropriate Lucene class(es) and wrap everything into an Elasticsearch plugin. Once the plugin is loaded, you can configure the custom analyser and apply it to the related fields. There are a lot of analysers and filters already available in Elasticsearch, so you might want to have a look at those before writing your own.
Elasticsearch plugins: https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-plugins.html (a list of known plugins at the end)
Defining custom analysers: https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-analyzers.html

Choice between REST API or Java API

I have been reading about neo4j last few days. I got very confused about whether I need to use REST API or if can I go with Java APIs.
My need is to create millions of nodes which will have some connection among them. I want to add indexes on few of node attributes for searching. Initially I started with embedded mode of GraphDB with Java API but soon reached OutOfMemory with indexing on few nodes so I thought it would be better if my neo4j is running as service and I connect to it through REST API then it will do all memory management by itself by swapping in/out data to underlying files. Is my assumption right?
Further, I have plans to scale my solution to billion of nodes which I believe wont be possible with single machine's neo4j installation. I also believe Neo4j has the capability of running in distributed mode. For this reason also I thought continuing with REST API implementation is best idea.
Though I couldn't find out any good documentation about how to run Neo4j in distributed environment.
Can I do stuff like batch insertion, etc. using REST APIs as well, which I do with Java APIs with Graph DB running in embedded mode?

Do you know why you are getting your OutOfMemory Exception? This sounds like you are creating all these nodes in the same transaction, which causes it to live in memory. Try committing small chunks at a time, so that Neo4j can write it to Disk. You don't have to manage the memory of Neo4j aside from things like cache.
Distributed mode is in a Master/Slave architecture, so you'll still have a copy of the entire DB on each system. Neo4j is very efficient for disk storage, a Node taking 9 Bytes, Relationship taking 33 Bytes, properties are variable.
There is a Batch REST API, which will group many calls into the same HTTP call, however making REST calls is still a slower then if this were embedded.
There are some disadvantages to using the REST API that you did not mentions, and that's stuff like transactions. If you are going to do atomic operations, where you need to create several nodes, relationships, change properties, and if any step fails not commit any of it, you cannot do this in the REST API.

API calls inside mapreduce job

I would want to ask you about the inconveniences of calling an external API while running a map reduce job. which are the drawbacks?
Some examples: If inside the mapper we need to geocode an address and we call a google maps api, or calling an external DB in order to get related elements of an item, etc.

It's perfectly OK to make a call to an external API as long as there are no DB calls in the external API. In many ways this is preferred to writing your logic over again. Often times you want your MapReduce jobs to be nothing more than wrapper's around logic written in a non MapReduce context. This make's for better testable code.
However, making external DB calls is STRONGLY discouraged. This will drastically reduce the speed of your MapReduce jobs as every call would be a random access call. In addition, having several thousand Map/Reduce taks hitting your DB at the same time could bring the DB to it's knees. If you need related elements, it's preferable to have all the elements on HDFS and doing a join in MapReduce. If the DB you're talking about is a NoSQL store such as Cassandra or HBase, they'll have a batch export feature to export the entire table onto HDFS.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.