Need external version support in OpenSearch Update operation - java

I am using OpenSearch to index JSON documents & make them searchable. All documents have update timestamp field in EPOCH format. The problem is I can get update request where document body contains an older update time. My application should skip the update if the current document update time is older than the update time field in existing document stored in OpenSearch
To fulfil the requirement, I added external version in HTTP request /test_index/_update/123?version=1674576432910&version_type=external.
But I am getting error
Validation Failed: 1: internal versioning can not be used for optimistic concurrency control. Please use if_seq_no and if_primary_term instead
I read about if_seq_no & if_primary_term fields. They can't be used to solve my problem. Has anyone else encountered this problem & solved it? Please share. Or if anyone know about any plugin that I can install to support this, please share.

Sadly neither OpenSearch nor ElasticSearch supports external version in update request. And I don't see the feature getting added in near future. You can solve your specific problem using scripting. OpenSearch supports multiple scripting languages including Painless script. You can write a script that will compare a specific field (in your case update timestamp). And if condition is true, it will go ahead & update the fields with the new values.
{
"script": {
"lang": "painless",
"source": "if (params.updateTimestamp > ctx._source.updateTimestamp) {for (entry in params.entrySet()) {ctx._source[entry.getKey()] = entry.getValue();}}"
}
}
You can see a sample script above which will silently skip any update if new document has older timestamp. You can even throw exception also & handle it from you application. That way you can track number of requests with such issue.
You can use a similar script as stored script & use it in your update request. You can get more details including sample HTTP request & Java code in this article.

You should use the "if_seq_no" and "if_primary_term" parameters to perform optimistic concurrency control.
To solve your problem, you could first retrieve the existing document from OpenSearch using the document ID, and check the update timestamp field. If the existing timestamp is newer than the one in the update request, you can skip the update. Otherwise, you can include the "if_seq_no" and "if_primary_term" parameters in your update request, along with the updated document. The "if_seq_no" parameter should be set to the sequence number of the existing document, and the "if_primary_term" parameter should be set to the primary term of the existing document.
You can use the Update API for this... or the Optimistic Concurrency Control (OCC) mechanism, which is based on a combination of _seq_no and _primary_term fields.

Related

Moving older versions of documents to an history index when saving new ones (elasticsearch)

I want to know if there is a built-in solution for something i need in Elasticsearch.
I want that everytime that my document is being replaced by a newer version of itself (with the same ID), the older version will not be deleted, but moved to an history index.
In that history index, I dont want replacments of older versions, but accumulation of them.
Do you know if there is a built-in solution for this, or will I need to program it myself in my API?
Thank you.
As there is no in built method for your use-case, you need to do it yourself in your application, I don't think Elasticsearch is best suited for creating the history of a document as as soon as you update the document in the history_index you will loose its previous history and if I understand correctly you want to have the complete history of a document.
I guess best is to use any RDBMS or NoSQL where you create a new history entry of a document (document_id of Elasticsearch and its version number will help you to construct the complete history of your Elasticsearch document).
Above DB you can update as soon as you get a update on Elasticsearch document .
There does not appear to be any built-in functionality for this. The easiest approach might be to copy the old version to the history index with the _reindex API, then write the new version:
POST /_reindex
{
"source": {
"index": "your_index",
"query": {
"ids": {
"values": ["<id>"]
}
}
},
"dest": {
"index": "your_history_index"
},
"script": {
"source": "ctx.remove(\"_id\")"
}
}
PUT /your_index/_doc/<id>
{
...
}
Note the script ctx.remove("_id") done as part of the _reindex operation, which ensures Elasticsearch will generate a new ID for the document instead of reusing the existing ID. This way, your_history_index will have one copy for each version of the document. Without this script, _reindex would preserve the ID and overwrite older copies.
I assume that the documents contain an identifier that can be used to search your_history_index for all versions of a document, even though _id is reset.

Apache Solr filtering not working but possible to retrieve by id

Background:
We have a 3-node solr cloud that was migrated to docker. It works as expected, however, for new data that is inserted, it can only be retrieved by id. Once we try to use filters, it doesn't show. Note that old data can still be filtered without any issues.
The database is is used via spring-boot crud-like application.
More background:
The app and the solr were migrated by another person and I have inherited the codebase recently so I am not familiar in much detail about the implementation and am still digging and debugging.
The nodes were migrated as-is (the data was copied into a docker mount).
What I have so far:
I have checked the logs of all the solr nodes and see the following happening when making the calls to the application:
Filtering:
2019-02-22 14:17:07.525 INFO (qtp15xxxxx-15) [c:content_api s:shard1 r:core_node1 x:content_api_shard1_replica0] o.a.s.c.S.Request
[content_api_shard1_replica0]
webapp=/solr path=/select
params=
{q=*:*&start=0&fq=id-lws-ttf:127103&fq=active-boo-ttf:(true)&fq=(publish-date-tda-ttf:[*+TO+2019-02-22T15:17:07Z]+OR+(*:*+NOT+publish-date-tda-ttf:[*+TO+*]))AND+(expiration-date-tda-ttf:[2019-02-22T15:17:07Z+TO+*]+OR+(*:*+NOT+expiration-date-tda-ttf:[*+TO+*]))&sort=create-date-tda-ttf+desc&rows=10&wt=javabin&version=2}
hits=0 status=0 QTime=37
Get by ID:
2019-02-22 14:16:56.441 INFO (qtp15xxxxxx-16) [c:content_api s:shard1 r:core_node1 x:content_api_shard1_replica0] o.a.s.c.S.Request
[content_api_shard1_replica0]
webapp=/solr path=/get params={ids=https://example.com/app/contents/127103/middle-east&wt=javabin&version=2}
status=0 QTime=0
Disclaimer:
I am an absolute beginner in working with Solr and am going through documentation ATM in order to get better insight into the nuts and bolts.
Assumptions and WIP:
The person who migrated it told me that only the data was copied, not the configuration. I have acquired the old config files (/opt/solr/server/solr/configsets/) and am trying to compare to the new ones. But the assumption is that the configs were defaults.
The old version was 6.4.2 and the new one is 6.6.5 (not sure that this could be the issue)
Is there something obvious that we are missing here? What is superconfusing is the fact that the data can be retrieved by id AND that the OLD data can be filtered
Update:
After some researching, I have to say that I have excluded the config issue because when I inspect the configuration from the admin UI, I see the correct configuration.
Also, another weird behavior is that the data can be queried after some time (like more than 5 days). I can see that because I run the query from the UI and order it by descending creation date. From there, I can see my tests that I was not just days ago
Relevant commit config part:
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>
More config output from the admin endpoint:
config:{
znodeVersion:0,
luceneMatchVersion:"org.apache.lucene.util.Version:6.0.1",
updateHandler:{
indexWriter:{
closeWaitsForMerges:true
},
commitWithin:{
softCommit:true
},
autoCommit:{
maxDocs:-1,
maxTime:15000,
openSearcher:false
},
autoSoftCommit:{
maxDocs:-1,
maxTime:-1
}
},
query:{
useFilterForSortedQuery:false,
queryResultWindowSize:20,
queryResultMaxDocsCached:200,
enableLazyFieldLoading:true,
maxBooleanClauses:1024,
filterCache:{
autowarmCount:"0",
size:"512",
initialSize:"512",
class:"solr.FastLRUCache",
name:"filterCache"
},
queryResultCache:{
autowarmCount:"0",
size:"512",
initialSize:"512",
class:"solr.LRUCache",
name:"queryResultCache"
},
documentCache:{
autowarmCount:"0",
size:"512",
initialSize:"512",
class:"solr.LRUCache",
name:"documentCache"
},
:{
size:"10000",
showItems:"-1",
initialSize:"10",
name:"fieldValueCache"
}
},
...
According to your examples you're only retrieving the document when you're querying the realtime get endpoint - i.e. /get. This endpoint returns documents by querying by id, even if the document hasn't been commited to the index or a new searcher has been opened.
A new searcher has to be created before any changes to the index become visible to the regular search endpoints, since the old searcher will still use the old index files for searching. If a new searcher isn't created, the stale content will still be returned. This matches the behaviour you're seeing, where you're not opening any new searchers, and content becomes visible when the searcher is recycled for other reasons (possibly because of restarts/another explicit commit/merges/optimizes/etc.).
Your example configuration shows that the autoSoftCommit is disabled, while the regular autoCommit is set to not open a new searcher (and thus, no new content is shown). I usually recommend disabling this feature and instead relying on using commitWithin in the URL as it allows greater configurability for different types of data, and allows you to ask for a new searcher to be opened within at least x seconds since data has been added. The default behaviour for commitWithin is that a new searcher will be opened after the commit has happened.
Sounds like you might have switched to a default managed schema on upgrade. Look for schema.xml in your previous install along with a section in your prior install's solrconfig.xml. More info at https://lucene.apache.org/solr/guide/6_6/schema-factory-definition-in-solrconfig.html#SchemaFactoryDefinitioninSolrConfig-SolrUsesManagedSchemabyDefault

Simple GET request with Facebooks API

I am currently taking a course in app development and I am trying to use Facebooks API for GET requests on certain events. My goal is the get a JSON file containing all comments made on a certain event.
However some events return only a an "id" key with an id number such as this:
{
"id": "116445769058883"
}
That happends with this event:
https://www.facebook.com/events/116445769058883/
However other events such as (https://www.facebook.com/events/1964003870536124/) : returns only the latest comment for some reason.
I am experementing with facebook explore API:
https://developers.facebook.com/tools/explorer/
This is the following GET requests that I have been using in the explorer:
GET -> /v.10/facebook-id/?fields=comments
Any ideas? It's really tricky to understand the response since both events have the privacy set to OPEN.
Starting from v2.4 of the API, the API is now declarative which means you'll need to specify what fields you want the API to return.
For example, if you want first name and second name of the user, then you make a GET request to /me?fields=first_name,last_name else you will only get back the default fields which are id and name.
If you want to see what fields are available for a given endpoint, use metadata field. e.g. GET /me?metadata=true

How to enforce valid json in elastic search?

Consider following code:
function add_document() {
index=$1
_type=$2
json="$3"
curl -s -X PUT "$NODE_ADDRESS/index/_type?pretty" -d "$json"
}
add_document users documents '{user_name : "kshitiz"}'
The above function runs just fine and adds the document to the index. The problem however is that {user_name : "kshitiz"} isn't valid JSON.
I can validate the JSON in my code before sending to Elastic but the problem is that this instance would be shared among a large team. A better solution would be to disable invalid JSON as acceptable documents at the node so that add operation would fail and force developers to code properly.
How can I enable strict JSON validation in Elastic?
You should implement this functionality in plugin. Have a look at examples:
Validate data before indexing them
Implementing data validation in ElasticSearch
Use mappings which are basically schema for docs in ES doc types. This mapping will do only type checks for keys and associated value.
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html#_explicit_mappings

Solr thinks an update is an add operation

Solr version is 5.4.1
I posted this to http://localhost:8983/solr/default-collection/update and it treated it like I was adding a whole document, not a partial update:
{
"id": "0be0daa1-a6ee-46d0-ba05-717a9c6ae283",
"tags": {
"add": [ "news article" ]
}
}
In the logs, I found this:
2016-02-26 14:07:50.831 ERROR (qtp2096057945-17) [c:default-collection s:shard1_1 r:core_node21 x:default-collection] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: [doc=0be0daa1-a6ee-46d0-ba05-717a9c6ae283] missing required field: data_type
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:198)
at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:83)
at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:273)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:207)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
Does this make any sense? I sent updates just fine a day or two ago like that, now it is acting like the update request is a whole new document.
UPDATE The answer I selected put me in the right direction. What I had to do to correct it was load all of the existing fields that are required fields and add them to the payload myself. It then worked for me. It was not automatic, which suggests that it might be a bug in Solr 5.4.1
If I'm not wrong, an update request always ends up in add operation. The difference with partial update is that Solr initially get the document from the index, then overrides fields according to the request parameters, and finally performs a usual document indexing.
The document is rejected because required field data_type is missing: it should be defined as stored=true in schema.xml or added to the partial document fields every time a partial update occurs. Actually the same applies to all fields.
EDIT : This is not true anymore since Solr introduced In-Place Updates.

Categories

Resources