ElasticSearch index deletion - java

I'm using Java API to delete old indexes from ElasticSearch.
Client client = searchConnection.client
DeleteIndexResponse delete = client.admin().indices().delete(new DeleteIndexRequest('location')).actionGet();
During deletion cluster goes red for a minute and not indexing new data - reason "missing indices/replicas etc".
How I can tell ElasticSearch that I'm going to delete them to prevent "red state"?

You could use aliases in order to abstract from the real indices underneath. The idea would be to read from an alias and write to an alias instead. That way you can create a new index, swap the write alias to the new index (so that the indexing process is not disrupted) and then delete the old index. Process-wise, it would go like this:
Context: Your current location index has the location_active alias and the indexing process writes to the location_active alias instead of directly to the location index.
Step 1: Create the new location_112015 index
curl -XPUT localhost:9200/location_112015
Step 2: Swap the location_active alias from the "old" location index to the "new" one created in step 1
curl -XPOST 'http://localhost:9200/_aliases' -d '{
"actions" : [
{ "remove" : { "index" : "location", "alias" : "location_active" } },
{ "add" : { "index" : "location_112015", "alias" : "location_active" } }
]
}'
Note that this operation is atomic, so if the indexing process keeps sending new documents to location_active, it will be transparent for it and no docs will be lost, no errors will be raised.
Step 3: Remove the old index
curl -XDELETE localhost:9200/location
Step 4: Rinse and repeat as often as needed
Note: these operations can easily be performed with the Java client library as well.

Related

Changing alias in ElasticSearch returns 200 and acknowledged but does not change alias

Using elasticsearch 8.4.3 with Java 17 and a cluster of 3 nodes where 3 are master eligible, we start with following situation:
index products-2023-01-12-0900 which has an alias current-products
We then start a job that creates a new index products-2023-01-12-1520 and at the end using elastic-rest-client on client side and alias API, we make this call:
At 2023-01-12 16:27:26,893:
POST /_aliases
{"actions":[
{
"remove": {
"alias":"current-products",
"index":"products-*"
}
},
{
"add":{
"alias":"current-products",
"index":"products-2023-01-12-1520"}
}
]}
And we get the following response 26 millis after with HTTP response code 200:
{"acknowledged":true}
But looking at what we end up with, we still have old index with current-products alias.
I don't understand why it happens, and it does not happen 100% of the time (it happened 2 times out of around 10 indexations).
Is it a known bug ? or a regular behaviour ?
Edit for #warkolm:
GET /_cat/aliases?v before indexation as of now:
alias index filter routing.index routing.search is_write_index
current-products products-2023-01-13-1510 - - - -
It appears that there might be an issue with the way you are updating the alias. When you perform a POST request to the _aliases endpoint with the "remove" and "add" actions, Elasticsearch will update the alias based on the current state of the indices at the time the request is executed.
However, it is possible that there are other processes or actions that are also modifying the indices or aliases at the same time, and this can cause conflicts or inconsistencies. Additionally, when you use the wildcard character (*) in the "index" field of the "remove" action, it will remove the alias from all indices that match the pattern, which may not be the intended behavior.
To avoid this issue, you could try using the Indices Aliases API instead of the _aliases endpoint. This API allows you to perform atomic updates on aliases, which means that the alias will only be updated if all actions succeed, and will roll back if any of the actions fail. Additionally, instead of using the wildcard character, you can explicitly specify the index that you want to remove the alias from.
Here is an example of how you could use the Indices Aliases API to update the alias:
POST /_aliases
{
"actions": [
{ "remove": { "index": "products-2023-01-12-0900", "alias": "current-products" } },
{ "add": { "index": "products-2023-01-12-1520", "alias": "current-products" } }
]
}
This way, the alias will only be removed from the specific index "products-2023-01-12-0900" and added to the specific index "products-2023-01-12-1520". This can help avoid any conflicts or inconsistencies that may be caused by other processes or actions that are modifying the indices or aliases at the same time.
Additionally, it is recommended to use a version of elasticsearch that is equal or greater than 8.4.3, as it has many bug fixes that might be the cause of the issue you are facing.
In conclusion, the issue you are encountering may not be a known bug but it's a regular behavior if multiple processes are modifying the indices or aliases at the same time, and using the Indices Aliases API and specifying the exact index to remove or add the alias can help avoid this issue.

Retrieve data from Elasticsearch using aggregations where the values contains hyphen

I am working on elastic search for quite some time now... I have been facing a problem recently.
I want to group by a particular column in elastic search index. The values for that particular column has hyphens and other special characters.
SearchResponse res1 = client.prepareSearch("my_index")
.setTypes("data")
.setSearchType(SearchType.QUERY_AND_FETCH)
.setQuery(QueryBuilders.rangeQuery("timestamp").gte(from).lte(to))
.addAggregation(AggregationBuilders.terms("cat_agg").field("category").size(10))
.setSize(0)
.execute()
.actionGet();
Terms termAgg=res1.getAggregations().get("cat_agg");
for(Bucket item :termAgg.getBuckets()) {
cat_number =item.getKey();
System.out.println(cat_number+" "+item.getDocCount());
}
This is the query I have written inorder to get the data groupby "category" column in "my_index".
The output I expected after running the code is:
category-1 10
category-2 9
category-3 7
But the output I am getting is :
category 10
1 10
category 9
2 9
category 7
3 7
I have already went through some questions like this one, but couldn't solve my issue with these answers.
That's because your category field has a default string mapping and it is analyzed, hence category-1 gets tokenized as two tokens namely category and 1, which explains the results you're getting.
In order to prevent this, you can update your mapping to include a sub-field category.raw which is going to be not_analyzed with the following command:
curl -XPUT localhost:9200/my_index/data/_mapping -d '{
"properties": {
"category": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
After that, you need to re-index your data and your aggregation will work and return you what you expect.
Just make sure to change the following line in your Java code:
.addAggregation(AggregationBuilders.terms("cat_agg").field("category.raw").size(10))
^
|
add .raw here
When you index "category-1" you will get (by default) two terms, "category", and "1". Therefore when you aggregate you will get back two results for that.
If you want it to be considered a single "term" then you need to change the analyzer used on that field when indexing. Set it to use the keyword analyzer

Archiving a mongodb collection

I'm using java Spring and spring data for mongodb.
I have a collection that needs to contain only documents from the last 3 months but all the documents should be saved in some way (maybe expoet to a file?). I'm looking for solution but all i can find talks about full DB backup.
What is the best way to keep the collection updated to only the last 3 months? (weekly cron?)
How to save the collection archive? I think mongodump is an overkill.
Both mongoexport and mongodump support a -q option to specify a query to limit the documents that will be deleted. The choice for either is more of a function of what format you'd like the data to be stored in.
Let's assume that you have a collection with a timestamp field. You could run either one of these (filling in the required names and times in the angle brackets):
mongoexport -d <yourdatabase> -c <yourcollection> -q "{ timestamp: { \$gt: <yourtimestamp>}}" -o <yourcollection_export_yourtimestamp>.json
mongodump -d <yourdatabase> -c <yourcollection> -q "{ timestamp: { \$gt: <yourtimestamp>}}"
And then delete the old data.
Alternatively you could take periodic snapshots via cron with either method on a collection with a ttl index so that you don't have to prune it yourself - mongodb will automatically delete older data:
db.collectioname.ensureIndex( { "createdAt": 1 }, { expireAfterSeconds: 7862400 } )
This will keep deleting any document older than 91 days based on a createdAt field in the document
http://docs.mongodb.org/manual/tutorial/expire-data/
With mongoexport you can backup a single collection instead of the whole database. I would recommand a Cron-Job (like you sad) to export the data and ceep the database limited to the Documents of the last 3 months my removing oder documents.
mongoexport -d databasename -c collectionname -o savefilename.json

How to assign constant boost value for single field in elasticsearch?

There are a lot of scoring/boosting options in elasticsearch but I haven't found the possibility to add a constant boost value for particular field. If such option exists, how the mapping should look like? Maybe there is an option to calculate score for the entire document depending on which field is being hit?
Here is the solution: wrapper query "custom_boost_factor" which multiplies the score of embedded query of every type:
curl -XPOST 'http://localhost:9200/test/entry/_search?pretty=true' -d '{
"query":{
"custom_boost_factor" :{
"query" : {
"text_phrase_prefix" : {
"title" : "test"
}
},
"boost_factor": 2.0
}
}
}'

Generate random number within a range (0-100k) in a cluster environment

I have to generate random number within a range (0-100.000) in a cluster environment (many stateless Java based app servers + Mongodb) - so every user request will get some unique number and will maintain it in the next few requests.
As I understand, I have two options:
1. have some number persisted in mongo and incrementAndGet it - but it's not atomic - bad choice.
2. Use Redis - it's atomic and support counters.
3. Any idea? Is it safe to use UUID and set a range for it ?
4. Hazelcast ?
Any other though?
Thanks
I would leverage the existing MongoDB infrastructure and use the MongoDB findAndModify command to do an atomic increment and get operation.
For the shell the command would look like.
var result = db.ids.findAndModify( {
query: { _id: "counter" },
sort: { rating: 1 },
new : true,
update: { $inc: { counter: 1 } },
upsert : true
} );
The 'new : true' returns the document after the update. Upsert creates the document if it is missing.
The 10gen supported driver and the Asynchronous Driver both contain helper methods/builders for the find and modify command.

Categories

Resources