How to get index ids from an index in elasticsearch

How to get index ids from an index in elasticsearch - java

I am having trouble getting the index ids from an index using the java api for elasticsearch.
When creating an IndexResponse I can get the IndexResponse id from the "IndexResponse" object. When creating an index I do not specify the id so I let elasticsearch handle this. How can I get a listing of the ids for a specific index?
I would then iterate through the ids to submit other requests (i.e. GET, DELETE).
I am using the java api and not spring-data. The version is 1.7 for those interested.

Retrieving all of the IDs from an index is generally a terrible idea, which gets more terrible depending on how large your index is. If you really need it, consider using a scroll query to achieve what you want.
https://www.elastic.co/guide/en/elasticsearch/guide/master/scroll.html#CO33-1
The guide has is written for Elasticsearch 2.x, but it works for Elasticsearch 5.x if you're using that.
Essentially how it works is this:
Create a scroll window of size x, return the first 1000 results without the overhead of scoring, analysis, etc. The resources are allocated by Elasticsearch for a time of y. The first response returns not only the first x documents, but returns a _scroll_id that can be used to fetch the next x documents.
GET http://yourhost:9200/old_index/_search?scroll=1m
{
"query": { "match_all": {}},
"sort" : ["_doc"],
"size": 1000
}
Say the response to the above query is something like...
{
"_scroll_id": "abcdefghijklmnopqrstuvwxyz",
"took": 15,
"timed_out": false,
"terminated_early": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1027,
"max_score": null,
"hits": [
{
...
You would then use the _scroll_id like so to fetch the next x results.
GET http://yourhost:9200/_search/scroll
{
"scroll": "1m",
"scroll_id" : "abcdefghijklmnopqrstuvwxyz"
}
It returns a response similar to the above. Ensure you rip out the _scroll_id from each request response and use it in the next. Using all of these responses, you can iterate through the hits and rip out the IDs.

Related

Elastic search and Java - Need to fetch the index and its corresponding types available for that particular index

I am currently developing a java application which connects to elastic search using HTTP and fetch all the index and its corresponding types in a map - Key(index) and Arraylist(type). Please help

One neat trick here is to use a terms aggregation on the _index and _type fields:
GET _search
{
"size": 0,
"aggs": {
"byindex": {
"terms": {
"field": "_index",
"size": 10000
},
"aggs": {
"by_type": {
"terms": {
"field": "_type",
"size": 10
}
}
}
}
}
}
Alternatively you can use the aliases api, which returns all aliases and the indices they point to. In the case of indices without an alias, the alias is simply the index name.
Also note that as of v6 having more than one type per index is no longer allowed. Types will be removed entirely in a future version.
You can use your favorite http library to do this in java or the official elasticsearch http client for java, or one of several alternatives.

Reliable, fault tolerant and scalable solution for tracking updates on different mongo collection

//CouponCartRule - MongoDB
{
"id": 1,
"applicability": {
"validFrom": "12-MAR-2017T01:00:00Z",
"validTill": "12-MAR-2019T01:00:00Z"
},
"maxUsage": 100,
"currentUsage": 99,
"maxBudget": 1000,
"currrentBudget": 990
}
//UniqueCoupon collection - MongoDB
{
"ruleId": 1,
"couponCode": "CITIDEALMAR18",
"currentCouponUsage": 90,
"validFrom": "12-MAR-2018T01:00:00Z",
"validTill": "12-APR-2018T01:00:00Z"
}
{
"ruleId": 1,
"couponCode": "CITIDEALAPR18",
"currentCouponUsage": 9,
"validFrom": "12-JAN-2018T01:00:00Z",
"validTill": "12-FEB-2018T01:00:00Z"
}
//Order - MongoDB
{
"id": 112,
"total": {
"discountCode": "CITIDEALMAR18",
"discount": 10,
"total": 90,
"grandTotal": 90
},
"items": []
}
Problem Description
A CouponCartRule will have conditions defined for a coupon code to be applied on order.
Each shopping cart rule can have n coupons where specific conditions might be overriden.
This is done to reuse CouponCartRule but create different coupons under them and also coupons under CouponCartRule can grow to max of 10K record.
When order is succesffuly placed the order will have couponCode and discount applied.
Order collection is managed by OM team and it is very big.
We receive order add / cancel event when order is created / cancelled.
I have requirement to check if maxUsage and maxBudget is not breached during checkout.
I am planning to listen to order add and cancel event, update usage on order add and order cancel event.
Steps
Update currentUsage of UniqueCoupon usage [inc operator of mongo]
Update currentUsage & budget stats of CouponCartRule using
Any suggestion on making the code idempotent if there is downtime, If there is failure in step 2. The listener would send message to DLQ and the count would be update again which is not desired.
One option that I thought of is to track and record stats at UniqueCoupon level and later aggregate at SCR level [The aggregation operation will eventually be consistent when subsequent coupon is used].
Since this code is invoked real time it should be efficient.

I ended up proposing the below solution that is supported by our infra [rabbit mq and mongodb]. Here is the solution that was accepted after design review. Hope this helps someone who is having similar issues.
System receives order add / cancel event from Order Management.
If coupon code was used in order, We Upsert record into DB based on order ID.
Order Add Event- Track coupon code, orderId, discount, couponRuleId, updated etc. Upsert based on orderId and update couponUsed flag as “APPLIED”
Order cancel Event- Upsert orderId based on orderId and update couponUsed flag as “NOT_APPLIED”
Have Mongo TTL on “coupon_usage_tracker” as 3 months / 6 months.
Trigger sync on COUPON_RULE_SUMMARY and COUPON_SUMMARY dashboard based on last by passing required data. Trigger sync mechanism is via MQ since we can have automated retry on failure via DLQ setup.
If trigger sync fails for either COUPON_RULE_SUMMARY and COUPON_SUMMARY then it would be updated with correct value in next coupon code usage [Eventual consistency].
The config collections will not have mutable states and would be strictly having configuration. This lets us cache the configs.
//COUPON_USAGE_TRACKER
{
"ruleId": 948,
"couponCode": "TENOFF",
"couponId": 123,
"order": {
"created_at": "",
"udpated_at": "",
"member_id": 221930,
"member_order_count": 5,
"publicId": "4asdff-23ewea-232",
"id": 234123,
"total": {
"discount_code": "TENOFF",
"discount": "10",
"grand_total": 100,
"sub_total": 10
}
}
}
//COUPON_RULE_SUMMARY
{
"ruleId": 948,
"last_update_at": "",
"currentUsage": 100,
"currentBudget": 1000
}
//COUPON_SUMMARY
{
"ruleId": 948,
"couponId": 948,
"last_update_at": "",
"currentUsage": 100,
"currentBudget": 1000
}

Java Deserialization when JSON attribute data varies wildly

Task at hand: Consider the following model for a JSON response, every API response will conform to this response, obviously the data will vary. Lets call it ResponseModel:
{
"isErroneous": false,
"Message": "",
"Result": null,
"Data": {
"objId": 38,
"objName": "StackO",
"objDescription": "StackODesc",
"objOtherId": 0,
"objLocationId": 1
}
}
How can I deserialize this response regardless of the data in Data: ? Data could contain a single object of different types, e.g a Dog, a Car. It could also contain a collection of Cars or Dogs e.g not just one like above.
It could also contain A car, the cars engine obj, the cars driver seat obj.
In short, the same response is always going to be present but the value of Data can vary wildly, I want to try best to deserialize this to some sort of Result.class for ALL possible scenarios, how do I best approach this? Setting up the class ResponseModel.class is easy for everything except the "Data" type.
Thanks
Here is another example of something which could be returned
{
"isErroneous": false,
"Message": "",
"Result": null,
"Data": [
{
"carId": 1,
"carName": "car#1",
"carDescription": "car#1",
"carOtherId": 1,
},
{
"carId": 2,
"carName": "car#2",
"carDescription": "car#2",
"carOtherId": 2,
},
{
"carId": 3,
"carName": "car#3",
"carDescription": "car#3",
"carOtherId": 3,
},
As you can see in the second example, we are returning a list of cars but in the first response we are just returning a single object. Am i trying to abstract this too much? should I setup custom response(s) for each call of the API etc?
I need to be writing integration tests asserting that the deserialized object is equal to one that I expect everytime.

you can try to check if your json input contains "Data": { or "Data": [ then parse it to seperate class accordingly
But... maybe it's not the best way to do it

Retrieve data from Elasticsearch using aggregations where the values contains hyphen

I am working on elastic search for quite some time now... I have been facing a problem recently.
I want to group by a particular column in elastic search index. The values for that particular column has hyphens and other special characters.
SearchResponse res1 = client.prepareSearch("my_index")
.setTypes("data")
.setSearchType(SearchType.QUERY_AND_FETCH)
.setQuery(QueryBuilders.rangeQuery("timestamp").gte(from).lte(to))
.addAggregation(AggregationBuilders.terms("cat_agg").field("category").size(10))
.setSize(0)
.execute()
.actionGet();
Terms termAgg=res1.getAggregations().get("cat_agg");
for(Bucket item :termAgg.getBuckets()) {
cat_number =item.getKey();
System.out.println(cat_number+" "+item.getDocCount());
}
This is the query I have written inorder to get the data groupby "category" column in "my_index".
The output I expected after running the code is:
category-1 10
category-2 9
category-3 7
But the output I am getting is :
category 10
1 10
category 9
2 9
category 7
3 7
I have already went through some questions like this one, but couldn't solve my issue with these answers.

That's because your category field has a default string mapping and it is analyzed, hence category-1 gets tokenized as two tokens namely category and 1, which explains the results you're getting.
In order to prevent this, you can update your mapping to include a sub-field category.raw which is going to be not_analyzed with the following command:
curl -XPUT localhost:9200/my_index/data/_mapping -d '{
"properties": {
"category": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
After that, you need to re-index your data and your aggregation will work and return you what you expect.
Just make sure to change the following line in your Java code:
.addAggregation(AggregationBuilders.terms("cat_agg").field("category.raw").size(10))
^
|
add .raw here

When you index "category-1" you will get (by default) two terms, "category", and "1". Therefore when you aggregate you will get back two results for that.
If you want it to be considered a single "term" then you need to change the analyzer used on that field when indexing. Set it to use the keyword analyzer

Elasticsearch for logging - need architectural advice

I am trying to come up with an optimized architecture to store event logging messages on Elasticsearch.
Here are my specs/needs:
Messages are read-only; once entered, they are only queried for reporting.
No free text search. User will use only filters for reporting.
Must be able to do timestamp range queries.
Mainly need to filter by agent and customer interactions (in addition to other fields).
customers and agents belong to the same location.
So the most frequently executed query will be: get all LogItems given client_id, customer_id, and timestamp range.
Here is what a LogItem looks like:
"_source": {
"agent_id" : 14,
"location_id" : 2,
"customer_id" : 5289,
"timestamp" : 1320366520000, //Java Long millis since epoch
"event_type" : 7,
"screen_id" : 12
}
I need help indexing my data.
I have been reading what is an elasticsearch index? and using elasticsearch to serve events for customers to get an idea of a good indexing architecture, but I need assistance from the pros.
So here are my questions:
The article suggests creating "One index per day". How would I do range queries with that architecture? (eg: is it possible to query on index range?)
Currently I'm using one big index. If I create one index per location_id, how do I use shards for further organization of my records?
Given the specs above, is there a better architecture you can suggest?
What fields should I filter with vs query with?
EDIT: Here's a sample query run from my app:
{
"query" : {
"bool" : {
"must" : [ {
"term" : {
"agent_id" : 6
}
}, {
"range" : {
"timestamp" : {
"from" : 1380610800000,
"to" : 1381301940000,
"include_lower" : true,
"include_upper" : true
}
}
}, {
"terms" : {
"event_type" : [ 4, 7, 11 ]
}
} ]
}
},
"filter" : {
"term" : {
"customer_id" : 56241
}
}
}

You can definitely search on multiple indices. You can use wildcards or a comma-separated list of indices for instance, but keep in mind that index names are strings, not dates.
Shards are not for organizing your data but to distribute it and eventually scale out. How you do that is driven by your data and what you do with it. Have a look at this talk: http://vimeo.com/44716955 .
Regarding your question about filters VS queries, have a look at this other question.

Take a good look at logstash (and kibana). They are all about solving this problem. If you decide to roll your own architecture for this, you might copy some of their design.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.