Retrieve data from Elasticsearch using aggregations where the values contains hyphen - java

I am working on elastic search for quite some time now... I have been facing a problem recently.
I want to group by a particular column in elastic search index. The values for that particular column has hyphens and other special characters.
SearchResponse res1 = client.prepareSearch("my_index")
.setTypes("data")
.setSearchType(SearchType.QUERY_AND_FETCH)
.setQuery(QueryBuilders.rangeQuery("timestamp").gte(from).lte(to))
.addAggregation(AggregationBuilders.terms("cat_agg").field("category").size(10))
.setSize(0)
.execute()
.actionGet();
Terms termAgg=res1.getAggregations().get("cat_agg");
for(Bucket item :termAgg.getBuckets()) {
cat_number =item.getKey();
System.out.println(cat_number+" "+item.getDocCount());
}
This is the query I have written inorder to get the data groupby "category" column in "my_index".
The output I expected after running the code is:
category-1 10
category-2 9
category-3 7
But the output I am getting is :
category 10
1 10
category 9
2 9
category 7
3 7
I have already went through some questions like this one, but couldn't solve my issue with these answers.

That's because your category field has a default string mapping and it is analyzed, hence category-1 gets tokenized as two tokens namely category and 1, which explains the results you're getting.
In order to prevent this, you can update your mapping to include a sub-field category.raw which is going to be not_analyzed with the following command:
curl -XPUT localhost:9200/my_index/data/_mapping -d '{
"properties": {
"category": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
After that, you need to re-index your data and your aggregation will work and return you what you expect.
Just make sure to change the following line in your Java code:
.addAggregation(AggregationBuilders.terms("cat_agg").field("category.raw").size(10))
^
|
add .raw here

When you index "category-1" you will get (by default) two terms, "category", and "1". Therefore when you aggregate you will get back two results for that.
If you want it to be considered a single "term" then you need to change the analyzer used on that field when indexing. Set it to use the keyword analyzer

Related

ElasticSearch / Java - Dynamic Templates aggregation with null values included

I'm having a diffuculties with aggregations over dynamic templates. I have values stored like this.
[
{
"country": "CZ",
"countryName": {
"en": "Czech Republic",
"es": "Republica checa",
"de": "Tschechische Republik"
},
"ownerName": "..."
},
{
"ownerName": "..."
}
]
Country field is classic keyword, mapping for country name is indexed as dynamic template according to the fact that I want to extend with with another languages when I need to.
{
"dynamic_templates": [
{
"countryName_lsi_object_template": {
"path_match": "countryName.*",
"mapping": {
"type": "keyword"
}
}
}
]
}
countryName and country are not mandatory parameters - when the document is not assigned to any country, I can't have countryName filled either. However I need to do a sorted aggregation over the country names with according to chosen key and also need to include buckets with null countries. Is there any way to do that?
Previously, I used TermsValuesSourceBuilder with order on "country" field, but I need data sorted according to specifix language and name and that can't be done over country codes.
(I'm using elasticsearch 7.7.1 and java 8 and recreation of index / changing data structure is not my option.)
I tried to use missing bucket option, but the response does not include buckets with "countryName" missing at all.
TermsValuesSourceBuilder("countryName").field("countryName.en").missingBucket(true);

How to access key value in array of objects - java

I'm getting result from cloudant db and response type would be Document object.
This is my query:
FindResult queryResult = cloudantConfig.clientBuilder()
.postFind(findOptions)
.execute()
.getResult();
This is my result from cloudant db:
{
"bookmark": "Tq2MT8lPzkzJBYqLOZaWZOQXZVYllmTm58UHpSamxLukloFUc8BU41GXBQAtfh51",
"docs": [
{
"sports": [
{
"name": "CRICKET",
"player_access": [
"All"
]
}
]
}
]
}
I'd like to access 'name' and 'player access,' but I can only go up to'sports,' and I can't get to 'name' or 'player access.' This is how I attempted to obtain 'name.'
queryResult.getDocs().get(0).get("sports").get(0).get("name");
With above one I'm getting an error like this The method get(int) is undefined for the type Object
I'm receiving the values when I try to get up to'sports.'
This is how I obtain sports:
queryResult.getDocs().get(0).get("sports");
When I sysout the aforementioned sports, I get the results below.
[{name=CRICKET, player_access=[All]}]
So, how do I gain access to 'name' and 'player access' here? Can somebody help me with this?
I've dealed with JSON values recently. But ended up just using regex and splitting/matching from there.
You can regex everything from the "name" (not including until the last comma) and do the same for sport Access.
Be aware that this is just a work around, and not the best option. But sometimes JSON objects on Java can be Tricky.

How to get value with an underscore inside a string from Elasticsearch using QueryBuilder in Java?

I'm using Elasticsearch 3.2.7 and ElasticsearchRepository.search() which takes QueryBuilder as an argument (doc)
I have a BoolQueryBuilder and use it like this:
boolQuery.must(termQuery("myObject.code", value);
var results = searchRepository.search(boolQuery);
The definition of the field code is as follows:
"myObject": {
"properties": {
"code": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
The issue is, when I search with a value that has underscore inside, for example: FOO_BAR then it doesn't return any results. When I search with other values that have either leading or trailing underscore then it's fine.
I've read that ES may ignore the special character and split the words inside by it so there's a need for an exact match search. But I also read that the keyword setting guarantees that. So right now I'm confused.
yes, you are correct, using keyword field you can achieve the exact match, you need to use the below query
boolQuery.must(termQuery("myObject.code.keyword", value); --> note addition of keyword
var results = searchRepository.search(boolQuery);
you can use the analyze API to see the tokens for your indexed documents and search term, and basically your tokens in index must match search terms tokens, in order ES to return the match :)

Sort the search result in ascending order of a multivalued field in Solr

I'm using Solr of version 6.6.0. I have a schema of title (text_general), description(text_general), id(integer). When I search for a keyword to list the results in ascending order of the title my code returns an error can not sort on multivalued field: title.
I have tried to set the sort using the following 3 methods
SolrQuery query = new SolrQuery();
1. query.setSort("title", SolrQuery.ORDER order);
2. query.addSort("title", SolrQuery.ORDER order);
3. SortClause ab = new SolrQuery.SortClause("title", SolrQuery.ORDER.asc);
query.addSort(ab);
but all of these returns the same error
I found a solution by referring to this answer
It says to use min/max functions.
query.setSort(field("pageTitle",min), ORDER.asc);
this what I'm trying to set as the query, I didn't understand what are the arguments used here.
This is the maven dependency that I'm using
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr-solrj</artifactId>
<version>6.5.1</version>
</dependency>
Unless title actually is multiValued - can your post have multiple titles - you should define it as multiValued="false" in your schema. However, there's a second issue - a field of the default type text_general isn't suited for sorting, as it'll generate multiple tokens, one for each word in the title. This is useful for searching, but will give weird and non-intuitive results when sorting.
So instead, define a title_sort field and use a field type with a KeywordTokenizer and LowerCaseFilter attached (if you want case insensitive sort), or if you want case sensitive sort, use the already defined string field type for the title_sort field.
The first thing to check is do you really need that title field to be multivalued, or do your documents really have multiple titles ? If not, you just need to fix the field definition by setting multivalued="false".
That said, sorting on a multivalued field doesn't make sense unless determining which one of these multiple values should be used to sort on, or how to combine them into one.
Let' say we need to sort a given resultset by title (alphabetically), first using a single-valued title field :
# Unsorted
"docs": [
{ "id": "1", "title": "One" },
{ "id": "2", "title": "Two" },
{ "id": "3", "title": "Three" },
]
# Sorted
"docs": [
{ "id": "1", "title": "One" },
{ "id": "3", "title": "Three" },
{ "id": "2", "title": "Two" },
]
# -> ok no problem here
Now applying the same logic with a multi-valued field is not possible as is, you would necessarily need to determine which title to use in each document to properly sort them :
# Unorted
"docs": [
{ "id": "1", "title": ["One", "z-One", "a-One"] },
{ "id": "2", "title": ["Two", "z-Two", "a-Two"] },
{ "id": "3", "title": ["Three", "z-Three", "a-Three"] }
]
Hopefully, Solr allows to sort results by the output of a function, meaning you can use any from Solr's function queries to "get" a single value per title field. The answer you referred to is a good example even though it may not work for you (because title would need docValues enabled - depends on field definition - and knowing that max/min functions should be used only with numeric values), just to get the idea :
# here the 2nd argument is a callback to max(), used precisely to get a single value from title
sort=field(title,max) asc

How to get index ids from an index in elasticsearch

I am having trouble getting the index ids from an index using the java api for elasticsearch.
When creating an IndexResponse I can get the IndexResponse id from the "IndexResponse" object. When creating an index I do not specify the id so I let elasticsearch handle this. How can I get a listing of the ids for a specific index?
I would then iterate through the ids to submit other requests (i.e. GET, DELETE).
I am using the java api and not spring-data. The version is 1.7 for those interested.
Retrieving all of the IDs from an index is generally a terrible idea, which gets more terrible depending on how large your index is. If you really need it, consider using a scroll query to achieve what you want.
https://www.elastic.co/guide/en/elasticsearch/guide/master/scroll.html#CO33-1
The guide has is written for Elasticsearch 2.x, but it works for Elasticsearch 5.x if you're using that.
Essentially how it works is this:
Create a scroll window of size x, return the first 1000 results without the overhead of scoring, analysis, etc. The resources are allocated by Elasticsearch for a time of y. The first response returns not only the first x documents, but returns a _scroll_id that can be used to fetch the next x documents.
GET http://yourhost:9200/old_index/_search?scroll=1m
{
"query": { "match_all": {}},
"sort" : ["_doc"],
"size": 1000
}
Say the response to the above query is something like...
{
"_scroll_id": "abcdefghijklmnopqrstuvwxyz",
"took": 15,
"timed_out": false,
"terminated_early": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1027,
"max_score": null,
"hits": [
{
...
You would then use the _scroll_id like so to fetch the next x results.
GET http://yourhost:9200/_search/scroll
{
"scroll": "1m",
"scroll_id" : "abcdefghijklmnopqrstuvwxyz"
}
It returns a response similar to the above. Ensure you rip out the _scroll_id from each request response and use it in the next. Using all of these responses, you can iterate through the hits and rip out the IDs.

Categories

Resources