Search match text in Elasticsearch SpringBoot by using percentage - java

I'm a new Elasticsearch SpringBoot here. I don't know how to search match text in Elasticsearch SpringBoot by using percentage. For example, I have a text "Hello world". Can I set a percentage of 50% or 70% to match with my text? I try with property minimumShouldMatch already but it seems doesn't work for my case right now.
Anyone help me please, Thank

You could use should query, split your search phrase by term, and set minimum_should_match according to your pourcentage
Example query
{
"query": {
"bool": {
"should": [
{
"term": {
"my_field": "hello"
}
},
{
"term": {
"my_field": "world"
}
},
{
"term": {
"my_field": "i'm"
}
},
{
"term": {
"my_field": "alive"
}
}
],
"minimum_should_match": 2
}
}
}
Will find hello world, hello alive etc...
To split a text in terms you should use _analyse of your index
Analyze and split terms
POST myindex/_analyze
{
"field": "my_field",
"text": "hello world i'm alive"
}
Which give you result like that to populate your query, and match term analyser with the query, if for example you use custom analyzer
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "world",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "i'm",
"start_offset" : 12,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "alive",
"start_offset" : 16,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}

Related

Opensearch - get inner aggregations from aggregations using opensearch-java client

There is this opensearch query constructed using openserch-java
GET eventsearch/_search
{
"aggregations": {
"WEB": {
"aggregations": {
"eventDate": {
"date_histogram": {
"extended_bounds": {
"max": "2022-12-01T00:00:00Z",
"min": "2022-01-01T00:00:00Z"
},
"field": "eventDate",
"fixed_interval": "1d",
"min_doc_count": 0
}
}
},
"filter": {
"term": {
"channel": {
"value": "WEB",
"case_insensitive": true
}
}
}
}
},
"query": {
"bool": {
"filter": [
{
"range": {
"eventDate": {
"from": "2022-01-01T00:00:00Z",
"to": "2022-12-01T00:00:00Z"
}
}
}
],
"must": [
{
"match_all": {}
}
]
}
},
"size": 0
}
Running query, the response is this:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 26,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"WEB" : {
"doc_count" : 25,
"eventDate" : {
"buckets" : [
{
"key_as_string" : "2022-01-01T00:00:00.000Z",
"key" : 1640995200000,
"doc_count" : 0
},
{
"key_as_string" : "2022-01-02T00:00:00.000Z",
"key" : 1641081600000,
"doc_count" : 0
},
{
"key_as_string" : "2022-01-03T00:00:00.000Z",
"key" : 1641168000000,
"doc_count" : 0
},
{
"key_as_string" : "2022-01-04T00:00:00.000Z",
"key" : 1641254400000,
"doc_count" : 0
},
....................
]
}
}
}
}
In java I need to perform this query and get the results from there.
But after using the opensearchclient.search and then get the "aggregations" list method, I receive this (image attached) and get
If I try to get the "WEB" from the Map, there is no other "eventDate" aggregation to fetch.
Is there a way to fetch this inner aggregation using opensearch-java client? I had no luck with documentation.
opensearch-java 2.1.0
There is currently no feature like this, it exists an open bug, with merged code, but not released.
https://github.com/opensearch-project/opensearch-java/issues/197

ElasticSearch - Searching partial text in String

What is the best way to use ElasticSearch to search exact partial text in String?
In SQL the method would be:
%PARTIAL TEXT%,
%ARTIAL TEX%
In Elastic Search current method being used:
{
"query": {
"match_phrase_prefix": {
"name": "PARTIAL TEXT"
}
}
}
However, it breaks whenever you remove first and last character of string as shown below (No results found):
{
"query": {
"match_phrase_prefix": {
"name": "ARTIAL TEX"
}
}
}
I believe that there will be numerous suggestions, such as the use of ngram analyzer, on how you can solve this problem. I believe the simplest would be to use fuzziness.
{
"query": {
"match": {
"name": {
"query": "artial tex",
"operator": "and",
"fuzziness": 1
}
}
}
}
There are multiple ways to do partial search and each comes with its own tradeoffs.
1. Wildcard
For wildcard perform search on "keyword" field instead of "text" .
{
"query": {
"wildcard": {
"name.keyword": "*artial tex*"
}
}
}
Wild cards have poor performance, there are better alternatives.
2. Match/Match_phrase/Match_phrase_prefix
If you are searching for whole tokens like "PARTIAL TEXT". You can simply use a match query, all documents which contain tokens "PARTIAL" and "TEXT" will be returned.
If order of tokens matter, you can use match_phrase.
If you want to search for partial tokens, use match_phrase_prefix. Prefix match is only done on last token in search input ex. "partial tex"
This is not suitable for your use case, since you want to search anywhere.
3. N grams
The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.
N-grams are like a sliding window that moves across the word - a
continuous sequence of characters of the specified length. They are
useful for querying languages that don’t use spaces or that have long
compound words, like German.
Query
{
"settings": {
"max_ngram_diff" : "5",
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 5,
"max_gram": 7
}
}
}
}
}
POST index29/_analyze
{
"analyzer": "my_analyzer",
"text": "Partial text"
}
Tokens Generated:
"tokens" : [
{
"token" : "Parti",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "Partia",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "Partial",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 2
},
{
"token" : "artia",
"start_offset" : 1,
"end_offset" : 6,
"type" : "word",
"position" : 3
},
{
"token" : "artial",
"start_offset" : 1,
"end_offset" : 7,
"type" : "word",
"position" : 4
},
{
"token" : "artial ",
"start_offset" : 1,
"end_offset" : 8,
"type" : "word",
"position" : 5
},
{
"token" : "rtial",
"start_offset" : 2,
"end_offset" : 7,
"type" : "word",
"position" : 6
},
{
"token" : "rtial ",
"start_offset" : 2,
"end_offset" : 8,
"type" : "word",
"position" : 7
},
{
"token" : "rtial t",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 8
},
{
"token" : "tial ",
"start_offset" : 3,
"end_offset" : 8,
"type" : "word",
"position" : 9
},
{
"token" : "tial t",
"start_offset" : 3,
"end_offset" : 9,
"type" : "word",
"position" : 10
},
{
"token" : "tial te",
"start_offset" : 3,
"end_offset" : 10,
"type" : "word",
"position" : 11
},
{
"token" : "ial t",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 12
},
{
"token" : "ial te",
"start_offset" : 4,
"end_offset" : 10,
"type" : "word",
"position" : 13
},
{
"token" : "ial tex",
"start_offset" : 4,
"end_offset" : 11,
"type" : "word",
"position" : 14
},
{
"token" : "al te",
"start_offset" : 5,
"end_offset" : 10,
"type" : "word",
"position" : 15
},
{
"token" : "al tex",
"start_offset" : 5,
"end_offset" : 11,
"type" : "word",
"position" : 16
},
{
"token" : "al text",
"start_offset" : 5,
"end_offset" : 12,
"type" : "word",
"position" : 17
},
{
"token" : "l tex",
"start_offset" : 6,
"end_offset" : 11,
"type" : "word",
"position" : 18
},
{
"token" : "l text",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 19
},
{
"token" : " text",
"start_offset" : 7,
"end_offset" : 12,
"type" : "word",
"position" : 20
}
]
You can do search on any of the tokens generated. You can also use "token_chars": [
"letter",
"digit"
]
to generate tokens excluding space.
Your choice of any of the option above will depend on your data size and performance requirements. Wildcard is more flexible but matching is done at run time hence perfomance is slow. If data size is small this will be ideal solution.
Ngrams, tokens are generated at time of indexing. It takes more memory but search is faster. For large data size this should be ideal solution.

Elasticsearch composite group by queries across the documents

We have an elastic search document which has a dimension called city. Each document will have only one value for city field. I have a scenario where I need to query the person based on the city or cities.
Documents in Elasticsearch
{
person_id: "1",
property_value : 25000,
city: "Bangalore"
}
{
person_id: "2",
property_value : 100000,
city: "Bangalore"
}
{
person_id: "1",
property_value : 15000,
city: "Delhi"
}
Note: The aggregation should be performed on property_value and group by on person_id.
For eg.,
If I query for Bangalore it should return document with person_id 1 and 2.
If I query for both Delhi and Bangalore it should return this
{
person_id: "1",
property_value : 40000,
city: ["Bangalore", "Delhi"]
}
Looking at your data, I've come up with a sample mapping, request query and the response.
Mapping:
PUT my_index_city
{
"mappings": {
"properties": {
"person_id":{
"type": "keyword"
},
"city":{
"type":"text",
"fields":{
"keyword":{
"type": "keyword"
}
}
},
"property_value":{
"type": "long"
}
}
}
}
Sample Request:
Note that I've made use of simple query string to filter the documents having Bangalore and Delhi.
For aggregation I've made use of Terms Aggregation on person_id and Sum Aggregation on the property_value field.
POST my_index_city/_search
{
"size": 0,
"query": {
"query_string": {
"default_field": "city",
"query": "Bangalore Delhi"
}
},
"aggs": {
"my_person": {
"terms": {
"field": "person_id",
"size": 10,
"min_doc_count": 2
},
"aggs": {
"sum_property_value": {
"sum": {
"field": "property_value"
}
}
}
}
}
}
Sample Response:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_person" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 2,
"sum_property_value" : {
"value" : 40000.0
}
}
]
}
}
}
Note: This query would only work if the person_id has multiple documents but each document having unique/different city value.
What I mean to say is, if the person_id has multiple documents with same city, the aggregation would not give right answer.
Updated Answer:
There is no direct way to achieve what you are looking for unless you modify the mapping. What I've done is, made use of nested datatype and ingested all the documents for person_id as a single document.
Mapping:
PUT my_sample_city_index
{
"mappings": {
"properties": {
"person_id":{
"type": "keyword"
},
"property_details":{
"type":"nested", <------ Note this
"properties": {
"city":{
"type": "text",
"fields":{
"keyword":{
"type":"keyword"
}
}
},
"property_value":{
"type": "long"
}
}
}
}
}
}
Sample Documents:
POST my_sample_city_index/_doc/1
{
"person_id": "1",
"property_details":[
{
"property_value" : 25000,
"city": "Bangalore"
},
{
"property_value" : 15000,
"city": "Delhi"
}
]
}
POST my_sample_city_index/_doc/2
{
"person_id": "2",
"property_details":[
{
"property_value" : 100000,
"city": "Bangalore"
}
]
}
Aggregation Query:
POST my_sample_city_index/_search
{
"size": 0,
"query": {
"nested": {
"path": "property_details",
"query": {
"query_string": {
"default_field": "property_details.city",
"query": "bangalore delhi"
}
}
}
},
"aggs": {
"persons": {
"terms": {
"field": "person_id",
"size": 10
},
"aggs": {
"property_sum": {
"nested": { <------ Note this
"path": "property_details"
},
"aggs": {
"total_sum": {
"sum": {
"field": "property_details.property_value"
}
}
}
}
}
}
}
}
Note that I've applied initially a term query on person_id post which I've applied Nested Aggregation, further on which I've applied metric sum aggregation query.
This should also work correctly if a person has multiple properties in the same city.
Response:
{
"took" : 31,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"persons" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"property_sum" : {
"doc_count" : 2,
"total_sum" : {
"value" : 40000.0
}
}
},
{
"key" : "2",
"doc_count" : 1,
"property_sum" : {
"doc_count" : 1,
"total_sum" : {
"value" : 100000.0
}
}
}
]
}
}
}
Let me know if this helps!

Elastic Search Filter Bucket Values

My use case is as follows, I need to find out all the unique colors that had appeared in last 1 year but went disappearing in last 3 months. So my documents looks like this
{
doc_id: 1,
color: "red",
timestamp: epoch time here
},
{
doc_id: 2,
color: "blue",
timestamp: epoch time here
}
So For example if any document with attribute color (from now referred to just as color) blue appeared in last year, but didn't appear in the last 3 months then we need to include blue in the result. On the other hand if documents with color red appeared in last year and also appeared in the last 3 months then we need to exclude red from the result.
The 1 year in the above example also includes the 3 months in it when computing. So if all the documents with Color blue happened only between May 2018 - Feb 2019, this means that documents with blue occurred in last year but went missing in last 3 months (March 2019 - May 2019), then blue should be in the result set. On the other hand if the documents with Color Red happened between May 2018 - Feb 2019 as well as March 2019 - May 2019, then we need to exclude this color red in the result set. I couldn't get this with terms query in Elastic search.
I have taken a range from "2019-01-01"- "2019-12-30", with excluded months as "2019-09-01"- "2019-12-30"
Mapping :
{
"testindex" : {
"mappings" : {
"properties" : {
"color" : {
"type" : "keyword"
},
"doc_id" : {
"type" : "long"
},
"timestamp" : {
"type" : "date"
}
}
}
}
}
Data:
"hits" : [
{
"_index" : "testindex",
"_type" : "_doc",
"_id" : "GPv0zWoB8AL5aj8D_wLG",
"_score" : 1.0,
"_source" : {
"doc_id" : 1,
"color" : "blue",
"timestamp" : "2019-03-30"
}
},
{
"_index" : "testindex",
"_type" : "_doc",
"_id" : "Gfv1zWoB8AL5aj8DJAKU",
"_score" : 1.0,
"_source" : {
"doc_id" : 1,
"color" : "red",
"timestamp" : "2019-12-30"
}
},
{
"_index" : "testindex",
"_type" : "_doc",
"_id" : "Gvv1zWoB8AL5aj8DOwKf",
"_score" : 1.0,
"_source" : {
"doc_id" : 1,
"color" : "red",
"timestamp" : "2019-01-01"
}
}
]
}
Final Query:
GET testindex/_search
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "2019-01-01",
"lte": "2019-12-30"
}
}
},
"aggs": {
"colors": {
"terms": {
"field": "color"
},
"aggs": {
"excluded_range": {
"date_range": {
"field": "timestamp",
"ranges": [
{
"from": "2019-09-01",
"to": "2019-12-31"
}
]
}
},
"excluded_docs_count": {
"sum_bucket": {
"buckets_path": "excluded_range>_count"
}
},
"myfinal": {
"bucket_selector": {
"buckets_path": {
"out_of_range_docs": "excluded_docs_count"
},
"script": {
"inline": "params.out_of_range_docs==0"
}
}
}
}
}
}
}

Gets documents and total count of them in single query include pagination

I'm new in mongo and use mongodb aggregation framework for my queries. I need to retrieve some records which satisfy certain conditions(include pagination+sorting) and also get total count of records.
Now, I perform next steps:
Create $match operator
{ "$match" : { "year" : "2012" , "author.authorName" : { "$regex" :
"au" , "$options" : "i"}}}
Added sorting and pagination
{ "$sort" : { "some_field" : -1}} , { "$limit" : 10} , { "$skip" : 0}
After querying I receive the expected result: 10 documents with all fields.
For pagination I need to know the total count of records which satisfy these conditions, in my case 25.
I use next query to get count : { "$match" : { "year" : "2012" , "author.authorName" : { "$regex" : "au" , "$options" : "i"}}} , { "$group" : { "_id" : "$all" , "reviewsCount" : { "$sum" : 1}}} , { "$sort" : { "some_field" : -1}} , { "$limit" : 10} , { "$skip" : 0}
But I don't want to perform two separate queries: one for retrieving documents and second for total counts of records which satisfy certain conditions.
I want do it in one single query and get result in next format:
{
"result" : [
{
"my_documets": [
{
"_id" : ObjectId("512f1f47a411dc06281d98c0"),
"author" : {
"authorName" : "author name1",
"email" : "email1#email.com"
}
},
{
"_id" : ObjectId("512f1f47a411dc06281d98c0"),
"author" : {
"authorName" : "author name2",
"email" : "email2#email.com"
}
}, .......
],
"total" : 25
}
],
"ok" : 1
}
I tried modify the group operator : { "$group" : { "_id" : "$all" , "author" : "$author" "reviewsCount" : { "$sum" : 1}}}
But in this case I got : "exception: the group aggregate field 'author' must be defined as an expression inside an object". If add all fields in _id then reviewsCount always = 1 because all records are different.
Nobody know how it can be implement in single query ? Maybe mongodb has some features or operators for this case? Implementation with using two separate query reduces performance for querying thousand or millions records. In my application it's very critical performance issue.
I've been working on this all day and haven't been able to find a solution, so thought i'd turn to the stackoverflow community.
Thanks.
You can try using $facet in the aggregation pipeline as
db.name.aggregate([
{$match:{your match criteria}},
{$facet: {
data: [{$sort: sort},{$skip:skip},{$limit: limit}],
count:[{$group: {_id: null, count: {$sum: 1}}}]
}}
])
In data, you'll get your list with pagination and in the count, count variable will have a total count of matched documents.
Ok, I have one example, but I think it's really crazy query, I put it only for fun, but if this example faster than 2 query, tell us about it in the comments please.
For this question i create collection called "so", and put into this collection 25 documents like this:
{
"_id" : ObjectId("512fa86cd99d0adda2a744cd"),
"authorName" : "author name1",
"email" : "email1#email.com",
"c" : 1
}
My query use aggregation framework:
db.so.aggregate([
{ $group:
{
_id: 1,
collection: { $push : { "_id": "$_id", "authorName": "$authorName", "email": "$email", "c": "$c" } },
count: { $sum: 1 }
}
},
{ $unwind:
"$collection"
},
{ $project:
{ "_id": "$collection._id", "authorName": "$collection.authorName", "email": "$collection.email", "c": "$collection.c", "count": "$count" }
},
{ $match:
{ c: { $lte: 10 } }
},
{ $sort :
{ c: -1 }
},
{ $skip:
2
},
{ $limit:
3
},
{ $group:
{
_id: "$count",
my_documets: {
$push: {"_id": "$_id", "authorName":"$authorName", "email":"$email", "c":"$c" }
}
}
},
{ $project:
{ "_id": 0, "my_documets": "$my_documets", "total": "$_id" }
}
])
Result for this query:
{
"result" : [
{
"my_documets" : [
{
"_id" : ObjectId("512fa900d99d0adda2a744d4"),
"authorName" : "author name8",
"email" : "email8#email.com",
"c" : 8
},
{
"_id" : ObjectId("512fa900d99d0adda2a744d3"),
"authorName" : "author name7",
"email" : "email7#email.com",
"c" : 7
},
{
"_id" : ObjectId("512fa900d99d0adda2a744d2"),
"authorName" : "author name6",
"email" : "email6#email.com",
"c" : 6
}
],
"total" : 25
}
],
"ok" : 1
}
By the end, I think that for big collection 2 query (first for data, second for count) works faster. For example, you can count total for collection like this:
db.so.count()
or like this:
db.so.find({},{_id:1}).sort({_id:-1}).count()
I don't fully sure in first example, but in second example we use only cursor, which means higher speed:
db.so.find({},{_id:1}).sort({_id:-1}).explain()
{
"cursor" : "BtreeCursor _id_ reverse",
"isMultiKey" : false,
"n" : 25,
"nscannedObjects" : 25,
"nscanned" : 25,
"nscannedObjectsAllPlans" : 25,
"nscannedAllPlans" : 25,
"scanAndOrder" : false,
!!!!!>>> "indexOnly" : true, <<<!!!!!
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
...
}
For completeness (full discussion was on the MongoDB Google Groups) here is the aggregation you want:
db.collection.aggregate(db.docs.aggregate( [
{
"$match" : {
"year" : "2012"
}
},
{
"$group" : {
"_id" : null,
"my_documents" : {
"$push" : {
"_id" : "$_id",
"year" : "$year",
"author" : "$author"
}
},
"reviewsCount" : {
"$sum" : 1
}
}
},
{
"$project" : {
"_id" : 0,
"my_documents" : 1,
"total" : "$reviewsCount"
}
}
] )
By the way, you don't need aggregation framework here - you can just use a regular find. You can get count() from a cursor without having to re-query.

Categories

Resources