What is the best way to use ElasticSearch to search exact partial text in String?
In SQL the method would be:
%PARTIAL TEXT%,
%ARTIAL TEX%
In Elastic Search current method being used:
{
"query": {
"match_phrase_prefix": {
"name": "PARTIAL TEXT"
}
}
}
However, it breaks whenever you remove first and last character of string as shown below (No results found):
{
"query": {
"match_phrase_prefix": {
"name": "ARTIAL TEX"
}
}
}
I believe that there will be numerous suggestions, such as the use of ngram analyzer, on how you can solve this problem. I believe the simplest would be to use fuzziness.
{
"query": {
"match": {
"name": {
"query": "artial tex",
"operator": "and",
"fuzziness": 1
}
}
}
}
There are multiple ways to do partial search and each comes with its own tradeoffs.
1. Wildcard
For wildcard perform search on "keyword" field instead of "text" .
{
"query": {
"wildcard": {
"name.keyword": "*artial tex*"
}
}
}
Wild cards have poor performance, there are better alternatives.
2. Match/Match_phrase/Match_phrase_prefix
If you are searching for whole tokens like "PARTIAL TEXT". You can simply use a match query, all documents which contain tokens "PARTIAL" and "TEXT" will be returned.
If order of tokens matter, you can use match_phrase.
If you want to search for partial tokens, use match_phrase_prefix. Prefix match is only done on last token in search input ex. "partial tex"
This is not suitable for your use case, since you want to search anywhere.
3. N grams
The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.
N-grams are like a sliding window that moves across the word - a
continuous sequence of characters of the specified length. They are
useful for querying languages that don’t use spaces or that have long
compound words, like German.
Query
{
"settings": {
"max_ngram_diff" : "5",
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 5,
"max_gram": 7
}
}
}
}
}
POST index29/_analyze
{
"analyzer": "my_analyzer",
"text": "Partial text"
}
Tokens Generated:
"tokens" : [
{
"token" : "Parti",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "Partia",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "Partial",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 2
},
{
"token" : "artia",
"start_offset" : 1,
"end_offset" : 6,
"type" : "word",
"position" : 3
},
{
"token" : "artial",
"start_offset" : 1,
"end_offset" : 7,
"type" : "word",
"position" : 4
},
{
"token" : "artial ",
"start_offset" : 1,
"end_offset" : 8,
"type" : "word",
"position" : 5
},
{
"token" : "rtial",
"start_offset" : 2,
"end_offset" : 7,
"type" : "word",
"position" : 6
},
{
"token" : "rtial ",
"start_offset" : 2,
"end_offset" : 8,
"type" : "word",
"position" : 7
},
{
"token" : "rtial t",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 8
},
{
"token" : "tial ",
"start_offset" : 3,
"end_offset" : 8,
"type" : "word",
"position" : 9
},
{
"token" : "tial t",
"start_offset" : 3,
"end_offset" : 9,
"type" : "word",
"position" : 10
},
{
"token" : "tial te",
"start_offset" : 3,
"end_offset" : 10,
"type" : "word",
"position" : 11
},
{
"token" : "ial t",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 12
},
{
"token" : "ial te",
"start_offset" : 4,
"end_offset" : 10,
"type" : "word",
"position" : 13
},
{
"token" : "ial tex",
"start_offset" : 4,
"end_offset" : 11,
"type" : "word",
"position" : 14
},
{
"token" : "al te",
"start_offset" : 5,
"end_offset" : 10,
"type" : "word",
"position" : 15
},
{
"token" : "al tex",
"start_offset" : 5,
"end_offset" : 11,
"type" : "word",
"position" : 16
},
{
"token" : "al text",
"start_offset" : 5,
"end_offset" : 12,
"type" : "word",
"position" : 17
},
{
"token" : "l tex",
"start_offset" : 6,
"end_offset" : 11,
"type" : "word",
"position" : 18
},
{
"token" : "l text",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 19
},
{
"token" : " text",
"start_offset" : 7,
"end_offset" : 12,
"type" : "word",
"position" : 20
}
]
You can do search on any of the tokens generated. You can also use "token_chars": [
"letter",
"digit"
]
to generate tokens excluding space.
Your choice of any of the option above will depend on your data size and performance requirements. Wildcard is more flexible but matching is done at run time hence perfomance is slow. If data size is small this will be ideal solution.
Ngrams, tokens are generated at time of indexing. It takes more memory but search is faster. For large data size this should be ideal solution.
Related
I am trying to send Nested json in Post request body to call akkhttp route, but getting exception,
Cannot unmarshall JSON as Campaign
Someone Please help me out to resolve the issue.
below is code
concat(
post(() -> pathPrefix(PathMatchers.segment(ACCOUNTS_SEGMENT).slash(PathMatchers.segment()), (accountId) ->
path(PathMatchers.segment(JOBS_SEGMENT).slash(PathMatchers.uuidSegment()) , (jobId) -> {
**return entity(Jackson.unmarshaller(Campaign.class), campaign -> {**
System.out.println("###############"+campaign.getName());
CompletionStage<Done> futureSaved = executeCampaignProcess(jobId.toString(), accountId);
return onSuccess(futureSaved, done ->
complete(StatusCodes.ACCEPTED, ACCEPTED_EXECUTE_CAMPAIGN_REQUEST)
);
});
}
)))
Nested json in a request body is as below
{
"id" : "2d2cee47-40c9-4ebe-80bb-f8a38e6379f9",
"name" : "DDDAmol",
"description" : "",
"type" : "FINITE",
"senderDisplayName" : "",
"senderAddress" : "",
"dialingOrder" : [ "PRIORITY", "RETRY", "REGULAR" ],
"finishType" : "FINISH_AFTER",
"finishTime" : null,
"finishAfter" : 0,
"checkTimeBasedFinishCriteria" : false,
"createdOn" : [ 2023, 1, 31, 8, 12, 40, 32309000 ],
"updatedOn" : [ 2023, 2, 14, 14, 11, 4, 758967300 ],
"lastExecutedOn" : [ 2023, 2, 18, 13, 51, 28, 821000000 ],
"contactList" : "CD_06_SMS_QUEUED_DELAYED",
"rule" : null,
"strategy" : {
"id" : "a41d8895-7a67-4cce-a39b-5a6ee5e8b4a9",
"name" : "Simple",
"type" : "SMS",
"description" : "",
"smsText" : "Hello",
"smsPace" : 40,
"smsPaceTimeUnit" : "SECOND",
"createdOn" : [ 2023, 1, 31, 8, 12, 15, 209992700 ],
"updatedOn" : [ 2023, 1, 31, 8, 12, 15, 209992700 ]
}
}
Refering below link but no working :(
https://doc.akka.io/docs/akka-http/current/common/json-support.html
I'm a new Elasticsearch SpringBoot here. I don't know how to search match text in Elasticsearch SpringBoot by using percentage. For example, I have a text "Hello world". Can I set a percentage of 50% or 70% to match with my text? I try with property minimumShouldMatch already but it seems doesn't work for my case right now.
Anyone help me please, Thank
You could use should query, split your search phrase by term, and set minimum_should_match according to your pourcentage
Example query
{
"query": {
"bool": {
"should": [
{
"term": {
"my_field": "hello"
}
},
{
"term": {
"my_field": "world"
}
},
{
"term": {
"my_field": "i'm"
}
},
{
"term": {
"my_field": "alive"
}
}
],
"minimum_should_match": 2
}
}
}
Will find hello world, hello alive etc...
To split a text in terms you should use _analyse of your index
Analyze and split terms
POST myindex/_analyze
{
"field": "my_field",
"text": "hello world i'm alive"
}
Which give you result like that to populate your query, and match term analyser with the query, if for example you use custom analyzer
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "world",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "i'm",
"start_offset" : 12,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "alive",
"start_offset" : 16,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
For example, create an example collection classes with the following document:
db.classes.insert( [
{ _id: 1, title: "Reading is ...", enrollmentlist: [ "giraffe2", "pandabear", "artie" ], days: ["M", "W", "F"] },
{ _id: 2, title: "But Writing ...", enrollmentlist: [ "giraffe1", "artie" ], days: ["T", "F"] }
])
Create another collection members with the following documents:
db.members.insert( [
{ _id: 1, name: "artie", joined: new Date("2016-05-01"), status: "A" },
{ _id: 2, name: "giraffe", joined: new Date("2017-05-01"), status: "D" },
{ _id: 3, name: "giraffe1", joined: new Date("2017-10-01"), status: "A" },
{ _id: 4, name: "panda", joined: new Date("2018-10-11"), status: "A" },
{ _id: 5, name: "pandabear", joined: new Date("2018-12-01"), status: "A" },
{ _id: 6, name: "giraffe2", joined: new Date("2018-12-01"), status: "D" }
])
The following aggregation operation joins documents in the classes collection with the members collection, matching on the members field to the name field:
db.classes.aggregate([
{
$lookup:
{
from: "members",
localField: "enrollmentlist",
foreignField: "name",
as: "enrollee_info"
}
}
])
The operation returns the following:
{
"_id" : 1,
"title" : "Reading is ...",
"enrollmentlist" : [ "giraffe2", "pandabear", "artie" ],
"days" : [ "M", "W", "F" ],
"enrollee_info" : [
{ "_id" : 1, "name" : "artie", "joined" : ISODate("2016-05-01T00:00:00Z"), "status" : "A" },
{ "_id" : 5, "name" : "pandabear", "joined" : ISODate("2018-12-01T00:00:00Z"), "status" : "A" },
{ "_id" : 6, "name" : "giraffe2", "joined" : ISODate("2018-12-01T00:00:00Z"), "status" : "D" }
]
}
{
"_id" : 2,
"title" : "But Writing ...",
"enrollmentlist" : [ "giraffe1", "artie" ],
"days" : [ "T", "F" ],
"enrollee_info" : [
{ "_id" : 1, "name" : "artie", "joined" : ISODate("2016-05-01T00:00:00Z"), "status" : "A" },
{ "_id" : 3, "name" : "giraffe1", "joined" : ISODate("2017-10-01T00:00:00Z"), "status" : "A" }
]
}
How would you write this in Java with MongoDB?
db.ticket.aggregate([
{
$lookup: {
from: "crmorder",
localField: "subOrderId",
foreignField: "currentStatus",
as: "comments"
}
}
])
in the result comments field is getting blank why?
you have one more table in name of crmorder
In ticket table must have - subOrderId field
In crmorder table must have - currentStatus field
and also ticket table subOrderId field == crmorder table currentStatus field
here is the sample snippt:
db.orders.insert([
{ "_id" : 1, "item" : "almonds", "price" : 12, "quantity" : 2 },
{ "_id" : 2, "item" : "pecans", "price" : 20, "quantity" : 1 },
{ "_id" : 3 }
])
db.inventory.insert([
{ "_id" : 1, "sku" : "almonds", description: "product 1", "instock" : 120 },
{ "_id" : 2, "sku" : "bread", description: "product 2", "instock" : 80 },
{ "_id" : 3, "sku" : "cashews", description: "product 3", "instock" : 60 },
{ "_id" : 4, "sku" : "pecans", description: "product 4", "instock" : 70 },
{ "_id" : 5, "sku": null, description: "Incomplete" },
{ "_id" : 6 }
])
db.orders.aggregate([
{
$lookup:
{
from: "inventory",
localField: "item",
foreignField: "sku",
as: "inventory_docs"
}
}
])
My use case is as follows, I need to find out all the unique colors that had appeared in last 1 year but went disappearing in last 3 months. So my documents looks like this
{
doc_id: 1,
color: "red",
timestamp: epoch time here
},
{
doc_id: 2,
color: "blue",
timestamp: epoch time here
}
So For example if any document with attribute color (from now referred to just as color) blue appeared in last year, but didn't appear in the last 3 months then we need to include blue in the result. On the other hand if documents with color red appeared in last year and also appeared in the last 3 months then we need to exclude red from the result.
The 1 year in the above example also includes the 3 months in it when computing. So if all the documents with Color blue happened only between May 2018 - Feb 2019, this means that documents with blue occurred in last year but went missing in last 3 months (March 2019 - May 2019), then blue should be in the result set. On the other hand if the documents with Color Red happened between May 2018 - Feb 2019 as well as March 2019 - May 2019, then we need to exclude this color red in the result set. I couldn't get this with terms query in Elastic search.
I have taken a range from "2019-01-01"- "2019-12-30", with excluded months as "2019-09-01"- "2019-12-30"
Mapping :
{
"testindex" : {
"mappings" : {
"properties" : {
"color" : {
"type" : "keyword"
},
"doc_id" : {
"type" : "long"
},
"timestamp" : {
"type" : "date"
}
}
}
}
}
Data:
"hits" : [
{
"_index" : "testindex",
"_type" : "_doc",
"_id" : "GPv0zWoB8AL5aj8D_wLG",
"_score" : 1.0,
"_source" : {
"doc_id" : 1,
"color" : "blue",
"timestamp" : "2019-03-30"
}
},
{
"_index" : "testindex",
"_type" : "_doc",
"_id" : "Gfv1zWoB8AL5aj8DJAKU",
"_score" : 1.0,
"_source" : {
"doc_id" : 1,
"color" : "red",
"timestamp" : "2019-12-30"
}
},
{
"_index" : "testindex",
"_type" : "_doc",
"_id" : "Gvv1zWoB8AL5aj8DOwKf",
"_score" : 1.0,
"_source" : {
"doc_id" : 1,
"color" : "red",
"timestamp" : "2019-01-01"
}
}
]
}
Final Query:
GET testindex/_search
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "2019-01-01",
"lte": "2019-12-30"
}
}
},
"aggs": {
"colors": {
"terms": {
"field": "color"
},
"aggs": {
"excluded_range": {
"date_range": {
"field": "timestamp",
"ranges": [
{
"from": "2019-09-01",
"to": "2019-12-31"
}
]
}
},
"excluded_docs_count": {
"sum_bucket": {
"buckets_path": "excluded_range>_count"
}
},
"myfinal": {
"bucket_selector": {
"buckets_path": {
"out_of_range_docs": "excluded_docs_count"
},
"script": {
"inline": "params.out_of_range_docs==0"
}
}
}
}
}
}
}