Search for substring in Elastic Search Java - java

I am working with elastic search and am trying to look for a substring inside a field. For example - searching for the string tac in stack overflow . I am using the MultiMatchQuery for this but it does not work. Here is a snippet of my code (first_name is the field name).
searchString = "*" + searchString.toLowerCase() + "*";
MultiMatchQueryBuilder mqb = new MultiMatchQueryBuilder("irs", first_name);
mqb.type(MultiMatchQueryBuilder.Type.PHRASE);
BoolQueryBuilder searchQuery = boolQuery();
searchQuery.should(mqb);
NativeSearchQueryBuilder queryBuilder = new NativeSearchQueryBuilder();
queryBuilder.withQuery(searchQuery);
NativeSearchQuery query = queryBuilder.build();
When I search for tac it does not return any results. When I search for stack or overflow it does return stack overflow.
So it looks for the exact string. I tried using MultiMatchQueryBuilder.Type.PHRASE_PREFIX but it looks for the phrases starting with the substring. It works with strings like stac or overf but not tac or tack.
Any suggestions on how to fix it?

Macth query is analyzed and applied the same analyzer which is applied during the index time, I believe you are using the standard analyzer, which generated below tokens
POST http://localhost:9200/_analyze
{
"text": "stack overflow",
"analyzer" : "standard"
}
{
"tokens": [
{
"token": "stack",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "overflow",
"start_offset": 6,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Hence searching for tac doesn't match any token in an index, you need to change the analyzer so that it matches the query time tokens to index time tokens.
n-gram tokenizer can solve the issue.
Example
Index mapping
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
},
"index.max_ngram_diff" : 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index sample doc
{
"title" : "stack overflow"
}
And search query
{
"query": {
"match": {
"title": "tac"
}
}
}
And search result
"hits": [
{
"_index": "65241835",
"_type": "_doc",
"_id": "1",
"_score": 0.4739784,
"_source": {
"title": "stack overflow"
}
}
]
}

Related

Elasticsearch Multimatch substring not working

So I have a record with following field :
"fullName" : "Virat Kohli"
I have written the following multi_match query that should fetch this record :
GET _search
{
"query": {
"multi_match": {
"query": "*kohli*",
"fields": [
"fullName^1.0",
"team^1.0"
],
"type": "phrase_prefix",
"operator": "OR",
"slop": 0,
"prefix_length": 0,
"max_expansions": 50,
"zero_terms_query": "NONE",
"auto_generate_synonyms_phrase_query": true,
"fuzzy_transpositions": true,
"boost": 1
}
}
}
This works fine.
But when I remove the letter 'k' from query and change it to :
"query": "*ohli*"
It doesn't fetch any record.
Any reason why this is happening? How can I modify the query to get the record returned with the above modification?
first let me explain you why your existing query didn't work and then the solution of it.
Problem : you are using the multi_match query with type phrase_prefix and as explained in the documentation it makes a prefix query on the last search term and in your case you have only 1 search term so on that Elasticsearch will perform the phrase query.
And prefix query works on the exact tokens, and you are using standard analyzer mostly, default for text fields so for fullName field it will have virat and kohli and your search term also generates kohli(notice smallcase k) as standard analyzer also lowercase the tokens, above you can check with the explain API output in your first request as shown below.
"_explanation": {
"value": 0.2876821,
"description": "max of:",
"details": [
{
"value": 0.2876821,
"description": "weight(fullName:kohli in 0) [PerFieldSimilarity], result of:",
"details": [
{
(note he search term in the weight)
Solution
As you are trying to use the wildcard in your query, best solution is to use the wildcard query against your field as shown below to get results in both case.
{
"query": {
"wildcard": {
"fullName": {
"value": "*ohli",
"boost": 1.0,
"rewrite": "constant_score"
}
}
}
}
And SR
"hits": [
{
"_shard": "[match_query][0]",
"_node": "BKVyHFTiSCeq4zzD-ZqMbA",
"_index": "match_query",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"id": 2,
"fullName": "Virat Kohli",
"team": [
"Royal Challengers Bangalore",
"India"
]
},
"_explanation": {
"value": 1.0,
"description": "fullName:*ohli",
"details": []
}
}
]

How to return all fields instead of just id and count after sortByCount operation in Mongodb/Java?

I need to do sortByCount and return all the fields instead of just _id and count.
sortByCount returns:
{ "_id" : "1", "count" : 4 }
{ "_id" : "2", "count" : 3 }
{ "_id" : "3", "count" : 2 }
{ "_id" : "4", "count" : 2 }
{ "_id" : "5", "count" : 1 }
But, I need a complete document like below:
{
"_id": 1,
"title": "The Pillars of Society",
"artist": "Grosz",
"year": 1926,
"tags": ["painting", "satire", "Expressionism", "caricature"]
} {
"_id": 2,
"title": "Melancholy III",
"artist": "Munch",
"year": 1902,
"tags": ["woodcut", "Expressionism"]
} {
"_id": 3,
"title": "Dancer",
"artist": "Miro",
"year": 1925,
"tags": ["oil", "Surrealism", "painting"]
} {
"_id": 4,
"title": "The Great Wave off Kanagawa",
"artist": "Hokusai",
"tags": ["woodblock", "ukiyo-e"]
} {
"_id": 5,
"title": "The Persistence of Memory",
"artist": "Dali",
"year": 1931,
"tags": ["Surrealism", "painting", "oil"]
}
Is there any way to replace root after sortByCount? In Java, I don't see any push method after sortByCount
$sortByCount is essentially a combination of $group followed by $sort on count field of group stage. If you really want the entire documents, you can try this:
db.collection.aggregate([
{
"$group": {
"_id": {
"id": "$_id"
},
"count": {
$sum: 1
},
"reqItems": {
$push: {
"title": "$title",
"artist": "$artist"
}
}
}
},
{
$sort: {
count: -1
}
}
])
Playground link

ElasticSearch not able to search special characters from a word

I have indexed my pdf file in elastic search using ingest-attachment processor plugin and now am search my file based on the contents available in PDF.
For Example, am having some contents like this in my pdf.
Hello I m Karthikeyan. My mail id Karthikeyan#gmail.com, My mob no 4573894833.
While am searching using Java API, am able to search like the following.
Search For,
Karthikeyan#gmail.com am able to get the file.
But,
If i search for,
#gm means am not able to get the file, am expecting that i should get the file because, this file have my search keyword #gm.
How can i do this. ?
Am using tokenizer with min_gram & max_gram as 3 each.
Please find the below java api that i have used, but none of them giving me the results as expected.
QueryStringQueryBuilder attachmentQB = new QueryStringQueryBuilder("#gm");
Please find my below mappings details.
PUT attach_local
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"attachment": {
"properties": {
"content": {
"type": "text",
"analyzer": "custom_analyzer"
},
"content_length": {
"type": "long"
},
"content_type": {
"type": "text"
},
"language": {
"type": "text"
}
}
},
"resume": {
"type": "text"
}
}
}
}
}
You can see how ES tokenizes your search text using
POST /attach_local/_analyze
{
"analyzer": "custom_analyzer",
"text": "#gm"
}
That will tell you if the # character is dropped or not. If it is then that would explain the behavior since your inverted index has all trigrams and you are searching for a bigram.

Elastic search cross fields, edge ngram analyzer

I have 999 documents which I am using for experimenting with elastic search.
There is a field f4 in my type mapping which is analyzed and has following settings for analyzer :
"myNGramAnalyzer" => [
"type" => "custom",
"char_filter" => ["html_strip"],
"tokenizer" => "standard",
"filter" => ["lowercase","standard","asciifolding","stop","snowball","ngram_filter"]
]
My filter is as below :
"filter" => [
"ngram_filter" => [
"type" => "edgeNGram",
"min_gram" => "2",
"max_gram" => "20"
]
]
I have value for field f4 as "Proj1", "Proj2", "Proj3"...... so on.
Now when I try to do search using cross fields for "proj1" string, I was expecting document with "Proj1" to be returned at the top of the response with max score. But it doesn't. Rest all the data is almost same in content.
Also I don't understand why it matches all 999 document?
Following is my search :
{
"index": "myindex",
"type": "mytype",
"body": {
"query": {
"multi_match": {
"query": "proj1",
"type": "cross_fields",
"operator": "and",
"fields": "f*"
}
},
"filter": {
"term": {
"deleted": "0"
}
}
}
}
My search response is :
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 999,
"max_score": 1,
"hits": [{
"_index": "myindex",
"_type": "mytype",
"_id": "42",
"_score": 1,
"_source": {
"f1": "396","f2": "125650","f3": "BH.1511AI.001",
"f4": "Proj42",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
}, {
"_index": "myindex",
"_type": "mytype",
"_id": "47",
"_score": 1,
"_source": {
"f1": "396","f2": "137946","f3": "BH.152096.001",
"f4": "Proj47",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
},
//.......
//.......
//MANY RECORDS IN BETWEEN HERE
//.......
//.......
{
"_index": myindex,
"_type": "mytype",
"_id": "1",
"_score": 1,
"_source": {
"f1": "396","f2": "142095","f3": "BH.705215.001",
"f4": "Proj1",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
//.......
//.......
//MANY RECORDS IN BETWEEN HERE
//.......
//.......
}]
}
}
Any thing that I am doing wrong or missing? (Apologies for lengthy question, but I thought to give all possible information discarding unnecessary other code).
EDITED :
Term vector response
{
"_index": "myindex",
"_type": "mytype",
"_id": "10",
"_version": 1,
"found": true,
"took": 9,
"term_vectors": {
"f4": {
"field_statistics": {
"sum_doc_freq": 5886,
"doc_count": 999,
"sum_ttf": 5886
},
"terms": {
"pr": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"pro": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj1": {
"doc_freq": 111,
"ttf": 111,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj10": {
"doc_freq": 11,
"ttf": 11,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
}
}
}
}
}
EDITED 2
Mappings for field f4
"f4" : {
"type" : "string",
"index_analyzer" : "myNGramAnalyzer",
"search_analyzer" : "standard"
}
I have updated to use standard analyzer for query time, which has improved the results but still not what I expected.
Instead of 999 (all documents) now it return 111 documents like "Proj1", "Proj11", "Proj111"......"Proj1", "Proj181"......... etc.
Still "Proj1" is in between the results and not at the top.
There is no index_analyzer (at least not from Elasticsearch version 1.7). For mapping parameters you can use analyzer and search_analyzer.
Try the following steps in order to make it work.
Create myindex with analyzer settings:
PUT /myindex
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"myNGramAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": "html_strip",
"filter": [
"lowercase",
"standard",
"asciifolding",
"stop",
"snowball",
"ngram_filter"
]
}
}
}
}
}
Add mappings to mytype (to make it short I just mapped the relevant fields):
PUT /myindex/_mapping/mytype
{
"properties": {
"f1": {
"type": "string"
},
"f4": {
"type": "string",
"analyzer": "myNGramAnalyzer",
"search_analyzer": "standard"
},
"deleted": {
"type": "string"
}
}
}
Index some data:
PUT myindex/mytype/1
{
"f1":"396",
"f4":"Proj12" ,
"deleted": "0"
}
PUT myindex/mytype/2
{
"f1":"42",
"f4":"Proj22" ,
"deleted": "1"
}
Now try your query:
GET myindex/mytype/_search
{
"query": {
"multi_match": {
"query": "proj1",
"type": "cross_fields",
"operator": "and",
"fields": "f*"
}
},
"filter": {
"term": {
"deleted": "0"
}
}
}
It should return document #1. It worked for me with Sense. I am using Elasticsearch 2.X versions.
Hope I have managed to help :)
After hours of spending time to find a solution to this, I finally made it work.
So I kept everything same as mentioned in my question, using n gram analzyer while indexing data. The only thing I had to change was, to use the all field in my search query as a bool query with my existing multi-match query.
Now my result for search text Proj1 would return me results in an order such as Proj1, Proj121, Proj11, etc.
Although this does not return the exact order like Proj1, Proj11, Proj121, etc, but still it closely resembles the result that I wanted.

Elasticsearch query doesn't produce expected result

I'm having trouble creating a query which should search for any documents with a certain search term in the fields title and text, and should match a state field which could be zero or more values where atleast one must match.
Given the following query:
"bool" : {
"must" : {
"multi_match" : {
"query" : "test",
"fields" : [ "title", "text" ]
}
},
"should" : {
"terms" : {
"state" : [ "NEW" ]
}
},
"minimum_should_match" : "1"
}
Should not the following data be returned as a result?
{
"_shards": {
"failed": 0,
"successful": 5,
"total": 5
},
"hits": {
"hits": [
{
"_id": "JXnEkYFDQp2feATMzp2LTA",
"_index": "tips",
"_score": 1.0,
"_source": {
"state": "NEW",
"text": "This is a test",
"title": "Test"
},
"_type": "tip"
}
],
"max_score": 1.0,
"total": 1
},
"timed_out": false,
"took": 1
}
In my test this is not the case. What am i doing wrong?
the following is the java code producing the outputted query.
SearchRequestBuilder builder = client.prepareSearch("tips").setTypes("tip");
BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
if(searchTermIsNotEmpty(searchTerm)){
MultiMatchQueryBuilder qb = QueryBuilders.multiMatchQuery(
searchTerm,
"title", "text"
);
boolQuery.must(qb);
}
if(filters.size() > 0){
boolQuery.should(QueryBuilders.termsQuery("state",filters));
boolQuery.minimumNumberShouldMatch(1);
}
if(boolQuery.hasClauses()){
builder.setQuery(boolQuery);
}
logger.info(boolQuery.toString());
SearchResponse result = builder.execute().actionGet();
return result.toString();
Any help on this is greatly appreciated!
Seems i found the issue, for some reason i was unable to fetch when using the filter enum in it's original form. I had to convert the enum to string and lowercase it.
I then added the following query
boolQuery.must(QueryBuilders.termsQuery("state", getLowerCaseEnumCollection(filters)).minimumMatch(1));
I'm new to elasticsearch, so i don't know if this is a bug, or a feature. Im just glad i figured it out.

Categories

Resources