I have an archive of university theses and publications indexed (with BM25 similarity) on Lucene (Java version). I have English document and Italian document, for this reason i have duplicate field like: pdf, pdf_en or like: titolo, titolo_en. When i have an italian document i fill italian field, otherwise i fill english filed.
Now i have a BooleanQuery with MultiFieldQueryParser, this is my code:
String[] fieldsGEN={"url","autori","lingua","settore","pdfurl"};
String[] fieldsITA={"titolo","tipologia","abstract","pdf"};
String[] fieldsENG={"titolo_en","tipologia_en", "abstract_en","pdf_en"};
MultiFieldQueryParser parserGEN = new MultiFieldQueryParser(version, fieldsGEN, analyzerIT);
MultiFieldQueryParser parserITA = new MultiFieldQueryParser(version, fieldsITA, analyzerIT);
MultiFieldQueryParser parserENG = new MultiFieldQueryParser(version, fieldsENG, analyzerENG);
parserITA.setDefaultOperator(QueryParser.Operator.OR);
parserITA.setDefaultOperator(QueryParser.Operator.OR);
parserENG.setDefaultOperator(QueryParser.Operator.OR);
Query query4 =parserGEN.parse(ricerca.ricerca);
bq.add(query4, Occur.SHOULD);
Query query2 =parserITA.parse(ricerca.ricerca);
bq.add(query2, Occur.SHOULD);
Query query3 =parserENG.parse(ricerca.ricerca);
bq.add(query3, Occur.SHOULD);
If I search "anna" (Name of an author) the 3 query are:
Query: [titolo:anna tipologia:anna abstract:anna pdf:anna]
Query: [titolo_en:anna tipologia_en:anna abstract_en:anna pdf_en:anna]
Query: [url:anna autori:anna lingua:anna settore:anna pdfurl:anna]
and I also authors without the name anna even if they are in the last position (about 3 document of 21 on 1000 indexed), I suppose that finds them in other fields.
Do you think the query is done well? the query can be improved? how? a search engine like google how it works on multifield search?
There is a better way to deal with multi-language field?
Thanks,
Neptune.
Unless you have both translations for all documents, I would create 2 indexes -- 1 for each language, using the same field names for each index. You would then use a MultiReader with the search queries.
The problem with this approach is words that are spelled the same in each language but have different meanings between English and Italian. Apart from those words, I think that this architecture will be easier to understand as well as easier to interpret the results of.
Related
I'm using elasticsearch 6.x version with ingest plugin to let me query inside document.
I managed to insert record with attachment document and I'm able to query it against various fields.
When I query the content of the file I'm doing this:
boolQuery.filter(new MatchPhrasePrefixQueryBuilder("attachment.content", "St. Anna Church"))
It works, but I want now to make query with this field: "Church Wall People" where basically it's not a complete phrase, I want back all the documents that contain the words Church, Wall and People.
I'm using Lucene 6 new IntPoint and I want to do some range search
Using IntPoint.newRangeQuery the search works and the correct documents are returned, however when I'm using QueryParser (classic) or the new StandardQueryParser nothing is returned.
// This works
Query query = IntPoint.newRangeQuery("duration",1,20);
System.out.println(query);
//This doesn't work
QueryParser parser = new QueryParser("name", analyzer);
Query query = parser.parse("duration:[1 TO 20]");
System.out.println(query);
//This doesn't work
StandardQueryParser queryParserHelper = new StandardQueryParser();
Query query = queryParserHelper.parse("timestamp:[1 TO 20]", "timestamp");
System.out.println(query);
// In all 3 cases it prints: timestamp:[1 TO 20]
Is this a bug or am I missing something?
It's not a bug, and I wouldn't say you are missing anything, really. QueryParser doesn't have any support for IntPoint fields, or any other numeric (PointValues) field types. Range queries in QueryParser syntax will always generate a TermRangeQuery, which will search for that field based on lexicographic order in the inverted index, which will not be work for searching PointValues fields. Generating these using IntPoint.newRangeQuery and similar methods is the correct thing to do.
In my java web application (Jsp + Servlet + hibernate) users can request books. The request goes to the database as a text. After that I tokenize the text using Apache Open NLP. Then I need to compare these tokenized text with books table (the books table has book ID , Book Name , Author , Description) and give most related suggestions to the user. Mostly I need to compare this with book name column and book description column. Is this possible?
import opennlp.tools.tokenize.SimpleTokenizer;
public class SimpleTokenizerExample {
public static void main(String args[]){
String sentence = "Hello Guys , I like to read horror stories. If you have any horror story books please share with us. Also my favorite author is Stephen King";
//Instantiating SimpleTokenizer class
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
//Tokenizing the given sentence
String tokens[] = simpleTokenizer.tokenize(sentence);
//Printing the tokens
for(String token : tokens) {
System.out.println(token);
}
}
}
Apache OpenNLP can do Natural Language Processing, but the task you describe is Information Retrieval. Take a look at http://lucene.apache.org/solr/.
If you really need to use DB only, you can try to make a query for each token using the LIKE sql keyword:
SELECT DISTINCT FROM mytable WHERE token IN description;
and rank the lines with higher match.
How OpenNLP can help you?
You can use the OpenNLP Stemmer. In that case you can get the stem of the book description and title before adding it to the columns to the database. You also need to stem the query. This will help you with inflections: "car" will match "cars", "car".
You can accomplish the same with the OpenNLP Lemmatizer, but you need a trained model, which is not available today for that module.
just to add to what #wcolen says, some out of the box stemmers exist for various languages in Lucene as well.
Another thing OpenNLP could help with is recognizing book authors names (e.g. Stephen King) via the NameFinderTool so that you could adjust the query so that your code creates a phrase query for such entities instead of a plain keyword based query (with the result that you won't get results containing Stephen or King but only results containing Stephen King).
I have a String address = "456 SOME STREET";
which I have to search in Lucene, I have created the index for this
StringField address = new StringField(Constants.ORGANIZATION_ADDRESS, address,Field.Store.YES);
And I am using Phrase Query to search this String using below Code
String[] tokens = address.split("\\s+");
PhraseQuery addressQuery = new PhraseQuery(Constants.ORGANIZATION_ADDRESS, tokens);
finalQuery.add(addressQuery, BooleanClause.Occur.MUST);
But its not giving me any result,I have tried TermQuery as well but that is also not working. Would really appreciate any help because I have tried many options now and I am unable to figure out whats wrong
I have also tried below
For Indexing :
doc.add(new StringField(Constants.ORGANIZATION_ADDRESS, address,Field.Store.YES));
Search using Term Query :
fullAddressExact= fullAddressExact.toLowerCase();
TermQuery tq = new TermQuery(new Term(Constants.ORGANIZATION_ADDRESS,fullAddressExact));
finalQuery.add(tq, BooleanClause.Occur.MUST);
Even this doesnt give any result. My intention to get the exact match
You should probably use TextField, not StringField when indexing the documents.
StringField stores the string as is, without breaking it into tokens, so in your example the index will contain "456 SOME STREET". Only a TermQuery with this term will retrieve it (or a PrefixQuery).
TextField is the standard field when indexing text, it splits the text into tokens (using a Tokenizer) and indexes the words separately, in your example, 456, SOME, STREET can all be used to find the document.
Read more about it here (a bit old, but relevant).
Let's say I have a docs like
stringfield:123456
textfield:name website stackoverflow
and If I build a query in the following manner
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
QueryParser luceneQueryParser = new QueryParser(Version.LUCENE_42, "", analyzer);
Query luceneSearchQuery = luceneQueryParser.parse("textfield:\"name website\"");
it will return the doc as expected, but if I build my query using Lucene QueryAPI
PhraseQuery firstNameQuery = new PhraseQuery();
firstNameQuery.add(new Term("textfield","name website"));
it will not give me any result, i will have to tokenize "name website" and add each token in phrasequery.
Is there any default way in QueryAPI to tokenize as it does while parsing a String Query.
Sure I can do that myself but reinvent the wheel if it's already implemented.
You are adding the entire query as a single term to your PhraseQuery. You are on the right track, but when tokenized, that will not be a single term, but rather two. That is, your index has the terms name, website, and stackoverflow, but your query only has one term, which matches none of those name website.
The correct way to use a PhraseQuery, is to add each term to the PhraseQuery separately.
PhraseQuery phrase = new PhraseQuery();
phrase.add(new Term("textfield", "name"));
phrase.add(new Term("textfield", "website"));
When you:
luceneQueryParser.parse("textfield:\"name website\"");
Lucene will tokenize the string "name website", and get 2 terms.
When you:
new Term("textfield","name website")
Lucene will not tokenize the string "name website", instead use the whole as a term.
As the result what you said, when you index the document, the field textfield MUST be Indexed and Tokenized.