How do I search partial words in Lucene when using MultiFieldQueryParser?

How do I search partial words in Lucene when using MultiFieldQueryParser? - java

public SearchResult search(String queryStr, SortBy sortBy, int maxCount)
throws ParseException, IOException {
String[] fields = {Indexer.TITLE_FIELD_NAME, Indexer.REVIEW_FIELD_NAME, "name"};
QueryParser parser = new MultiFieldQueryParser(fields, analyzer);
Query query = parser.parse(queryStr);
Sort sort = null;
if (sortBy != null) {
sort = sortBy.sort;
}
return searchAfter(null, query, sort, maxCount);
}
Above method just gives me the result, but for that I have to search for the whole word but if I search partial word it doesn't work.

By default MultiFieldQueryParser (and QueryParser, which this class inherits from) will look for the whole words you are searching, however it can also generate wildcard queries. The word "elephant" can be matched by using elep*, elep?ant (i.e. ? mathes a single letter) or ele*nt. You can also use fuzzy queries, like elechant~.
You can read the whole syntax specification here: http://lucene.apache.org/core/7_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html (below the class list).

Related

Lucene Stopword and nGram

I'm using Lucene and I want to use nGrams with stopwords.
I wrote an own Analyzer in Lucene with respect to the German Stopword Analyzer.
public class GermanNGramAnalyzer extends StopwordAnalyzerBase {
#Override
protected TokenStreamComponents createComponents(String s) {
NGramTokenizer tokenizer = new NGramTokenizer(4,4); //Tokenizer for nGrams
TokenStream result = new StandardFilter(tokenizer);
result = new LowerCaseFilter(result);
result = new StopFilter(result, this.stopwords);
result = new SetKeywordMarkerFilter(result, this.exclusionSet);
result = new GermanNormalizationFilter(result);
result = new NumberFilter(result);
return new TokenStreamComponents(tokenizer, result);
}
(...)
}
This works but not like I want.
As you can see we have 4 grams so it looks like this: (blanks are masked as "_")
Das Haus
das_
as_h
s_ha
_hau
haus
In German "das" is like "the" and should be removed. But of course it won't be removed then "das_", "as_h", "s_ha" doesn't contains "das" at all.
So I want to first a word tokenizer, use stopword and after that merge everything again and use ngram like normal.
Of course I can "manually" remove all stopwords from the string before I throw it into Lucene but I thought it should be possible to do this with Lucene.
Someone has an idea?

One of the possibility would be instead of using NGramTokenizer as a tokenizer, first you could use StandardTokenizer or any other nice tokenization and then apply creation of the ngrams via usage of NGramTokenFilter which could be applied exactly after the usage of StopFilter.

How to query lucene for 2 index fields?

I'd like to execute queries with lucene. But the lookup should not only be based on the input, but also on a 2nd parameter.
Example: imagine the lucene index should contain citynames and countrycodes.
Now, during lookup I already know which country the desired cityname should be in.
SO I want to query the lucene index by cityname, but tell lucene to only look on the citynames where the countrycode matches.
It it possibel? If yes, how?
For a single attribute I would just set up the following:
QueryParser q = QueryParser(Version matchVersion, String f, Analyzer a)
Query q = queryParser.parse(input);
But how for 2 attributes?

Something like this should work. Untested but you should get the idea:
String countryCode = ....; // known in advance
QueryParser queryParser = new QueryParser(matchVersion, f, a);
Query cityNameQuery = queryParser.parse(inputWithCityName);
Query countryCodeQuery = queryParser.parse("+countrycode:" + countryCode);
BooleanQuery result = new BooleanQuery();
result.add(new BooleanClause(cityNameQuery, MUST));
result.add(new BooleanClause(countryCodeQuery, MUST));

Converting an ArrayList to String

I am currently scraping data from a HTML table using JSoup then storing this into an ArrayList. I would then like to store these values within a SQL database. From what I understand these must be converted to a String before they can be inserted into a database. How would I go about converting these ArrayLists to a String?
Here is my current code:
Document doc = Jsoup.connect(URL).timeout(5000).get();
for (Element table : doc.select("table:first-of-type"))
{
for (Element row : table.select("tr:gt(0)")) {
Elements tds = row.select("td");
List1.add(tds.get(0).text());
List2.add(tds.get(1).text());
List3.add(tds.get(2).text());
}
}

To get all the values you need, you can iterate through the list very easily :
public static void main(String[] args) {
List<String> List1 = new ArrayList<>();
for (String singleString : List1){
//Do whatever you want with each string stored in this list
}
}

Well i am not really sure if there is some special way to achieve what you are asking .
You would need to iterate over the array . You could use the StringBuilder instead of String for a bit of optimization. I am not really sure how much of a boost you would gain in performance ..
Anyway this is how the code would look ..
StringBuilder sb = new StringBuilder();
for (Object o : list1){
sb.append(o.toString()+delimiter); // delimiter could be , or \t or ||
//You could manipulate the string here as required..
// enter code here`
}
return sb.toString();
You would then need to split the strings if required. ( if they need to go into seperate fields

Java collection framework doesn't provide any direct utility method to convert ArrayList to String in Java. But Spring framework which is famous for dependency Injection and its IOC container also provides API with common utilities like method to convert Collection to String in Java. You can convert ArrayList to String using Spring Framework's StringUtils class. StringUtils class provide three methods to convert any collection
e.g. ArrayList to String in Java, as shown below:
public static String collectionToCommaDelimitedString(Collection coll)
public static String collectionToDelimitedString(Collection coll, String delim)
public static String collectionToDelimitedString(Collection coll, String delim, String prefix, String suffix)
got it....

Partially match strings in case of List.contains(String)

I have a List<String>
List<String> list = new ArrayList<String>();
list.add("ABCD");
list.add("EFGH");
list.add("IJ KL");
list.add("M NOP");
list.add("UVW X");
if I do list.contains("EFGH"), it returns true.
Can I get a true in case of list.contains("IJ")? I mean, can I partially match strings to find if they exist in the list?
I have a list of 15000 strings. And I have to check about 10000 strings if they exist in the list. What could be some other (faster) way to do this?
Thanks.

If suggestion from Roadrunner-EX does not suffice then, I believe you are looking for Knuth–Morris–Pratt algorithm.
Time complexity:
Time complexity of the table algorithm is O(n), preprocessing time
Time complexity of the search algorithm is O(k)
So, the complexity of the overall algorithm is O(n + k).
n = Size of the List
k = length of pattern you are searching for
Normal Brute-Force will have time complexity of O(nm)
Moreover KMP algorithm will take same O(k) complexity for searching with same search string, on the other hand, it will be always O(km) for brute force approach.

Perhaps you want to put each String group into a HashSet, and by fragment, I mean don't add "IJ KL" but rather add "IJ" and "KL" separately. If you need both the list and this search capabilities, you may need to maintain two collections.

As a second answer, upon rereading your question, you could also inherit from the interface List, specialize it for Strings only, and override the contains() method.
public class PartialStringList extends ArrayList<String>
{
public boolean contains(Object o)
{
if(!(o instanceof String))
{
return false;
}
String s = (String)o;
Iterator<String> iter = iterator();
while(iter.hasNext())
{
String iStr = iter.next();
if (iStr.contain(s))
{
return true;
}
}
return false;
}
}
Judging by your earlier comments, this is maybe not the speed you're looking for, but is this more similar to what you were asking for?

You could use IterableUtils from Apache Commons Collections.
List<String> list = new ArrayList<String>();
list.add("ABCD");
list.add("EFGH");
list.add("IJ KL");
list.add("M NOP");
list.add("UVW X");
boolean hasString = IterableUtils.contains(list, "IJ", new Equator<String>() {
#Override
public boolean equate(String o1, String o2) {
return o2.contains(o1);
}
#Override
public int hash(String o) {
return o.hashCode();
}
});
System.out.println(hasString); // true

You can iterate over the list, and then call contains() on each String.
public boolean listContainsString(List<string> list. String checkStr)
{
Iterator<String> iter = list.iterator();
while(iter.hasNext())
{
String s = iter.next();
if (s.contain(checkStr))
{
return true;
}
}
return false;
}
Something like that should work, I think.

How about:
java.util.List<String> list = new java.util.ArrayList<String>();
list.add("ABCD");
list.add("EFGH");
list.add("IJ KL");
list.add("M NOP");
list.add("UVW X");
java.util.regex.Pattern p = java.util.regex.Pattern.compile("IJ");
java.util.regex.Matcher m = p.matcher("");
for(String s : list)
{
m.reset(s);
if(m.find()) System.out.println("Partially Matched");
}

Here's some code that uses a regex to shortcut the inner loop if none of the test Strings are found in the target String.
public static void main(String[] args) throws Exception {
List<String> haystack = Arrays.asList(new String[] { "ABCD", "EFGH", "IJ KL", "M NOP", "UVW X" });
List<String> needles = Arrays.asList(new String[] { "IJ", "NOP" });
// To cut down on iterations, create one big regex to check the whole haystack
StringBuilder sb = new StringBuilder();
sb.append(".*(");
for (String needle : needles) {
sb.append(needle).append('|');
}
sb.replace(sb.length() - 1, sb.length(), ").*");
String regex = sb.toString();
for (String target : haystack) {
if (!target.matches(regex)) {
System.out.println("Skipping " + target);
continue;
}
for (String needle : needles) {
if (target.contains(needle)) {
System.out.println(target + " contains " + needle);
}
}
}
}
Output:
Skipping ABCD
Skipping EFGH
IJ KL contains IJ
M NOP contains NOP
Skipping UVW X
If you really want to get cute, you could bisect use a binary search to identify which segments of the target list matches, but it mightn't be worth it.
It depends which is how likely it is that yo'll find a hit. Low hit rates will give a good result. High hit rates will perform not much better than the simple nested loop version. consider inverting the loops if some needles hit many targets, and other hit none.
It's all about aborting a search path ASAP.

Yes, you can! Sort of.
What you are looking for, is often called fuzzy searching or approximate string matching and there are several solutions to this problem.
With the FuzzyWuzzy lib, for example, you can have all your strings assigned a score based on how similar they are to a particular search term. The actual values seem to be integer percentages of the number of characters matching with regards to the search string length.
After invoking FuzzySearch.extractAll, it is up to you to decide what the minimum score would be for a string to be considered a match.
There are also other, similar libraries worth checking out, like google-diff-match-patch or the Apache Commons Text Similarity API, and so on.
If you need something really heavy-duty, your best bet would probably be Lucene (as also mentioned by Ryan Shillington)

This is not a direct answer to the given problem. But I guess this answer will help someone to compare partially both given and the elements in a list using Apache Commons Collections.
final Equator equator = new Equator<String>() {
#Override
public boolean equate(String o1, String o2) {
final int i1 = o1.lastIndexOf(":");
final int i2 = o2.lastIndexOf(":");
return o1.substring(0, i1).equals(o2.substring(0, i2));
}
#Override
public int hash(String o) {
final int i1 = o.lastIndexOf(":");
return o.substring(0, i1).hashCode();
}
};
final List<String> list = Lists.newArrayList("a1:v1", "a2:v2");
System.out.println(IteratorUtils.matchesAny(list.iterator(), new EqualPredicate("a2:v1", equator)));

Using Lucene to count results in categories

I am trying to use Lucene Java 2.3.2 to implement search on a catalog of products. Apart from the regular fields for a product, there is field called 'Category'. A product can fall in multiple categories. Currently, I use FilteredQuery to search for the same search term with every Category to get the number of results per category.
This results in 20-30 internal search calls per query to display the results. This is slowing down the search considerably. Is there a faster way of achieving the same result using Lucene?

Here's what I did, though it's a bit heavy on memory:
What you need is to create in advance a bunch of BitSets, one for each category, containing the doc id of all the documents in a category. Now, on search time you use a HitCollector and check the doc ids against the BitSets.
Here's the code to create the bit sets:
public BitSet[] getBitSets(IndexSearcher indexSearcher,
Category[] categories) {
BitSet[] bitSets = new BitSet[categories.length];
for(int i=0; i<categories.length; i++)
{
Query query = categories[i].getQuery();
final BitSet bitset = new BitSet()
indexSearcher.search(query, new HitCollector() {
public void collect(int doc, float score) {
bitSet.set(doc);
}
});
bitSets[i] = bitSet;
}
return bitSets;
}
This is just one way to do this. You could probably use TermDocs instead of running a full search if your categories are simple enough, but this should only run once when you load the index anyway.
Now, when it's time to count categories of search results you do this:
public int[] getCategroryCount(IndexSearcher indexSearcher,
Query query,
final BitSet[] bitSets) {
final int[] count = new int[bitSets.length];
indexSearcher.search(query, new HitCollector() {
public void collect(int doc, float score) {
for(int i=0; i<bitSets.length; i++) {
if(bitSets[i].get(doc)) count[i]++;
}
}
});
return count;
}
What you end up with is an array containing the count of every category within the search results. If you also need the search results, you should add a TopDocCollector to your hit collector (yo dawg...). Or, you could just run the search again. 2 searches are better than 30.

I don't have enough reputation to comment (!) but in Matt Quail's answer I'm pretty sure you could replace this:
int numDocs = 0;
td.seek(terms);
while (td.next()) {
numDocs++;
}
with this:
int numDocs = terms.docFreq()
and then get rid of the td variable altogether. This should make it even faster.

You may want to consider looking through all the documents that match categories using a TermDocs iterator.
This example code goes through each "Category" term, and then counts the number of documents that match that term.
public static void countDocumentsInCategories(IndexReader reader) throws IOException {
TermEnum terms = null;
TermDocs td = null;
try {
terms = reader.terms(new Term("Category", ""));
td = reader.termDocs();
do {
Term currentTerm = terms.term();
if (!currentTerm.field().equals("Category")) {
break;
}
int numDocs = 0;
td.seek(terms);
while (td.next()) {
numDocs++;
}
System.out.println(currentTerm.field() + " : " + currentTerm.text() + " --> " + numDocs);
} while (terms.next());
} finally {
if (td != null) td.close();
if (terms != null) terms.close();
}
}
This code should run reasonably fast even for large indexes.
Here is some code that tests that method:
public static void main(String[] args) throws Exception {
RAMDirectory store = new RAMDirectory();
IndexWriter w = new IndexWriter(store, new StandardAnalyzer());
addDocument(w, 1, "Apple", "fruit", "computer");
addDocument(w, 2, "Orange", "fruit", "colour");
addDocument(w, 3, "Dell", "computer");
addDocument(w, 4, "Cumquat", "fruit");
w.close();
IndexReader r = IndexReader.open(store);
countDocumentsInCategories(r);
r.close();
}
private static void addDocument(IndexWriter w, int id, String name, String... categories) throws IOException {
Document d = new Document();
d.add(new Field("ID", String.valueOf(id), Field.Store.YES, Field.Index.UN_TOKENIZED));
d.add(new Field("Name", name, Field.Store.NO, Field.Index.UN_TOKENIZED));
for (String category : categories) {
d.add(new Field("Category", category, Field.Store.NO, Field.Index.UN_TOKENIZED));
}
w.addDocument(d);
}

Sachin, I believe you want faceted search. It does not come out of the box with Lucene. I suggest you try using SOLR, that has faceting as a major and convenient feature.

So let me see if I understand the question correctly: Given a query from the user, you want to show how many matches there are for the query in each category. Correct?
Think of it like this: your query is actually originalQuery AND (category1 OR category2 or ...) except as well an overall score you want to get a number for each of the categories. Unfortunately the interface for collecting hits in Lucene is very narrow, only giving you an overall score for a query. But you could implement a custom Scorer/Collector.
Have a look at the source for org.apache.lucene.search.DisjunctionSumScorer. You could copy some of that to write a custom scorer that iterates through category matches while your main search is going on. And you could keep a Map<String,Long> to keep track of matches in each category.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How do I search partial words in Lucene when using MultiFieldQueryParser? - java

Related

Lucene Stopword and nGram

How to query lucene for 2 index fields?

Converting an ArrayList to String

Partially match strings in case of List.contains(String)

Using Lucene to count results in categories

Categories

Resources