Using Lucene to count results in categories

Using Lucene to count results in categories - java

I am trying to use Lucene Java 2.3.2 to implement search on a catalog of products. Apart from the regular fields for a product, there is field called 'Category'. A product can fall in multiple categories. Currently, I use FilteredQuery to search for the same search term with every Category to get the number of results per category.
This results in 20-30 internal search calls per query to display the results. This is slowing down the search considerably. Is there a faster way of achieving the same result using Lucene?

Here's what I did, though it's a bit heavy on memory:
What you need is to create in advance a bunch of BitSets, one for each category, containing the doc id of all the documents in a category. Now, on search time you use a HitCollector and check the doc ids against the BitSets.
Here's the code to create the bit sets:
public BitSet[] getBitSets(IndexSearcher indexSearcher,
Category[] categories) {
BitSet[] bitSets = new BitSet[categories.length];
for(int i=0; i<categories.length; i++)
{
Query query = categories[i].getQuery();
final BitSet bitset = new BitSet()
indexSearcher.search(query, new HitCollector() {
public void collect(int doc, float score) {
bitSet.set(doc);
}
});
bitSets[i] = bitSet;
}
return bitSets;
}
This is just one way to do this. You could probably use TermDocs instead of running a full search if your categories are simple enough, but this should only run once when you load the index anyway.
Now, when it's time to count categories of search results you do this:
public int[] getCategroryCount(IndexSearcher indexSearcher,
Query query,
final BitSet[] bitSets) {
final int[] count = new int[bitSets.length];
indexSearcher.search(query, new HitCollector() {
public void collect(int doc, float score) {
for(int i=0; i<bitSets.length; i++) {
if(bitSets[i].get(doc)) count[i]++;
}
}
});
return count;
}
What you end up with is an array containing the count of every category within the search results. If you also need the search results, you should add a TopDocCollector to your hit collector (yo dawg...). Or, you could just run the search again. 2 searches are better than 30.

I don't have enough reputation to comment (!) but in Matt Quail's answer I'm pretty sure you could replace this:
int numDocs = 0;
td.seek(terms);
while (td.next()) {
numDocs++;
}
with this:
int numDocs = terms.docFreq()
and then get rid of the td variable altogether. This should make it even faster.

You may want to consider looking through all the documents that match categories using a TermDocs iterator.
This example code goes through each "Category" term, and then counts the number of documents that match that term.
public static void countDocumentsInCategories(IndexReader reader) throws IOException {
TermEnum terms = null;
TermDocs td = null;
try {
terms = reader.terms(new Term("Category", ""));
td = reader.termDocs();
do {
Term currentTerm = terms.term();
if (!currentTerm.field().equals("Category")) {
break;
}
int numDocs = 0;
td.seek(terms);
while (td.next()) {
numDocs++;
}
System.out.println(currentTerm.field() + " : " + currentTerm.text() + " --> " + numDocs);
} while (terms.next());
} finally {
if (td != null) td.close();
if (terms != null) terms.close();
}
}
This code should run reasonably fast even for large indexes.
Here is some code that tests that method:
public static void main(String[] args) throws Exception {
RAMDirectory store = new RAMDirectory();
IndexWriter w = new IndexWriter(store, new StandardAnalyzer());
addDocument(w, 1, "Apple", "fruit", "computer");
addDocument(w, 2, "Orange", "fruit", "colour");
addDocument(w, 3, "Dell", "computer");
addDocument(w, 4, "Cumquat", "fruit");
w.close();
IndexReader r = IndexReader.open(store);
countDocumentsInCategories(r);
r.close();
}
private static void addDocument(IndexWriter w, int id, String name, String... categories) throws IOException {
Document d = new Document();
d.add(new Field("ID", String.valueOf(id), Field.Store.YES, Field.Index.UN_TOKENIZED));
d.add(new Field("Name", name, Field.Store.NO, Field.Index.UN_TOKENIZED));
for (String category : categories) {
d.add(new Field("Category", category, Field.Store.NO, Field.Index.UN_TOKENIZED));
}
w.addDocument(d);
}

Sachin, I believe you want faceted search. It does not come out of the box with Lucene. I suggest you try using SOLR, that has faceting as a major and convenient feature.

So let me see if I understand the question correctly: Given a query from the user, you want to show how many matches there are for the query in each category. Correct?
Think of it like this: your query is actually originalQuery AND (category1 OR category2 or ...) except as well an overall score you want to get a number for each of the categories. Unfortunately the interface for collecting hits in Lucene is very narrow, only giving you an overall score for a query. But you could implement a custom Scorer/Collector.
Have a look at the source for org.apache.lucene.search.DisjunctionSumScorer. You could copy some of that to write a custom scorer that iterates through category matches while your main search is going on. And you could keep a Map<String,Long> to keep track of matches in each category.

Related

How do I search partial words in Lucene when using MultiFieldQueryParser?

public SearchResult search(String queryStr, SortBy sortBy, int maxCount)
throws ParseException, IOException {
String[] fields = {Indexer.TITLE_FIELD_NAME, Indexer.REVIEW_FIELD_NAME, "name"};
QueryParser parser = new MultiFieldQueryParser(fields, analyzer);
Query query = parser.parse(queryStr);
Sort sort = null;
if (sortBy != null) {
sort = sortBy.sort;
}
return searchAfter(null, query, sort, maxCount);
}
Above method just gives me the result, but for that I have to search for the whole word but if I search partial word it doesn't work.

By default MultiFieldQueryParser (and QueryParser, which this class inherits from) will look for the whole words you are searching, however it can also generate wildcard queries. The word "elephant" can be matched by using elep*, elep?ant (i.e. ? mathes a single letter) or ele*nt. You can also use fuzzy queries, like elechant~.
You can read the whole syntax specification here: http://lucene.apache.org/core/7_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html (below the class list).

What is the better approach for solving Restrictions.in with large lists?

It has been established that when you use Hibernate's Restrictions.in(String property, List list), you have to limit the size of list.
This is because the database server might not be able to handle long queries. Aside from adjusting the configuration of the database server.
Here are the solutions I found:
SOLUTION 1: Split the list into smaller ones and then add the smaller lists separately into several Restrictions.in
public List<Something> findSomething(List<String> subCdList) {
Criteria criteria = getSession().createCriteria(getEntityClass());
//if size of list is greater than 1000, split it into smaller lists. See List<List<String>> cdList
if(subCdList.size() > 1000) {
List<List<String>> cdList = new ArrayList<List<String>>();
List<String> tempList = new ArrayList<String>();
Integer counter = 0;
for(Integer i = 0; i < subCdList.size(); i++) {
tempList.add(subCdList.get(i));
counter++;
if(counter == 1000) {
counter = 0;
cdList.add(tempList);
tempList = new ArrayList<String>();
}
}
if(tempList.size() > 0) {
cdList.add(tempList);
}
Criterion criterion = null;
//Iterate the list of lists, add the restriction for smaller list
for(List<String> cds : cdList) {
if (criterion == null) {
criterion = Restrictions.in("subCd", cds);
} else {
criterion = Restrictions.or(criterion, Restrictions.in("subCd", cds));
}
}
criteria.add(criterion);
} else {
criteria.add(Restrictions.in("subCd", subCdList));
}
return criteria.list();
}
This is an okay solution since you will only have one select statement. However, I think it's a bad idea to have for loops on the DAO layer because we do not want the connection to be open for a long time.
SOLUTION 2: Use DetachedCriteria. Instead of passing the list, query it on the WHERE clause.
public List<Something> findSomething() {
Criteria criteria = getSession().createCriteria(getEntityClass());
DetachedCriteria detached = DetachedCriteria.forClass(DifferentClass.class);
detached.setProjection(Projections.property("cd"));
criteria.add(Property.forName("subCd").in(detached));
return criteria.list();
}
The problem in this solution is on the technical usage of DetachedCriteria. You usually use it when you want to create a query to a another class that is totally not connected (or does not have relationship) on your current class. On the example, Something.class has a property subCd that is a foreign key from DifferentClass. Another, this produces a subquery on the where clause.
When you look at the code:
1. SOLUTION 2 is simpler and concise.
2. But SOLUTION 1 offers a query with only one select.
Please help me decide which one is more efficient.
Thanks.

For Solution 1 : Instead of using for loops, you can try as below
To avoid this use an utility method to build the Criterion Query IN clause if the number of parameter values passed has a size more than 1000.
class HibernateBuildCriteria {
private static final int PARAMETER_LIMIT = 800;
public static Criterion buildInCriterion(String propertyName, List<?> values) {
Criterion criterion = null;
int listSize = values.size();
for (int i = 0; i < listSize; i += PARAMETER_LIMIT) {
List<?> subList;
if (listSize > i + PARAMETER_LIMIT) {
subList = values.subList(i, (i + PARAMETER_LIMIT));
} else {
subList = values.subList(i, listSize);
}
if (criterion != null) {
criterion = Restrictions.or(criterion, Restrictions.in(propertyName, subList));
} else {
criterion = Restrictions.in(propertyName, subList);
}
}
return criterion;
}
}
Using the Method :
criteria.add(HibernateBuildCriteria.buildInCriterion(propertyName, list));
hope this helps.

Solution 1 has one major drawback: you may end up with a lot of different prepared statements which would need to be parsed and for which execution plan would need to be calculated and cached. This process may be much more expensive than the actual execution of the query for which the statement has already been cached by the database. Please see this question for more details.
The way how I solve this is to utilize the algorithm used by Hibernate for batch fetching of lazy loaded associated entities. Basically, I use ArrayHelper.getBatchSizes to get the sublists of ids and then I execute a separate query for each sublist.
Solution 2 is appropriate only if you can project ids in a subquery. But if you can't, then you can't use it. For example, the user of your app edited 20 entities on a screen and now they are saving the changes. You have to read the entities by ids to merge the changes and you cannot express it in a subquery.
However, an alternative approach to solution 2 could be to use temporary tables. For example Hibernate does it sometimes for bulk operations. You can store your ids in the temporary table and then use them in the subquery. I personally consider this to be an unnecessary complication compared to the solution 1 (for this use case of course; Hibernate's reasoning is good for their use case), but it is a valid alternative.

Android - How to combine two ArrayLists

I have got two ArrayLists, created from parsed html. First one contains jobs and is like
Job A
Job B
Job C
and the second one is like
Company A
Company B
Company C
What I need is combination of Job A and Company A and so on, so I can get the results like (an ArrayList too would be great)
Job A : Company A
Job B : Company B
Job C : Company C
I didn't find clear tutorial or something. Any ideas?

Are you sure you are looking at the correct data structure to achieve this?
Why not use a Map? You can define a key/value relationship going this route.
Map<Company, Job> jobMap = new HashMap<Company, Job>();
jobMap.put("Company A" /* or corresponding list item */, "Job A" /* or corresponding list item */);
You may even do something like this: (Swap out the strings to your to fit your implementation)
Map<Company, List<Job>> jobMap...;
List<Job> jobList = new ArrayList<Job>();
jobList.add("Job A");
jobList.add("Job B");
jobList.add("Job C");
jobMap.put("Company A", jobList);
What this will do is define a company as your key and you can set multiple jobs to a company

if (jobs.length() != companies.length()) {
throw new InvalidArgumentException("Mismatch of jobs and companies");
}
for (int i = 0; i < jobs.length(); i++) {
combine(jobs.get(i), companies.get(i));
}
There are lots of ways to combine references between two kinds of objects. Here's a flexible example that will let you use one to look up the other. It's overkill if you know which you'd always be using to do the lookup. Using LinkedHashMap also preserves the insertion order. So if you decide to put them in B, C, A order, you can get them out in B, C, A order.
LinkedHashMap<Job, Company> jobToCompany = new LinkedHashMap<>();
LinkedHashMap<Company, Job> companyToJob = new LinkedHashMap<>();
private void combine(Job job, Company company) {
jobToCompany.put(job, company);
companyToJob.put(company, job);
}

If you really want to store the combined values in an ArrayList then the following code will work for you:
List<String> jobs = new ArrayList<>();
List<String> companies = new ArrayList<>();
List<String> mergedList = new ArrayList<>();
//assuming the value are populated for `jobs` and `companies`.
if(jobs.size() == companies.size()) {
int n = jobs.size();
for(int index=0; index<n; index++)
{
mergedList.add(jobs.get(index) + " : " + companies.get(index))
}
} else {
System.out.println("Cannot combine");
//Throw exception or take any action you need.
}
Keep in mind that if you need to search for any item it would be O(n) but I assume you are aware of it before taking decision of going with an ArrayList.

If you're not willing to use a Map (not sure why you would that) my approach would be: To create another class (lets call it CompanyJob) that would contain both a Company and a Job attribute, then simply have a collection of your CompanyJob instances (an ArrayList would do).
class CompanyList{
private Company mCompany;
private Job mJob;
public CompanyList (Company com, Job job){
mCompany = com;
mJob = job;
}
// Some logic ...
}
// Then your list
private ArrayList<CompanyList> yourList = new ArraList <>();
int i = 0;
for (Company tmpCom: companyList){
yourList.add (new CompanyJob (tmpCom,jobList.get(i));
i++;
}

You need to create a new one
List<String> third = new ArrayList<String>();
Also need a counter.
int position = 0;
Then iterate through the list (considering the size is same for both the list).
for(String item:firstList){
third.add(item+ " : " + secondList.get(position);
position ++;
}
Then the third will have the desired result.
To confirm:
for (String item:third){
//try to print "item" here
}

java - building a new array to organize data based off of common index

This is a simple issue I am just having trouble finding the right terms to use for google help. I have some java code that loops some data and I end up having two pieces of information: an int id, and an int quantity.
However, sometimes the ids are the same, and I want to combine the quantities if they are, rather than having new entries in an array.
In PHP, I would do this as such (assume $products is an array with lots of id/quant data, of course):
$newArray = array();
for($products as $id > $quant){
if(array_key_exists($id, $newArray)){
$newArray[ $id ] += $quant;
} else {
$newArray[ $id ] = $quant;
}
}
I'm trying to do this in Java but nothing I find seems to help.

Use HashMap:
1. Get the id
2. See if the id is present in the map
3. if not
insert (id, quantity)
else (i.e. if present)
quantity = hashmap.get(id);
quantity = quantity + new_quantity
hashmap.put(id, quantity);
Helps?
There are many approaches, hashmaps take more memory. You can do this with 2 arrays as well, but then you will spend more time searching through the array.

Use a map implementation, like HashMap.
More or less; adjust types as necessary:
Map<Integer> map = new HashMap<Integer>();
for (Product p : products) {
if (map.hasKey(p.getId()) {
map.put(p.getId(), map.get(p.getId()) + p.getQuant());
} else {
map.put(p.getId(), p.getQuant());
}
}
Slightly cleaner to keep the mainline code readable:
// Mainline code
Map<Integer> map = new HashMap<Integer>();
for (Product p : products) {
putOrAddQuant(map, p);
}
// Extracted helper
public void putOrAddQuant(Map<Integer> map, Product p) {
if (map.hasKey(p.getId())) {
map.put(p.getId(), map.get(p.getId()) + p.getQuant());
} else {
map.put(p.getId(), p.getQuant());
}
}

Partially match strings in case of List.contains(String)

I have a List<String>
List<String> list = new ArrayList<String>();
list.add("ABCD");
list.add("EFGH");
list.add("IJ KL");
list.add("M NOP");
list.add("UVW X");
if I do list.contains("EFGH"), it returns true.
Can I get a true in case of list.contains("IJ")? I mean, can I partially match strings to find if they exist in the list?
I have a list of 15000 strings. And I have to check about 10000 strings if they exist in the list. What could be some other (faster) way to do this?
Thanks.

If suggestion from Roadrunner-EX does not suffice then, I believe you are looking for Knuth–Morris–Pratt algorithm.
Time complexity:
Time complexity of the table algorithm is O(n), preprocessing time
Time complexity of the search algorithm is O(k)
So, the complexity of the overall algorithm is O(n + k).
n = Size of the List
k = length of pattern you are searching for
Normal Brute-Force will have time complexity of O(nm)
Moreover KMP algorithm will take same O(k) complexity for searching with same search string, on the other hand, it will be always O(km) for brute force approach.

Perhaps you want to put each String group into a HashSet, and by fragment, I mean don't add "IJ KL" but rather add "IJ" and "KL" separately. If you need both the list and this search capabilities, you may need to maintain two collections.

As a second answer, upon rereading your question, you could also inherit from the interface List, specialize it for Strings only, and override the contains() method.
public class PartialStringList extends ArrayList<String>
{
public boolean contains(Object o)
{
if(!(o instanceof String))
{
return false;
}
String s = (String)o;
Iterator<String> iter = iterator();
while(iter.hasNext())
{
String iStr = iter.next();
if (iStr.contain(s))
{
return true;
}
}
return false;
}
}
Judging by your earlier comments, this is maybe not the speed you're looking for, but is this more similar to what you were asking for?

You could use IterableUtils from Apache Commons Collections.
List<String> list = new ArrayList<String>();
list.add("ABCD");
list.add("EFGH");
list.add("IJ KL");
list.add("M NOP");
list.add("UVW X");
boolean hasString = IterableUtils.contains(list, "IJ", new Equator<String>() {
#Override
public boolean equate(String o1, String o2) {
return o2.contains(o1);
}
#Override
public int hash(String o) {
return o.hashCode();
}
});
System.out.println(hasString); // true

You can iterate over the list, and then call contains() on each String.
public boolean listContainsString(List<string> list. String checkStr)
{
Iterator<String> iter = list.iterator();
while(iter.hasNext())
{
String s = iter.next();
if (s.contain(checkStr))
{
return true;
}
}
return false;
}
Something like that should work, I think.

How about:
java.util.List<String> list = new java.util.ArrayList<String>();
list.add("ABCD");
list.add("EFGH");
list.add("IJ KL");
list.add("M NOP");
list.add("UVW X");
java.util.regex.Pattern p = java.util.regex.Pattern.compile("IJ");
java.util.regex.Matcher m = p.matcher("");
for(String s : list)
{
m.reset(s);
if(m.find()) System.out.println("Partially Matched");
}

Here's some code that uses a regex to shortcut the inner loop if none of the test Strings are found in the target String.
public static void main(String[] args) throws Exception {
List<String> haystack = Arrays.asList(new String[] { "ABCD", "EFGH", "IJ KL", "M NOP", "UVW X" });
List<String> needles = Arrays.asList(new String[] { "IJ", "NOP" });
// To cut down on iterations, create one big regex to check the whole haystack
StringBuilder sb = new StringBuilder();
sb.append(".*(");
for (String needle : needles) {
sb.append(needle).append('|');
}
sb.replace(sb.length() - 1, sb.length(), ").*");
String regex = sb.toString();
for (String target : haystack) {
if (!target.matches(regex)) {
System.out.println("Skipping " + target);
continue;
}
for (String needle : needles) {
if (target.contains(needle)) {
System.out.println(target + " contains " + needle);
}
}
}
}
Output:
Skipping ABCD
Skipping EFGH
IJ KL contains IJ
M NOP contains NOP
Skipping UVW X
If you really want to get cute, you could bisect use a binary search to identify which segments of the target list matches, but it mightn't be worth it.
It depends which is how likely it is that yo'll find a hit. Low hit rates will give a good result. High hit rates will perform not much better than the simple nested loop version. consider inverting the loops if some needles hit many targets, and other hit none.
It's all about aborting a search path ASAP.

Yes, you can! Sort of.
What you are looking for, is often called fuzzy searching or approximate string matching and there are several solutions to this problem.
With the FuzzyWuzzy lib, for example, you can have all your strings assigned a score based on how similar they are to a particular search term. The actual values seem to be integer percentages of the number of characters matching with regards to the search string length.
After invoking FuzzySearch.extractAll, it is up to you to decide what the minimum score would be for a string to be considered a match.
There are also other, similar libraries worth checking out, like google-diff-match-patch or the Apache Commons Text Similarity API, and so on.
If you need something really heavy-duty, your best bet would probably be Lucene (as also mentioned by Ryan Shillington)

This is not a direct answer to the given problem. But I guess this answer will help someone to compare partially both given and the elements in a list using Apache Commons Collections.
final Equator equator = new Equator<String>() {
#Override
public boolean equate(String o1, String o2) {
final int i1 = o1.lastIndexOf(":");
final int i2 = o2.lastIndexOf(":");
return o1.substring(0, i1).equals(o2.substring(0, i2));
}
#Override
public int hash(String o) {
final int i1 = o.lastIndexOf(":");
return o.substring(0, i1).hashCode();
}
};
final List<String> list = Lists.newArrayList("a1:v1", "a2:v2");
System.out.println(IteratorUtils.matchesAny(list.iterator(), new EqualPredicate("a2:v1", equator)));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using Lucene to count results in categories - java

I don't have enough reputation to comment (!) but in Matt Quail's answer I'm pretty sure you could replace this: int numDocs = 0; td.seek(terms); while (td.next()) { numDocs++; } with this: int numDocs = terms.docFreq() and then get rid of the td variable altogether. This should make it even faster.

Sachin, I believe you want faceted search. It does not come out of the box with Lucene. I suggest you try using SOLR, that has faceting as a major and convenient feature.

Related

How do I search partial words in Lucene when using MultiFieldQueryParser?

What is the better approach for solving Restrictions.in with large lists?

Android - How to combine two ArrayLists

java - building a new array to organize data based off of common index

Partially match strings in case of List.contains(String)

Categories

Resources