Solr: search excludes bigger phrazes - java

F.e. I have a 3 documents.
1. "dog cat a ball"
2. "dog the cat of balls"
3. "dog the cat, ball and elephant"
So. By querying "dog AND cat AND ball" I want to receive only first two documents.
So. the main idea that I want to include into results only words I requested.
I'll appreciate any advise.
thank you.

well, if you store your TermVector (while creating a Field, before adding the Document to the index, use TermVector.YES) it can be done, by overriding a Collector. here is a simple implementation (that returns only the documents without scores):
private static class MyCollector extends Collector {
private IndexReader ir;
private int numberOfTerms;
private Set<Integer> set = new HashSet<Integer>();
public MyCollector(IndexReader ir,int numberOfTerms) {
this.ir = ir;
this.numberOfTerms = numberOfTerms;
}
#Override
public void setScorer(Scorer scorer) throws IOException { } //we do not use a scorer in this example
#Override
public void setNextReader(IndexReader reader, int docBase) {
//ignore
}
#Override
public void collect(int doc) throws IOException {
TermFreqVector vector = ir.getTermFreqVector(doc, CONTENT_FIELD);
//CONTENT_FILED is the name of the field you are searching in...
if (vector != null) {
if (vector.getTerms().length == numberOfTerms) {
set.add(doc);
}
} else {
set.add(doc); //well, assume it doesn't happen, because you stored your TermVectors.
}
}
#Override
public boolean acceptsDocsOutOfOrder() {
return true;
}
public Set<Integer> getSet() {
return set;
}
};
now, use IndexSearcher#search(Query,Collector)
the idea is: you know how many terms should be in the document if it is to be accepted, so you just verify it, and collect only documents that match this rule. of course this can be more complex (look for a specific term in the Vector, order of words in the Vector), but this is the general idea.
actually, if you store your TermVector, you can do almost anything, so just try working with it.

You may implement a filter factory/tokenizer pair with hashing capabilities.
Use copyfield directive
You need to tokenize terms
Remove stopwords (in your example)
Sort terms in alphanumeric order and save the hash
expand the query to also search for the hash something like:
somestring:"dog AND cat AND ball" AND somehash:"dog AND cat AND ball"
The second searchquery part will be implicitly hashed in the query processing.
this will result only in exact matches ( with a very very unrealistic probability of false positives )
P.S. you dont need to store termvectors. Which will result in a noticeable smaller index.

Related

Multiple string replacement in a single string generating all possible combinations

I'm trying to replace multiple words in a string with multiple other words. The string is
I have sample {url} with time to {live}
Here the possible values for {url} are
point1
point2
Possible values for {live} are
10
20
The four possible answers are
I have sample point1 with time to 10
I have sample point1 with time to 20
I have sample point2 with time to 10
I have sample point2 with time to 20
This can also increase to three.
I have {sample} {url} with time to {live}
What would be best data structures and good approach to solve this problem ?
You can do it something like:
public static void main(String[] args) {
String inputStr = "I have {sample} {url} with time to {live}";
Map<String, List<String>> replacers = new HashMap<String, List<String>>(){{
put("{sample}", Arrays.asList("point1", "point2"));
put("{live}", Arrays.asList("10", "20"));
put("{url}", Arrays.asList("url1", "url2", "url3"));
}};
for (String variant : stringGenerator(inputStr, replacers)) {
System.out.println(variant);
}
}
public static List<String> stringGenerator(String template, Map<String, List<String>> replacers) {
List<String> out = Arrays.asList(template);
for (Map.Entry<String, List<String>> replacerEntry : replacers.entrySet()) {
List<String> tempOut = new ArrayList<>(out.size()*replacerEntry.getValue().size());
for (String replacerValue : replacerEntry.getValue()) {
for (String variant : out) {
tempOut.add(variant.replace(replacerEntry.getKey(), replacerValue));
}
}
out = tempOut;
}
return out;
}
also you can try make similar solution with recursion
You can use a template string and print the combinations using System.out.format method like below:
public class Combinations {
public static void main(String[] args) {
String template = "I have sample %s with time to %d%n"; //<-- 2 arguments case
String[] points = {"point1", "point2"};
int[] lives = {10, 20};
for (String point : points) {
for (int live : lives) {
System.out.format(template, point, live);
}
}
}
}
The code solves the 2 argument case but it can be easily extended to the 3 cases substituting the sample word with another %s in the template and a triple loop.
I'm using the simplest array structures, it is up to you decide which structure is the more adapt for your code.
Unless you want the hardcoded solution with simple nested loops shown in Dariosicily's answer, you will need to store "replacee-replacements" pairings, for example the string {url} paired with a list of strings point1 and point2. A simple class can do that, like
class StringListPair{
public final String s;
public final List<String> l;
public StringListPair(String s,List<String> l){
this.s=s;
this.l=l;
}
}
and then a list of replacements can be initialized as
List<StringListPair> mappings=Arrays.asList(
new StringListPair("{url}",Arrays.asList("point1","point2")),
new StringListPair("{live}",Arrays.asList("10","20","30")));
(If someone wants to totally avoid having a helper class, these are all strings, so a List<List<String>> can do the job too, having "{url}","point1","point2" lists inside, just then we would have to fight with indexing the inner lists everywhere)
Then two common approaches pop into my mind: a recursive one, generating all possible combinations in a single run, and a direct-indexing one, numbering all combinations and generating any of them directly upon request. Recursion is simpler to come up with, and it has no significant drawbacks if all the combinations are needed anyway. The direct approach generates a single combination at a time, so if many combinations are not going to be used, it can spare a lot of memory and runtime (for example if someone would need a single randomly selected combination only, out of millions perhaps).
Recursion will be, well, recursive, having a completed combination generated in its deepest level, thus it needs the following:
the list of combinations (because it will be extended deep inside the call-chain)
the mappings
the candidate it is working on at the moment
something to track what label it is supposed to replace a the moment.
Then two things remain: recursion has to stop (when no further labels remain for replacement in the current candidate, it is added to the list), or it has to replace the current label with something, and proceed to the next level.
In code it can look like this:
static void recursive(List<String> result,List<StringListPair> mappings,String sofar,int partindex) {
if(partindex>=mappings.size()) {
result.add(sofar);
return;
}
StringListPair p=mappings.get(partindex);
for(String item:p.l)
recursive(result,mappings,sofar.replace(p.s,item),partindex+1);
}
level is tracked by a simple number, partindex, current candidate is called sofar (from "so far"). When the index is not referring to an existing element in mappings, the candidate is complete. Otherwise it loops through the "current" mapping, and calling itself with every replacement, well, recursively.
Wrapper function to creata and return an actual list:
static List<String> userecursive(List<StringListPair> mappings,String base){
List<String> result=new ArrayList<>();
recursive(result, mappings, base, 0);
return result;
}
The direct-indexing variant uses some maths. We have 2*3 combinations in the example, numbered from 0...5. If we say that these numbers are built from i=0..1 and j=0..2, the expression for that could be index=i+j*2. This can be reversed using modulo and division operations, like for the last index index=5: i=5%2=1, j=5//2=2. Where % is the modulo operator, and // is integer division. The method works higher "dimensions" too, just then it would apply modulo at every step, and update index itself with the division as the actual code does:
static String direct(List<StringListPair> mappings,String base,int index) {
for(StringListPair p:mappings) {
base=base.replace(p.s,p.l.get(index % p.l.size())); // modulo "trick" for current label
index /= p.l.size(); // integer division throws away processed label
}
return base;
}
Wrapper function (it has a loop to calculate "2*3" at the beginning, and collects combinations in a list):
static List<String> usedirect(List<StringListPair> mappings,String base){
int total=1;
for(StringListPair p:mappings)
total*=p.l.size();
List<String> result=new ArrayList<>();
for(int i=0;i<total;i++)
result.add(direct(mappings,base,i));
return result;
}
Complete code and demo is on Ideone

How to override Similarity in a single field in Lucene?

I am using version 4.4 of Apache Lucene.
My system indexes a collection of documents into three different fields: the title, description and author(s) of the documents.
I want a document to get higher score the more frequency of a query term it has. However, when the term is part of the author field, I just want it to act as a "boolean"; this is, to add the same score if the term appears just once or more times. For example, if three authors of a document have a surname "Smith", just one match should be given.
For this, I have found the following code, which overrides the term frequency:
Similarity sim = new DefaultSimilarity() {
#Override
public float tf(float freq) {
return freq == 0 ? 0 : 1;
}
};
searcher.setSimilarity(sim);
However, this overrides me it for the three fields. How can I manage to override the single author field?
You can implement PerFieldSimilarityWrapper, like this:
public class MyCustomSimilarity extends PerFieldSimilarityWrapper {
#Override
public Similarity get(String fieldName) {
if (fieldName.equals("author")) {
return new CustomAuthorSimilarity();
}
else {
return new DefaultSimilarity();
}
}
}

Creating a dictionary: Method to prevent the same word from being added more than once

I need to create a method to determine whether or not the word I'm trying to add to my String[] dictionary has already been added. We were not allowed to use ArrayList for this project, only arrays.
I started out with this
public static boolean dictHasWord(String str){
for(int i = 0; i < dictionary.length; i++){
if(str.equals(dictionary[i])){
return true;
}
}
return false;
}
However, my professor told me not to use this, because it is a linear function O(n), and is not effective. What other way could I go about solving this method?
This is a example of how to quickly search through a Array with good readability. I would suggest using this method to search your array.
import java.util.*;
public class test {
public static void main(String[] args) {
String[] list = {"name", "ryan"
};
//returns boolean here
System.out.println(Arrays.asList(list).contains("ryan"));
}
}
If you are allowed to use the Arrays class as part of your assignment, you can sort your array and use a binary search instead, which is not O(n).
public static boolean dictHasWord(String str){
if(Arrays.binarySearch(dictionary, str) != -1){
return true;
}
return false;
}
Just keep in mind you must sort first.
EDIT:
Regarding writing your own implementation, here's a sample to get you going. Here are the javadocs for compareTo() as well. Heres another sample (int based example) showing the difference between recursive and non recursive, specifically in Java.
Although it maybe an overkill in this case, but a hash-table would not be O(n).
This uses the fact that every String can be turnt into an int via hashCode(), and equal strings will produce the same hash.
Our dictionary can be declared as:
LinkedList<String>[] dictionary;
In other words in each place several strings may reside, this is due to possible collisions (different strings producing the same result).
The simplest solution for addition would be:
public void add(String str)
{
dictionary[str.hashCode()].add(str);
}
But in order to do this, you would need to make an array size equal to 1 less the maximum of hashCode() function. Which is probably too much memory for you. So we can do a little differently:
public void add(String str)
{
dictionary[str.hashCode()%dictionary.length].add(str);
}
This way we always mod the hash. For best results you should make your dictionary size some prime number, or at least a power of a single prime.
Then when you want to test the existence of the string you do exactly what you had in the original, but you use the specific LinkedList that you get from the hash:
public static boolean dictHasWord(String str)
{
for(String existing : dictionary[str.hashCode()%dictionary.length])
{
if(str.equals(existing)){
return true;
}
}
return false;
}
At which point you may ask "Isn't it O(n)?". And the answer is that it is not, since the hash function did not take into consideration the number of elements in array. The more memory you will give to your array, less collisions you will have, and more this approach moves towards O(1).
If somebody finds this answer searching for a real solution (not homework assignment). Then just use HashMap.

Java pattern for parameters of which only one needs to be non-null?

In the last time I often write long functions that have several parameters but use only one of them and the functionality is only different at a few keypoints that are scattered around the function. Thus splitting the function would create too many small functions without a purpose. Is this good style or is there a good general refactoring pattern for this? To be more clear, an example:
public performSearch(DataBase dataBase, List<List<String>> segments) {performSearch(dataBase,null,null,segments);}
public performSearch(DataBaseCache dataBaseCache,List<List<String>> segments) {performSearch(null,dataBaseCache,null,segments);}
public performSearch(DataBase dataBase, List<String> keywords {performSearch(dataBase,null,keywords,null);}
public performSearch(DataBaseCache dataBaseCache,List<String> keywords) {performSearch(null,dataBaseCache,keywords,null);}
/** either dataBase or dataBaseCache may be null, dataBaseCache is used if it is non-null, else dataBase is used (slower). */
private void performSearch(DataBase dataBase, DataBaseCache dataBaseCache, List<String> keywords, List<List<String>> segments)
{
SearchObject search = new SearchObject();
search.setFast(true);
...
search.setNumberOfResults(25);
if(dataBaseCache!=null) {search.setSource(dataBaseCache);}
else {search.setSource(dataBase);}
... do some stuff ...
if(segments==null)
{
// create segments from keywords
....
segments = ...
}
}
This style of code works but I don't like all those null parameters and the possibilities of calling methods like this wrong (both parameters null, what happens if both are non-null) but I don't want to write 4 seperate functions either... I know this may be too general but maybe someone has a general solution to this principle of problems :-)
P.S.: I don't like to split up a long function if there is no reason for it other than it being long (i.e. if the subfunctions are only ever called in that order and only by this one function) especially if they are tightly interwoven and would need a big amount of parameters transported around them.
I think it is very bad procedural style. Try to avoid such coding. Since you already have a bulk of such code it may be very hard to re-factor it because each method contains its own logic that is slightly different from other. BTW the fact that it is hard is an evidence that the style is bad.
I think you should use behavioral patterns like
Chain of responsibilities
Command
Strategy
Template method
that can help you to change your procedural code to object oriented.
Could you use something like this
public static <T> T firstNonNull(T...parameters) {
for (T parameter: parameters) {
if (parameter != null) {
return parameter;
}
}
throw new IllegalArgumentException("At least one argument must be non null");
}
It does not check if more than one parameter is not null and they must be of the same type, but you could use it like this:
search.setSource(firstNonNull(dataBaseCache, database));
Expecting nulls is an anti-pattern because it litters your code with NullPointerExceptions waiting to happen. Use the builder pattern to construct the SearchObject. This is the signature you want, I'll let you figure out the implementation:
class SearchBuilder {
SearchObject search = new SearchObject();
List<String> keywords = new ArrayList<String>();
List<List<String>> segments = new ArrayList<List<String>>();
public SearchBuilder(DataBase dataBase) {}
public SearchBuilder(DataBaseCache dataBaseCache) {}
public void addKeyword(String keyword) {}
public void addSegment(String... segment) {}
public void performSearch();
}
I agree with what Alex said. Without knowing the problem I would recommend following structure based on what was in the example:
public interface SearchEngine {
public SearchEngineResult findByKeywords(List<String> keywords);
}
public class JDBCSearchEngine {
private DataSource dataSource;
public JDBCSearchEngine(DataSource dataSource) {
this.dataSource = dataSource;
}
public SearchEngineResult findByKeywords(List<String> keywords) {
// Find from JDBC datasource
// It might be useful to use a DAO instead of datasource, if you have database operations other that searching
}
}
public class CachingSearchEngine {
private SearchEngine searchEngine;
public CachingSearchEngine(SearchEngine searchEngine) {
this.searchEngine = searchEngine;
}
public SearchEngineResult findByKeywords(List<String> keywords) {
// First check from cache
...
// If not found, then fetch from real search engine
SearchEngineResult result = searchEngine.findByKeywords(keywords);
// Then add to cache
// Return the result
return result;
}
}

StringTemplate: increment value when if condition true

I want to find out if StringTemplate have/support incrementation of a number.
Situation is:
input: is an array of objects which have "isKey() and getName()" getter.
output should be (i=0; IF !obj.getKey() THEN ps.setObject(i++,obj.getName)) ENDIF):
ps.setObject(1,"Name");
ps.setObject(2,"Name");
ps.setObject(3,"Name");
...
Currently I have next ST: <objs:{<if(it.key)><else>ps.setObject(<i>, <it.name;>);<"\n"><endif>}>
And the output in case if 1st is key:
ps.setObject(2,"Name");
ps.setObject(3,"Name");
ps.setObject(4,"Name");
...
Issue now I need to find a way to replace the 'i' with something which will be increment only when if condition is true.
PLS advice who faced this kind of issue!
In general, changing the state in response to ST's getting the state is not a good idea, so numbering non-key fields should happen in your model, before you start with the generation.
Add a getter for nonKeyIndex to the class of your model that hosts the name property. Go through all siblings, and number them as you need (i.e. starting from one and skipping the keys in your numbering). Now you can use this ST to produce the desired output:
<objs:{<if(it.key)><else>ps.setObject(<it.nonKeyIndex>, <it.name;>);<"\n"><endif>}>
Sometimes it may not be possible to add methods such as nonKeyIndex to your model classes. In such cases you should wrap your classes into view classes designed specifically to work with string template, and add the extra properties there:
public class ColumnView {
private final Column c;
private int nonKeyIdx;
public ColumnView(Column c) {this.c = c;}
public String getName() { return c.getName(); }
public boolean getKey() { return c.getKey(); }
public int getNonKeyIndex() { return nonKeyIdx; }
public void setNonKeyIndex(int i) { nonKeyIdx = i; }
}

Categories

Resources