How to override Similarity in a single field in Lucene?

How to override Similarity in a single field in Lucene? - java

I am using version 4.4 of Apache Lucene.
My system indexes a collection of documents into three different fields: the title, description and author(s) of the documents.
I want a document to get higher score the more frequency of a query term it has. However, when the term is part of the author field, I just want it to act as a "boolean"; this is, to add the same score if the term appears just once or more times. For example, if three authors of a document have a surname "Smith", just one match should be given.
For this, I have found the following code, which overrides the term frequency:
Similarity sim = new DefaultSimilarity() {
#Override
public float tf(float freq) {
return freq == 0 ? 0 : 1;
}
};
searcher.setSimilarity(sim);
However, this overrides me it for the three fields. How can I manage to override the single author field?

You can implement PerFieldSimilarityWrapper, like this:
public class MyCustomSimilarity extends PerFieldSimilarityWrapper {
#Override
public Similarity get(String fieldName) {
if (fieldName.equals("author")) {
return new CustomAuthorSimilarity();
}
else {
return new DefaultSimilarity();
}
}
}

Related

Filter ArrayList using complex AND+OR logic

I am hoping to filter an ArrayList of custom model objects down to those matching user-selected values.
The user can filter the different fields of the model object in any manner...
If they select multiple values for the same field (e.g. choosing breakfast and dinner for the "category" field) objects matching any of the selections should be returned. If they simultaneously filter using the "protein" field and choose "chicken" only chicken breakfast and dinner meals should be returned.
I am currently using Guava and Collections2.filter(...), but can't seem to combine the AND/OR logic properly.
Any guidance would be appreciated! :)
Edit: Adding code snippet as an indication that I'm not looking for "moral support"
Collection<FieldAcceptanceLogItem> objectFilter = allLogItems;
for (final Filter filter : mFilters) {
objectFilter = Collections2.filter(objectFilter, new Predicate<FieldAcceptanceLogItem>() {
#Override
public boolean apply(#javax.annotation.Nullable FieldAcceptanceLogItem input) {
if (filter.getCategory().equalsIgnoreCase(getString(R.string.sublocation))) {
return input.getSublocation().equalsIgnoreCase(filter.getTitle());
}
else if (filter.getCategory().equalsIgnoreCase(getString(R.string.technology))) {
return input.getTechnology().equalsIgnoreCase(filter.getTitle());
}
else { //(filter.getCategory().equalsIgnoreCase(getString(R.string.component)))
return input.getComponent().equalsIgnoreCase(filter.getTitle());
}
}
});
}

So it looks like you want the intersection of sublocation, technology, and component. I moved a couple of things around that should highlight what you're trying to tackle:
objectFilter = Collections2.filter(objectFilter, new Predicate<FieldAcceptanceLogItem>() {
#Override
public boolean apply(#javax.annotation.Nullable FieldAcceptanceLogItem input) {
return SublocationFilters.from(mFilters).contains(input.getLocation())
&& TechnologyFilters.from(mFilters).contains(input.getTechnology())
&& ComponentFilters.from(mFilters).contains(input.getComponent());
});
SublocationFilters.from(...) will give you all of the sublocation filters
TechnologyFilters.from(...) will give you all of the tech filters
ComponentFilters.from(...) will give you all of the component filters
contains(...) is just a convinient method for doing "filter_1 OR filter_2 OR... filter_n"
If you do want to follow that pattern though, I'd recommend doing something more like this as it is less to write tests for:
new CategoryFilter(mFilters, getString(R.string.component))
.contains(input.getComponent());

Object sorting using java comparator but with fixed value should be in last value in a sorted list

I have a MasterPayee object sorting based on Payee category code with alphabetical order now i need to get Other Services category code to be last in the sorted list
List after sorting applied
Financial and Insurance services
Government Sectors
Other Services
Telecommunications and Utilities
Transportation Services
Required list as follows
Financial and Insurance services
Government Sectors
Telecommunications and Utilities
Transportation Services
Other Services
Need to acheive Other Services as last in the list Following Comparator is using to sort the list
Collections.sort(masterPayees, getCategoryNameComparatorByMasterPayee());
private Comparator<MasterPayee> getCategoryNameComparatorByMasterPayee() {
Comparator<MasterPayee> categoryNameComparatorByMasterPayee = new Comparator<MasterPayee>() {
public int compare(MasterPayee o1, MasterPayee o2) {
return (((MasterPayee) o1).getPayee_category().toString()
.compareToIgnoreCase(((MasterPayee) o2).getPayee_category().toString()));
}
};
return categoryNameComparatorByMasterPayee;
}
Other Services should be always last in the sorted list

Try this:
Comparator<MasterPayee> categoryNameComparatorByMasterPayee = new Comparator<MasterPayee>(){
public int compare(MasterPayee o1, MasterPayee o2) {
if (((MasterPayee) o1).getPayee_category().toString().equalsIgnoreCase("Other Services") && ((MasterPayee) o1).getPayee_category().toString().equalsIgnoreCase(((MasterPayee) o2).getPayee_category().toString())) {
return 0;
}
else if (((MasterPayee) o1).getPayee_category().toString().equalsIgnoreCase("Other Services")) {
return 1;
}
else if (((MasterPayee) o2).getPayee_category().toString().equalsIgnoreCase("Other Services")) {
return -1;
}
else return (((MasterPayee) o1).getPayee_category().toString().compareToIgnoreCase(((MasterPayee) o2).getPayee_category().toString()));
}
};
It treats an object with "Other Services" always as "larger", thus making it appear at the end.

Create a constant map <Payee, Integer> and in the comparator use the value.

You can use guava'a Ordering if you know all values that may be sorted.
To create comparator you can speccify your values like this:
Ordering<String> ordering1 = Ordering.explicit("Financial and Insurance services","Government Sectors","Telecommunications and Utilities","Transportation Services","Other Services");
You may also provide List with your values as argument to Ordering.explicit().

If there is only a limited set of those elements I would write them as enum.
A name for the output text and an ordinal for the sorting. It's cleaner.

Another suggestion, if "Other Services" is always present, remove it from the list, do the sorting, and then add "Other Services" last. That way you can keep the sorting logic simple and add the exception separately.
If not always present, then you can look for it first, and then only add if it was present.

I think we can handle the logic gracefully by using a ternary expression.
private Comparator<MasterPayee> getCategoryNameComparatorByMasterPayee() {
Comparator<MasterPayee> categoryNameComparatorByMasterPayee = new Comparator<MasterPayee>() {
public int compare(MasterPayee o1, MasterPayee o2) {
String s1 = ((MasterPayee) o1).getPayee_category().toString();
String s2 = ((MasterPayee) o1).getPayee_category().toString();
boolean b1 = s1.equalsIgnoreCase("Other Services");
boolean b2 = s2.equalsIgnoreCase("Other Services");
return b1 ? (b2 ? 0 : 1) : (b2 ? -1 : s1.compareToIgnoreCase(s2));
}
};
return categoryNameComparatorByMasterPayee;
}
This avoids having code which is difficult to read, and therefore difficult to maintain. And if we need to change the logic here, we might only have to make minimal changes.

If the list of strings is fixed ordering is based on business logic instead of string value, then i recommend using EnumMap collections.
enum Industry{
FINANCE, GOVERNMENT, UTILITIES, TRANSPORT, OTHER
}
public class StreamEnumMap {
public static void main(String... strings){
Map<Industry, String> industryMap = new EnumMap<>(Industry.class);
industryMap.put(Industry.FINANCE, "Financial and Insurance services");
industryMap.put(Industry.GOVERNMENT,"Government Sectors");
industryMap.put(Industry.UTILITIES,"Telecommunications and Utilities");
industryMap.put(Industry.OTHER,"Other Services");
industryMap.put(Industry.TRANSPORT, "Transportation Services");
industryMap.values().stream().forEach(System.out::println);
}
}
This produces the results in the below order,
Financial and Insurance services
Government Sectors
Telecommunications and Utilities
Transportation Services
Other Services

Advice on Java program

My java project required that I create an array of objects(items), populate the array of items, and then create a main method that asks a user to enter the item code which spits back the corresponding item.
It took me a while to figure out, but I ended up "cheating" by using a public variable to avoid passing/referencing the object between classes.
Please help me properly pass the object back.
This is the class with most of my methods including insert and the find method.
public class Catalog {
private Item[] itemlist;
private int size;
private int nextInsert;
public Item queriedItem;
public Catalog (int max) {
itemlist = new Item[max];
size = 0;
}
public void insert (Item item) {
itemlist[nextInsert] = item;
++nextInsert;
++size;
}
public Item find (int key) {
queriedItem = null;
for (int posn = 0; posn < size; ++posn) {
if (itemlist[posn].getKey() == key) queriedItem = itemlist[posn];
}{
return queriedItem;
}
}
}
This is my main class:
import java.util.*;
public class Program {
public static void main (String[] args) {
Scanner kbd = new Scanner (System.in);
Catalog store;
int key = 1;
store = new Catalog (8);
store.insert(new Item(10, "food", 2.00));
store.insert(new Item(20, "drink", 1.00));
while (key != 0) {
System.out.printf("Item number (0 to quit) ?%n");
key = kbd.nextInt();
if (key == 0) {
System.out.printf("Exiting program now!");
System.exit(0);
}
store.find(key);
if (store.queriedItem != null) {
store.queriedItem.print();
}
else System.out.printf("No Item found for %d%n", key);
}
}
}
Thanks I appreciate the help!!!!!!

store.find(key); returns an Item you should use it and delete the public field from Catalog
public Item find (int key) {
Item queriedItem = null;
//....
}
Item searched = store.find(key);
if (searched != null)
searched.print();
else
System.out.printf("No Item found for %d%n", key);

Remove your use of queriedItem entirely and just return the item from find: Replace
store.find(key);
if (store.queriedItem != null){store.queriedItem.print();}else System.out.printf("No Item found for %d%n", key);
With
Item foundItem = store.find(key);
if (foundItem != null) {
foundItem.print();
} else System.out.printf("No Item found for %d%n", key);

Well, here are some suggesetions (choose complexity at your own discretion, but all of them is highly recommended):
Research Properties, for example here. Or XML. You could populate the array with values from a configuration file for greater flexibility.
Use constanst for literals in your code (where they are necessary).
Create an ApplicationFactory the initializes the whole application for you. Things like this need to be separated from your domain logic.
Create a UserInputProvider interface so you can easily change the way the input of the user is read without affecting anything else. Implement it with a ConsoleInputProvider class for example.
In general, try using interfaces for everything that's not a pure domain object (here, the only one you have is probably Item).
Try to keep your methods as short as possible. Instead of doing many things in a method, have it invoke other methods (grouping related logic) named appropriately to tell what it is doing.
If you're not allowed to cheat and use List or a Map, devise your own implementation of one, separating data structure and handling from the logic represented by Catalog (i.e. Catalog in turn will delegate to, for example, Map.get or equivalent method of your data structure implementation)
Your main should basically just have ApplicationFactory (or an IoC framework) to build and initialize your application, invoke the UserInputProvider (it should not know the exact implementation it is using) to get user input, validate and convert the data as required, invoke Catalog to find the appropriate Item and then (similarly to the input interface) send the result (the exact data it got, not some string or alike) to some implementation of a SearchResultView interface that decides how to display this result (in this case it will be a console-based implementation, that prints a string representing the Item it got).
Generally, the higher the level of decoupling you can achieve, the better your program will be.
The Single Responsibility Principle states: " every class should have a single responsibility, and that responsibility should be entirely encapsulated by the class". This is also true for methods: they should have one and only one well defined task without any side effects.

StringTemplate: increment value when if condition true

I want to find out if StringTemplate have/support incrementation of a number.
Situation is:
input: is an array of objects which have "isKey() and getName()" getter.
output should be (i=0; IF !obj.getKey() THEN ps.setObject(i++,obj.getName)) ENDIF):
ps.setObject(1,"Name");
ps.setObject(2,"Name");
ps.setObject(3,"Name");
...
Currently I have next ST: <objs:{<if(it.key)><else>ps.setObject(<i>, <it.name;>);<"\n"><endif>}>
And the output in case if 1st is key:
ps.setObject(2,"Name");
ps.setObject(3,"Name");
ps.setObject(4,"Name");
...
Issue now I need to find a way to replace the 'i' with something which will be increment only when if condition is true.
PLS advice who faced this kind of issue!

In general, changing the state in response to ST's getting the state is not a good idea, so numbering non-key fields should happen in your model, before you start with the generation.
Add a getter for nonKeyIndex to the class of your model that hosts the name property. Go through all siblings, and number them as you need (i.e. starting from one and skipping the keys in your numbering). Now you can use this ST to produce the desired output:
<objs:{<if(it.key)><else>ps.setObject(<it.nonKeyIndex>, <it.name;>);<"\n"><endif>}>
Sometimes it may not be possible to add methods such as nonKeyIndex to your model classes. In such cases you should wrap your classes into view classes designed specifically to work with string template, and add the extra properties there:
public class ColumnView {
private final Column c;
private int nonKeyIdx;
public ColumnView(Column c) {this.c = c;}
public String getName() { return c.getName(); }
public boolean getKey() { return c.getKey(); }
public int getNonKeyIndex() { return nonKeyIdx; }
public void setNonKeyIndex(int i) { nonKeyIdx = i; }
}

Solr: search excludes bigger phrazes

F.e. I have a 3 documents.
1. "dog cat a ball"
2. "dog the cat of balls"
3. "dog the cat, ball and elephant"
So. By querying "dog AND cat AND ball" I want to receive only first two documents.
So. the main idea that I want to include into results only words I requested.
I'll appreciate any advise.
thank you.

well, if you store your TermVector (while creating a Field, before adding the Document to the index, use TermVector.YES) it can be done, by overriding a Collector. here is a simple implementation (that returns only the documents without scores):
private static class MyCollector extends Collector {
private IndexReader ir;
private int numberOfTerms;
private Set<Integer> set = new HashSet<Integer>();
public MyCollector(IndexReader ir,int numberOfTerms) {
this.ir = ir;
this.numberOfTerms = numberOfTerms;
}
#Override
public void setScorer(Scorer scorer) throws IOException { } //we do not use a scorer in this example
#Override
public void setNextReader(IndexReader reader, int docBase) {
//ignore
}
#Override
public void collect(int doc) throws IOException {
TermFreqVector vector = ir.getTermFreqVector(doc, CONTENT_FIELD);
//CONTENT_FILED is the name of the field you are searching in...
if (vector != null) {
if (vector.getTerms().length == numberOfTerms) {
set.add(doc);
}
} else {
set.add(doc); //well, assume it doesn't happen, because you stored your TermVectors.
}
}
#Override
public boolean acceptsDocsOutOfOrder() {
return true;
}
public Set<Integer> getSet() {
return set;
}
};
now, use IndexSearcher#search(Query,Collector)
the idea is: you know how many terms should be in the document if it is to be accepted, so you just verify it, and collect only documents that match this rule. of course this can be more complex (look for a specific term in the Vector, order of words in the Vector), but this is the general idea.
actually, if you store your TermVector, you can do almost anything, so just try working with it.

You may implement a filter factory/tokenizer pair with hashing capabilities.
Use copyfield directive
You need to tokenize terms
Remove stopwords (in your example)
Sort terms in alphanumeric order and save the hash
expand the query to also search for the hash something like:
somestring:"dog AND cat AND ball" AND somehash:"dog AND cat AND ball"
The second searchquery part will be implicitly hashed in the query processing.
this will result only in exact matches ( with a very very unrealistic probability of false positives )
P.S. you dont need to store termvectors. Which will result in a noticeable smaller index.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.