How to sort an index file in lucene

How to sort an index file in lucene - java

I have an index which made by lucene and each document in it have 3 filed that one of them is a numeric field which is my frequency. I search in my index but before it I want to sort it by the numeric field. is there any way to sort it by lucene before my search?

Sorting before searching doesn't really make a lot of sense, since Lucene is creating an inverted index for searching against, rather than storing and searching through a sequential set of documents.
However, it sounds like you want to run a search and get results that are already sorted in a specified way.
This is done by passing a Sort to the IndexSearcher.search call, like:
SortField field = new SortField("frequency", SortField.Type.FLOAT);
//Sorting, first, by "frequency", then by relevance score
Sort sort = new Sort(field, Sort.FIELD_SCORE);
searcher.search(query, maxDocs, sort);
The name of the field makes me wonder if you aren't re-inventing the wheel though. Lucene already factors term frequency into it's relevance scores. If you want to tweak that sort of scoring, if might be a better idea to create a custom Similarity class to calculate scores for you, extending either TFIDFSimilarity or DefaultSimilarity, and overriding the method tf, particularly.

Related

Hibernate Search / Lucene based Sorting Issue

I am having an issue in sorting, which specify below.
Previously, the code is writtern as
Sort sort = new Sort(new SortField[] {
SortField.FIELD_SCORE,
new SortField("field_1", SortField.STRING),
new SortField("field_2", SortField.STRING),
new SortField("field_2", SortField.LONG)
});
and this is an example pasted by the a stackoverflow answer here for custom sorting,
Sorting search result in Lucene based on a numeric field.
Though he does not suggest this is the correct way to do the sorting, this is also the code where my company has been used for years.
But when I create a new function, that needs to do sorting on lots of fields, and by performing unit testing, I found that it does not actually work as intended.
I need to remove SortField.FIELD_SCORE in order to make it works great. And I think this is suggested by the example described here if I did understand correctly, https://docs.jboss.org/hibernate/search/4.1/reference/en-US/html_single/#d0e5317.
i.e. the main code will convert to
Sort sort = new Sort(new SortField[] {
new SortField("field_1", SortField.STRING),
new SortField("field_2", SortField.STRING),
new SortField("field_2", SortField.LONG)
});
So my question is
what is the usage of SortField.FIELD_SCORE? How does the field score be calculated?
Why presenting SortField.FIELD_SCORE sometimes return correct value, sometimes don't?

what is the usage of SortField.FIELD_SCORE? How does the field score be calculated?
When you search for documents containing a word, each document gets assigned a "score": a float value, generally positive. The higher this value, the better the match. How exactly this is computed is a bit complex, and it gets worse when you have multiple nested queries (e.g. boolean queries, etc.), because then scores get combined with other formulas. Suffice it to say: the score is a number, there's one value for each document, and higher is better.
SortField.FIELD_SCORE will simply sort documents by descending score.
Why presenting SortField.FIELD_SCORE sometimes return correct value, sometimes don't?
Hard to say. It depends on lots of things, like your analyzers, the exact query you're running, and even the frequency of the search terms in your documents. Like I said, the formula used to compute the score is complex.
One thing that stand out in your sort, though, is that you're sorting by score and by actual fields. That's unlikely to work well. Scores are generally unique, so unless your documents are very similar (e.g. all text fields are empty for some reason), the top documents will have scores like this: [5.1, 3.4, 2.6, 2.4, 2.2]. Their order is already "complete": you can add as many subsequent sorts as you want, the order will not change because it is fully defined by the sort by score.
Think of alphabetical order: if I have to sort ["area", "baby"], the second letter of "baby" may be "a", but it doesn't matter, because the first letter is "b" and it's always going to be after the "a" of "area".
So, if you're not interested in a sort by score (and, if you don't know what score is, chances are you indeed are not interested), just stick to sorts by field:
Sort sort = new Sort(new SortField[] {
new SortField("field_1", SortField.STRING),
new SortField("field_2", SortField.STRING),
new SortField("field_2", SortField.LONG)
});
And if you're interested in a sort by score, then just sort by score:
Sort sort = new Sort(new SortField[] {
SortField.FIELD_SCORE
});
// Or equivalently
Sort sort = Sort.RELEVANCE; // "Relevance" means "sort by score"
Note that Hibernate Search 4.1 (the version for your documentation link) is very old; you should consider upgrading at least to 5.11 (similar API, also old but still maintained), and preferably to 6.0 (different, but more modern API, new and also maintained).

Fuzzy String matching of Strings in Java

I have a very large list of Strings stored in a NoSQL DB. Incoming query is a string and I want to check if this String is there in the list or not. In case of Exact match, this is very simple. That NoSQL DB may have the String as the primary key and I will just check if there is any record with that string as primary key. But I need to check for Fuzzy match as well.
There is one approach to traverse every String in that list and check Levenshtein Distance of input String with the Strings in list, but this approach will result in O(n) complexity and the size of list is very large (10 million) and may even increase. This approach will result in higher latency of my solution.
Is there a better way to solve this problem?

Fuzzy matching is complicated for the reasons you have discovered. Calculating a distance metric for every combination of search term against database term is impractical for performance reasons.
The solution to this is usually to use an n-gram index. This can either be used standalone to give a result, or as a filter to cut down the size of possible results so that you have fewer distance scores to calculate.
So basically, if you have a word "stack" you break it into n-grams (commonly trigrams) such as "s", "st", "sta", "ack", "ck", "k". You index those in your database against the database row. You then do the same for the input and look for the database rows that have the same matching n-grams.
This is all complicated, and your best option is to use an existing implementation such as Lucene/Solr which will do the n-gram stuff for you. I haven't used it myself as I work with proprietary solutions, but there is a stackoverflow question that might be related:
Return only results that match enough NGrams with Solr
Some databases seem to implement n-gram matching. Here is a link to a Sybase page that provides some discussion of that:
Sybase n-gram text index
Unfortunately, discussions of n-grams would be a long post and I don't have time. Probably it is discussed elsewhere on stackoverflow and other sites. I suggest Googling the term and reading up about it.

First of all, if Searching is what you're doing, then you should use a Search Engine (ElasticSearch is pretty much the default). They are good at this and you are not re-inventing wheels.
Second, the technique you are looking for is called stemming. Along with the original String, save a normalized string in your DB. Normalize the search query with the same mechanism. That way you will get much better search results. Obviously, this is one of the techniques a search engine uses under the hood.

Use Solr (or Lucene) could be a suitable solution for you?
Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:
roam~
This search will find terms like foam and roams.
Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:
roam~0.8
https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

Fastest way to find an item in a sorted ArrayList

In Java, I have an ArrayList with a list of objects. Each object has a date field that is just a long data type. The ArrayList is sorted by the date field. I want to insert a new object into the ArrayList so that it appears in the correct position with regard to its date. The only solution I can see is to iterate through all the items and compare the date field of the object being inserted to the objects being iterated on and then insert it once I reach the correct position. This will be a performance issue if I have to insert a lot of records.
What are some possible ways to improve this performance? Maybe an ArrayList is not the best solution?

I would say that you are correct in making the statement:
Maybe an ArrayList is not the best solution
Personally, I think that a tree structure would be better suited for this. Specifically Binary Search Tree, which is sorted on the object's date time. Once you have the tree created, you can use binary search which would take O(log n) time.

Whether or not binary search + O(n) insertion is bad for you depends on at least these things:
size of the list,
access pattern (mostly insert or mostly read),
space considerations (ArrayList is far more compact than the alternatives).
Given the existence of these factors and their quite complex interactions you should not switch over to a binary search tree-based solution until you find out how bad your current design is—through measurements. The switch might even make things worse for you.

I would consider using TreeSet and make your item Comparable. Then you get everything out of the box.
If this is not possible I would search for the index via Collections.binarySearch(...).
EDIT: Make sure performance is an issue before you start optimizing

first you should sort ArrayList Using:
ArrayList<Integer> arr = new ArrayList<>();
...
Collections.sort(arr);
Then Your Answer is:
int index = Collections.binarySearch(arr , 5);

Java collections, search by "part of a value"

I searched many posts but I didnt find answer. I'd like to search and access values in collection by searching by value. My object type is DictionaryWord with two values: String word and int wordUsage (number of times the word was used). I was wondering which collection would be the fastest. If I write down "wa" I'd like it to give me e.g. 5 strings that start with these letters. Any list or set would be probably way too slow as I have 100 000 objects.
I thought about using a HashMap by making its key values String word and its values int wordUsage. I could even write my own hash() function to just give every key same value after hashing - key: "writing", hash value: "writing". Considering there are no duplicates would it be a good idea or should I look for something else?
My point is: how and what do I use to search for values that have some part of the value used in the search condition. For example writing down "tea" i find in collection values like: "tea", "teacher", "tear", "teaching" etc.

The fastest I can think of is a binary search tree. I found this to be very helpful and it should make it clear why a tree is the best option.

Probably, you need prefix tree. Take a look at Trie wiki page for further information.

How to get synonyms ordered by their occurrence probability from Wordnet

I am searching in Wordnet for synonyms for a big list of words. The way I have it done it, when some word has more than one synonym, the results are returned in alphabetical order. What I need is to have them ordered by their probability of occurrence, and I would take just the top 1 synonym.
I have used the prolog wordnet database and Syns2Index to convert it into Lucene type index for querying synonyms. Is there a way to get them ordered by their probabilities in this way, or I should use another approach?
Speed not important, this synonym lookup will not be done online.

In case someone stumbles upon this thread, this was the way to go(at least what i needed):
http://lyle.smu.edu/~tspell/jaws/doc/edu/smu/tspell/wordnet/impl/file/ReferenceSynset.html#getTagCount%28java.lang.String%29
tagCount method gives the most likely synset group for every word. The problem again is that synset with highes probability again can have several words. But i guess theres no chance to avoid this

I think that you should do another step (provided that speed is not important).
From the Lucene index, you should build another dictionary in which each word is mapped to a small object that contains the only synonym that its meaning has higher probability of appearance, its meaning, and probability of appearance. I.e., given this code:
class Synonym {
public:
String name;
double probability;
String meaning;
}
Map<String, Synonym> m = new HashMap<String, Synonym>();
... you just have to fill it from the Lucene index.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to sort an index file in lucene - java

I have an index which made by lucene and each document in it have 3 filed that one of them is a numeric field which is my frequency. I search in my index but before it I want to sort it by the numeric field. is there any way to sort it by lucene before my search?

Related

Hibernate Search / Lucene based Sorting Issue

Fuzzy String matching of Strings in Java

Fastest way to find an item in a sorted ArrayList

Java collections, search by "part of a value"

How to get synonyms ordered by their occurrence probability from Wordnet

Categories

Resources