Hibernate Search / Lucene based Sorting Issue - java

I am having an issue in sorting, which specify below.
Previously, the code is writtern as
Sort sort = new Sort(new SortField[] {
SortField.FIELD_SCORE,
new SortField("field_1", SortField.STRING),
new SortField("field_2", SortField.STRING),
new SortField("field_2", SortField.LONG)
});
and this is an example pasted by the a stackoverflow answer here for custom sorting,
Sorting search result in Lucene based on a numeric field.
Though he does not suggest this is the correct way to do the sorting, this is also the code where my company has been used for years.
But when I create a new function, that needs to do sorting on lots of fields, and by performing unit testing, I found that it does not actually work as intended.
I need to remove SortField.FIELD_SCORE in order to make it works great. And I think this is suggested by the example described here if I did understand correctly, https://docs.jboss.org/hibernate/search/4.1/reference/en-US/html_single/#d0e5317.
i.e. the main code will convert to
Sort sort = new Sort(new SortField[] {
new SortField("field_1", SortField.STRING),
new SortField("field_2", SortField.STRING),
new SortField("field_2", SortField.LONG)
});
So my question is
what is the usage of SortField.FIELD_SCORE? How does the field score be calculated?
Why presenting SortField.FIELD_SCORE sometimes return correct value, sometimes don't?

what is the usage of SortField.FIELD_SCORE? How does the field score be calculated?
When you search for documents containing a word, each document gets assigned a "score": a float value, generally positive. The higher this value, the better the match. How exactly this is computed is a bit complex, and it gets worse when you have multiple nested queries (e.g. boolean queries, etc.), because then scores get combined with other formulas. Suffice it to say: the score is a number, there's one value for each document, and higher is better.
SortField.FIELD_SCORE will simply sort documents by descending score.
Why presenting SortField.FIELD_SCORE sometimes return correct value, sometimes don't?
Hard to say. It depends on lots of things, like your analyzers, the exact query you're running, and even the frequency of the search terms in your documents. Like I said, the formula used to compute the score is complex.
One thing that stand out in your sort, though, is that you're sorting by score and by actual fields. That's unlikely to work well. Scores are generally unique, so unless your documents are very similar (e.g. all text fields are empty for some reason), the top documents will have scores like this: [5.1, 3.4, 2.6, 2.4, 2.2]. Their order is already "complete": you can add as many subsequent sorts as you want, the order will not change because it is fully defined by the sort by score.
Think of alphabetical order: if I have to sort ["area", "baby"], the second letter of "baby" may be "a", but it doesn't matter, because the first letter is "b" and it's always going to be after the "a" of "area".
So, if you're not interested in a sort by score (and, if you don't know what score is, chances are you indeed are not interested), just stick to sorts by field:
Sort sort = new Sort(new SortField[] {
new SortField("field_1", SortField.STRING),
new SortField("field_2", SortField.STRING),
new SortField("field_2", SortField.LONG)
});
And if you're interested in a sort by score, then just sort by score:
Sort sort = new Sort(new SortField[] {
SortField.FIELD_SCORE
});
// Or equivalently
Sort sort = Sort.RELEVANCE; // "Relevance" means "sort by score"
Note that Hibernate Search 4.1 (the version for your documentation link) is very old; you should consider upgrading at least to 5.11 (similar API, also old but still maintained), and preferably to 6.0 (different, but more modern API, new and also maintained).

Related

Efficient Intersection and Union of Lists of Strings

I need to efficiently find the ratio of (intersection size / union size) for pairs of Lists of strings. The lists are small (mostly about 3 to 10 items), but I have a huge number of them (~300K) and have to do this on every pair, so I need this actual computation to be as efficient as possible. The strings themselves are short unicode strings -- averaging around 5-10 unicode characters.
The accepted answer here Efficiently compute Intersection of two Sets in Java? looked extremely helpful but (likely because my sets are small (?)) I haven't gotten much improvement by using the approach suggested in the accepted answer.
Here's what I have so far:
protected double uuEdgeWeight(UVertex u1, UVertex u2) {
Set<String> u1Tokens = new HashSet<String>(u1.getTokenlist());
List<String> u2Tokens = u2.getTokenlist();
int intersection = 0;
int union = u1Tokens.size();
for (String s:u2Tokens) {
if (u1Tokens.contains(s)) {
intersection++;
} else {
union++;
}
}
return ((double) intersection / union);
My question is, is there anything I can do to improve this, given that I'm working with Strings which may be more time consuming to check equality than other data types.
I think because I'm comparing multiple u2's against the same u1, I could get some improvement by doing the cloning of u2 into a HashSet outside of the loop (which isn't shown -- meaning I'd pass in the HashSet instead of the object from which I could pull the list and then clone into a set)
Anything else I can do to squeak out even a small improvement here?
Thanks in advance!
Update
I've updated the numeric specifics of my problem above. Also, due to the nature of the data, most (90%?) of the intersections are going to be empty. My initial attempt at this used the clone the set and then retainAll the items in the other set approach to find the intersection, and then shortcuts out before doing the clone and addAll to find the union. That was about as efficient as the code posted above, presumably because of the trade of between it being a slower algorithm overall versus being able to shortcut out a lot of the time. So, I'm thinking about ways to take advantage of the infrequency of overlapping sets, and would appreciate any suggestions in that regard.
Thanks in advance!
You would get a large improvement by moving the HashSet outside of the loop.
If the HashSet really has only got a few entries in it then you are probably actually just as fast to use an Array - since traversing an array is much simpler/faster. I'm not sure where the threshold would lie but I'd measure both - and be sure that you do the measurements correctly. (i.e. warm up loops before timed loops, etc).
One thing to try might be using a sorted array for the things to compare against. Scan until you go past current and you can immediately abort the search. That will improve processor branch prediction and reduce the number of comparisons a bit.
If you want to optimize for this function (not sure if it actually works in your context) you could assign each unique String an Int value, when the String is added to the UVertex set that Int as a bit in a BitSet.
This function should then become a set.or(otherset) and a set.and(otherset). Depending on the number of unique Strings that could be efficient.

How to sort an index file in lucene

I have an index which made by lucene and each document in it have 3 filed that one of them is a numeric field which is my frequency. I search in my index but before it I want to sort it by the numeric field. is there any way to sort it by lucene before my search?
Sorting before searching doesn't really make a lot of sense, since Lucene is creating an inverted index for searching against, rather than storing and searching through a sequential set of documents.
However, it sounds like you want to run a search and get results that are already sorted in a specified way.
This is done by passing a Sort to the IndexSearcher.search call, like:
SortField field = new SortField("frequency", SortField.Type.FLOAT);
//Sorting, first, by "frequency", then by relevance score
Sort sort = new Sort(field, Sort.FIELD_SCORE);
searcher.search(query, maxDocs, sort);
The name of the field makes me wonder if you aren't re-inventing the wheel though. Lucene already factors term frequency into it's relevance scores. If you want to tweak that sort of scoring, if might be a better idea to create a custom Similarity class to calculate scores for you, extending either TFIDFSimilarity or DefaultSimilarity, and overriding the method tf, particularly.

Java: optimize hashset for large-scale duplicate detection

I am working on a project where I am processing a lot of tweets; the goal is to remove duplicates as I process them. I have the tweet IDs, which come in as strings of the format "166471306949304320"
I have been using a HashSet<String> for this, which works fine for a while. But by the time I get to around 10 million items I am drastically bogged down and eventually get a GC error, presumably from the rehashing. I tried defining a better size/load with
tweetids = new HashSet<String>(220000,0.80F);
and that lets it get a little farther, but is still excruciatingly slow (by around 10 million it is taking 3x as long to process). How can I optimize this? Given that I have an approximate idea of how many items should be in the set by the end (in this case, around 20-22 million) should I create a HashSet that rehashes only two or three times, or would the overhead for such a set incur too many time-penalties? Would things work better if I wasn't using a String, or if I define a different HashCode function (which, in this case of a particular instance of a String, I'm not sure how to do)? This portion of the implementation code is below.
tweetids = new HashSet<String>(220000,0.80F); // in constructor
duplicates = 0;
...
// In loop: For(each tweet)
String twid = (String) tweet_twitter_data.get("id");
// Check that we have not processed this tweet already
if (!(tweetids.add(twid))){
duplicates++;
continue;
}
SOLUTION
Thanks to your recommendations, I solved it. The problem was the amount of memory required for the hash representations; first, HashSet<String> was simply enormous and uncalled for because the String.hashCode() is exorbitant for this scale. Next I tried a Trie, but it crashed at just over 1 million entries; reallocating the arrays was problematic. I used a HashSet<Long> to better effect and almost made it, but speed decayed and it finally crashed on the last leg of the processing (around 19 million). The solution came with departing from the standard library and using Trove. It finished 22 million records a few minutes faster than not checking duplicates at all. Final implementation was simple, and looked like this:
import gnu.trove.set.hash.TLongHashSet;
...
TLongHashSet tweetids; // class variable
...
tweetids = new TLongHashSet(23000000,0.80F); // in constructor
...
// inside for(each record)
String twid = (String) tweet_twitter_data.get("id");
if (!(tweetids.add(Long.parseLong(twid)))) {
duplicates++;
continue;
}
You may want to look beyond the Java collections framework. I've done some memory intensive processing and you will face several problems
The number of buckets for large hashmaps and hash sets is going to
cause a lot of overhead (memory). You can influence this by using
some kind of custom hash function and a modulo of e.g. 50000
Strings are represented using 16 bit characters in Java. You can halve that by using utf-8 encoded byte arrays for most scripts.
HashMaps are in general quite wasteful data structures and HashSets are basically just a thin wrapper around those.
Given that, take a look at trove or guava for alternatives. Also, your ids look like longs. Those are 64 bit, quite a bit smaller than the string representation.
An alternative you might want to consider is using bloom filters (guava has a decent implementation). A bloom filter would tell you if something is definitely not in a set and with reasonable certainty (less than 100%) if something is contained. That combined with some disk based solution (e.g. database, mapdb, mecached, ...) should work reasonably well. You could buffer up incoming new ids, write them in batches, and use the bloom filter to check if you need to look in the database and thus avoid expensive lookups most of the time.
If you are just looking for the existence of Strings, then I would suggest you try using a Trie(also called a Prefix Tree). The total space used by a Trie should be less than a HashSet, and it's quicker for string lookups.
The main disadvantage is that it can be slower when used from a harddisk as it's loading a tree, not a stored linearly structure like a Hash. So make sure that it can be held inside of RAM.
The link I gave is a good list of pros/cons of this approach.
*as an aside, the bloom filters suggested by Jilles Van Gurp are great fast prefilters.
Simple, untried and possibly stupid suggestion: Create a Map of Sets, indexed by the first/last N characters of the tweet ID:
Map<String, Set<String>> sets = new HashMap<String, Set<String>>();
String tweetId = "166471306949304320";
sets.put(tweetId.substr(0, 5), new HashSet<String>());
sets.get(tweetId.substr(0, 5)).add(tweetId);
assert(sets.containsKey(tweetId.substr(0, 5)) && sets.get(tweetId.substr(0, 5)).contains(tweetId));
That easily lets you keep the maximum size of the hashing space(s) below a reasonable value.

storing sets of integers to check if a certain set has already been mentioned

I've come across an interesting problem which I would love to get some input on.
I have a program that generates a set of numbers (based on some predefined conditions). Each set contains up to 6 numbers that do not have to be unique with integers that ranges from 1 to 100).
I would like to somehow store every set that is created so that I can quickly check if a certain set with the exact same numbers (order doesn't matter) has previously been generated.
Speed is a priority in this case as there might be up to 100k sets stored before the program stops (maybe more, but most the time probably less)! Would anyone have any recommendations as to what data structures I should use and how I should approach this problem?
What I have currently is this:
Sort each set before storing it into a HashSet of Strings. The string is simply each number in the sorted set with some separator.
For example, the set {4, 23, 67, 67, 71} would get encoded as the string "4-23-67-67-71" and stored into the HashSet. Then for every new set generated, sort it, encode it and check if it exists in the HashSet.
Thanks!
if you break it into pieces it seems to me that
creating a set (generate 6 numbers, sort, stringify) runs in O(1)
checking if this string exists in the hashset is O(1)
inserting into the hashset is O(1)
you do this n times, which gives you O(n).
this is already optimal as you have to touch every element once anyways :)
you might run into problems depending on the range of your random numbers.
e.g. assume you generate only numbers between one and one, then there's obviously only one possible outcome ("1-1-1-1-1-1") and you'll have only collisions from there on. however, as long as the number of possible sequences is much larger than the number of elements you generate i don't see a problem.
one tip: if you know the number of generated elements beforehand it would be wise to initialize the hashset with the correct number of elements (i.e. new HashSet<String>( 100000 ) );
p.s. now with other answers popping up i'd like to note that while there may be room for improvement on a microscopic level (i.e. using language specific tricks), your overal approach can't be improved.
Create a class SetOfIntegers
Implement a hashCode() method that will generate reasonably unique hash values
Use HashMap to store your elements like put(hashValue,instance)
Use containsKey(hashValue) to check if the same hashValue already present
This way you will avoid sorting and conversion/formatting of your sets.
Just use a java.util.BitSet for each set, adding integers to the set with the set(int bitIndex) method, you don't have to sort anything, and check a HashMap for already existing BitSet before adding a new BitSet to it, it will be really very fast. Don't use sorting of value and toString for that purpose ever if speed is important.

The best way to store and access 120,000 words in java

I'm programming a java application that reads strictly text files (.txt). These files can contain upwards of 120,000 words.
The application needs to store all +120,000 words. It needs to name them word_1, word_2, etc. And it also needs to access these words to perform various methods on them.
The methods all have to do with Strings. For instance, a method will be called to say how many letters are in word_80. Another method will be called to say what specific letters are in word_2200.
In addition, some methods will compare two words. For instance, a method will be called to compare word_80 with word_2200 and needs to return which has more letters. Another method will be called to compare word_80 with word_2200 and needs to return what specific letters both words share.
My question is: Since I'm working almost exclusively with Strings, is it best to store these words in one large ArrayList? Several small ArrayLists? Or should I be using one of the many other storage possibilities, like Vectors, HashSets, LinkedLists?
My two primary concerns are 1.) access speed, and 2.) having the greatest possible number of pre-built methods at my disposal.
Thank you for your help in advance!!
Wow! Thanks everybody for providing such a quick response to my question. All your suggestions have helped me immensely. I’m thinking through and considering all the options provided in your feedback.
Please forgive me for any fuzziness; and let me address your questions:
Q) English?
A) The text files are actually books written in English. The occurrence of a word in a second language would be rare – but not impossible. I’d put the percentage of non-English words in the text files at .0001%
Q) Homework?
A) I’m smilingly looking at my question’s wording now. Yes, it does resemble a school assignment. But no, it’s not homework.
Q) Duplicates?
A) Yes. And probably every five or so words, considering conjunctions, articles, etc.
Q) Access?
A) Both random and sequential. It’s certainly possible a method will locate a word at random. It’s equally possible a method will want to look for a matching word between word_1 and word_120000 sequentially. Which leads to the last question…
Q) Iterate over the whole list?
A) Yes.
Also, I plan on growing this program to perform many other methods on the words. I apologize again for my fuzziness. (Details do make a world of difference, do they not?)
Cheers!
I would store them in one large ArrayList and worry about (possibly unnecessary) optimisations later on.
Being inherently lazy, I don't think it's a good idea to optimise unless there's a demonstrated need. Otherwise, you're just wasting effort that could be better spent elsewhere.
In fact, if you can set an upper bound to your word count and you don't need any of the fancy List operations, I'd opt for a normal (native) array of string objects with an integer holding the actual number. This is likely to be faster than a class-based approach.
This gives you the greatest speed in accessing the individual elements whilst still retaining the ability to do all that wonderful string manipulation.
Note I haven't benchmarked native arrays against ArrayLists. They may be just as fast as native arrays, so you should check this yourself if you have less blind faith in my abilities than I do :-).
If they do turn out to be just as fast (or even close), the added benefits (expandability, for one) may be enough to justify their use.
Just confirming pax assumptions, with a very naive benchmark
public static void main(String[] args)
{
int size = 120000;
String[] arr = new String[size];
ArrayList al = new ArrayList(size);
for (int i = 0; i < size; i++)
{
String put = Integer.toHexString(i).toString();
// System.out.print(put + " ");
al.add(put);
arr[i] = put;
}
Random rand = new Random();
Date start = new Date();
for (int i = 0; i < 10000000; i++)
{
int get = rand.nextInt(size);
String fetch = arr[get];
}
Date end = new Date();
long diff = end.getTime() - start.getTime();
System.out.println("array access took " + diff + " ms");
start = new Date();
for (int i = 0; i < 10000000; i++)
{
int get = rand.nextInt(size);
String fetch = (String) al.get(get);
}
end = new Date();
diff = end.getTime() - start.getTime();
System.out.println("array list access took " + diff + " ms");
}
and the output:
array access took 578 ms
array list access took 907 ms
running it a few times the actual times seem to vary some, but generally array access is between 200 and 400 ms faster, over 10,000,000 iterations.
If you will access these Strings sequentially, the LinkedList would be the best choice.
For random access, ArrayLists have a nice memory usage/access speed tradeof.
My take:
For a non-threaded program, an Arraylist is always fastest and simplest.
For a threaded program, a java.util.concurrent.ConcurrentHashMap<Integer,String> or java.util.concurrent.ConcurrentSkipListMap<Integer,String> is awesome. Perhaps you would later like to allow threads so as to make multiple queries against this huge thing simultaneously.
If you're going for fast traversal as well as compact size, use a DAWG (Directed Acyclic Word Graph.) This data structure takes the idea of a trie and improves upon it by finding and factoring out common suffixes as well as common prefixes.
http://en.wikipedia.org/wiki/Directed_acyclic_word_graph
Use a Hashtable? This will give you your best lookup speed.
ArrayList/Vector if order matters (it appears to, since you are calling the words "word_xxx"), or HashTable/HashMap if it doesn't.
I'll leave the exercise of figuring out why you would want to use an ArrayList vs. a Vector or a HashTable vs. a HashMap up to you since I have a sneaking suspicion this is your homework. Check the Javadocs.
You're not going to get any methods that help you as you've asked for in the examples above from your Collections Framework class, since none of them do String comparison operations. Unless you just want to order them alphabetically or something, in which case you'd use one of the Tree implementations in the Collections framework.
How about a radix tree or Patricia trie?
http://en.wikipedia.org/wiki/Radix_tree
The only advantage of a linked list over an array or array list would be if there are insertions and deletions at arbitrary places. I don't think this is the case here: You read in the document and build the list in order.
I THINK that when the original poster talked about finding "word_2200", he meant simply the 2200th word in the document, and not that there are arbitrary labels associated with each word. If so, then all he needs is indexed access to all the words. Hence, an array or array list. If there really is something more complex, if one word might be labeled "word_2200" and the next word is labeled "foobar_42" or some such, then yes, he'd need a more complex structure.
Hey, do you want to give us a clue WHY you want to do any of this? I'm hard pressed to remember the last time I said to myself, "Hey, I wonder if the 1,237th word in this document I'm reading is longer or shorter than the 842nd word?"
Depends on what the problem is - speed or memory.
If it's memory, the minimum solution is to write a function getWord(n) which scans the whole file each time it runs, and extracts word n.
Now - that's not a very good solution. A better solution is to decide how much memory you want to use: lets say 1000 items. Scan the file for words once when the app starts, and store a series of bookmarks containing the word number and the position in the file where it is located - do this in such a way that the bookmarks are more-or-less evenly spaced through the file.
Then, open the file for random access. The function getWord(n) now looks at the bookmarks to find the biggest word # <= n (please use a binary search), does a seek to get to the indicated location, and scans the file, counting the words, to find the requested word.
An even quicker solution, using rather more memnory, is to build some sort of cache for the blocks - on the basis that getWord() requests usually come through in clusters. You can rig things up so that if someone asks for word # X, and its not in the bookmarks, then you seek for it and put it in the bookmarks, saving memory by consolidating whichever bookmark was least recently used.
And so on. It depends, really, on what the problem is - on what kind of patterns of retreival are likely.
I don't understand why so many people are suggesting Arraylist, or the like, since you don't mention ever having to iterate over the whole list. Further, it seems you want to access them as key/value pairs ("word_348"="pedantic").
For the fastest access, I would use a TreeMap, which will do binary searches to find your keys. Its only downside is that it's unsynchronized, but that's not a problem for your application.
http://java.sun.com/javase/6/docs/api/java/util/TreeMap.html

Categories

Resources