Which is the best choice to indexing a Boolean value in lucene? - java

Indexing a Boolean value(true/false) in lucene(not need to store)
I want to get more disk space usage and higher search performance
doc.add(new Field("boolean","true",Field.Store.NO,Field.Index.NOT_ANALYZED_NO_NORMS));
//or
doc.add(new Field("boolean","1",Field.Store.NO,Field.Index.NOT_ANALYZED_NO_NORMS));
//or
doc.add(new NumericField("boolean",Integer.MAX_VALUE,Field.Store.NO,true).setIntValue(1));
Which should I choose? Or any other better way?
thanks a lot

An interesting question!
I don't think the third option (NumericField) is a good choice for a boolean field. I can't think of any use case for this.
The Lucene search index (leaving to one side stored data, which you aren't using anyway) is stored as an inverted index
Leaving your first and second options as (theoretically) identical
If I was faced with this, I think I would choose option one ("true" and "false" terms), if it influences the final decision.
Your choice of NOT_ANALYZED_NO_NORMS looks good, I think.

Lucene jumps through an elaborate set of hoops to make NumericField searchable by NumericRangeQuery, so definitely avoid it an all cases where your values don't represent quantities. For example, even if you index an integer, but only as a unique ID, you would still want to use a plain String field. Using "true"/"false" is the most natural way to index a boolean, while using "1"/"0" gives just a slight advantage by avoiding the possibility of case mismatch or typo. I'd say this advantage is not worth much and go for true/false.

Use Solr (a flavour of lucene) - it indexes all basic java types natively.
I've used it and it rocks.

Related

Subtle difference when searching multi value fields in Solr

I have a very simple question but I don't understand exactly why it happens and what the difference is.
Take a simple Solr search on a multi value field:
field_name:ABC AND DEF
field_name:(ABC AND DEF)
They return quite different results. I understand the brackets are for grouping but I don't understand the difference. It seems quite subtle.
Many thanks.
The first query isn't doing what you think it's doing.
field_name:ABC AND DEF
This is parsed as:
field_name:ABC AND <default search field>:DEF
This is different from your second example, which is parsed as:
field_name:ABC AND field_name:DEF
In the first example the second part of your query is made against whatever field is defined as the default search field in your index (or in the query itself, if you've set df).

create and query a n-gram index with lucene

I would like to build an index containing n-grams of each line from my input file, which looks like this:
Segeln bei den Olympischen Sommerspielen
Erdmond
Olympische Spiele
Turnen bei den Olympischen Sommerspielen
Tennis bei den Olympischen Sommerspielen
Geschichte der Astronomie
I need the n-grams because I would like to search in the index but I have to assume that there are many typing errors in the search-term. For example I would like to find "Geschichte der Astronomie" if I search with the term "schichte astrologie". It would be even better if it could give me a list of the best possible matches, lets say the best 10 matches, no matter how bad they maybe are.
I hope you can point me in the right direction if there would be a better way to achieve this, than with n-grams, or that you have a hint how to create the index and how to query it. I would be very happy to have an example that helps me to understand how to do it.
I currently use lucene 4.3.1. I would prefer to implement it in java and not built the index on the command line.
There are a lot of different ways to approach to this problem, and Lucene has a lot of tools to help with them. N-Grams are probably not the best approach in this situation, to my mind.
Stemmers to reduce terms to their root, based on linguistic rules (ex. matching "fishing" "fished" and "fish) (I don't claim to know how GermanStemmer handles the "ge" prefix, but that would be a good example of something that a stemmer might deal with)
Synonym Filter can handle specific known synonyms you want to recognize (ex. "astrology" = "astronomy")
Fuzzy queries can be used to obtain matches with low edit distances.
Among other possibilities.
As far as implementing on NGrams, NGramTokenizer would be the correct tokenizer for that.

Lucene: Searching multiple fields with default operator = AND

To allow users to search across multiple fields with Lucene 3.5 I currently create and add a QueryParser for each field to be searched to a DisjunctionMaxQuery. This works great when using OR as the default operator but I now want to change the default operator to AND to get more accurate (and fewer) results.
Problem is, queryParser.setDefaultOperator(QueryParser.AND_OPERATOR) misses many documents since all terms must be in atleast 1 field.
For example, consider the following data for a document: title field = "Programming Languages", body field = "Java, C++, PHP". If a user were to search for Java Programming this particular document would not be included in the results since the title nor the body field contains all terms in the query although combined they do. I would want this document returned for the above query but not for the query HTML Programming.
I've considered a catchall field but I have a few problems with it. First, users frequently include per field terms in their queries (author:bill) which is not possible with a catchall field. Also, I highlight certain fields with FastVectorHighlighter which requires them to be indexed and stored. So by adding a catchall field I would have to index most of the same data twice which is time and space consuming.
Any ideas?
Guess I should have done a little more research. Turns out MultiFieldQueryParser provides the exact functionality I was looking for. For whatever reason I was creating a QueryParser for each field I wanted to search like this:
String[] fields = {"title", "body", "subject", "author"};
QueryParser[] parsers = new QueryParser[fields.length];
for(int i = 0; i < parsers.length; i++)
{
parsers[i] = new QueryParser(Version.LUCENE_35, fields[i], analyzer);
parsers[i].setDefaultOperator(QueryParser.AND_OPERATOR);
}
This would result in a query like this:
(+title:java +title:programming) | (+body:java +body:programming)
...which is not what I was looking. Now I create a single MultiFieldQueryParser like this:
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_35, new String[]{"title", "body", "subject"}, analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
This gives me the query I was looking for:
+(title:java body:java) +(title:programming body:programming)
Thanks to #seeta and #femtoRgon for the help!
Perhaps what you need is a combination of Boolean queries that capture the different combinations of fields and terms. In your given example, the query could be -
(title:Java AND body:programming) OR (title:programming AND body:Java).
I don't know if there's an existing Query class that generates this automatically for you, but I think that's what should be the ultimate query that's run on the index.
You want to be able to search multiple fields with the same set of terms, then the question from your comment:
((title:java title:programming) | (body:java body:programming))~0.2
May not be the best implementation.
You're effectively getting either the score from the title, or the score from the body for the combined set of terms. The case where you hit java in the title and programming in the body would be given approx. equal weight to a hit on java in the body and no hit on programming.
I think a better structured query would be:
(title:java body:java)~0.2 (title:programming body:programming)~0.2
This makes more sense to me, since you want the dismax queries to limit score growing on multiple queries of the same term (in different fields), but you do want scoring to grow for hits on different terms, I believe.
If that sort of query structure gets you better score results, limiting results to a certain minimum score (a percentage of the max score returned, rather than a simple hard-coded value) may be adequate to prevent too-weak results from being seen.
I also still wouldn't count out indexing an all field. It's an implementation I've used before, while indexing BOTH the specific field and the catchall field, thus allowing both general querying and specific single-field queries. Index storage tends to be pretty lean for unstored terms, and it will generally help performance, if you find yourself having to create big, complicated queries to make up for not having it.
If you really want to be sure that it takes minimal storage, you can even turn off TermVectors for that field:
new Field(name, value, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);
Although I don't know how much of a difference that would really make.

Java's String.replace() vs. String.replaceFirst() vs. homebrew

I have a class that is doing a lot of text processing. For each string, which is anywhere from 100->2000 characters long, I am performing 30 different string replacements.
Example:
string modified;
for(int i = 0; i < num_strings; i++){
modified = runReplacements(strs[i]);
//do stuff
}
public runReplacements(String str){
str = str.replace("foo","bar");
str = str.replace("baz","beef");
....
return str;
}
'foo', 'baz', and all other "targets" are only expected to appear once and are string literals (no need for an actual regex).
As you can imagine, I am concerned about performance :)
Given this,
replaceFirst() seems a bad choice because it won't use Pattern.LITERAL and will do extra processing that isn't required.
replace() seems a bad choice because it will traverse the entire string looking for multiple instances to be replaced.
Additionally, since my replacement texts are the same everytime, it seems to make sense for me to write my own code otherwise String.replaceFirst() or String.replace() will be doing a Pattern.compile every single time in the background. Thinking that I should write my own code, this is my thought:
Perform a Pattern.compile() only once for each literal replacement desired (no need to recompile every single time) (i.e. p1 - p30)
Then do the following for each pX: p1.matcher(str).replaceFirst(Matcher.quoteReplacement("desiredReplacement"));
This way I abandon ship on the first replacement (instead of traversing the entire string), and I am using literal vs. regex, and I am not doing a re-compile every single iteration.
So, which is the best for performance?
So, which is the best for performance?
Measure it! ;-)
ETA: Since a two word answer sounds irretrievably snarky, I'll elaborate slightly. "Measure it and tell us..." since there may be some general rule of thumb about the performance of the various approaches you cite (good ones, all) but I'm not aware of it. And as a couple of the comments on this answer have mentioned, even so, the different approaches have a high likelihood of being swamped by the application environment. So, measure it in vivo and focus on this if it's a real issue. (And let us know how it goes...)
First, run and profile your entire application with a simple match/replace. This may show you that:
your application already runs fast enough, or
your application is spending most of its time doing something else, so optimizing the match/replace code is not worthwhile.
Assuming that you've determined that match/replace is a bottleneck, write yourself a little benchmarking application that allows you to test the performance and correctness of your candidate algorithms on representative input data. It's also a good idea to include "edge case" input data that is likely to cause problems; e.g. for the substitutions in your example, input data containing the sequence "bazoo" could be an edge case. On the performance side, make sure that you avoid the traps of Java micro-benchmarking; e.g. JVM warmup effects.
Next implement some simple alternatives and try them out. Is one of them good enough? Done!
In addition to your ideas, you could try concatenating the search terms into a single regex (e.g. "(foo|baz)" ), use Matcher.find(int) to find each occurrence, use a HashMap to lookup the replacement strings and a StringBuilder to build the output String from input string substrings and replacements. (OK, this is not entirely trivial, and it depends on Pattern/Matcher handling alternates efficiently ... which I'm not sure is the case. But that's why you should compare the candidates carefully.)
In the (IMO unlikely) event that a simple alternative doesn't cut it, this wikipedia page has some leads which may help you to implement your own efficient match/replacer.
Isn't if frustrating when you ask a question and get a bunch of advice telling you to do a whole lot of work and figure it out for yourself?!
I say use replaceAll();
(I have no idea if it is, indeed, the most efficient, I just don't want you to feel like you wasted your money on this question and got nothing.)
[edit]
PS. After that, you might want to measure it.
[edit 2]
PPS. (and tell us what you found)

how to perform word clustering using k-means algorithm in java

Please help me how to perform word clustering using k-means algorithm in java. From the set of documents, I get word and its frequency count. Then i dont know how to start for clustering.I already search google. But no idea. Please tell me steps to perform word clustering. Very needful now. Thanks in advance.
"Programming Collective Intelligence" by Toby Segaran has a wonderful chapter on how to do this. The examples are in Python, but they should be easy to port to Java.
In clustering most important thing is to build a method, which check how to things (for example) are "close" together. E.g. is you are interested in string with same lang, this could be like:
int calculateDistance(String s1, String s2) {
return Math.abs(s1.length() - s2.length());
}
Then I'm not so sure, but in can be like this:
1. choose (can be randomly) first k string,
2. iterate for all string, and relate them to their "nearest" string.
Then can be something, like choosing from every "cluster" middle of it, and start it again. I don't remember it for 100% but I thing it is good way to start.
And remember, that most important is the method calculateDistance()!

Categories

Resources