what is the difference between - and NOT operator in Lucene?

what is the difference between - and NOT operator in Lucene? - java

In the query syntax of Lucene it is said the following:
The NOT operator excludes documents that contain the term after NOT.
...
The "-" or prohibit operator excludes documents that contain
the term after the "-" symbol
I think the difference is that the - operator can be used alone, which is not the case for NOT. Is that it?

There is a very subtle difference. Take a look at this long thread on "Getting a Better Understanding of Lucene's Search Operators" which should hopefully answer your question.

Quick answer:
There is no difference between the behavior of the - (prohibit) operator and the NOT operator. The documentation does not make this especially clear, I think.
NOT is a synonym for -, here.
This can be demonstrated with some tests, summarized below.
See also the extract at the end of this answer for a summary which does a great job of distilling various points about the Lucene classic query parser.
Probably the most important point to take away is that the AND, OR, and NOT operators are not the same as "traditional" boolean operators. They are subtly different. This is because Lucene's classic query parser is only partially reliant on boolean operations - specifically, whether a document should receive a score or not. Beyond that, operators can be used in distinctly "non-boolean" ways, to affect how documents are scored relative to each other.
This makes sense, given Lucene's purpose of showing results in order of relevance.
Test inputs:
I am using:
Lucene 8.9.0
the StandardAnalyzer
a TextField named "body"
the classic query parser
the default query parser operator (A B means "A or B")
the following 6 test documents:
apples
oranges
apples oranges
bananas
apples bananas
oranges bananas
See here for the official "classic query parser" syntax documentation.
First test case: A -B
My paraphrase: "documents which contain A but cannot contain B"
The following query strings...
apples -oranges
apples NOT oranges
apples OR -oranges
apples OR NOT oranges
...are all parsed to the same query, using org.apache.lucene.queryparser.classic.QueryParser. That query is:
body:apples -body:oranges
They therefore all generate the same hits:
doc = 0; score = 0.3648143
field = apples
doc = 4; score = 0.2772589
field = apples bananas
Second test case: -X
The following query strings...
-apples
NOT apples
-anything
NOT anything
...are all parsed to the following 2 queries:
-body:apples
-body:anything
These queries always generate no hits, regardless of the data in the source documents.
This may be counterintuitive - especially -anything.
In the first case, the single clause -body:apples forces all documents containing apples to be given a score of zero. But now there are no more clauses in the query - and therefore there is no additional information which can be used to calculate any scores for the remaining documents. They therefore all stay at their initial state of "unscored". Therefore, no documents can be returned.
In the second case -body:anything, the overall logic is the same. After removing all the documents containing anything from scoring consideration (even if that means removing no documents at all), there is still no more information in the query which can be used for scoring purposes.
Third test case: A AND -B
The following query strings...
apples AND -oranges
apples AND NOT oranges
...are both parsed to the same query:
+body:apples -body:oranges
This is very similar to the first test case - and actually returns the same hits with the same score. This specific case is not significant when investigating the differences between - and NOT, since it gives the same results as test case 1.
Digression: A more interesting test case would be A B versus +A B, where there is a difference in results and scoring (+A forces A to be required). But that is outside the scope of this question.
More Background
Looking at the e-mail thread referred to in another answer, here is a copy of the most relevant section, reproduced here for reference:
begin copied section
In a nutshell...
Lucene's QueryParser class does not parse boolean expressions -- it
might look like it, but it does not.
Lucene's BooleanQuery clause does not model Boolean Queries ... it
models aggregate queries.
the most native way to represent the options available in a lucene
"BooleanQuery" as a string is with the +/- prefixes, where...
+foo ... means foo is a required clause and docs must match it
-foo ... means foo is prohibited clause and docs must not match it
foo ... means foo is an optional clause and docs that match it will
get score benefits for doing so.
in an attempt to make things easier for people who have
simple needs, QueryParser "fakes" that it parses boolean expressions
by interpreting A AND B as +A +B; A OR B as A B and NOT A as
-A
if you change the default operator on QueryParser to be AND then
things get more complicated, mainly because then QueryParser treats
A B the same as +A +B
you should avoid thinking in terms of AND, OR, and NOT ... think in
terms of OPTIONAL, REQUIRED, and PROHIBITED ... your life will be much
easier: documentation will make more sense, conversations on the email
list will be more synergistastic, wine will be sweeter, and food will
taste better.
end copied section

Long time back i read this somewhere... Something similar to your guess... :)
The NOT operator cannot be used with just one term. For example, the following search will return no results:
NOT "jakarta apache"
whereas the "-" or prohibit operator excludes documents that contain the term after the "-" symbol...
Hope this will be useful..

Related

Is there any way to write parsing logic using json?

I have a map in java Map<String,Object> dataMap whose content looks like this -
{country=Australia, animal=Elephant, age=18}
Now while parsing the map the use of various conditional statements may be made like-
if(dataMap.get("country").contains("stra")
OR
if(dataMap.get("animal") || 100 ==0)
OR
Some other operation inside if
I want to create a config file that contains all the rules on how the data inside the Map should look like. In simple words, I want to define the conditions that value corresponding to keys country, animal, and age should follow, what operations should be performed on them, all in the config file, so that the if elses and extra code can be removed. The config file will be used for parsing the map.
Can someone tell me how such a config file can be written, and how can it be used inside Java?
Sample examples and code references will be of help.
I am thinking of creating a json file for this purpose
Example -
Boolean b = true;
List<String> conditions = new ArrayList<>();
if(dataMap.get("animal").toString().contains("pha")){
conditions.add("condition1 satisifed");
if(((Integer.parseInt(dataMap.get("age").toString()) || 100) ==0)){
conditions.add("condition2 satisifed");
if(dataMap.get("country").equals("Australia")){
conditions.add("condition3 satisifed");
}
else{
b=false;
}
}
else{
b=false;
}
}
else{
b=false;
}
Now suppose I want to define the conditions in a config file for each map value like the operation ( equals, OR, contains) and the test values, instead of using if else's. Then the config file can be used for parsing the java map

Just to manage expectations: Doing this in JSON is a horrible, horrible idea.
To give you some idea of what you're trying to make:
Grammars like this are best visualized as a tree structure. The 'nodes' in this tree are:
'atomics' (100 is an atom, so is "animal", so is dataMap).
'operations' (+ is an operation, so is or / ||).
potentially, 'actions', though you can encode those as operations.
Java works like this, so do almost all programming languages, and so does a relatively simple 'mathematical expression engine', such as something that can evaluate e.g. the string "(1 + 2) * 3 + 5 * 10" into 59.
In java, dataMap.get("animal") || 100 ==0 is parsed into this tree:
OR operation
/ \
INVOKE get[1] equality
/ \ / \
dataMap "animal" INT(100) INT(0)
where [1] is stored as INVOKEVIRTUAL java.util.Map :: get(Object) with as 'receiver' an IDENT node, which is an atomic, with value dataMap, and an args list node which contains 1 element, the string literal atomic "animal", to be very precise.
Once you see this tree you see how the notion of precedence works - your engine will need to be capable of representing both (1 + 2) * 3 as well as 1 + (2 * 3), so doing this without trees is not really possible unless you delve into bizarre syntaxis, where the lexical ordering matching processing ordering (if you want that, look at how reverse polish notation calculators work, or something like fortran - stack based language design. I don't think you'll like what you find there).
You're already making language design decisions here. Apparently, you think the language should adopt a 'truthy'/'falsy' concept, where dataMap.get("animal") which presumably returns an animal object, is to be considered as 'true' (as you're using it in a boolean operation) if, presumably, it isn't null or whatnot.
So, you're designing an entire programming language here. Why handicap yourself by enforcing that it is written in, of all things, JSON, which is epically unsuitable for the job? Go whole hog and write an entire language. It'll take 2 to 3 years, of course. Doing it in json isn't going to knock off more than a week off of that total, and make something that is so incredibly annoying to write, nobody would ever do it, buying you nothing.
The language will also naturally trend towards turing completeness. Once a language is turing complete, it becomes mathematically impossible to answer such questions as: "Is this code ever going to actually finish running or will it loop forever?" (see 'halting problem'), you have no idea how much memory or CPU power it takes, and other issues that then result in security needs. These are solvable problems (sandboxing, for example), but it's all very complicated.
The JVM is, what, 2000 personyears worth of experience and effort?
If you got 2000 years to write all this, by all means. The point is: There is no 'simple' way here. It's a woefully incomplete thing that never feels like you can actually do what you'd want to do (which is express arbitrary ideas in a manner that feels natural enough, can be parsed by your system, and when you read back still makes sense), or it's as complex as any language would be.
Why not just ... use a language? Let folks write not JSON but write full blown java, or js, or python, or ruby, or lua, or anything else that already exists, is open source, seems well designed?

Unexpected results from Metaphone algorithm

I am using phonetic matching for different words in Java. i used Soundex but its too crude. i switched to Metaphone and realized it was better. However, when i rigorously tested it. i found weird behaviour. i was to ask whether thats the way metaphone works or am i using it in wrong way. In following example its works fine:-
Metaphone meta = new Metaphone();
if (meta.isMetaphoneEqual("cricket","criket")) System.out.prinlnt("Match 1");
if (meta.isMetaphoneEqual("cricket","criketgame")) System.out.prinlnt("Match 2");
This would Print
Match 1
Mathc 2
Now "cricket" does sound like "criket" but how come "cricket" and "criketgame" are the same. If some one would explain this. it would be of great help.

Your usage is slightly incorrect. A quick investigation of the encoded strings and default maximum code length shows that it is 4, which truncates the end of the longer "criketgame":
System.out.println(meta.getMaxCodeLen());
System.out.println(meta.encode("cricket"));
System.out.println(meta.encode("criket"));
System.out.println(meta.encode("criketgame"));
Output (note "criketgame" is truncated from "KRKTKM" to "KRKT", which matches "cricket"):
4
KRKT
KRKT
KRKT
Solution: Set the maximum code length to something appropriate for your application and the expected input. For example:
meta.setMaxCodeLen(8);
System.out.println(meta.encode("cricket"));
System.out.println(meta.encode("criket"));
System.out.println(meta.encode("criketgame"));
Now outputs:
KRKT
KRKT
KRKTKM
And now your original test gives the expected results:
Metaphone meta = new Metaphone();
meta.setMaxCodeLen(8);
System.out.println(meta.isMetaphoneEqual("cricket","criket"));
System.out.println(meta.isMetaphoneEqual("cricket","criketgame"));
Printing:
true
false
As an aside, you may also want to experiment with DoubleMetaphone, which is an improved version of the algorithm.
By the way, note the caveat from the documentation regarding thread-safety:
The instance field maxCodeLen is mutable but is not volatile, and accesses are not synchronized. If an instance of the class is shared between threads, the caller needs to ensure that suitable synchronization is used to ensure safe publication of the value between threads, and must not invoke setMaxCodeLen(int) after initial setup.

Simple physical quantity measurement unit parser for Java

I want to be able to parse expressions representing physical quantities like
g/l
m/s^2
m/s/kg
m/(s*kg)
kg*m*s
°F/(lb*s^2)
and so on. In the simplest way possible. Is it possible to do so using something like Pyparsing (if such a thing exists for Java), or should I use more complex tools like Java CUP?
EDIT: To answere MrD's question the goal is to make conversion between quantities, so for example convert g to kg (this one is simple...), or maybe °F/(kg*s^2) to K/(lb*h^2) supposing h is four hour and lb for pounds

This is harder than it looks. (I have done a fair amount of work here). The main problem is there is no standard (I have worked with NIST on units and although they have finally created a markup language few people use it). So it's really a form of natural language processing and has to deal with :
ambiguity (what does "M" mean - meters or mega)
inconsistent punctuation
abbreviations
symbols (e.g. "mu" for micro)
unclear semantics (e.g. is kg/m/s the same as kg/(m*s)?
If you are just creating a toy system then you should create a BNF for the system and make sure that all examples adhere to it. This will use common punctuation ("/", "", "(", ")", "^"). Character fields can be of variable length ("m", "kg", "lb"). Algebra on these strings ("kg" -> 1000"g" has problems as kg is a fundamental unit.
If you are doing it seriously then ANTLR (#Yaugen) is useful, but be aware that units in the wild will not follow a regular grammar due to the inconsistencies above.
If you are REALLY serious (i.e. prepared to put in a solid month), I'd be interested to know. :-)
My current approach (which is outside the scope of your question) is to collect a large number of examples from the literature automatically and create a number of heuristics.

Lucene: Searching multiple fields with default operator = AND

To allow users to search across multiple fields with Lucene 3.5 I currently create and add a QueryParser for each field to be searched to a DisjunctionMaxQuery. This works great when using OR as the default operator but I now want to change the default operator to AND to get more accurate (and fewer) results.
Problem is, queryParser.setDefaultOperator(QueryParser.AND_OPERATOR) misses many documents since all terms must be in atleast 1 field.
For example, consider the following data for a document: title field = "Programming Languages", body field = "Java, C++, PHP". If a user were to search for Java Programming this particular document would not be included in the results since the title nor the body field contains all terms in the query although combined they do. I would want this document returned for the above query but not for the query HTML Programming.
I've considered a catchall field but I have a few problems with it. First, users frequently include per field terms in their queries (author:bill) which is not possible with a catchall field. Also, I highlight certain fields with FastVectorHighlighter which requires them to be indexed and stored. So by adding a catchall field I would have to index most of the same data twice which is time and space consuming.
Any ideas?

Guess I should have done a little more research. Turns out MultiFieldQueryParser provides the exact functionality I was looking for. For whatever reason I was creating a QueryParser for each field I wanted to search like this:
String[] fields = {"title", "body", "subject", "author"};
QueryParser[] parsers = new QueryParser[fields.length];
for(int i = 0; i < parsers.length; i++)
{
parsers[i] = new QueryParser(Version.LUCENE_35, fields[i], analyzer);
parsers[i].setDefaultOperator(QueryParser.AND_OPERATOR);
}
This would result in a query like this:
(+title:java +title:programming) | (+body:java +body:programming)
...which is not what I was looking. Now I create a single MultiFieldQueryParser like this:
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_35, new String[]{"title", "body", "subject"}, analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
This gives me the query I was looking for:
+(title:java body:java) +(title:programming body:programming)
Thanks to #seeta and #femtoRgon for the help!

Perhaps what you need is a combination of Boolean queries that capture the different combinations of fields and terms. In your given example, the query could be -
(title:Java AND body:programming) OR (title:programming AND body:Java).
I don't know if there's an existing Query class that generates this automatically for you, but I think that's what should be the ultimate query that's run on the index.

You want to be able to search multiple fields with the same set of terms, then the question from your comment:
((title:java title:programming) | (body:java body:programming))~0.2
May not be the best implementation.
You're effectively getting either the score from the title, or the score from the body for the combined set of terms. The case where you hit java in the title and programming in the body would be given approx. equal weight to a hit on java in the body and no hit on programming.
I think a better structured query would be:
(title:java body:java)~0.2 (title:programming body:programming)~0.2
This makes more sense to me, since you want the dismax queries to limit score growing on multiple queries of the same term (in different fields), but you do want scoring to grow for hits on different terms, I believe.
If that sort of query structure gets you better score results, limiting results to a certain minimum score (a percentage of the max score returned, rather than a simple hard-coded value) may be adequate to prevent too-weak results from being seen.
I also still wouldn't count out indexing an all field. It's an implementation I've used before, while indexing BOTH the specific field and the catchall field, thus allowing both general querying and specific single-field queries. Index storage tends to be pretty lean for unstored terms, and it will generally help performance, if you find yourself having to create big, complicated queries to make up for not having it.
If you really want to be sure that it takes minimal storage, you can even turn off TermVectors for that field:
new Field(name, value, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);
Although I don't know how much of a difference that would really make.

Detecting equivalent expressions

I'm currently working on a Java application where I need to implement a system for building BPF expressions. I also need to implement mechanism for detecting equivalent BPF expressions.
Building the expression is not too hard. I can build a syntax tree using the Interpreter design pattern and implement the toString for getting the BPF syntax.
However, detecting if two expressions are equivalent is much harder. A simple example would be the following:
A: src port 1024 and dst port 1024
B: dst port 1024 and src port 1024
In order to detect that A and B are equivalent I probably need to transform each expression into a "normalized" form before comparing them. This would be easy for above example, however, when working with a combination of nested AND, OR and NOT operations it's getting harder.
Does anyone know how I should best approach this problem?

One way to compare boolean expressions may be to convert both to the disjunctive normal form (DNF), and compare the DNF. Here, the variables would be Berkeley Packet Filter tokens, and the same token (e.g. port 80) appearing anywhere in either of the two expressions would need to be assigned the same variable name.
There is an interesting-looking applet at http://www.izyt.com/BooleanLogic/applet.php - sadly I can't give it a try right now due to Java problems in my browser.

I'm pretty sure detecting equivalent expressions is either an np-hard or np-complete problem, even for boolean-only expressions. Meaning that to do it perfectly, the optimal way is basically to build complete tables of all possible combinations of inputs and the results, then compare the tables.
Maybe BPF expressions are limited in some way that changes that? I don't know, so I'm assuming not.
If your problems are small, that may not be a problem. I do exactly that as part of a decision-tree designing algorithm.
Alternatively, don't try to be perfect. Allow some false negatives (cases which are equivalent, but which you won't detect).
A simple approach may be to do a variant of the normal expression-evaluation, but evaluating an alternative representation of the expression rather than the result. Impose an ordering on commutative operators. Apply some obvious simplifications during the evaluation. Replace a rich operator set with a minimal set of primitive operators - e.g. using de-morgans to eliminate OR operators.
This alternative representation forms a canonical representation for all members of a set of equivalent expressions. It should be an equivalence class in the sense that you always find the same canonical form for any member of that set. But that's only the set-theory/abstract-algebra sense of an equivalence class - it doesn't mean that all equivalent expressions are in the same equivalence class.
For efficient dictionary lookups, you can use hashes or comparisons based on that canonical representation.

I'd definitely go with syntax normalization. That is, like aix suggested, transform the booleans using DNF and reorder the abstract syntax tree such that the lexically smallest arguments are on the left-hand side. Normalize all comparisons to < and <=. Then, two equivalent expressions should have equivalent syntax trees.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.