What does Lucene's ScoreDoc.score mean?

What does Lucene's ScoreDoc.score mean? - java

I am performing a boolean query with multiple terms. I only want to process results with a score above a particular threshold. My problem is, I don't understand how this value is calculated. I understand that high numbers mean its a good match, and low numbers mean its a bad match, but there doesn't seem to be any upper bounds?
Is it possible to normalize the scores over the range [0,1]?

Here is a page describing how scores are calculated in Lucene:
http://lucene.apache.org/java/3_0_0/scoring.html
The short answer is that the absolute values of each document's score doesn't really mean anything outside the context of a given search result set. In other words, there isn't really a good way of translating the scores to a human definition of relevance, even if you do normalize the scores.
That being said you can easily normalize the scores by dividing each hit's score by the maximum score. So if the first hit's score is 2.5, then divide every hit's score by 2.5, and you'll get a number in between 0 and 1.

Related

Subset sum problem with continuous subset using recursion

I am trying to think how to solve the Subset sum problem with an extra constraint: The subset of the array needs to be continuous (the indexes needs to be). I am trying to solve it using recursion in Java.
I know the solution for the non-constrained problem: Each element can be in the subset (and thus I perform a recursive call with sum = sum - arr[index]) or not be in it (and thus I perform a recursive call with sum = sum).
I am thinking about maybe adding another parameter for knowing weather or not the previous index is part of the subset, but I don't know what to do next.

You are on the right track.
Think of it this way:
for every entry you have to decide: do you want to start a new sum at this point or skip it and reconsider the next entry.
a + b + c + d contains the sum of b + c + d. Do you want to recompute the sums?
Maybe a bottom-up approach would be better

The O(n) solution that you asked for:
This solution requires three fixed point numbers: The start and end indices, and the total sum of the span
Starting from element 0 (or from the end of the list if you want) increase the end index until the total sum is greater than or equal to the desired value. If it is equal, you've found a subset sum. If it is greater, move the start index up one and subtract the value of the previous start index. Finally, if the resulting total is greater than the desired value, move the end index back until the sum is less than the desired value. In the other case (where the sum is less) move the end index forward until the sum is greater than the desired value. If no match is found, repeat
So, caveats:
Is this "fairly obvious"? Maybe, maybe not. I was making assumptions about order of magnitude similarity when I said both "fairly obvious" and o(n) in my comments
Is this actually o(n)? It depends a lot on how similar (in terms of order of magnitude (digits in the number)) the numbers in the list are. The closer all the numbers are to each other, the fewer steps you'll need to make on the end index to test if a subset exists. On the other hand, if you have a couple of very big numbers (like in the thousands) surrounded by hundreds of pretty small numbers (1's and 2's and 3's) the solution I've presented will get closers to O(n^2)
This solution only works based on your restriction that the subset values are continuous

How to calculate percentage format prediction confidence of face recognition using opencv?

I am doing a two-faces comparison work using OpenCV FaceRecognizer of LBP type. My question is how to calculate the percentage format prediction confidence? Giving the following code(javacv):
int n[] = new int[1];
double p[] = new double[1];
personRecognizer.predict(mat, n, p);
int confidence = p[0];
but the confidence is a double value, how should I convert it into a percentage % value of probability?
Is there an existing formula?
Sorry if I didn't state my question in a clear way. Ok, here is the scenario:
I want to compare two face images and get out the likeliness of the two face, for example input John's pic and his classmate Tom's pic, and let's say the likeliness is 30%; and then input John's pic and his brother Jack's pic, comes the likeliness is 80%.
These two likeliness factor shows that Jack is more like his brother John than Tom does... so the likeliness factor in percentage format is what i want, more the value means more likeliness of the two input face.
Currently I did this by computing the confidence value of the input using opencv function FaceRecognizer.predict, but the confidence value actually stands for the distance between the inputs in their feature vectors space, so how can I scale the distance(confidence) into the likeliness percentage format?

You are digging too deep by your question. Well, according to the OpenCV documentation:
predict()
Predicts a label and associated confidence (e.g. distance) for a given
input image
I am not sure what are you looking for here but the question is not really easy to be answered. Intra-person face variants (variation of the same person) are vast and inter-person face variation (faces from different persons) can be more compact (e.g. when both face front while the intra-person second facial image is profile) so this is a whole topic that expect an answer.
Probably you should have a ground truth (i.e. some faces with labels already known) and deduct form this set the percentage you want associating the distances with the labels. Though this is also often inaccurate as distance would not coincide with your perception of similarity (as mentioned before inter-person faces can vary a lot).
Edit:
First of all, there is no universal human perception of face similarity. On the other half, most people would recognize a face that belongs to the same person in various poses and postures. Most word here is important. As you pressure the limits the human perception will start to diverge, e.g. when asked to recognize a face over the years and the time span becomes quite large (child vs adolescence vs old person).
You are asking to compute the similarity of noses/eyes etc? If so, I think the best way is to find a set of noses/eyes belonging to the same persons and train over this and then check your performance on a different set from different persons.
The usual approach as I know is to train and test using pairs of images comprising positive and negative samples. A positive sample is a pair of images belonging to the same person while a negative one is an image pair belong to two different ones.
I am not sure what you are asking exactly so maybe you can check out this link.
Hope it helped.
Edit 2:
Well, since you want to convert the distance that you are getting to a similarity expressed as percentage you can somehow invert the distance to get the similarity. There are some problems arising here though:
There is a value for absolute match, that is dis = 0; or equivalently similarity is sim = 100% but there is no value explicit for total mismatch: dis = infinite so sim = 0%. On the other hand the inverse progress has explicit boundaries 0% - 100%.
Since extreme values include 0 and infinite there must be a smarter conversion than simple inversion.
You can easily assign 1.0 (or 100% to similarity) corresponding to the absolute match but what you are going to take as total mismatch is not clear. You can consider an arbitrary high value as 0.0 (since you there is no big difference e.g. in using distance 10000 to 11000 I guess) and all values higher than this (distance values that is) to be considered 0.0.
To find which value that should be I would suggest to compare two quite distinct images and use the distance between them as 0.0.
Let's suppose that this value is disMax = 250.0; and simMax = 100.0;
then a simple approach could be:
double sim = simMax - simMax/disMax*dis;
which gives a 100.0 similarity for 0 distance and 0.0 for 250 distance. Values larger than 250 would give negative similarity values which should be considered 0.0.

looking for a single index to show the similarity in point matching algorithm

I did read the Point set registration and would like to implement it for my simple line matching. However, I only got very basic maths knowledge and cannot really understand the equations on the page.
Assuming I am able to extract points from 2 images, searching nearest pair by brute force looping and got a list of pairs with corresponding distances.
What is the next step to calculate a single index by utilizing the above data obtained?
The idea I currently come up with is to simply average all the distance. I believe this are many better approach. Or I should capture more data for the calculation?

Your instincts are almost correct.
Generally, the metric is the sum of squared distances; with the goal of finding the least-squares fit (minimizing the sum of all the individual square distances). Essentially this minimizes the standard deviation (actually it minimizes variance, but same end effect).
So take all your corresponding pairs, calculate the distance squared between them (fast calculation, no sqrt involved; faster than calculating actual distances) add them up and the lower the better. If your point sets differ in count you may wish to divide by the count to get a proper variance value.
This metric applies to pretty much any registration algorithm.
By the way, if you already have a point correspondance and you know there is no scaling/skewing, you might also be interested in Horn's method, which is a closed-form (non-iterative) algorithm that just spits out the least-squared fit directly. It's very efficient.
(P.S. For a very simple explanation of why the variance is a better indicator than the mean distance, check out this page).

What are some ways to store and recover numbers in this situation?

So i'm going to be running a simulator that plays craps.
My assignment requires me to run the sim 10,000,000 times.
None of that is an issue; I have the sim made and I know how to run in and I know how to create the required variables.
What I'm unsure of, is how I should go about storing the results of each game?
What I need to find in the end is:
Average # of Rolls Per Game
Max # of Rolls in a game
number of games that needed more than 30 rolls
number of wins
number of losses
probability of a win
longest sequence of wins and longest sequence of losses
All easy enough, I'm just not sure how to store 10,000,000 numbers and then access them easily.
For example the first:
Average number of rolls
should I create an arraylist that has 10,000,000 items in it? add one item at the end of each game and then add them all up and divide by 10,000,000?
I realize this should work, I'm just wondering if there is another way, or perhaps a better (more efficient) way.
New part to this question:
Can I return more than one value from a method? Currently the simulation runs 10,000,000 times and returns a win or loss from each time. But I also need it to return the number of rolls from each game... Otherwise I can't figure out the values for avg rolls and highest number of rolls and number of games over 30 rolls.
Any ideas here?

You don't need to maintain array for any of the statistics you want.
For average number of rolls per game, just keep a variable, say cumulativeNumberOfRolls; after every game, just output the number of rolls in that game and add it to this variable. When all simulations are done, just divide this value by total number of simulations (10,000,000).
For max. number of rolls, again keep a single variable, say maxRolls; after every game, output the number of rolls in that game and compare that with this variable. If the number of rolls in this game is greater, then just update maxRolls with the new value. Try the same approach - of having a single variable and updating it after every game - to get the value for games that required more than 30 rolls, number of wins and number of losses. If you face problems, we can discuss them in comments.
For longest sequence of wins and losses, you would need to maintain a bunch of variables:
longest win sequence overall
longest loss sequence overall
current sequence count
current sequence type (indicates if current sequence is a win sequence or loss sequence)
Here's the overview of the approach.
After every game, compare the result of the game with the current sequence type. If they are same, for instance result of current game is win and the current sequence type is also a win, then just update the current sequence count and move on to the next game. If they are different, you need to consider two scenarios and do slightly different things for them. I'll explain for one - the result of current game is loss and the current sequence type is win. In this scenario, compare current sequence count with longest win sequence overall and if it (current sequence count) is greater then just update the longest win sequence overall. After this, change the current sequence type to loss and set the current sequence count to 1.
Extend the above approach for the second scenario - the result of the current game is win and the current sequence type is loss. If you have clarifications, feel free to post back in comments.

You could just calculate the statistics as you go without storing them. For instance, if you hava an "average" field in your class, then after each simulation average = ((number of rolls this game) + (total rolls so far)) / (number of games so far). The same could be done for the other statistics.

Well, you've got a fixed number of runs, so you might as well use an array rather than an arraylist (faster). It seems to me that you actually only need two total arrays: one listing the outcome of each game (maybe true/false for win/lose), and one with the number of rolls in that game. You fill these up as you run the simulations; then you get to do a bunch of simple math involving one array or the other to get your stats. That seems like the best way to go about it to me; I don't think you're going to get much more efficient without a lot of undue effort.

8 puzzle: Solvability and shortest solution

I have built a 8 puzzle solver using Breadth First Search. I would now want to modify the code to use heuristics. I would be grateful if someone could answer the following two questions:
Solvability
How do we decide whether an 8 puzzle is solvable ? (given a starting state and a goal state )
This is what Wikipedia says:
The invariant is the parity of the permutation of all 16 squares plus
the parity of the taxicab distance (number of rows plus number of
columns) of the empty square from the lower right corner.
Unfortunately, I couldn't understand what that meant. It was a bit complicated to understand. Can someone explain it in a simpler language?
Shortest Solution
Given a heuristic, is it guaranteed to give the shortest solution using the A* algorithm? To be more specific, will the first node in the open list always have a depth ( or the number of movements made so fat ) which is the minimum of the depths of all the nodes present in the open list?
Should the heuristic satisfy some condition for the above statement to be true?
Edit : How is it that an admissible heuristic will always provide the optimal solution? And how do we test whether a heuristic is admissible?
I would be using the heuristics listed here
Manhattan Distance
Linear Conflict
Pattern Database
Misplaced Tiles
Nilsson's Sequence Score
N-MaxSwap X-Y
Tiles out of row and column
For clarification from Eyal Schneider :

I'll refer only to the solvability issue. Some background in permutations is needed.
A permutation is a reordering of an ordered set. For example, 2134 is a reordering of the list 1234, where 1 and 2 swap places. A permutation has a parity property; it refers to the parity of the number of inversions. For example, in the following permutation you can see that exactly 3 inversions exist (23,24,34):
1234
1432
That means that the permutation has an odd parity. The following permutation has an even parity (12, 34):
1234
2143
Naturally, the identity permutation (which keeps the items order) has an even parity.
Any state in the 15 puzzle (or 8 puzzle) can be regarded as a permutation of the final state, if we look at it as a concatenation of the rows, starting from the first row. Note that every legal move changes the parity of the permutation (because we swap two elements, and the number of inversions involving items in between them must be even). Therefore, if you know that the empty square has to travel an even number of steps to reach its final state, then the permutation must also be even. Otherwise, you'll end with an odd permutation of the final state, which is necessarily different from it. Same with odd number of steps for the empty square.
According to the Wikipedia link you provided, the criteria above is sufficient and necessary for a given puzzle to be solvable.

The A* algorithm is guaranteed to find the (one if there are more than one equal short ones) shortest solution, if your heuristic always underestimates the real costs (In your case the real number of needed moves to the solution).
But on the fly I cannot come up with a good heuristic for your problem. That needs some thinking to find such a heuristic.
The real art using A* is to find a heuristic that always underestimates the real costs but as little as possible to speed up the search.
First ideas for such a heuristic:
A quite pad but valid heuristic that popped up in my mind is the manhatten distance of the empty filed to its final destination.
The sum of the manhatten distance of each field to its final destination divided by the maximal number of fields that can change position within one move. (I think this is quite a good heuristic)

For anyone coming along, I will attempt to explain how the OP got the value pairs as well as how he determines the highlighted ones i.e. inversions as it took me several hours to figure it out. First the pairs.
First take the goal state and imagine it as a 1D array(A for example)
[1,2,3,8,0,4,7,5]. Each value in that array has it's own column in the table(going all the way down, which is the first value of the pair.)
Then move over 1 value to the right in the array(i + 1) and go all the way down again, second pair value. for example(State A): the first column, second value will start [2,3,8,0,4,7,5] going down. the second column, will start [3,8,0,4,7,5] etc..
okay now for the inversions. for each of the 2 pair values, find their INDEX location in the start state. if the left INDEX > right INDEX then it's an inversion(highlighted). first four pairs of state A are: (1,2),(1,3),(1,8),(1,0)
1 is at Index 3
2 is at Index 0
3 > 0 so inversion.
1 is 3
3 is 2
3 > 2 so inversion
1 is 3
8 is 1
3 > 2 so inversion
1 is 3
0 is 7
3 < 7 so No inversion
Do this for each pairs and tally up the total inversions.
If both even or both odd (Manhattan distance of blank spot And total inversions)
then it's solvable. Hope this helps!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.