Appengine Full Text Search - Precision of numeric fields in the Search API - java

While testing the search api locally (Java SDK - 1.9.6) i'm getting unexpected results doing equality and range checks against numbers which are small.
For example, if i index three documents with the following fields:
numeric: 0.0011
numeric: 0.0022
numeric: 0.0033
I get the following results for the following queries:
numeric: 0.0033 -> []
numeric= 0.0033 -> []
numeric>= 0.0033 -> []
numeric < 0.0033 -> [document1, document2, document3]
numeric < 0.0022 -> [document1, document2]
numeric < 0.0021 -> [document1, document2]
numeric < 0.002 -> [document1]
I assume there is something in the implementation which indexes or runs queries against numbers at a granularity other than exact? Should I expect these
results to be reflected in the real appengine environments? What precision can I rely on?
The main challenge i am trying to solve is the ability to store numbers which fall outside of the SearchApiLimits.MINIMUM_NUMBER_VALUE and SearchApiLimits.MAXIMUM_NUMBER_VALUE and still operate on them. At the moment, shifting them by moving the decimal place is the only option I have been able to come up. Are there any alternatives that allow good control over how much precision is lost in the translation first to a double (the type in the java api), and then whatever is happening under the hood?

For the first part of the question,
I have not been able to reproduce your results with the newest version of the API. If you can reproduce your results in version 1.9.7 then post some code and I'll take a look at it again. The dev server and the production server should behave the same, and if not, then it is a bug.
One alternative to get around the limits is by not storing them as numbers but as atoms. But then you won't be able to apply the <,>,<=,..., operators only equality.

It should be noted that the local FTS is not the same code as the App Engine server. I think it's lucene or something.

Related

java.lang.ArithmeticException: Division is undefined

I have a simple operation going on in my program:
exposureNoDecimals =
BigDecimal.valueOf(curEffSpreadPremium).multiply(BigDecimal.valueOf(100)).divide(wsRate, 0,
java.math.RoundingMode.HALF_UP).longValue();
exposureNoDecimals - long
curEffSpreadPremium - long
wsRate - BigDecimal
However I am getting
"java.lang.ArithmeticException: Division is undefined"
at java.math.BigDecimal.longScaledDivide(BigDecimal.java:3105)
at java.math.BigDecimal.divide(BigDecimal.java:2409)
at java.math.BigDecimal.divide(BigDecimal.java:2396)
at java.math.BigDecimal.divide(BigDecimal.java:2361)
The problem is the issue is recreatable on production and not on my machine (cant debug, or cant see the inputs)
What can be the issue here? Any suggestions/ideas?
Take a look at the source code for BigDecimal (e.g. here).
An ArithmeticException is only thrown with the message "Division undefined" when you attempt to divide zero by zero.
I'm not going to suggest a fix, because the >>correct<< fix will depend on what this calculation is supposed to be doing, and why the divisor / dividend happen to be zero. Putting in some zero checks might be a solution, but it could also be a "band-aid solution" that hides the problem rather than fixing it. It could come back to bite you later on.
The problem is the issue is recreatable on production and not on my machine (cant debug, or cant see the inputs)
As noted in various comments, there are different versions of BigDecimal depending on the Java version and (apparently) vendor. One of the differences between (some) versions is that the exception messages differ.
If you really want to track this down this reproducibility issue, you are going to have to look at the source code for BigDecimal in production and on your machine. (Unfortunately, a stacktrace involving Java SE classes is often difficult to diagnose without precise Java vendor and version number information. It is not helpful in this case ... for that reason.)
According to the source code of BigDecimal, java.lang.ArithmeticException: Division undefined (without the is) is only thrown when you divide zero by zero.
Looks like in your case curEffSpreadPremium and wsRate both are zero.
So you need to guard the line with zero-checks.

How to calculate similarity between Chamber of Commerce numbers?

I am working on an engine that does OCR post-processing, and currently I have a set of organizations in the database, including Chamber of Commerce Numbers.
Also from the OCR output I have a list of possible Chamber of Commerce (COC) numbers.
What would be the best way to search the most similar one? Currently I am using Levenshtein Distance, but the result range is simply too big and on big databases I really doubt it's feasibility. Currently it's implemented in Java, and the database is a MySQL database.
Side note: A Chamber of Commerce number in The Netherlands is defined to be an 8-digit number for every company, an earlier version of this system used another 4 digits (0000, 0001, etc.) to indicate an establishment of an organization, nowadays totally new COC numbers are being given out for those.
Example of COCNumbers:
30209227
02045251
04087614
01155720
20081288
020179310000
09053023
09103292
30039925
13041611
01133910
09063023
34182B01
27124701
List of possible COCNumbers determined by post-processing:
102537177
000450093333
465111338098
NL90223l30416l
NLfl0737D447B01
12juni2013
IBANNL32ABNA0242244777
lncassantNL90223l30416l10000
KvK13041611
BtwNLfl0737D447B01
A few extra notes:
The post-processing picks up words and word groups from the invoice, and those word groups are being concatenated in one string. (A word group is at it says, a group of words, usually denoted by a space between them).
The condition that the post-processing uses for it to be a COC number is the following: The length should be 8 or more, half of the content should be numbers and it should be alphanumerical.
The amount of possible COCNumbers determined by post-processing is relatively small.
The database itself can grow very big, up to 10.000s of records.
How would I proceed to find the best match in general? (In this case (13041611, KvK13041611) is the best (and moreover correct) match)
Doing this matching exclusively in MySQL is probably a bad idea for a simple reason: there's no way to use a regular expression to modify a string natively.
You're going to need to use some sort of scoring algorithm to get this right, in my experience (which comes from ISBNs and other book-identifying data).
This is procedural -- you probably need to do it in Java (or some other procedural programming language).
Is the candidate string found in the table exactly? If yes, score 1.0.
Is the candidate string "kvk" (case-insensitive) prepended to a number that's found in the table exactly? If so, score 1.0.
Is the candidate string the correct length, and does it match after changing lower case L into 1 and upper case O into 0? If so, score 0.9
Is the candidate string the correct length after trimming all alphabetic characters from either beginning or the end, and does it match? If so, score 0.8.
Do both steps 3 and 4, and if you get a match score 0.7.
Trim alpha characters from both the beginning and end, and if you get a match score 0.6.
Do steps 3 and 6, and if you get a match score 0.55.
The highest scoring match wins.
Take a visual look at the ones that don't match after this set of steps and see if you can discern another pattern of OCR junk or concatenated junk. Perhaps your OCR is seeing "g" where the input is "8", or other possible issues.
You may be able to try using Levenshtein's distance to process these remaining items if you match substrings of equal length. They may also be few enough in number that you can correct your data manually and proceed.
Another possibility: you may be able to use Amazon Mechanical Turk to purchase crowdsourced labor to resolve some difficult cases.

Detecting equivalent expressions

I'm currently working on a Java application where I need to implement a system for building BPF expressions. I also need to implement mechanism for detecting equivalent BPF expressions.
Building the expression is not too hard. I can build a syntax tree using the Interpreter design pattern and implement the toString for getting the BPF syntax.
However, detecting if two expressions are equivalent is much harder. A simple example would be the following:
A: src port 1024 and dst port 1024
B: dst port 1024 and src port 1024
In order to detect that A and B are equivalent I probably need to transform each expression into a "normalized" form before comparing them. This would be easy for above example, however, when working with a combination of nested AND, OR and NOT operations it's getting harder.
Does anyone know how I should best approach this problem?
One way to compare boolean expressions may be to convert both to the disjunctive normal form (DNF), and compare the DNF. Here, the variables would be Berkeley Packet Filter tokens, and the same token (e.g. port 80) appearing anywhere in either of the two expressions would need to be assigned the same variable name.
There is an interesting-looking applet at http://www.izyt.com/BooleanLogic/applet.php - sadly I can't give it a try right now due to Java problems in my browser.
I'm pretty sure detecting equivalent expressions is either an np-hard or np-complete problem, even for boolean-only expressions. Meaning that to do it perfectly, the optimal way is basically to build complete tables of all possible combinations of inputs and the results, then compare the tables.
Maybe BPF expressions are limited in some way that changes that? I don't know, so I'm assuming not.
If your problems are small, that may not be a problem. I do exactly that as part of a decision-tree designing algorithm.
Alternatively, don't try to be perfect. Allow some false negatives (cases which are equivalent, but which you won't detect).
A simple approach may be to do a variant of the normal expression-evaluation, but evaluating an alternative representation of the expression rather than the result. Impose an ordering on commutative operators. Apply some obvious simplifications during the evaluation. Replace a rich operator set with a minimal set of primitive operators - e.g. using de-morgans to eliminate OR operators.
This alternative representation forms a canonical representation for all members of a set of equivalent expressions. It should be an equivalence class in the sense that you always find the same canonical form for any member of that set. But that's only the set-theory/abstract-algebra sense of an equivalence class - it doesn't mean that all equivalent expressions are in the same equivalence class.
For efficient dictionary lookups, you can use hashes or comparisons based on that canonical representation.
I'd definitely go with syntax normalization. That is, like aix suggested, transform the booleans using DNF and reorder the abstract syntax tree such that the lexically smallest arguments are on the left-hand side. Normalize all comparisons to < and <=. Then, two equivalent expressions should have equivalent syntax trees.

Best way to test CRC logic?

How can I verify two CRC implementations will generate the same checksums?
I'm looking for an exhaustive implementation evaluating methodology specific to CRC.
You can separate the problem into edge cases and random samples.
Edge cases. There are two variables to the CRC input, number of bytes, and value of each byte. So create arrays of 0, 1, and MAX_BYTES, with values ranging from 0 to MAX_BYTE_VALUE. The edge case suite will be something you'll most likely want to keep within a JUnit suite.
Random samples. Using the ranges above, run CRC on randomly generated arrays of bytes in a loop. The longer you let the loop run, the more you exhaust the inputs. If you are low on computing power, consider deploying the test to EC2.
Create several unit tests with the same input that will compare the output of both implementations against each other.
One nice property of CRCs is that for a given set of parameters (polynomial, reflection, initial state, etc.) you will get a constant value when you recompute the CRC over the original dataset + the original CRC. These constants are documented for common CRCs but you can just blindly generate them using two different random data sets and check that they are the same:
implementation 1: crc(rand_data_1 + crc(rand_data_1)) -> constant_1
implementation 2: crc(rand_data_2 + crc(rand_data_2)) -> constant_2
assert constant_1 == constant_2
You can use the same method within an implementation to get a warm fuzzy feeling about its correctness. If your implementation works with arbitrary polynomials, you can have the unittest exhaustively check every possible polynomial using this method without needing to know what the constants are.
This technique is powerful but it would also be wise to add an independent test that verifies the result based on known input for the pathological case where your CRC implementations both produce bad results that happen to get by the constant equivalence check.
First, if it is a standard CRC implementation, you should be able to find known values somewhere on the net.
Second, you could generate some number of payloads and run the each CRC on the payloads and check that the CRC values match.
By writing a unit test for each which takes the same input and verify against the expected output.

Estimating a probability given other probabilities from a prior

I have a bunch of data coming in (calls to an automated callcenter) about whether or not a person buys a particular product, 1 for buy, 0 for not buy.
I want to use this data to create an estimated probability that a person will buy a particular product, but the problem is that I may need to do it with relatively little historical data about how many people bought/didn't buy that product.
A friend recommended that with Bayesian probability you can "help" your probability estimate by coming up with a "prior probability distribution", essentially this is information about what you expect to see, prior to taking into account the actual data.
So what I'd like to do is create a method that has something like this signature (Java):
double estimateProbability(double[] priorProbabilities, int buyCount, int noBuyCount);
priorProbabilities is an array of probabilities I've seen for previous products, which this method would use to create a prior distribution for this probability. buyCount and noBuyCount are the actual data specific to this product, from which I want to estimate the probability of the user buying, given the data and the prior. This is returned from the method as a double.
I don't need a mathematically perfect solution, just something that will do better than a uniform or flat prior (ie. probability = buyCount / (buyCount+noBuyCount)). Since I'm far more familiar with source code than mathematical notation, I'd appreciate it if people could use code in their explanation.
Here's the Bayesian computation and one example/test:
def estimateProbability(priorProbs, buyCount, noBuyCount):
# first, estimate the prob that the actual buy/nobuy counts would be observed
# given each of the priors (times a constant that's the same in each case and
# not worth the effort of computing;-)`
condProbs = [p**buyCount * (1.0-p)**noBuyCount for p in priorProbs]
# the normalization factor for the above-mentioned neglected constant
# can most easily be computed just once
normalize = 1.0 / sum(condProbs)
# so here's the probability for each of the prior (starting from a uniform
# metaprior)
priorMeta = [normalize * cp for cp in condProbs]
# so the result is the sum of prior probs weighed by prior metaprobs
return sum(pm * pp for pm, pp in zip(priorMeta, priorProbs))
def example(numProspects=4):
# the a priori prob of buying was either 0.3 or 0.7, how does it change
# depending on how 4 prospects bought or didn't?
for bought in range(0, numProspects+1):
result = estimateProbability([0.3, 0.7], bought, numProspects-bought)
print 'b=%d, p=%.2f' % (bought, result)
example()
output is:
b=0, p=0.31
b=1, p=0.36
b=2, p=0.50
b=3, p=0.64
b=4, p=0.69
which agrees with my by-hand computation for this simple case. Note that the probability of buying, by definition, will always be between the lowest and the highest among the set of priori probabilities; if that's not what you want you might want to introduce a little fudge by introducing two "pseudo-products", one that nobody will ever buy (p=0.0), one that anybody will always buy (p=1.0) -- this gives more weight to actual observations, scarce as they may be, and less to statistics about past products. If we do that here, we get:
b=0, p=0.06
b=1, p=0.36
b=2, p=0.50
b=3, p=0.64
b=4, p=0.94
Intermediate levels of fudging (to account for the unlikely but not impossible chance that this new product may be worse than any one ever previously sold, or better than any of them) can easily be envisioned (give lower weight to the artificial 0.0 and 1.0 probabilities, by adding a vector priorWeights to estimateProbability's arguments).
This kind of thing is a substantial part of what I do all day, now that I work developing applications in Business Intelligence, but I just can't get enough of it...!-)
A really simple way of doing this without any difficult math is to increase buyCount and noBuyCount artificially by adding virtual customers that either bought or didn't buy the product. You can tune how much you believe in each particular prior probability in terms of how many virtual customers you think it is worth.
In pseudocode:
def estimateProbability(priorProbs, buyCount, noBuyCount, faithInPrior=None):
if faithInPrior is None: faithInPrior = [10 for x in buyCount]
adjustedBuyCount = [b + p*f for b,p,f in
zip(buyCount, priorProbs, faithInPrior]
adjustedNoBuyCount = [n + (1-p)*f for n,p,f in
zip(noBuyCount, priorProbs, faithInPrior]
return [b/(b+n) for b,n in zip(adjustedBuyCount, adjustedNoBuyCount]
Sounds like what you're trying to do is Association Rule Learning. I don't have time right now to provide you with any code, but I will point you in the direction of WEKA which is a fantastic open source data mining toolkit for Java. You should find plenty of interesting things there that will help you solve your problem.
As I see it, the best you could do is use the uniform distribution, unless you have some clue regarding the distribution. Or are you talking about making a relationship between this products and products previously bought by the same person in the Amazon Fashion "people who buy this product also buy..." ??

Categories

Resources