Lucene geo-distance sorting performance

Lucene geo-distance sorting performance - java

I have a task to sort search results not only by relevance of string fields of indexed documents, but also by distance from a given geographical point to a point associated with each document being indexed. It should be mentioned that only top-ten or so matched documents should be included into a result set. Also it's not important to sort by precise distance, only kind of "distance levels" from the given point are important.
Technically I have successfully implemented the task. The geographical part of the task was implemented as a CustomScoreQuery-derived class:
private static class DistanceQuery extends CustomScoreQuery {
public DistanceQuery(final Query _subQuery, final SpatialStrategy _strategy, final Point _bp) {
super(_subQuery, new FunctionQuery(_strategy.makeDistanceValueSource(_bp)));
}
#Override
protected CustomScoreProvider getCustomScoreProvider(AtomicReaderContext _context) throws IOException {
return new CustomScoreProvider(_context) {
#Override
public float customScore(int _doc, float _subQueryScore, float _valSrcScore) throws IOException {
// the spatial strategies makeDistanceValueSource creates a ValueSource which score varies from almost 0 for nearby points to 2.7-2.8 for distant points
// so I voluntarily chosen 2 as the normalization factor and increase subQueryScore for that factor at max;
logger.debug("customScore for document {}: [subQuery={}, valScore={}", this.context.reader().document(_doc).getField(IndexedField.id.name()).numericValue().toString(), _subQueryScore, _valSrcScore);
return (_valSrcScore > 2 || _valSrcScore < 0) ? _subQueryScore : _subQueryScore + (2 - _valSrcScore);
}
};
}
}
and wrap a given "textual" query with this geospatial "enhancement".
Generally speaking the chosen strategy gives me pretty reasonable results. As one may see, the final score just slightly exceeds the initial query score (by 2 at max). And with typical results scores of a dozen and more, this geospatial addition works just as a way to "post-sort" otherwise similar documents.
With few hundreds or thousands test documents in the index, performance of the wrapped query was also good enough. It's about 10-50 milliseconds per search and this is just 2-5 times slower than an unwrapped query.
But when I switched from a test to a real-world DB and the number of documents in the index raised from a thousand to approximately 10 millions and is going to increase even more (with an estimation of a hundred millions in a near future), then the situation have changed dramatically. Actually I can't get any search results anymore because JVM goes out of memory and processor. Currently it can't finish the search in JVM with -Xmx6g and more.
Certainly I could buy a better hardware for the task, but the problem is likely to be solved by choosing a more appropriate sorting strategy.
One solution is to completely avoid geo-sorting provided by Lucene and manually sort top N items of the result set if items relevance scores are similar. And I'm going to choose this way if nothing else helps.
But my question is whether more adequate solutions exist. Maybe I can somehow split result items by classes of equivalence (with same or similar enough scores) and apply geo-spatial sorting only to first few classes? Please suggest.

Look at how elasticsearch implements this in the function_score query. You can probably reuse a few things from what they do. If I remember correctly, they can optionally use faster but less accurate distance calculation algorithms as well. You probably want to do something similar.

I'm using another CustomScoreProvider for DistanceQuery:
public class DistanceQueryScoreProvider extends CustomScoreProvider {
private double x;
private double y;
public DistanceQueryScoreProvider(LeafReaderContext context, double x, double y) {
super(context);
this.x = x;
this.y = y;
}
#Override
public float customScore(int doc, float subQueryScore, float valSrcScore) throws IOException {
Document d = context.reader().document(doc);
double geomX = d.getField(Consts.GEOM_X_FIELD).numericValue().doubleValue();
double geomY = d.getField(Consts.GEOM_Y_FIELD).numericValue().doubleValue();
double deglen = 110.25;
double deltaX = geomY - y;
double deltaY = (geomX - x) * Math.cos(y);
return -Double.valueOf(deglen * Math.sqrt(deltaX * deltaX + deltaY * deltaY)).floatValue();
}
}
Elasticsearch implementation of plane distance function from Sorting by Distance was slower, than above code function customScore. This function was implemented based on article Geographic distance can be simple and fast
user3159253, maybe you have your answer for this thread?

Related

Designing most efficient algorithm for the problem with dynamic programming approach

Assume that you are in exam and you have 120 minutes but you can't solve the questions because you have a limited time. For example, the points and time to be needed to complete the question is below.
enter image description here
So we need to design the most efficient algorithm using dynamic programming approach for calculating highest point you will take in available time.
Here is my code below;
static int maxPoints(int points[], int time[],int n) {
if(n<=0) {
return 0;
}
else {
return Math.max(points[n-1]+maxPoints(points,time,(n-2)),
time[n - 1] + maxPoints(points, time, (n - 1)));
}
}
public static void main(String[] args) {
int n=10;
int points[]= {4,9,5,12,14,6,12,20,7,10};
int time[]= {1,15,2,3,20,120};
System.out.println();
}
But i couldn't find the correct algorithm, can you help me with this problem?

In your question, each question has a weight (the amount of time it needs) and a value (the points it awards). There is a constraint on the total time (or weight) and you need to maximise the points (or value).
This becomes analogous to the 0-1 Knapsack Problem, which can easily be solved using dynamic programming.

Neural networks and large data sets

I have a basic framework for a neural network to recognize numeric digits, but I'm having some problems with training it. My back-propogation works for small data sets, but when I have more than 50 data points, the return value starts converging to 0. And when I have data sets in the thousands, I get NaN's for costs and returns.
Basic structure: 3 layers: 784 : 15 : 1
784 is the number of pixels per data set, 15 neurons in my hidden layer, and one output neuron which returns a value from 0 to 1 (when you multiply by 10 you get a digit).
public class NetworkManager {
int inputSize;
int hiddenSize;
int outputSize;
public Matrix W1;
public Matrix W2;
public NetworkManager(int input, int hidden, int output) {
inputSize = input;
hiddenSize = hidden;
outputSize = output;
W1 = new Matrix(inputSize, hiddenSize);
W2 = new Matrix(hiddenSize, output);
}
Matrix z2, z3;
Matrix a2;
public Matrix forward(Matrix X) {
z2 = X.dot(W1);
a2 = sigmoid(z2);
z3 = a2.dot(W2);
Matrix yHat = sigmoid(z3);
return yHat;
}
public double costFunction(Matrix X, Matrix y) {
Matrix yHat = forward(X);
Matrix cost = yHat.sub(y);
cost = cost.mult(cost);
double returnValue = 0;
int i = 0;
while (i < cost.m.length) {
returnValue += cost.m[i][0];
i++;
}
return returnValue;
}
Matrix yHat;
public Matrix[] costFunctionPrime(Matrix X, Matrix y) {
yHat = forward(X);
Matrix delta3 = (yHat.sub(y)).mult(sigmoidPrime(z3));
Matrix dJdW2 = a2.t().dot(delta3);
Matrix delta2 = (delta3.dot(W2.t())).mult(sigmoidPrime(z2));
Matrix dJdW1 = X.t().dot(delta2);
return new Matrix[]{dJdW1, dJdW2};
}
}
There's the code for network framework. I pass double arrays of length 784 into the forward method.
int t = 0;
while (t < 10000) {
dJdW = Nn.costFunctionPrime(X, y);
Nn.W1 = Nn.W1.sub(dJdW[0].scalar(3));
Nn.W2 = Nn.W2.sub(dJdW[1].scalar(3));
t++;
}
I call this to adjust the weights. With small sets, the cost converges to 0 pretty well, but larger sets don't (the cost associated with 100 characters converges to 13, always). And if the set is too large, the first adjustment works (and costs go down) but after the second all I can get is NaN.
Why does this implementation fail with larger data sets (specifically training) and how can I fix this? I tried a similar structure with 10 outputs instead of 1 where each would return a value near 0 or 1 acting like boolean values, but the same thing was happening.
I'm also doing this in java by the way, and I'm wondering if that has something to do with the problem. I was wondering if it was a problem with running out of space but I haven't been getting any heap space messages. Is there a problem with how I'm back-propogating or is something else happening?
EDIT: I think I know what's happening. I think my backpropogation function is getting caught in local minimums. Sometimes the training succeeds and sometimes it fails for large data sets. Because I'm starting with random weights, I get random initial costs. What I've noticed is that when the cost initially exceeds a certain amount (it depends on the number of datasets involved), the costs converge to a clean number (sometimes 27, others 17.4) and the outputs converge to 0 (which makes sense).
I was warned about relative minimums in the cost function when I began, and I'm beginning to realize why. So now the question becomes, how do I go about my gradient descent so that I'll actually find the global minimum? I'm working in Java by the way.

This seems like a problem with weight initialization.
As far as i can see you never initialize the weights to any specific value. Therefore the network diverges. You should at least use random initialization.

If your backprop works on small dataset is there really good assumtion that there isn't problem. When you're suspicious about it you can try your BP on XOR problem.
Are units biased?
I once discuss with guy who doing exactly same thing. Hand digit recognition and 15 units in hidden layer. I saw a network who doing this task well. Her topology was:
Input: 784
First hidden: 500
Second hidden: 500
Third hidden: 2000
Output: 10
You have a sets of images and you nonlinear transform 784 pixels of image into the 15 numbers from <0, 1> interval and you doing this for all images of your set. You hope that you can right separate digit based on these 15 numbers. From my point of view is 15 hidden unit too little for such a task when I assumed you have dataset with thousands of example. Please try for example 500 hidden units.
And learning rate has influence on backprop and can caused problem with convergence.

Calculating math in GPS format

I'm making a Android Application to calculate Math in GPS Format.
Example:
Given
N 48°44.(30x4) E 019°08.[(13x31)+16]
the App calculates it, and result is:
N 48°44.120 E 019°08.419
Is it possible to do this?
I searched for plugins and solutions, but it's all just for math strings like as "14 + 6".

I am assuming you are working in Java as it is tagged in your question.
You could create a new public class for your GPS coordinates, and store the actual value of the coordinate in the lowest division, which according to your example appears to be minutes or seconds. This allows you to store the value as an int or a double with whatever precision you wish. You could then create a set of private and public methods to complete your mathematical operations and others to display your values in the appropriate fashion:
public class GPSCoordinate {
private double verticalcoord;
private double horizontalcoord;
//Constructors
GPSCoordinate(){
setVertical(0);
setHorizontal(0);
}
GPSCoordinate(double vert, double horiz){
setVertical(vert);
setHorizontal(horiz);
}
//Display methods
public String verticalString(){
return ((int)verticalcoord / 60) + "°" + (verticalcoord - ((int)verticalcoord / 60) *60);
}
public String horizontalString(){
return ((int)horizontalcoord / 60) + "°" + (horizontalcoord - ((int)horizontalcoord / 60) *60);
}
//Setting Methods
public void setVertical(double x){
this.verticalcoord = x;
}
public void setHorizontal(double x){
this.horizontalcoord = x;
}
//Math Methods
public void addMinutesVertical(double x){
this.verticalcoord += x;
}
}
This will allow you to initiate an instance in your main code with a given GPS coordinate, and then you can call your math functions on it.
GPSCoordinate coord1 = new GPSCoordinate(567.23, 245);
coord1.addMinutesVertical(50);
coord1.otherMathFunction(50 * 30);
You will, of course, need to refine the above to make it fit your project. If this isn't helpful, please provide more specifics and I'll see if I can think of anything else that might fit what your looking for.

Can't you just substring the whole thing and search for the expression in the brackets? Then it's just a matter of simple calculation. If I understood the question correctly. The gps data doesn't look like an ordinary expression, so you can't appy math() directly.

Echo/delay algorithm just causes noise/static?

There have been other questions and answers on this site suggesting that, to create an echo or delay effect, you need only add one audio sample with a stored audio sample from the past. As such, I have the following Java class:
public class DelayAMod extends AudioMod {
private int delay = 500;
private float decay = 0.1f;
private boolean feedback = false;
private int delaySamples;
private short[] samples;
private int rrPointer;
#Override
public void init() {
this.setDelay(this.delay);
this.samples = new short[44100];
this.rrPointer = 0;
}
public void setDecay(final float decay) {
this.decay = Math.max(0.0f, Math.min(decay, 0.99f));
}
public void setDelay(final int msDelay) {
this.delay = msDelay;
this.delaySamples = 44100 / (1000/this.delay);
System.out.println("Delay samples:"+this.delaySamples);
}
#Override
public short process(short sample) {
System.out.println("Got:"+sample);
if (this.feedback) {
//Delay should feed back into the loop:
sample = (this.samples[this.rrPointer] = this.apply(sample));
} else {
//No feedback - store base data, then add echo:
this.samples[this.rrPointer] = sample;
sample = this.apply(sample);
}
++this.rrPointer;
if (this.rrPointer >= this.samples.length) {
this.rrPointer = 0;
}
System.out.println("Returning:"+sample);
return sample;
}
private short apply(short sample) {
int loc = this.rrPointer - this.delaySamples;
if (loc < 0) {
loc += this.samples.length;
}
System.out.println("Found:"+this.samples[loc]+" at "+loc);
System.out.println("Adding:"+(this.samples[loc] * this.decay));
return (short)Math.max(Short.MIN_VALUE, Math.min(sample + (int)(this.samples[loc] * this.decay), (int)Short.MAX_VALUE));
}
}
It accepts one 16-bit sample at a time from an input stream, finds an earlier sample, and adds them together accordingly. However, the output is just horrible noisy static, especially when the decay is raised to a level that would actually cause any appreciable result. Reducing the decay to 0.01 barely allows the original audio to come through, but there's certainly no echo at that point.
Basic troubleshooting facts:
The audio stream sounds fine if this processing is skipped.
The audio stream sounds fine if decay is 0 (nothing to add).
The stored samples are indeed stored and accessed in the proper order and the proper locations.
The stored samples are being decayed and added to the input samples properly.
All numbers from the call of process() to return sample are precisely what I would expect from this algorithm, and remain so even outside this class.
The problem seems to arise from simply adding signed shorts together, and the resulting waveform is an absolute catastrophe. I've seen this specific method implemented in a variety of places - C#, C++, even on microcontrollers - so why is it failing so hard here?
EDIT: It seems I've been going about this entirely wrong. I don't know if it's FFmpeg/avconv, or some other factor, but I am not working with a normal PCM signal here. Through graphing of the waveform, as well as a failed attempt at a tone generator and the resulting analysis, I have determined that this is some version of differential pulse-code modulation; pitch is determined by change from one sample to the next, and halving the intended "volume" multiplier on a pure sine wave actually lowers the pitch and leaves volume the same. (Messing with the volume multiplier on a non-sine sequence creates the same static as this echo algorithm.) As this and other DSP algorithms are intended to work on linear pulse-code modulation, I'm going to need some way to get the proper audio stream first.

It should definitely work unless you have significant clipping.
For example, this is a text file with two columns. The leftmost column is the 16 bit input. The second column is the sum of the first and a version delayed by 4001 samples. The sample rate is 22KHz.
Each sample in the second column is the result of summing x[k] and x[k-4001] (e.g. y[5000] = x[5000] + x[999] = -13840 + 9181 = -4659) You can clearly hear the echo signal when playing the samples in the second column.
Try this signal with your code and see if you get identical results.

Using of getSpectrum() in Libgdx library

I know the first thing you are thinking is "look for it in the documentation", however, the documentation is not clear about it.
I use the library to get the FFT and I followed this short guide:
http://www.digiphd.com/android-java-reconstruction-fast-fourier-transform-real-signal-libgdx-fft/
The problem arises when it uses:
fft.forward(array);
fft_cpx=fft.getSpectrum();
tmpi = fft.getImaginaryPart();
tmpr = fft.getRealPart();
Both "fft_cpx", "tmpi", "tmpr" are float vectors. While "tmpi" and "tmpr" are used for calculate the magnitude, "fft_cpx" is not used anymore.
I thought that getSpectrum() was the union of getReal and getImmaginary but the values are all different.
Maybe, the results from getSpectrum are complex values, but what is their representation?
I tried without fft_cpx=fft.getSpectrum(); and it seems to work correctly, but I'd like to know if it is actually necessary and what is the difference between getSpectrum(), getReal() and getImmaginary().
The documentation is at:
http://libgdx-android.com/docs/api/com/badlogic/gdx/audio/analysis/FFT.html
public float[] getSpectrum()
Returns: the spectrum of the last FourierTransform.forward() call.
public float[] getRealPart()
Returns: the real part of the last FourierTransform.forward() call.
public float[] getImaginaryPart()
Returns: the imaginary part of the last FourierTransform.forward()
call.
Thanks!

getSpectrum() returns absolute values of complex numbers.
It is calculated like this
for (int i = 0; i < spectrum.length; i++) {
spectrum[i] = (float)Math.sqrt(real[i] * real[i] + imag[i] * imag[i]);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene geo-distance sorting performance - java

Look at how elasticsearch implements this in the function_score query. You can probably reuse a few things from what they do. If I remember correctly, they can optionally use faster but less accurate distance calculation algorithms as well. You probably want to do something similar.

Related

Designing most efficient algorithm for the problem with dynamic programming approach

Neural networks and large data sets

Calculating math in GPS format

Echo/delay algorithm just causes noise/static?

Using of getSpectrum() in Libgdx library

Categories

Resources