Reducing amount of RAM BFS algorithm takes

Reducing amount of RAM BFS algorithm takes - java

I have written a 2x2x2 rubiks cube solver that uses a breadth first search algorithm to solve a cube position entered by the user. The program does solve the cube. However I encounter a problem when I enter in a hard position to solve which would be found deep in the search, I run out of heap space. My computer only has 4 GB of RAM and when I run the program I give 3GB to it. I was wondering what I could do to reduce the amount of ram the search takes. Possibly by changing a few aspects of the BFS.
static private void solve(Cube c) {
Set<Cube> cubesFound = new HashSet<Cube>();
cubesFound.add(c);
Stack<Cube> s = new Stack<Cube>();
s.push(c);
Set<Stack<Cube>> initialPaths = new HashSet<Stack<Cube>>();
initialPaths.add(s);
solve(initialPaths, cubesFound);
}
static private void solve(Set<Stack<Cube>> livePaths, Set<Cube> cubesFoundSoFar) {
System.out.println("livePaths size:" + livePaths.size());
int numDupes = 0;
Set<Stack<Cube>> newLivePaths = new HashSet<Stack<Cube>>();
for(Stack<Cube> currentPath : livePaths) {
Set<Cube> nextStates = currentPath.peek().getNextStates();
for (Cube next : nextStates) {
if (currentPath.size() > 1 && next.isSolved()) {
currentPath.push(next);
System.out.println("Path length:" + currentPath.size());
System.out.println("Path:" + currentPath);
System.exit(0);
} else if (!cubesFoundSoFar.contains(next)) {
Stack<Cube> newCurrentPath = new Stack<Cube>();
newCurrentPath.addAll(currentPath);
newCurrentPath.push(next);
newLivePaths.add(newCurrentPath);
cubesFoundSoFar.add(next);
} else {
numDupes += 1;
}
}
}
String storeStates = "positions.txt";
try {
PrintWriter outputStream = new PrintWriter(storeStates);
outputStream.println(cubesFoundSoFar);
outputStream.println(storeStates);
outputStream.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Duplicates found " + numDupes + ".");
solve(newLivePaths, cubesFoundSoFar);
}
That is the BFS but I fear that its not all the information needed to understand what is going on so here is the link to file with all the code. https://github.com/HaginCodes/2x2-Cube-Solver/blob/master/src/com/haginonyango/pocketsolver/Cube.java

By definition, "Best first search" keeps the entire search frontier of possible paths through the space.
That can be exponentially large. With 3x3x3 Rubik's cube, I think there are 12? possible moves at each point, so a 10 move sequence to solve arguably requires 12^10 combinations which is well over a billion (10^9). WIth this many states, you'll want to minimize the size of the state to minimize total storage. (Uh, you actually print all the states? " outputStream.println(cubesFoundSoFar);" Isnt that a vast amount of output?)
With 2x2x2, you only have 8 possible moves at each point. I don't know solution lengths here for random problems. If it is still length 10, you get 8^10 which is still pretty big.
Now, many move sequences lead to the same cube configuration. To recognize this you need to check that a generated move does not re-generate a position already visited. You seem to be doing that (good!) and tracking the number of hits; I'd expect that hit count to be pretty high as many paths should lead to the same configuration.
What you don't show, is how you score each move sequence to guide the search. What node does it expand next? This is where best comes into play. If you don't have any guidance (e.g., all move sequences enumerated have the same value), you truly will wander over a huge space because all move sequences are equally good. What guides your solver to a solution? You need something like a priority queue over nodes with priority being determined by score, and that I don't see in the code you presented here.
I don't know what a great heuristic for measuring the score as a quality of a move sequence, but you can start by scoring a move sequence with the number of moves it takes to arrive there. The next "best move" to try is then the one with the shortest path. That will probably help immensely.
(A simple enhancement that might work is to count the number of colors on a face; 3 colors hints [is this true?] that it might take 3-1 --> 2 moves minimum to remove the wrong colors. Then the score might be #moves+#facecolors-1, to estimate number of moves to a solution; clearly you want the shortest sequence of moves. This might be a so-called "admissible" hueristic score).
You'll also have to adjust your scheme to detecting duplicate move sequences. When you find an already encountered state, that state will now presumably have attached to it the score (move count) it takes to reach that state. When you get a hit, you've found another way to get to that same state... but the score for new path, may be smaller than what is recorded in the state. In this case, you need to revise the score of discovered duplicate state with the smaller new score. This way a path/state with score 20 may in fact be discovered to have a score of 10, which means it is suddenly improved.

Related

Distinguishing voiced/ unvoiced speech using zero-crossing rate

The zero-crossing rate is the rate of sign-changes along a signal, i.e., the rate at which the signal changes from positive to negative or back.
The zero-crossing rate Zn can be used to:
1-Distinguish voiced/unvoiced speech
2-Seperate unvoiced speech from static background noise.
It is a simple (yet effective) way to distinguish between
voiced and unvoiced speech regions:
• Voiced region: lower zero-crossing rate
• Unvoiced region: higher zero-crossing rate
and here is the code i am using:
public double evaluate(){
int numZC=0;
int size=signals.length;
for (int i=0; i<size-1; i++){
if((signals[i]>=0 && signals[i+1]<0) || (signals[i]<0 && signals[i+1]>=0)){
numZC++;
}
}
return numZC/lengthInSecond;
}
MY questions are:
1- My goal of using zero crossing is to eliminate the unvoiced part of the signal,,, and this code gives back the ZERO-CROSSING RATE. SO how will i do that?!
2- How will i know how much is a "low" zero-crossing rate and how much is a "high" zero-crossing rate???

The fundamental problem is that while you've found a way to calculate the zero crossing rate of a block of samples, you can't use that to distinguish sounds within that block because it only gives you one number that describes your entire block.
One potential solution is to divide your big block into small blocks, and then work on those. If you do that, you will soon find that your small blocks, which you made arbitrarily, don't fit into neat categories of voiced and unvoiced, and simply removing one block or setting a block's volume to zero will leave you with "choppy" sounds or even harsh clicking sounds, and won't divide the parts of speech as cleanly as you like.
This may be a worthwhile point to start with, because it's closer to your existing code, but it won't work out in the long run, unless you are just looking to do something rough (in which case, this might be good enough!).
To resolve this, you may want to consider calculating an "instantaneous zero crossing rate"1 that updates the Zr for each sample.
My goal of using zero crossing is to eliminate the unvoiced part of the signal,,, and this code gives back the ZERO-CROSSING RATE. SO how will i do that?! It's not clear what you want. What do you mean by "eliminate"? Do you want silence or do you want to skip those sections? For silence, simply replace the unwanted sections with zero. To skip, simply remove those samples. Of course, you will still end up with clicks and pops, but I assume you know how to get rid of that. If not, maybe you can read up on linear interpolation. Keep in mind that you will almost certainly have to apply some heuristics like "don't remove any sections that are smaller than n samples".
How will i know how much is a "low" zero-crossing rate and how much is a "high" zero-crossing rate??? I would guess a good threshold will be roughly around 400Hz, but speech is not my specialty. Moreover it will vary a bit by speaker and possibly by language and other factors. I suggest you make some samples and see for yourself.
1 this name is a bit misleading and you could say "there's no such thing as an instantaneous zero crossing rate". I'm not here to argue that; rather I want to use that phrase because it expresses what I mean and I hope you understand it. Suffice it to say you should do your best to update Zr as often as you can. eg. something like this:
int lastSign = 0;
int lastCrossing = 0;
float nextZeroCrossing( float newSample ) {
int thisSign = newSample > 0 ? 1 : -1 ;
if( thisSign != lastSign ) {
lastSign = thisSign;
//zero crossing has happened. Update our estimate of Zr using lastCrossing and return that
} else {
++lastCrossing;
//zero crossing has not happened. Return existing Zr
}
}
You may want to "smooth" the output of nextZeroCrossing(), as it will tend to jump around a lot. A simple exponential or moving average filter will work great.

Calculating the optimal move in checkers

I created a checkress game and I would like the computer to calculate the most optimal move.
Here is what I've done so far:
public BoardS calcNextMove(BoardS bs)
{
ArrayList<BoardS>options = calcPossibleOptions(bs);
int max = -1;
int temp;
int bestMove = 0;
for(int k=0;k<options.size();k++)
{
temp = calculateNextMove2(options.get(k));
if(max<temp)
{
max = temp;
bestMove = k;
}
}
return options.get(bestMove);
}
public int calculateNextMove2(BoardS bs)
{
int res = soWhoWon(bs);
if(res == 2) //pc won(which is good so we return 1)
return 1;
if(res == 1)
return 0;
ArrayList<BoardS>options = calcPossibleOptions(bs);
int sum = 0;
for(int k=0;k<options.size();k++)
{
sum += calculateNextMove2(options.get(k));
}
return sum;
}
I keep getting
Exception in thread "AWT-EventQueue-0" java.lang.StackOverflowError
calcPossibleOptions works good, it's a function that returns an array of all the possible options.
BoardS is a clas that represent a game board.
I guess I have to make it more efficient, how?

The recursion on "calculateNextMove2()" will run too deep and this is why you get a StackOverFlow. If you expect the game to run to the end (unless relatively close to an actual win) this may take a very long time indeed. Like with chess e.g. (which I have more experience with) an engine will go maybe 20 moves deep before you will likely go with what it's found thus far. If you ran it from the start of a chess game... it can run for 100s of years on current technology (and still not really be able to say what the winning first move is). Maybe just try 10 or 20 moves deep? (which would still beat most humans - and would still probably classify as "best move"). As with chess you will have the difficulty of assessing a position as good or bad for this to work (often calculated as a combination of material advantage and positional advantage). That is the tricky part. Note: Project Chinook has done what you're looking to achieve - it has "defeated" the game of Checkers (no such thing exists yet for chess EDIT: Magnus Carslen has "deafeted" the game for all intents and purposes :D).
see: Alpha-beta pruning
(it may help somewhat)
Also here is an (oldish) paper I saw on the subject:
Graham Owen 1997 Search Algorithms and Game Playing
(it may be useful)
Also note that returning the first move that results in a "win" is 'naive' - because it may be a totally improbably line of moves where the opponent will gain an advantage if they don't play in a very particular win - e.g. Playing for Fools Mate or Scholars Mate in chess... (quick but very punishable)

You're going to need to find an alternative to MinMax, even with alpha-beta pruning most likely, as the board is too large making too many possible moves, and counter moves. This leads to the stack overflow.
I was fairly surprised to see that making the full decision tree for Tic-Tac-Toe didn't overflow myself, but I'm afraid I don't know enough about AI planning, or another algorithm for doing these problems, to help you beyond that.

Neural Network Back-Propagation Algorithm Gets Stuck on XOR Training PAttern

Overview
So I'm trying to get a grasp on the mechanics of neural networks. I still don't totally grasp the math behind it, but I think I understand how to implement it. I currently have a neural net that can learn AND, OR, and NOR training patterns. However, I can't seem to get it to implement the XOR pattern. My feed forward neural network consists of 2 inputs, 3 hidden, and 1 output. The weights and biases are randomly set between -0.5 and 0.5, and outputs are generated with the sigmoidal activation function
Algorithm
So far, I'm guessing I made a mistake in my training algorithm which is described below:
For each neuron in the output layer, provide an error value that is the desiredOutput - actualOutput --go to step 3
For each neuron in a hidden or input layer (working backwards) provide an error value that is the sum of all forward connection weights * the errorGradient of the neuron at the other end of the connection --go to step 3
For each neuron, using the error value provided, generate an error gradient that equals output * (1-output) * error. --go to step 4
For each neuron, adjust the bias to equal current bias + LEARNING_RATE * errorGradient. Then adjust each backward connection's weight to equal current weight + LEARNING_RATE * output of neuron at other end of connection * this neuron's errorGradient
I'm training my neural net online, so this runs after each training sample.
Code
This is the main code that runs the neural network:
private void simulate(double maximumError) {
int errorRepeatCount = 0;
double prevError = 0;
double error; // summed squares of errors
int trialCount = 0;
do {
error = 0;
// loop through each training set
for(int index = 0; index < Parameters.INPUT_TRAINING_SET.length; index++) {
double[] currentInput = Parameters.INPUT_TRAINING_SET[index];
double[] expectedOutput = Parameters.OUTPUT_TRAINING_SET[index];
double[] output = getOutput(currentInput);
train(expectedOutput);
// Subtracts the expected and actual outputs, gets the average of those outputs, and then squares it.
error += Math.pow(getAverage(subtractArray(output, expectedOutput)), 2);
}
} while(error > maximumError);
Now the train() function:
public void train(double[] expected) {
layers.outputLayer().calculateErrors(expected);
for(int i = Parameters.NUM_HIDDEN_LAYERS; i >= 0; i--) {
layers.allLayers[i].calculateErrors();
}
}
Output layer calculateErrors() function:
public void calculateErrors(double[] expectedOutput) {
for(int i = 0; i < numNeurons; i++) {
Neuron neuron = neurons[i];
double error = expectedOutput[i] - neuron.getOutput();
neuron.train(error);
}
}
Normal (Hidden & Input) layer calculateErrors() function:
public void calculateErrors() {
for(int i = 0; i < neurons.length; i++) {
Neuron neuron = neurons[i];
double error = 0;
for(Connection connection : neuron.forwardConnections) {
error += connection.output.errorGradient * connection.weight;
}
neuron.train(error);
}
}
Full Neuron class:
package neuralNet.layers.neurons;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import neuralNet.Parameters;
import neuralNet.layers.NeuronLayer;
public class Neuron {
private double output, bias;
public List<Connection> forwardConnections = new ArrayList<Connection>(); // Forward = layer closer to input -> layer closer to output
public List<Connection> backwardConnections = new ArrayList<Connection>(); // Backward = layer closer to output -> layer closer to input
public double errorGradient;
public Neuron() {
Random random = new Random();
bias = random.nextDouble() - 0.5;
}
public void addConnections(NeuronLayer prevLayer) {
// This is true for input layers. They create their connections differently. (See InputLayer class)
if(prevLayer == null) return;
for(Neuron neuron : prevLayer.neurons) {
Connection.createConnection(neuron, this);
}
}
public void calcOutput() {
output = bias;
for(Connection connection : backwardConnections) {
connection.input.calcOutput();
output += connection.input.getOutput() * connection.weight;
}
output = sigmoid(output);
}
private double sigmoid(double output) {
return 1 / (1 + Math.exp(-1*output));
}
public double getOutput() {
return output;
}
public void train(double error) {
this.errorGradient = output * (1-output) * error;
bias += Parameters.LEARNING_RATE * errorGradient;
for(Connection connection : backwardConnections) {
// for clarification: connection.input refers to a neuron that outputs to this neuron
connection.weight += Parameters.LEARNING_RATE * connection.input.getOutput() * errorGradient;
}
}
}
Results
When I'm training for AND, OR, or NOR the network can usually converge within about 1000 epochs, however when I train with XOR, the outputs become fixed and it never converges. So, what am I doing wrong? Any ideas?
Edit
Following the advice of others, I started over and implemented my neural network without classes...and it works. I'm still not sure where my problem lies in the above code, but it's in there somewhere.

This is surprising because you are using a big enough network (barely) to learn XOR. Your algorithm looks right, so I dont really know what is going on. It might help to know how you generate your training data: are you just reating the samples (1,0,1),(1,1,0),(0,1,1),(0,0,0) or something like that over and over? Perhaps the problem is that stochastic gradient descent is causing you to jump around the stabilizing minima. You could try some things to fix this: perhaps randomly sample from your training examples instead of repeating them (if that is what you are doing). Or, alternatively, you could modify your learning algorithm:
currently you have something equivalent to:
weight(epoch) = weight(epoch - 1) + deltaWeight(epoch)
deltaWeight(epoch) = mu * errorGradient(epoch)
where mu is the learning rate
One option is to very slowly decrease the value of mu.
An alternative would be to change your definition of deltaWeight to include a "momentum"
deltaWeight(epoch) = mu * errorGradient(epoch) + alpha * deltaWeight(epoch -1)
where alpha is the momentum parameter (between 0 and 1).
Visually, you can think of gradient descent as trying to find the minimum point of a curved surface by placing an object on that surface, and then step by step moving that object small amounts in which ever directing is sloping down based on where it is currently located. The problem is that you dont really do gradient descent: instead you do stochastic gradient descent where you move in direction by sampling from a set of training vectors and moving in what ever direction the sample makes look like is down. On average over the entire training data, stochastic gradient descent should work, but it is isn't guaranteed to because you can get into a situation where you jump back and forth never making progress. Slowly decreasing the learning rate means you take smaller and smaller steps each time so can not get stuck in an infinite cycle.
On the other hand, momentum makes the algorithm into something akin to rolling a rubber ball. As the ball roles it tends to go in the downwards direction, but it also tends to keep going in the direction it was going before, and if it is ever on a stretch where the down slope is in the same direction for a while it will speed up. The ball will therefore jump over some local minima, and it will be more resilient against stepping back and forth over the target because doing so means working against the force of momentum.
Having some code and thinking about this some more, it is pretty clear that your problem is in training the early layers. The functions you have successfully learned are all linearly separable, so it would make sense that only a single layer is being properly learned. I agree with LiKao about implementation strategies in general, although your approach should work. My suggestion for how to debug this is figure out what the progression of the weights on the connections between the input layer and the output layer looks like.
You should post the rest implementation of Neuron.

I faced the same problem short time ago. Finally I found the solution, how to write a code solving XOR wit the MLP algorithm.
The XOR problem seems to be an easy to learn problem but it isn't for the MLP because it's not linearly separable. So even if your MLP is OK (I mean there is no bug in your code) you have to find the good parameters to be able to learn the XOR problem.
Two hidden and one output neuron is fine. The 2 main thing you have to set:
although you have only 4 training samples you have to run the training for a couple of thousands epoch.
if you use sigmoid hidden layers but linear output the network will converge faster
Here is the detailed description and sample code: http://freeconnection.blogspot.hu/2012/09/solving-xor-with-mlp.html

Small hint - if the output of your NN seem to drift toward 0.5 then everything's OK!
The algorithm using just the learning rate and bias is just too simple to quickly learn XOR. You can either increase the number of epochs or change the algorithm.
My recommendation is to use momentum:
1000 epochs
learningRate = 0.3
momentum = 0.8
weights drawn from [0,1]
bias drawn form [-0.5, 0.5]
And some crucial pseudo code (assuming back and forward propagation works) :
for every edge:
previous_edge_weight_change = -1 * learningRate * edge_source_neuron_value * edge_target_neuron_delta + previous_edge_weight * momentum
edge_weight += previous_edge_weight_change
for every neuron:
previous_neuron_bias_change = -1 * learningRate * neuron_delta + previous_neuron_bias_change * momentum
bias += previous_neuron_bias_change

I would suggest you to generate a grid (say from [-5,-5] to [5,5] with a step like 0.5), learn your MLP on the XOR and apply it to the grid. Plotted in color you could see some kind of a frontier.
If you do that at each iteration, you'll see the evolution of the frontier and can control the learning.

It's been a while since I last implemented an Neural Network myself, but I think your mistake is in the lines:
bias += Parameters.LEARNING_RATE * errorGradient;
and
connection.weight += Parameters.LEARNING_RATE * connection.input.getOutput() * errorGradient;
The first of these lines should not be there at all. Bias is best modeled as the input of a neuron which is fixed at 1. This will serve to make your code a lot simpler and cleaner, because you will not have to treat the bias in any special way.
The other point is, that I think the sign is wrong in both of these expressions. Think about it like this:
Your gradient points into the direction of steepest ascend, so if you go into that direction, your error will get larger.
What you are doing here is adding something to the weights, in case the error is already positive, i.e. you are making it more positive. If it is negative you are substracting someting, i.e. you make it more negative.
Unless I am missing something about your definition of error or the gradient calculation you should change these lines to:
bias -= Parameters.LEARNING_RATE * errorGradient;
and
connection.weight -= Parameters.LEARNING_RATE * connection.input.getOutput() * errorGradient;
I had a similar mistake in one of my early implementations and it lead to exactly the same behaviour, i.e. it resulted in a network that learned in simple cases, but not anymore once the training data became more complex.

LiKao's comment to simplify my implementation and get rid of the object-oriented aspects solved my problem. The flaw in the algorithm as it is described above is unknown, however I now have a working neural network that is a lot smaller.
Feel free to continue to provide insight on the problem with my previous implementation, as others may have the same problem in the future.

I'm a bit rusty on neural networks, but I think there was a problem to implement the XOR with one perceptron: basically a neuron is able to separate two groups of solutions through a straight line, but one straight line is not sufficient for the XOR problem...
Here should be the answer!

I couldn't see anything wrong with the code, but I was having a similar problem with my network not converging for XOR, so figured I'd post my working configuration.
3 input neurons (one of them being a fixed bias of 1.0)
3 hidden neurons
1 output neuron
Weights randomly chosen between -0.5 and 0.5.
Sigmoid activation function.
Learning rate = 0.2
Momentum = 0.4
Epochs = 50,000
Converged 10/10 times.
One of the mistakes I was making was not connecting the bias input to the output neuron, and this would mean for the same configuration it only converged 2 out of 10 times with the other eight times failing because 1 and 1 would output 0.5.
Another mistake was not doing enough epochs. If I only did 1000 then the outputs tend to be around 0.5 for every test case. With epochs >= 8000 so 2000 times for each test case, it started to look like it might be working (but only if using momentum).
When doing 50000 epochs it did not matter whether momentum was used or not.
Another thing I tried was to not apply the sigmoid function to the output neurons output (which I think was what an earlier post had suggested), but this wrecked the network because the output*(1-output) part of the error equation could now be negative meaning weights were updated in a way that caused the error to increase.

Splitting up visual blocks of text in java

I have a block of text I'm trying to interpret in java (or with grep/awk/etc) looking like the following:
Somewhat differently, plaques of the rN8 and rN9 mutants and human coronavirus OC43 as well as the more divergent
were of fully wild-type size, indicating that the suppressor mu- SARS-CoV, human coronavirus HKU1, and bat coronaviruses
tations, in isolation, were not noticeably deleterious to the HKU4, HKU5, and HKU9 (Fig. 6B). Thus, not only do mem-
--
able effect on the viral phenotype. A potentially related obser- sented for the existence of an interaction between nsp9
vation is that the mutation A2U, which is also neutral by itself, nsp8 (56). A hexadecameric complex of SARS-CoV nsp8 and
is lethal in combination with the AACAAG insertion (data not nsp7 has been found to bind to double-stranded RNA. The
And what I'd like to do is split it into two parts: left and right. I'm having trouble coming up with a regex or any other method that would split a block of text obviously visually split, but not obvious to a programming language. The lengths of the lines are variable.
I've considered looking for the first block and then finding the second by looking for multiple spaces, but I'm not sure that that's a robust solution. Any ideas, snippets, pseudo code, links, etc?
Text Source
The text has been ran as follows through pdftotext pdftotext -layout MyPdf.pdf

Blur the text and come up with an array of the character density per column of text. Then look for gaps and split there.
String blurredText = text.replaceAll("(?<=\\S) (?=\\S)", ".");
String[] blurredLines = text.split("\r\n?|\n");
int maxRowLength = 0;
for (String blurredLine : blurredLines) {
maxRowLength = Math.max(maxRowLength, blurredLine.length());
}
int[] columnCounts = new int[maxRowLength];
for (String blurredLine : blurredLines) {
for (int i = 0, n = blurredLine.length(); i < n; ++i) {
if (blurredLine.charAt(i) != ' ') { ++columnCounts[i]; }
}
}
// Look for runs of zero of at least length 3.
// Alternatively, you might look for the n longest runs of zeros.
// Alternatively, you might look for runs of length min(columnCounts) to ignore
// horizontal rules.
int minBreakLen = 3; // A tuning parameter.
List<Integer> breaks = new ArrayList<Integer>();
outer: for (int i = 0; i < maxRowLength - minBreakLen; ++i) {
if (columnCounts[i] != 0) { continue; }
int runLength = 1;
while (i + runLength < maxRowLength && 0 == columnCounts[i + runLength]) {
++runLength;
}
if (runLength >= minBreakLen) {
breaks.add(i);
}
i += runLength - 1;
}
System.out.println(breaks);

I doubt there is any robust solution to this. I would go for some sort of heuristic approach.
Off the top of my head, I would calculate a histogram of the column index of the first character of each word, and split on the column with the highest score (the idea being to find lots of words that are all aligned horizontally). I might also choose to weight this based on the number of preceding spaces.

I work in this general area. I am surprised that a double-column bioscience text of recent times (SARS, etc.) would be rendered in double-column monospace as the original - it would be typeset in proportional font or in HTML. So I suspect your text came from some other format (such as PDF). If so then you should try to get that format. PDF is horrible to parse, but PDF flattened to monospace is probably worse.
If you possibly can find someone who has worked in the area and see what they have done. If you have multiple documents (e.g. from different journals or reports) then your problem is worse. Yes, I could write an algorithm to solve the example you have posted, but my guess is it will break on the next set of documents. You will end up customising this for each different source (I and others have had to do this).
UPDATE: Thanks. As it's PDF then I would start by asking around. We collaborate with the group at Penn State (who have also done Citeseer). I also have colleagues at Cambridge who have spent months on a PDF reader.
If you want to do it yourself - and it will take time - then I'd start with PDFBox. I've done quite a lot with this and I think it's better for this than pdf2text or pdftotext. I can't remember whether it has double column option - I think so
UPDATE Here is a recent answer of several ways of tackling double-column PDF
http://metaoptimize.com/qa/questions/3943/methods-for-extracting-two-column-text-from-a-pdf
I'd certainly see what other people have done.
FWIW I spend a lot of time trying to convince people that scientists should not create their output in PDF because it destroys machine parsing - as you and I have found
UPDATE. You get the PDFs from your PI (== Principal Investigator?) In which case you'll gets lots of different sources which makes it worse.
What is the real problem you are trying to solve? I may be able to help

Problem with terminated paths in simple recursive algorithm

First of all: this is not a homework assignment, it's for a hobby project of mine.
Background:
For my Java puzzle game I use a very simple recursive algorithm to check if certain spaces on the 'map' have become isolated after a piece is placed. Isolated in this case means: where no pieces can be placed in.
Current Algorithm:
public int isolatedSpace(Tile currentTile, int currentSpace){
if(currentTile != null){
if(currentTile.isOpen()){
currentTile.flag(); // mark as visited
currentSpace += 1;
currentSpace = isolatedSpace(currentTile.rightNeighbor(),currentSpace);
currentSpace = isolatedSpace(currentTile.underNeighbor(),currentSpace);
currentSpace = isolatedSpace(currentTile.leftNeighbor(),currentSpace);
currentSpace = isolatedSpace(currentTile.upperNeighbor(),currentSpace);
if(currentSpace < 3){currentTile.markAsIsolated();} // <-- the problem
}
}
return currentSpace;
}
This piece of code returns the size of the empty space where the starting tile is part of. That part of the code works as intented. But I came across a problem regarding the marking of the tiles and that is what makes the title of this question relevant ;)
The problem:
The problem is that certain tiles are never 'revisited' (they return a value and terminate, so never get a return value themselves from a later incarnation to update the size of the empty space). These 'forgotten' tiles can be part of a large space but still marked as isolated because they were visited at the beginning of the process when currentSpace had a low value.
Question:
How to improve this code so it sets the correct value to the tiles without too much overhead? I can think of ugly solutions like revisiting all flagged tiles and if they have the proper value check if the neighbors have the same value, if not update etc. But I'm sure there are brilliant people here on Stack Overflow with much better ideas ;)
Update:
I've made some changes.
public int isolatedSpace(Tile currentTile, int currentSpace, LinkedList<Tile> visitedTiles){
if(currentTile != null){
if(currentTile.isOpen()){
// do the same as before
visitedTiles.add();
}
}
return currentSpace;
}
And the marktiles function (only called when the returned spacesize is smaller than a given value)
marktiles(visitedTiles){
for(Tile t : visitedTiles){
t.markAsIsolated();
}
}
This approach is in line with the answer of Rex Kerr, at least if I understood his idea.

This isn't a general solution, but you only mark spaces as isolated if they occur in a region of two or fewer spaces. Can't you simplify this test to "a space is isolated iff either (a) it has no open neighbours or (b) precisely one open neighbour and that neighbour has no other open neighbours".

You need to have a two-step process: gathering info about whether a space is isolated, and then then marking as isolated separately. So you'll need to first count up all the spaces (using one recursive function) and then mark all connected spaces if the criterion passes (using a different recursive function).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.