First of all: this is not a homework assignment, it's for a hobby project of mine.
Background:
For my Java puzzle game I use a very simple recursive algorithm to check if certain spaces on the 'map' have become isolated after a piece is placed. Isolated in this case means: where no pieces can be placed in.
Current Algorithm:
public int isolatedSpace(Tile currentTile, int currentSpace){
if(currentTile != null){
if(currentTile.isOpen()){
currentTile.flag(); // mark as visited
currentSpace += 1;
currentSpace = isolatedSpace(currentTile.rightNeighbor(),currentSpace);
currentSpace = isolatedSpace(currentTile.underNeighbor(),currentSpace);
currentSpace = isolatedSpace(currentTile.leftNeighbor(),currentSpace);
currentSpace = isolatedSpace(currentTile.upperNeighbor(),currentSpace);
if(currentSpace < 3){currentTile.markAsIsolated();} // <-- the problem
}
}
return currentSpace;
}
This piece of code returns the size of the empty space where the starting tile is part of. That part of the code works as intented. But I came across a problem regarding the marking of the tiles and that is what makes the title of this question relevant ;)
The problem:
The problem is that certain tiles are never 'revisited' (they return a value and terminate, so never get a return value themselves from a later incarnation to update the size of the empty space). These 'forgotten' tiles can be part of a large space but still marked as isolated because they were visited at the beginning of the process when currentSpace had a low value.
Question:
How to improve this code so it sets the correct value to the tiles without too much overhead? I can think of ugly solutions like revisiting all flagged tiles and if they have the proper value check if the neighbors have the same value, if not update etc. But I'm sure there are brilliant people here on Stack Overflow with much better ideas ;)
Update:
I've made some changes.
public int isolatedSpace(Tile currentTile, int currentSpace, LinkedList<Tile> visitedTiles){
if(currentTile != null){
if(currentTile.isOpen()){
// do the same as before
visitedTiles.add();
}
}
return currentSpace;
}
And the marktiles function (only called when the returned spacesize is smaller than a given value)
marktiles(visitedTiles){
for(Tile t : visitedTiles){
t.markAsIsolated();
}
}
This approach is in line with the answer of Rex Kerr, at least if I understood his idea.
This isn't a general solution, but you only mark spaces as isolated if they occur in a region of two or fewer spaces. Can't you simplify this test to "a space is isolated iff either (a) it has no open neighbours or (b) precisely one open neighbour and that neighbour has no other open neighbours".
You need to have a two-step process: gathering info about whether a space is isolated, and then then marking as isolated separately. So you'll need to first count up all the spaces (using one recursive function) and then mark all connected spaces if the criterion passes (using a different recursive function).
Related
We have a system that processes flat-file and (with a couple of validations only) inserts into database.
This code:
//There can be 8 million lines-of-codes
for(String line: lines){
if (!Class.isBranchNoValid(validBranchNoArr, obj.branchNo)){
continue;
}
list.add(line);
}
definition of isBranchNoValid:
//the array length ranges from 2 to 5 only
public static boolean isBranchNoValid(String[] validBranchNoArr, String branchNo) {
for (int i = 0; i < validBranchNoArr.length; i++) {
if (validBranchNoArr[i].equals(branchNo)) {
return true;
}
}
return false;
}
The validation is at line-level (we have to filter or skip the line that doesn't have a branchNo in the array). Earlier, this wasn't (filter) the case.
Now, high-performance degradation is troubling us.
I understand (may be, I am wrong) that this repeated function call is causing a lot of stack creation resulting in a very high GC invocation.
I can't figure out a way (is it even possible) to perform this filter without this high cost of performance degradation (a little difference is fine).
This is not a stack problem for sure, because your function is not recursive nothing is kept in the stack between calls; after each call the variables are erased since they are not needed anymore.
You can put the valid numbers in a set and use that one for some optimization but in your case I am not sure it will bring any benefits at all since you have at most 5 elements.
So there are several possible bottlenecks in your scenario.
reading the lines of the file
Parse the line to construct the object to insert into the database
check the applicability of the object (ie branch no filter)
insert into the db
Generally, you'd say IO is the slowest, so 1. and 2. You're saying nothing except 2. changed, right? That is weird.
Anyway, if you want to optimize that, I wouldn't be passing the array around 8 million times, and I wouldn't iterate it every time either. Since your valid branches are known, create a HashSet from it - it has O(1) access.
Set<String> validBranches = Arrays.stream(branches)
.collect(Collectors.toCollection(HashSet::new));
Then, iterate the lines
for (String line : lines) {
YourObject obj = parse(line);
if (validBranches.contains(obj.branchNo)) {
writeToDb(obj);
}
}
or, in the stream version
Files.lines(yourPath)
.map(this::parse)
.filter(o -> validBranches.contains(o.branchNo))
.forEach(this::writeToDb);
I'd also check if it isn't more efficient to first collect a batch of objects, then write to db. Also, it's possible that handling the lines in parallel gains some speed, in case the parsing is time intensive.
I have an array list with some names inside it (first and last names). What I have to do is go through each "first name" and see how many times a character (which the user specifies) shows up at the end of every first name in the array list, and then print out the number of times that character showed up.
public int countFirstName(char c) {
int i = 0;
for (Name n : list) {
if (n.getFirstName().length() - 1 == c) {
i++;
}
}
return i;
}
That is the code I have. The problem is that the counter (i) doesn't add 1 even if there is a character that matches the end of the first name.
You're comparing the index of last character in the string to the required character, instead of the last character itself, which you can access with charAt:
String firstName = n.getFirstName()
if (firstName.charAt(firstName.length() - 1) == c) {
i++;
}
When you're setting out learning to code, there is a great value in using pencil and paper, or describing your algorithm ahead of time, in the language you think in. Most people that learn a foreign language start out by assembling a sentence in their native language, translating it to foreign, then speaking the foreign. Few, if any, learners of a foreign language are able to think in it natively
Coding is no different; all your life you've been speaking English and thinking in it. Now you're aiming to learn a different pattern of thinking, syntax, key words. This task will go a lot easier if you:
work out in high level natural language what you want to do first
write down the steps in clear and simple language, like a recipe
don't try to do too much at once
Had I been a tutor marking your program, id have been looking for something like this:
//method to count the number of list entries ending with a particular character
public int countFirstNamesEndingWith(char lookFor) {
//declare a variable to hold the count
int cnt = 0;
//iterate the list
for (Name n : list) {
//get the first name
String fn = n.getFirstName();
//get the last char of it
char lc = fn.charAt(fn.length() - 1);
//compare
if (lc == lookFor) {
cnt++;
}
}
return cnt;
}
Taking the bullet points in turn:
The comments serve as a high level description of what must be done. We write them aLL first, before even writing a single line of code. My course penalised uncommented code, and writing them first was a handy way of getting the requirement out of the way (they're a chore, right? Not always, but..) but also it is really easy to write a logic algorithm in high level language, then translate the steps into the language learning. I definitely think if you'd taken this approach you wouldn't have made the error you did, as it would have been clear that the code you wrote didn't implement the algorithm you'd have described earlier
Don't try to do too much in one line. Yes, I'm sure plenty of coders think it looks cool, or trick, or shows off what impressive coding smarts they have to pack a good 10 line algorithm into a single line of code that uses some obscure language features but one day it's highly likely that someone else is going to have to come along to maintain that code, improve it or change part of what it does - at that moment it's no longer cool, and it was never really a smart thing to do
Aominee, in their comment, actually gives us something like an example of this:
return (int)list.stream().filter(e -> e.charAt.length()-1)==c).count();
It's a one line implementation of a solution to your problem. Cool huh? Well, it has a bug* (for a start) but it's not the main thrust of my argument. At a more basic level: have you got any idea what it's doing? can you look at it and in 2 seconds tell me how it works?
It's quite an advanced language feature, it's trick for sure, but it might be a very poor solution because it's hard to understand, hard to maintain as a result, and does a lot while looking like a little- it only really makes sense if you're well versed in the language. This one line bundles up a facility that loops over your list, a feature that effectively has a tiny sub method that is called for every item in the list, and whose job is to calculate if the name ends with the sought char
It p's a brilliant feature, a cute example and it surely has its place in production java, but it's place is probably not here, in your learning exercise
Similarly, I'd go as far to say that this line of yours:
if (n.getFirstName().length() - 1 == c) {
Is approaching "doing too much" - I say this because it's where your logic broke down; you didn't write enough code to effectively implement the algorithm. You'd actually have to write even more code to implement this way:
if (n.getFirstName().charAt(n.getFirstName().length() - 1) == c) {
This is a right eyeful to load into your brain and understand. The accepted answer broke it down a bit by first getting the name into a temporary variable. That's a sensible optimisation. I broke it out another step by getting the last char into a temp variable. In a production system I probably wouldn't go that far, but this is your learning phase - try to minimise the number of operations each of your lines does. It will aid your understanding of your own code a great deal
If you do ever get a penchant for writing as much code as possible in as few chars, look at some code golf games here on the stack exchange network; the game is to abuse as many language features as possible to make really short, trick code.. pretty much every winner stands as a testament to condense that should never, ever be put into a production system maintained by normal coders who value their sanity
*the bug is it doesn't get the first name out of the Name object
I have written a 2x2x2 rubiks cube solver that uses a breadth first search algorithm to solve a cube position entered by the user. The program does solve the cube. However I encounter a problem when I enter in a hard position to solve which would be found deep in the search, I run out of heap space. My computer only has 4 GB of RAM and when I run the program I give 3GB to it. I was wondering what I could do to reduce the amount of ram the search takes. Possibly by changing a few aspects of the BFS.
static private void solve(Cube c) {
Set<Cube> cubesFound = new HashSet<Cube>();
cubesFound.add(c);
Stack<Cube> s = new Stack<Cube>();
s.push(c);
Set<Stack<Cube>> initialPaths = new HashSet<Stack<Cube>>();
initialPaths.add(s);
solve(initialPaths, cubesFound);
}
static private void solve(Set<Stack<Cube>> livePaths, Set<Cube> cubesFoundSoFar) {
System.out.println("livePaths size:" + livePaths.size());
int numDupes = 0;
Set<Stack<Cube>> newLivePaths = new HashSet<Stack<Cube>>();
for(Stack<Cube> currentPath : livePaths) {
Set<Cube> nextStates = currentPath.peek().getNextStates();
for (Cube next : nextStates) {
if (currentPath.size() > 1 && next.isSolved()) {
currentPath.push(next);
System.out.println("Path length:" + currentPath.size());
System.out.println("Path:" + currentPath);
System.exit(0);
} else if (!cubesFoundSoFar.contains(next)) {
Stack<Cube> newCurrentPath = new Stack<Cube>();
newCurrentPath.addAll(currentPath);
newCurrentPath.push(next);
newLivePaths.add(newCurrentPath);
cubesFoundSoFar.add(next);
} else {
numDupes += 1;
}
}
}
String storeStates = "positions.txt";
try {
PrintWriter outputStream = new PrintWriter(storeStates);
outputStream.println(cubesFoundSoFar);
outputStream.println(storeStates);
outputStream.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Duplicates found " + numDupes + ".");
solve(newLivePaths, cubesFoundSoFar);
}
That is the BFS but I fear that its not all the information needed to understand what is going on so here is the link to file with all the code. https://github.com/HaginCodes/2x2-Cube-Solver/blob/master/src/com/haginonyango/pocketsolver/Cube.java
By definition, "Best first search" keeps the entire search frontier of possible paths through the space.
That can be exponentially large. With 3x3x3 Rubik's cube, I think there are 12? possible moves at each point, so a 10 move sequence to solve arguably requires 12^10 combinations which is well over a billion (10^9). WIth this many states, you'll want to minimize the size of the state to minimize total storage. (Uh, you actually print all the states? " outputStream.println(cubesFoundSoFar);" Isnt that a vast amount of output?)
With 2x2x2, you only have 8 possible moves at each point. I don't know solution lengths here for random problems. If it is still length 10, you get 8^10 which is still pretty big.
Now, many move sequences lead to the same cube configuration. To recognize this you need to check that a generated move does not re-generate a position already visited. You seem to be doing that (good!) and tracking the number of hits; I'd expect that hit count to be pretty high as many paths should lead to the same configuration.
What you don't show, is how you score each move sequence to guide the search. What node does it expand next? This is where best comes into play. If you don't have any guidance (e.g., all move sequences enumerated have the same value), you truly will wander over a huge space because all move sequences are equally good. What guides your solver to a solution? You need something like a priority queue over nodes with priority being determined by score, and that I don't see in the code you presented here.
I don't know what a great heuristic for measuring the score as a quality of a move sequence, but you can start by scoring a move sequence with the number of moves it takes to arrive there. The next "best move" to try is then the one with the shortest path. That will probably help immensely.
(A simple enhancement that might work is to count the number of colors on a face; 3 colors hints [is this true?] that it might take 3-1 --> 2 moves minimum to remove the wrong colors. Then the score might be #moves+#facecolors-1, to estimate number of moves to a solution; clearly you want the shortest sequence of moves. This might be a so-called "admissible" hueristic score).
You'll also have to adjust your scheme to detecting duplicate move sequences. When you find an already encountered state, that state will now presumably have attached to it the score (move count) it takes to reach that state. When you get a hit, you've found another way to get to that same state... but the score for new path, may be smaller than what is recorded in the state. In this case, you need to revise the score of discovered duplicate state with the smaller new score. This way a path/state with score 20 may in fact be discovered to have a score of 10, which means it is suddenly improved.
So I realize this text format may not be very unusual. However, I've been trying many ideas to read this correctly into the objects needed and know there has to be a better way. Here is what the file looks like:
S S n
B 1 E 2
B N n
C 2 F 3
C N n
D 2 GA 4
D N n
GA 1
E N n
B 1 F 3 H 6
F N n
I 3 GA 3 C 1
A N g
H B b
U 2 GB 2 F 1
I N n
GA 2 GB 2
GB N g
So the first line of each pair is the name of the node S, whether its a starting node N/s then whether its a goal node g/n
The second line is the children of node S, and their distance weight.
For example, Node S has child node B with a distance of 1, and child node E with a distance of 2
I'm working with two object types, Nodes, and Edges.
Example below. (Constructors ommitted)
Does anyone have tips on how to read this type of input efficiently?
public class Edge {
public String start;
public String end;
public int weight;
public class Node {
public String name;
public boolean start = false;
public boolean goal = false;
public ArrayList<Edge> adjecentNodes = new ArrayList<Edge>();
Actually your question is almost too broad and unspecific, but I am in the mood to give you some starting points. But please understand that you could easily fill several hours of computer science lectures on this topic; and that is not going to happen here.
First, you have to clarify some requirements (for yourself; or together with the folks working on this project):
Do you really need to focus on efficient reading/building of your graph? I rather doubt that: building the graph happens once; but the computations that you probably do later on may run for a much longer time. So one primary focus should be on designing that object/class model that allows you to efficiently solve the problems that you want to solve on that graph! For example: it might be beneficial to already sort edges by distance/weight when creating the graph. Or maybe not. Depends on later use cases! And even when you are talking about huge files that need efficient processing ... that still means: you are talking about huge graphs; so all the more reason to find a good model for that graph.
Your description of the file is not clear. For example: is this a (un)directed graph? Meaning - can you travel on any edge in both direction? And sorry, I didn't get what a "goal" node is supposed to be. (I guess you have directed edges that go one way only, as that would explain those rows in the example where nodes do not have any children). Of course, sometimes requirements become clear in that moment when you start writing real code. But this here is really about concepts/data/algorithms. So the earlier you answer all such questions, the better for you.
Secondly, a suggestion in which order to do things:
As said, clarify all your requirements. Spend some serious time just thinking about the properties of the graphs you are dealing with; and what problems you later have to solve on them.
Start coding; ideally you use TDD/unit testing here. Because all of the things you are going to do can be nicely sliced into small work packages, and each one could be tested with unit-tests. Do not write all your code first, to then, after 2 days running your first experiments! The first thing you code: your Node/Edge classes ... because you want to play around with things like: what arguments do my constructors need? how can I make my classes immutable (data is pushed in by constructors only)? Do I want distance to be a property of my Edge; or can I just go with Node objects (and represent edges as Map<Node, Integer> --- meaning each node just knows its neighbors and the distance to get there!)
Then, when you are convinced that that Node/Edge fit your problem, then you start writing code that takes strings and builds Node/Edges out of those strings.
You also went to spent some time on writing good dump methods; ideally you call graph.dump() ... and that produces a string matching your input format (makes a nice test later on: reading + dumping should result in identical files!)
Then, when "building from strings" works ... then you write the few lines of "file parsing code" that uses some BufferedReader, FileReader, Scanner, Whatever mechanism to dissect your input file into strings ... which you then feed into the methods you created above for step 3.
And, seriously: if this is for school/learning:
Try to talk to your peers often. Have them look at what you are doing. Not too early, but also not too late.
Really, seriously: consider throwing away stuff; and starting from scratch again. For each of the steps above, or after going through the whole sequence. It is an incredible experience to do that; because typically, you come up with new, different, interesting ideas each time you do that.
And finally, some specific hints:
It is tempting to use "public" (writable) fields like start/end/... in your classes. But: consider not doing that. Try to hide as much of the internals of your classes. Because that will make it easier (or possible!) later on to change one part of your program without the need to change anything else, too.
Example:
class Edge {
private final int distance;
private final Node target;
public Edge(int distance, Node target) {
this.distance = distance; this.target = target;
}
...
This creates an immutable object - you can't change its core internal properties after the object was created. That is very often helpful.
Then: override methods like toString(), equals(), hashCode() in your classes; and use them. For example, toString() can be used to create a nice, human-readable dump of a node.
Finally: if you liked all of that, consider remembering my user id; and when you reach enough reputation to "upvote", come back and upvote ;-)
I have a block of text I'm trying to interpret in java (or with grep/awk/etc) looking like the following:
Somewhat differently, plaques of the rN8 and rN9 mutants and human coronavirus OC43 as well as the more divergent
were of fully wild-type size, indicating that the suppressor mu- SARS-CoV, human coronavirus HKU1, and bat coronaviruses
tations, in isolation, were not noticeably deleterious to the HKU4, HKU5, and HKU9 (Fig. 6B). Thus, not only do mem-
--
able effect on the viral phenotype. A potentially related obser- sented for the existence of an interaction between nsp9
vation is that the mutation A2U, which is also neutral by itself, nsp8 (56). A hexadecameric complex of SARS-CoV nsp8 and
is lethal in combination with the AACAAG insertion (data not nsp7 has been found to bind to double-stranded RNA. The
And what I'd like to do is split it into two parts: left and right. I'm having trouble coming up with a regex or any other method that would split a block of text obviously visually split, but not obvious to a programming language. The lengths of the lines are variable.
I've considered looking for the first block and then finding the second by looking for multiple spaces, but I'm not sure that that's a robust solution. Any ideas, snippets, pseudo code, links, etc?
Text Source
The text has been ran as follows through pdftotext pdftotext -layout MyPdf.pdf
Blur the text and come up with an array of the character density per column of text. Then look for gaps and split there.
String blurredText = text.replaceAll("(?<=\\S) (?=\\S)", ".");
String[] blurredLines = text.split("\r\n?|\n");
int maxRowLength = 0;
for (String blurredLine : blurredLines) {
maxRowLength = Math.max(maxRowLength, blurredLine.length());
}
int[] columnCounts = new int[maxRowLength];
for (String blurredLine : blurredLines) {
for (int i = 0, n = blurredLine.length(); i < n; ++i) {
if (blurredLine.charAt(i) != ' ') { ++columnCounts[i]; }
}
}
// Look for runs of zero of at least length 3.
// Alternatively, you might look for the n longest runs of zeros.
// Alternatively, you might look for runs of length min(columnCounts) to ignore
// horizontal rules.
int minBreakLen = 3; // A tuning parameter.
List<Integer> breaks = new ArrayList<Integer>();
outer: for (int i = 0; i < maxRowLength - minBreakLen; ++i) {
if (columnCounts[i] != 0) { continue; }
int runLength = 1;
while (i + runLength < maxRowLength && 0 == columnCounts[i + runLength]) {
++runLength;
}
if (runLength >= minBreakLen) {
breaks.add(i);
}
i += runLength - 1;
}
System.out.println(breaks);
I doubt there is any robust solution to this. I would go for some sort of heuristic approach.
Off the top of my head, I would calculate a histogram of the column index of the first character of each word, and split on the column with the highest score (the idea being to find lots of words that are all aligned horizontally). I might also choose to weight this based on the number of preceding spaces.
I work in this general area. I am surprised that a double-column bioscience text of recent times (SARS, etc.) would be rendered in double-column monospace as the original - it would be typeset in proportional font or in HTML. So I suspect your text came from some other format (such as PDF). If so then you should try to get that format. PDF is horrible to parse, but PDF flattened to monospace is probably worse.
If you possibly can find someone who has worked in the area and see what they have done. If you have multiple documents (e.g. from different journals or reports) then your problem is worse. Yes, I could write an algorithm to solve the example you have posted, but my guess is it will break on the next set of documents. You will end up customising this for each different source (I and others have had to do this).
UPDATE: Thanks. As it's PDF then I would start by asking around. We collaborate with the group at Penn State (who have also done Citeseer). I also have colleagues at Cambridge who have spent months on a PDF reader.
If you want to do it yourself - and it will take time - then I'd start with PDFBox. I've done quite a lot with this and I think it's better for this than pdf2text or pdftotext. I can't remember whether it has double column option - I think so
UPDATE Here is a recent answer of several ways of tackling double-column PDF
http://metaoptimize.com/qa/questions/3943/methods-for-extracting-two-column-text-from-a-pdf
I'd certainly see what other people have done.
FWIW I spend a lot of time trying to convince people that scientists should not create their output in PDF because it destroys machine parsing - as you and I have found
UPDATE. You get the PDFs from your PI (== Principal Investigator?) In which case you'll gets lots of different sources which makes it worse.
What is the real problem you are trying to solve? I may be able to help