I needed some help with creating custom trees given an arithmetic expression. Say, for example, you input this arithmetic expression:
(5+2)*7
The result tree should look like:
*
/ \
+ 7
/ \
5 2
I have some custom classes to represent the different types of nodes, i.e. PlusOp, LeafInt, etc. I don't need to evaluate the expression, just create the tree, so I can perform other functions on it later.
Additionally, the negative operator '-' can only have one child, and to represent '5-2', you must input it as 5 + (-2).
Some validation on the expression would be required to ensure each type of operator has the correct the no. of arguments/children, each opening bracket is accompanied by a closing bracket.
Also, I should probably mention my friend has already written code which converts the input string into a stack of tokens, if that's going to be helpful for this.
I'd appreciate any help at all. Thanks :)
(I read that you can write a grammar and use antlr/JavaCC, etc. to create the parse tree, but I'm not familiar with these tools or with writing grammars, so if that's your solution, I'd be grateful if you could provide some helpful tutorials/links for them.)
Assuming this is some kind of homework and you want to do it yourself..
I did this once, you need a stack
So what you do for the example is:
parse what to do? Stack looks like
( push it onto the stack (
5 push 5 (, 5
+ push + (, 5, +
2 push 2 (, 5, +, 2
) evaluate until ( 7
* push * 7, *
7 push 7 +7, *, 7
eof evaluate until top 49
The symbols like "5" or "+" can just be stored as strings or simple objects, or you could store the + as a +() object without setting the values and set them when you are evaluating.
I assume this also requires an order of precedence, so I'll describe how that works.
in the case of: 5 + 2 * 7
you have to push 5 push + push 2 next op is higher precedence so you push it as well, then push 7. When you encounter either a ) or the end of file or an operator with lower or equal precedence you start calculating the stack to the previous ( or the beginning of the file.
Because your stack now contains 5 + 2 * 7, when you evaluate it you pop the 2 * 7 first and push the resulting *(2,7) node onto the stack, then once more you evaluate the top three things on the stack (5 + *node) so the tree comes out correct.
If it was ordered the other way: 5 * 2 + 7, you would push until you got to a stack with "5 * 2" then you would hit the lower precedence + which means evaluate what you've got now. You'd evaluate the 5 * 2 into a *node and push it, then you'd continue by pushing the + and 3 so you had *node + 7, at which point you'd evaluate that.
This means you have a "highest current precedence" variable that is storing a 1 when you push a +/-, a 2 when you push a * or / and a 3 for "^". This way you can just test the variable to see if your next operator's precedence is < = your current precedence.
if ")" is considered priority 4 you can treat it as other operators except that it removes the matching "(", a lower priority would not.
I wanted to respond to Bill K.'s answer, but I lack the reputation to add a comment there (that's really where this answer belongs). You can think of this as a addendum to Bill K.'s answer, because his was a little incomplete. The missing consideration is operator associativity; namely, how to parse expressions like:
49 / 7 / 7
Depending on whether division is left or right associative, the answer is:
49 / (7 / 7) => 49 / 1 => 49
or
(49 / 7) / 7 => 7 / 7 => 1
Typically, division and subtraction are considered to be left associative (i.e. case two, above), while exponentiation is right associative. Thus, when you run into a series of operators with equal precedence, you want to parse them in order if they are left associative or in reverse order if right associative. This just determines whether you are pushing or popping to the stack, so it doesn't overcomplicate the given algorithm, it just adds cases for when successive operators are of equal precedence (i.e. evaluate stack if left associative, push onto stack if right associative).
The "Five minute introduction to ANTLR" includes an arithmetic grammar example. It's worth checking out, especially since antlr is open source (BSD license).
Several options for you:
Re-use an existing expression parser. That would work if you are flexible on syntax and semantics. A good one that I recommend is the unified expression language built into Java (initially for use in JSP and JSF files).
Write your own parser from scratch. There is a well-defined way to write a parser that takes into account operator precedence, etc. Describing exactly how that's done is outside the scope of this answer. If you go this route, find yourself a good book on compiler design. Language parsing theory is going to be covered in the first few chapters. Typically, expression parsing is one of the examples.
Use JavaCC or ANTLR to generate lexer and parser. I prefer JavaCC, but to each their own. Just google "javacc samples" or "antlr samples". You will find plenty.
Between 2 and 3, I highly recommend 3 even if you have to learn new technology. There is a reason that parser generators have been created.
Also note that creating a parser that can handle malformed input (not just fail with parse exception) is significantly more complicated that writing a parser that only accepts valid input. You basically have to write a grammar that spells out the various common syntax errors.
Update: Here is an example of an expression language parser that I wrote using JavaCC. The syntax is loosely based on the unified expression language. It should give you a pretty good idea of what you are up against.
Contents of org.eclipse.sapphire/plugins/org.eclipse.sapphire.modeling/src/org/eclipse/sapphire/modeling/el/parser/internal/ExpressionLanguageParser.jj
the given expression (5+2)*7 we can take as infix
Infix : (5+2)*7
Prefix : *+527
from the above we know the preorder and inorder taversal of tree ... and we can easily construct tree from this.
Thanks,
Related
I'm building a project whose main objective is to find a given number (if possible, otherwise closest possible) using 6 given numbers and main operators (+, -, *, /). Idea is to randomly generate expressions, using the numbers given and the operators, in reverse polish (postfix) notation, because I found it the easiest to generate and compute later. Those expressions are Individuals in Population of my Genetic Algorithm. Those expressions have the form of an ArrayList of Strings in Java, where Strings are both the operators and operands (the numbers given).
The main question here is, what would be the best method to crossover these individuals (postfix expressions actually)? Right now I'm thinking about crossing expressions that are made out of all the six operands that are given (and 5 operators with them). Later I'll probably also cross the expressions that would be made out of less operands (5, 4, 3, 2 and also only 1), but I guess that I should figure this out first, as the most complex case (if you think it might be a better idea to start differently, I'm open to any suggestions). So, the thing is that every expression is made from all the operands given, and also the child expression should have all the operands included, too. I understand that this requires some sort of ordered crossover (often used in problems like TSP), and I read a lot about it (for example here where multiple methods are described), but I didn't quite figure out which one would be best in my case (I'm also aware that in Genetic Algorithms there is a lot of 'trial and error' process, but I'm talking about something else here).
What I'm saying is bothering me, are operators. If I had only a list of operands, then it wouldn't be a problem to cross 2 such lists, for example taking a random subarray of half elements from 1 parent, and fill the rest with remaining elements from parent 2 keeping the order like it was. But here, if I, say, take first half of an expression from first parent expression, I would definitely have to fill the child expression with remaining operands, but what should I do with operators? Take them from parent 2 like the remaining operands (but then I would have to watch out because in order to use an operator in postfix expression, I need to have at least 1 operand more, and checking that all the time might be time consuming, or not?), or maybe I could generate random operators for the rest of the child expression (but that wouldn't be a pure crossover then, would it)?
When talking about crossover, there is also mutation, but I guess I have that worked out. I can take an expression and perform a mutation where I'll just switch 2 operands, or take an expression and randomly change 1 or more operators. For that, I have some ideas, but the crossover is what really bothers me.
So, that pretty much sums my problem. Like I said, the main question is how to crossover, but if you have any other suggestions or questions about the program (maybe easier representation of expressions - other then list of strings - which may be easier/faster to crossover, maybe something I didn't mention here, it doesn't matter, maybe even a whole new approach to the problem?), I'd love to hear them. I didn't give any code here because I don't think it's needed to answer this question, but if you think it would help, I'll definitely edit in order to solve this. One more time, main question is to answer how to crossover, this specific part of the problem (idea or pseudocode expected, although the code itself would be great, too :D), but if you think that I need to change something more, or you know some other solutions to my whole problem, feel free to say.
Thanks in advance!
There are two approaches that come to mind:
Approach #1
Encode each genome as a fixed length expression where odd indices are numbers and even indices are the operators. For mutation, you could slightly change the numbers and/or change the operators.
Pros:
Very simple to code
Cons:
Would have to create an infix parser
Fixed length expressions
Approach #2
Encode each genome as a syntax tree. For instance, 4 + 3 / 2 - 1 is equivalent to Add(4, Subtract(Divide(3, 2), 1)) which looks like:
_____+_____
| |
4 ____-____
| |
__/__ 1
| |
3 2
Then when crossing over, pick a random node from each tree and swap them. For mutation, you could add, remove, and/or modify random nodes.
Pros:
Might find better results
Variable length expressions
Cons:
Adds time complexity
Adds programming complexity
Here is an example of the second approach:
Source
I have a ANTLR expression parser which can evaluate expressions of the form ( A & ( B | C ) ) using the generated visitor. A , B and C can take any of the 2 values true or false. However I am faced with a challenge of finding all combinations of A,B and C for which the expression is true. I tried to solve this by the following method.
Evaluate the expression for the 3 variables taking true and false each
This comes to 8 combinations since 2 ^ 3 is 8
I evaluate giving values like 000, 001, 010 ....... 111 to the variables and evaluate using the visitor
Though this works this method becomes compute intensive as the number of variables increases. Hence for an expression with 20 variables 1048576 computations are required. How can I optimise this complexity so that I get all the true expressions ? I hope this falls under Boolean satisfiabilty problem
It does. If you are liimted to 20-30 variables, you can simply brute force a trial of all the combinations. If it takes 100ns per try (that's 500 machine instructions), it will run in about 100 seconds. That's faster than you.
If you want to solve much bigger equations than that, you need to build a real constraint solver.
EDIT due to OP remark about trying to go parallel to speed up a Java program that brute forces the answer:
I don't know how you represent your boolean formula. For brute force, you don't want to interpret a formula tree or do something else which is slow.
A key trick is to make evaluation of the boolean formula fast. For most programming languages, you should be able to code the formula to test as an native expression in that language by hand, wrap it N nested loops and compile the whole thing, e.g.,
A=false;
do {
B=false;
do {
C= false;
do {
if (A & (B | C) ) { printf (" %b %b %b\n",A,B,C);
C=~C;
} until C==false;
B=~B;
} until B==false;
A=~A;
} until A==false;
Compiled (or even JITted by Java), I'd expect the innner loop to take 1-2 machine instructions per boolean operation, touching only registers or a single cachec line, plus 2 for the loop.
For 20 variables thats around 42 machine instructions, even better than my rough estimate in the first paragraph.
If one insists, one could convert the outermost loops (3 or 4) into parallel threads, but if all you want are the print statements I don't see how that will actually matter in terms of utility.
If you have many of these formulas, it is easy to write a code generator to produce this from whatever representation you have of the formula (e.g., ANTLR parse tree).
One of my midterm review questions asks to parse this tree in different ways - pre/postfix etc. It asks these two ways as well though: In "Infix, Java precedence rules" and in "Infix, left-to-right precedence"
What is the difference between Java precedence rule and plain left-to-right infix rule? I thought if it was as Java precedence, something like "newline" may be needed like the actual java code but I really don't see what's really asked here. Thanks for your help in advance
Another question. How would you regard d and e nodes?
If it was postfix, (d e) f h * - would be appropriate for that portion of tree?
I think left-to-right precedence simply means that you just apply all the infix operators from left to right, so that
2 * 3 + 4 * 5
is interpreted as
((2 * 3) + 4) * 5 = 50
In Java and every other programming language I know of except APL, however, * is given higher precedence than + or -, which means the expression is interpreted as
(2 * 3) + (4 * 5) = 26
(Java has a lot more operators, so the order of precedence is pretty complicated. But if you're only going to see +, -, *, and /, all you need to know is that * and / have higher precedence; and that for operators with the same precedence, they're evaluated left to right.)
I'm guessing that the assignment is asking you how the tree would be represented using the two different precedence rules. Of course, you could put parentheses around everything, which means the precedence rules wouldn't apply at all:
(foo (a, (b + c) * ((d ? e) - (f * h)), (j * k)) - g
[The ? is there because there seems to be a box missing from the diagram.] So you're probably supposed to write it in a way without unnecessary parentheses, which means you need to know the precedence rules.
To answer your last question about d and e: you should ask your instructor, because I'm guessing it means "misprint". Unless they've come up with some new kinds of syntax tree diagrams since I studied this, it looks like a box is missing.
I have successfully implemented a shunting yard algorithm in java. The algorithm itself was simple however I am having trouble with the tokenizer. Currently the algorithm works with everything I want excluding one thing. How can I tell the difference between subtraction(-) and negative (-)
such as 4-3 is subtraction
but -4+3 is negative
I now know how to find out when it should be a negative and when it should be a minus, but where in the algorithm should it be placed because if you use it like a function it wont always work for example
3 + 4 * 2 / -( 1 − 5 ) ^ 2 ^ 3
when 1-5 becomes -4 it will become 4 before it gets squared and cubed
just like
3 + 4 * 2 / cos( 1 − 5 ) ^ 2 ^ 3 , you would take the cosine before squaring and cubing
but in real math you wouldn’t with a - because what your really saying is
3 + 4 * 2 / -(( 1 − 5 ) ^ 2 ^ 3) in order to have the right value
It sounds like you're doing a lex-then-parse style parser, where you're going to need a simple state machine in the lexer in order to get separate tokens for unary and binary minus. (In a PEG parser, this isn't something you have to worry about.)
In JavaCC, you would have a DEFAULT state, where you would consider the - character to be UNARY_MINUS. When you tokenized the end of a primary expression (either a closing paren, or an integer, based on the examples you gave), then you would switch to the INFIX state where - would be considered to be INFIX_MINUS. Once you encountered any infix operator, you would return to the DEFAULT state.
If you're rolling your own, it might be a bit simpler than that. Look at this Python code for a clever way of doing it. Basically, when you encounter a -, you just check to see if the previous token was an infix operator. That example uses the string "-u" to represent the unary minus token, which is convenient for an informal tokenization. Best I can tell, the Python example does fail to handle case where a - follows an open paren, or comes at the beginning of the input. Those should be considered unary as well.
In order for unary minus to be handled correctly in the shunting-yard algorithm itself, it needs to have higher precedence than any of the infix operators, and it needs to marked as right-associative. (Make sure you handle right-associativity. You may have left it out since the rest of your operators are left-associative.) This is clear enough in the Python code (although I would use some kind of struct rather than two separate maps).
When it comes time to evaluate, you will need to handle unary operators a little differently, since you only need to pop one number off the stack, rather than two. Depending on what your implementation looks like, it may be easier to just go through the list and replace every occurrence of "-u" with [-1, "*"].
If you can follow Python at all, you should be able to see everything I'm talking about in the example I linked to. I find the code to be a bit easier to read than the C version that someone else mentioned. Also, if you're curious, I did a little write-up a while back about using shunting-yard in Ruby, but I handled unary operators as a separate nonterminal, so they are not shown.
The answers to this question might be helpful.
In particular, one of those answers references a solution in C that handles unary minus.
Basically, you have to recognize a unary minus based on the appearance of the minus sign in positions where a binary operator can't be, and make a different token for it, as it has different precedence.
Dijkstra's original paper doesn't too clearly explain how he dealt with this, but the unary minus was listed as a separate operator.
This isn't in Java, but here is a library I wrote to specifically solve this problem after searching and not finding any clear answers.
This does all you want and more:
https://marginalhacks.com/Hacks/libExpr.rb/
It is a ruby library (as well as a testbench to check it) that runs a modified shunting yard algorithm that also supports unary ('-a') and ternary ('a?b:c') ops. It also does RPN, Prefix and AST (abstract syntax trees) - your choice, and can evaluate the expression, including the ability to yield to a block (a lambda of sorts) that can handle any variable evaluation. Only AST does the full set of operations, including the ability to handle short-circuit operations (such as '||' and '?:' and so on), but RPN does support unary. It also has a flexible precedence model that includes presets for precedence as done by C expressions or by Ruby expressions (not the same). The testbench itself is interesting as it can create random expressions which it can then eval() and also run through libExpr to compare results.
It's fairly documented/commented, so it shouldn't be too hard to convert the ideas to Java or some other language.
The basic idea as far as unary operators is that you can recognize them based on the previous token. If the previous token is either an operator or a left-paren, then the "unary-possible" operators (+ and -) are just unary and can be pushed with only one operand. It's important that your RPN stack distinguishes between the unary operator and the binary operator so it knows what to do on evaluation.
In your lexer, you can implement this pseudo-logic:
if (symbol == '-') {
if (previousToken is a number
OR previousToken is an identifier
OR previousToken is a function) {
currentToken = SUBTRACT;
} else {
currentToken = NEGATION;
}
}
You can set up negation to have a precedence higher than multiply and divide, but lower than exponentiation. You can also set it up to be right associative (just like '^').
Then you just need to integrate the precedence and associativity into the algorithm as described on Wikipedia's page.
If the token is an operator, o1, then: while there is an operator
token, o2, at the top of the stack, and either o1 is left-associative
and its precedence is less than or equal to that of o2, or o1 has
precedence less than that of o2, pop o2 off the stack, onto the output
queue; push o1 onto the stack.
I ended up implementing this corresponding code:
} else if (nextToken instanceof Operator) {
final Operator o1 = (Operator) nextToken;
while (!stack.isEmpty() && stack.peek() instanceof Operator) {
final Operator o2 = (Operator) stack.peek();
if ((o1.associativity == Associativity.LEFT && o1.precedence <= o2.precedence)
|| (o1.associativity == Associativity.RIGHT && o1.precedence < o2.precedence)) {
popStackTopToOutput();
} else {
break;
}
}
stack.push(nextToken);
}
Austin Taylor is quite right that you only need to pop off one number for a unary operator:
if (token is operator negate) {
operand = pop;
push operand * -1;
}
Example project:
https://github.com/Digipom/Calculator-for-Android/
Further reading:
http://en.wikipedia.org/wiki/Shunting-yard_algorithm
http://sankuru.biz/blog/1-parsing-object-oriented-expressions-with-dijkstras-shunting-yard-algorithm
I know it's an old post, but may be someone will find it useful .
I implemented this algorithm before, starting by toknizer using StreamTokenizer class
and it works fine. In StreamTokenizer in Java, there are some character with specific meaning. For example: ( is an operator, sin is a word,...
For your question, There is a method called "streamToknizer.ordinaryChar(..)" which it specifies that the character argument is "ordinary" in this tokenizer. It removes any special significance the character has as a comment character, word component, string delimiter, white space, or number character. Source here
So you can define - as ordinary character which means, it won't be considered as a sign for number.For example, if you have expression 2-3 , You will have [2,-,3], but if you didn't specify it as ordinary, so it will be [2,-3]
Is there any way to interpret "normal" mathematical notation into Reverse Polish Notation(RPN)..?
eg
1) 2 + 3*4 - 1 = 234*+1-
2) 5 (4-8) = 548-
U can assume that BODMAS rule is followed and that inner brackets have to be calculated first etc.. i mean the normal maths to be applied here.. the answer should be in postfix notation..
Thanks
Yes; the shunting yard algorithm defines how to do this.
Each time you read a number, put it onto the output queue. Each time you read an operator, put it on the operator stack. These two structures form the basis of the algorithm.
So-called "normal" is rigorously called infix notation. There are also prefix and postfix notations, the latter being RPN.
The typical rearrangement of notation is done by constructing a parse tree and traversing specifically for the arrangement needed.
Here are some descriptions of how to do it: a b