I'm building an ANTLR parser for a small query language. The query language is by definition ambiguous, and we need all possible interpretations (ASTs) to process the query.
Example:
query : CLASSIFIED_TOKEN UNCLASSIFIED_TOKEN
| ANY_TOKEN UNCLASSIFIED_TOKEN
;
In this case, if input matches both rules, I need to get 2 ASTs with both interpretations. ANTLR will return the first matched AST.
Do you know a simple way to get all possible ASTs for the same grammar? I'm thinking about running parser multiple times, "turning off" already matched rules between iterations; this seems dirty. Is there a better idea? Maybe other lex/parser tool with java support that can do this?
Thanks
If I were you, I'd remove the ambiguities. You can often do that by using contextual information to determine which grammar rules actually trigger. For instance, in
C* X;
in C (not your language, but this is just to make a point), you can't tell if this is just a pointless multiplication (legal to write in C), or a declaration of a variable X of type "pointer to C". So, there are two valid (ambiguous) parses. But if you know that C is a type declaration (from some context, perhaps an earlier code declaration), you can hack the parser to kill off the inappropriate choices and end up with just the one "correct" parse, no ambiguities.
If you really don't have the context, then you likely need a GLR parser, which happily generate both parses in your final tree. I don't know of any available for Java.
Our DMS Software Reengineering Toolkit [not a Java-based product] has GLR parsing support, and we use that all the time to parse difficult languages with ambiguities. The way we handle the C example above is to produce both parses, because the GLR parser is happy to do this, and then if we have additional information (such as symbol table table), post-process the tree to remove the inappropriate parses.
DMS is designed to support the customized analysis and transformation of arbitrary languages, such as your query language, and makes it easy to define the grammar. Once you have a context-free grammar (ambiguities or not), DMS can parse code and you can decide what to do later.
I doubt you're going to get ANTLR to return multiple parse trees without wholesale rewriting of the code.
I believe you're going to have to partition the ambiguities, each into its own unambiguous grammar and run the parse multiple times. If the total number of ambiguous productions is large you could have an unmanageable set of distinct grammars. For example, for three binary ambiguities (two choices) you'll end up with 8 distinct grammars, though there might be slightly fewer if one ambiguous branch eliminates one or more of the other ambiguities.
Good luck
Related
I want to write a compiler as a personal project and am in the process of reading and understanding parsers (LL(k), LR(k), SLR etc.)
All of these parsers are based on some grammar which comes from the user and this grammar is generally written in a text file (for eg. in ANTLR, it comes in a .g4 file, which is a text file IMO). If I want my parser to create its parsing tables from such a grammar file, what is the best way to parse it and represent the productions in code?
EDIT:
For example, let's say I have this grammar:
S -> 'a'|'b'|'('S')'|T
T -> '*'S
I was thinking of parsing this given grammar and storing it as an ArrayList<ArrayList<String>>. This way every item in the ArrayList will be a collection of productions from the same non-terminal:
// with this type of a representation, I can assign an id to each production
//For example, production S -> 'a' has id 01 or T -> '*'S has an id of 10 and so on
{
{"S", "'a'", "'b'", "'('S')'", "T"},
{"T", "'*'S"}
}
I am not sure about representing the grammar as an AST, because then I don't know how to assign Ids to each production. But the above representation of a grammar seems pretty naive design to me and I am suspicious that there should be some standard way to doing this which will be easier to work with.
I tried using Javalang module available in python to get the AST of Java source code , but it requires an entire class to generate the AST . Passing a block of code like an 'if' statement throws an error . Is there any other way of doing it ?
PS : I am preferably looking for a python module to do the task.
Thanks
Javalang can parse snippets of Java code:
>>> tokens = javalang.tokenizer.tokenize('System.out.println("Hello " + "world");')
>>> parser = javalang.parser.Parser(tokens)
>>> parser.parse_expression()
MethodInvocation
OP is interested in a non-Python answer.
Our DMS Software Reengineering Toolkit with its Java Front End can accomplish this.
DMS is a general purpose tools for parsing/analyzing/transforming code, parameterized by langauge definitions (including grammars). Given a langauge definition, DMS can easily be invoked on a source file/stream representing the goal symbol for a grammar by calling the Parse method offered by the langauge parameter, and DMS will build a tree for the parsed string. Special support is provided for parsing source file/streams for arbitrary nonterminals as defined by the langauge grammar; DMS will build an AST whose root is that nonterminal, parsing the source according to the subgrammar define by that nonterminal.
Once you have the AST, DMS provides lots of support for visiting the AST, inspecting/modifying nodes, carry out source-to-source transformations on the AST using surface syntax rewrite rules. Finally you can prettyprint the modified AST and get back valid source code. (If you have only parsed a code fragment for a nonterminal, what you get back is valid code for that nonterminal).
If OP is willing to compare complete files instead of snippets, our Smart Differencer might be useful out of the box. SmartDifferencer builds ASTs of its two input files, finds the smallest set of conceptual edits (insert, delete, move, copy, rename) over structured code elemnts that explains the differences, and reports that difference.
I want to be able to parse expressions representing physical quantities like
g/l
m/s^2
m/s/kg
m/(s*kg)
kg*m*s
°F/(lb*s^2)
and so on. In the simplest way possible. Is it possible to do so using something like Pyparsing (if such a thing exists for Java), or should I use more complex tools like Java CUP?
EDIT: To answere MrD's question the goal is to make conversion between quantities, so for example convert g to kg (this one is simple...), or maybe °F/(kg*s^2) to K/(lb*h^2) supposing h is four hour and lb for pounds
This is harder than it looks. (I have done a fair amount of work here). The main problem is there is no standard (I have worked with NIST on units and although they have finally created a markup language few people use it). So it's really a form of natural language processing and has to deal with :
ambiguity (what does "M" mean - meters or mega)
inconsistent punctuation
abbreviations
symbols (e.g. "mu" for micro)
unclear semantics (e.g. is kg/m/s the same as kg/(m*s)?
If you are just creating a toy system then you should create a BNF for the system and make sure that all examples adhere to it. This will use common punctuation ("/", "", "(", ")", "^"). Character fields can be of variable length ("m", "kg", "lb"). Algebra on these strings ("kg" -> 1000"g" has problems as kg is a fundamental unit.
If you are doing it seriously then ANTLR (#Yaugen) is useful, but be aware that units in the wild will not follow a regular grammar due to the inconsistencies above.
If you are REALLY serious (i.e. prepared to put in a solid month), I'd be interested to know. :-)
My current approach (which is outside the scope of your question) is to collect a large number of examples from the literature automatically and create a number of heuristics.
I need to make a lot of operations using BigDecimal, and I found having to express
Double a = b - c * d; //natural way
as
BigDecimal a = b.subtract(c.multiply(d))//BigDecimal way
is not only ugly, but a source of mistakes and communication problems between me and business analysts. They were perfectly able to read code with Doubles, but now they can't.
Of course a perfect solution will be java support for operator overloading, but since this not going to happen, I'm looking for an eclipse plugin or even an external tool that make an automatic conversion from "natural way" to "bigdecimal way".
I'm not trying to preprocess source code or dynamic translation or any complex thing, I just want something I can input text and get text, and keep the "natural way" as a comment in source code.
P.S.: I've found this incredible smart hack but I don't want to start doing bytecode manipulation. Maybe I can use that to create a Natural2BigDecimal translator, but I don't want to reinvent the wheel if someone has already done such a tool.
I don't want to switch to Scala/Groovy/JavaScript and I also can't, company rules forbid anything but java in server side code.
"I'm not trying to preprocess source code ... I just want something I can input [bigDecimal arithmetic expression] text".
Half of solving a problem is recognizing the problem for what it is. You exactly want something to preprocess your BigDecimal expressions to produce legal Java.
You have only two basic choices:
A stand-alone "domain specific language" and DSL compiler that accepts "standard" expressions and converts them directly to Java code. (This is one kind of preprocessor). This leaves you with the problem of keeping all the expression fragments around, and somehow knowing where to put them in the Java code.
A tool that reads the Java source text, finds such expressions, and converts them to BigDecimal in the text. I'd suggest something that let you code the expressions outside the actual code and inserted the translation.
Perhaps (stolen from another answer):
// BigDecimal a = b - c * d;
BigDecimal a = b.subtract( c.multiply( d ) );
with the meaning "compile the big decimal expression in the comment into its java equivalent, and replace the following statement with that translation.
To implement the second idea, you need a program transformation system, which can apply source-to-source rewriting rules to transforms (generate as a special case of transform) the code. This is just a preprocessor that is organized to be customizable to your needs.
Our DMS Software Reengineering Toolkit with its Java Front End could do this. You need a full Java parser to do that transformation part; you'll want name and type resolution so that you can parse/check the proposed expression for sanity.
While I agree that the as-is Java notation is ugly, and your proposal would make it prettier, my personal opinion is this isn't worth the effort. You end up with a dependency on a complex tool (yes, DMS is complex: manipulating code isn't easy) for a rather marginal gain.
If you and your team wrote thousands of these formulas, or the writers of such formulas were Java-naive it might make sense. In that case,
I'd go further, and simply insist you write the standard expression format where you need it. You could customize the Java Front End to detect when the operand types were of decimal type, and do the rewriting for you. Then you simply run this preprocessor before every Java compilation step.
I agree, it's very cumbersome! I use proper documentation (comments before each equation) as the best "solution" to this.
// a = b - c * d;
BigDecimal a = b.subtract( c.multiply( d ) )
You might go the route of an expression evaluator. There is a decent (albeit paid) one at http://www.singularsys.com/jep. Antlr has a rudimentary grammar that also does expression evaluation (tho I am not sure how it would perform) at http://www.antlr.org/wiki/display/ANTLR3/Expression+evaluator.
Neither would give you the compile-time safety you would have with true operators. You could also write the various algorithm-based classes in something like Scala, which does support operator overloading out of the box and would interoperate seamlessly with your other Java classes.
For an application I want to parse a String with arithmetic expressions and variables. Just imagine this string:
((A + B) * C) / (D - (E * F))
So I have placeholders here and no actual integer/double values. I am searching for a library which allows me to get the first placeholder, put (via a database query for example) a value into the placeholder and proceed with the next placeholder.
So what I essentially want to do is to allow users to write a string in their domain language without knowing the actual values of the variables. So the application would provide numeric values depending on some "contextual logic" and would output the result of the calculation.
I googled and did not find any suitable library. I found ANTLR, but I think it would be very "heavyweight" for my usecase. Any suggestions?
You are right that ANTLR is a bit of an overkill. However parsing arithmetic expressions in infix notation isn't that hard, see:
Operator-precedence parser
Shunting-yard algorithm
Algorithms for Parsing Arithmetic Expressions
Also you should consider using some scripting languages like Groovy or JRuby. Also JDK 6 onwards provides built-in JavaScript support. See my answer here: Creating meta language with Java.
If all you want to do is simple expressions, and you know the grammar for those expressions in advance, you don't even need a library; you can code this trivially in pure Java.
See this answer for a detailed version of how:
Is there an alternative for flex/bison that is usable on 8-bit embedded systems?
If the users are defining thier own expression language, if it is always in the form of a few monadic or binary operators, and they can specify the precedence, you can bend the above answer by parameterizing the parser with a list of operators at several levels of precedence.
If the language can be more sophisticated, you might want to investigate metacompilers.