Simple java recursive descent parsing library with placeholders

Simple java recursive descent parsing library with placeholders - java

For an application I want to parse a String with arithmetic expressions and variables. Just imagine this string:
((A + B) * C) / (D - (E * F))
So I have placeholders here and no actual integer/double values. I am searching for a library which allows me to get the first placeholder, put (via a database query for example) a value into the placeholder and proceed with the next placeholder.
So what I essentially want to do is to allow users to write a string in their domain language without knowing the actual values of the variables. So the application would provide numeric values depending on some "contextual logic" and would output the result of the calculation.
I googled and did not find any suitable library. I found ANTLR, but I think it would be very "heavyweight" for my usecase. Any suggestions?

You are right that ANTLR is a bit of an overkill. However parsing arithmetic expressions in infix notation isn't that hard, see:
Operator-precedence parser
Shunting-yard algorithm
Algorithms for Parsing Arithmetic Expressions
Also you should consider using some scripting languages like Groovy or JRuby. Also JDK 6 onwards provides built-in JavaScript support. See my answer here: Creating meta language with Java.

If all you want to do is simple expressions, and you know the grammar for those expressions in advance, you don't even need a library; you can code this trivially in pure Java.
See this answer for a detailed version of how:
Is there an alternative for flex/bison that is usable on 8-bit embedded systems?
If the users are defining thier own expression language, if it is always in the form of a few monadic or binary operators, and they can specify the precedence, you can bend the above answer by parameterizing the parser with a list of operators at several levels of precedence.
If the language can be more sophisticated, you might want to investigate metacompilers.

Related

Formal notation for code translations by compiler

I'm designing a DSL which translates to Java source code. Are their notations which are commonly used to specify the semantics/translation of a compiler?
Example:
DSL:
a = b = c = 4
Translates into:
Integer temp0 = 4;
Integer a = temp0;
Integer b = temp0;
Integer c = temp0;
Thanks in advance,
Jeroen

Pattern matching languages can be used to formalise small tree transforms. For an example of such a DSL take a look at Nanopass framework. A more generic approach is to think of the tree transforms as a form of term rewriting.
Such transforms are formal enough, e.g., they can be certified, as in CompCert.

There are formal languages to define semantics; you can see such languages and definitions in almost any technical paper in conference proceedings on programming languages. Texts on the topic are available: https://mitpress.mit.edu/.../semantics-programming-languages You need to have some willingness to read concise mathematical notations.
As a practical matter, these semantics are not used to drive translations/compilers; this is still a research topic. See http://Fwww.andrew.cmu.edu%2Fuser%2Fasubrama%2Fdissertation.pdf To read these you typically need to have spent some time reading introductory texts such as the above.
There has been more practical work on defining translations; the most practical are program transformation systems.
With such tools, one can write, using the notations of source language (e.g., your DSL), and the notation of the target language (e.g., Java or assembler or whatever), transformation rules
of the form:
replace source_language_fragment by target_language_fragment if condition
These tools are driven by grammar for the source and target languages, and interpret the transformation rules from their readable form into AST to AST rewrites. To fully translate a complex DSL to another language typically requires hundreds of rules, but a key point is they are much more easily read than procedural code typical of hand-written translators.
Trying to follow OP's example, assuming one has grammars for the OP's "MyDSL" and for "Java" as a target, and using our DMS Software Reengineeering Toolkit's style of transformation rules:
source domain dsl;
target domain Java;
rule translate_single_assignment(t: dsl_IDENTIFIER, e: dsl_expression):
" \t = \e " -- MyDSL syntax
-> -- read as "rewrites to"
" int \JavaIdentifier\(\t\)=\e;
".
rule translate_multi_assignment(t1: dsl_IDENTIFIER, t2: dsl_IDENTIFIER, e: dsl_expression):
" \t1 = \t2 = \e " -- MyDSL syntax
-> -- read as "rewrites to"
" \>\dsl \t2 = \e \statement
int \t1;
\t1=\t2;
".
You need two rules: one for the base case of a simple assignment t=e; and one to handle the multiple assignment case. The multiple assignment case peels off the outermost assignment,
and generates code for it, and inserts the remainder of the multiple assignment back in in its original DSL form, to be reprocessed by one of the two rules.
You can see another example of this used for refactoring (source_language == target_language) at https://stackoverflow.com/questions/22094428/programmatic-refactoring-of-java-source-files/22100670#22100670

ANTLR: Multiple ASTs using the same ambiguous grammar?

I'm building an ANTLR parser for a small query language. The query language is by definition ambiguous, and we need all possible interpretations (ASTs) to process the query.
Example:
query : CLASSIFIED_TOKEN UNCLASSIFIED_TOKEN
| ANY_TOKEN UNCLASSIFIED_TOKEN
;
In this case, if input matches both rules, I need to get 2 ASTs with both interpretations. ANTLR will return the first matched AST.
Do you know a simple way to get all possible ASTs for the same grammar? I'm thinking about running parser multiple times, "turning off" already matched rules between iterations; this seems dirty. Is there a better idea? Maybe other lex/parser tool with java support that can do this?
Thanks

If I were you, I'd remove the ambiguities. You can often do that by using contextual information to determine which grammar rules actually trigger. For instance, in
C* X;
in C (not your language, but this is just to make a point), you can't tell if this is just a pointless multiplication (legal to write in C), or a declaration of a variable X of type "pointer to C". So, there are two valid (ambiguous) parses. But if you know that C is a type declaration (from some context, perhaps an earlier code declaration), you can hack the parser to kill off the inappropriate choices and end up with just the one "correct" parse, no ambiguities.
If you really don't have the context, then you likely need a GLR parser, which happily generate both parses in your final tree. I don't know of any available for Java.
Our DMS Software Reengineering Toolkit [not a Java-based product] has GLR parsing support, and we use that all the time to parse difficult languages with ambiguities. The way we handle the C example above is to produce both parses, because the GLR parser is happy to do this, and then if we have additional information (such as symbol table table), post-process the tree to remove the inappropriate parses.
DMS is designed to support the customized analysis and transformation of arbitrary languages, such as your query language, and makes it easy to define the grammar. Once you have a context-free grammar (ambiguities or not), DMS can parse code and you can decide what to do later.

I doubt you're going to get ANTLR to return multiple parse trees without wholesale rewriting of the code.
I believe you're going to have to partition the ambiguities, each into its own unambiguous grammar and run the parse multiple times. If the total number of ambiguous productions is large you could have an unmanageable set of distinct grammars. For example, for three binary ambiguities (two choices) you'll end up with 8 distinct grammars, though there might be slightly fewer if one ambiguous branch eliminates one or more of the other ambiguities.
Good luck

BigDecimal notation eclipse plugin or nice external tool

I need to make a lot of operations using BigDecimal, and I found having to express
Double a = b - c * d; //natural way
as
BigDecimal a = b.subtract(c.multiply(d))//BigDecimal way
is not only ugly, but a source of mistakes and communication problems between me and business analysts. They were perfectly able to read code with Doubles, but now they can't.
Of course a perfect solution will be java support for operator overloading, but since this not going to happen, I'm looking for an eclipse plugin or even an external tool that make an automatic conversion from "natural way" to "bigdecimal way".
I'm not trying to preprocess source code or dynamic translation or any complex thing, I just want something I can input text and get text, and keep the "natural way" as a comment in source code.
P.S.: I've found this incredible smart hack but I don't want to start doing bytecode manipulation. Maybe I can use that to create a Natural2BigDecimal translator, but I don't want to reinvent the wheel if someone has already done such a tool.
I don't want to switch to Scala/Groovy/JavaScript and I also can't, company rules forbid anything but java in server side code.

"I'm not trying to preprocess source code ... I just want something I can input [bigDecimal arithmetic expression] text".
Half of solving a problem is recognizing the problem for what it is. You exactly want something to preprocess your BigDecimal expressions to produce legal Java.
You have only two basic choices:
A stand-alone "domain specific language" and DSL compiler that accepts "standard" expressions and converts them directly to Java code. (This is one kind of preprocessor). This leaves you with the problem of keeping all the expression fragments around, and somehow knowing where to put them in the Java code.
A tool that reads the Java source text, finds such expressions, and converts them to BigDecimal in the text. I'd suggest something that let you code the expressions outside the actual code and inserted the translation.
Perhaps (stolen from another answer):
// BigDecimal a = b - c * d;
BigDecimal a = b.subtract( c.multiply( d ) );
with the meaning "compile the big decimal expression in the comment into its java equivalent, and replace the following statement with that translation.
To implement the second idea, you need a program transformation system, which can apply source-to-source rewriting rules to transforms (generate as a special case of transform) the code. This is just a preprocessor that is organized to be customizable to your needs.
Our DMS Software Reengineering Toolkit with its Java Front End could do this. You need a full Java parser to do that transformation part; you'll want name and type resolution so that you can parse/check the proposed expression for sanity.
While I agree that the as-is Java notation is ugly, and your proposal would make it prettier, my personal opinion is this isn't worth the effort. You end up with a dependency on a complex tool (yes, DMS is complex: manipulating code isn't easy) for a rather marginal gain.
If you and your team wrote thousands of these formulas, or the writers of such formulas were Java-naive it might make sense. In that case,
I'd go further, and simply insist you write the standard expression format where you need it. You could customize the Java Front End to detect when the operand types were of decimal type, and do the rewriting for you. Then you simply run this preprocessor before every Java compilation step.

I agree, it's very cumbersome! I use proper documentation (comments before each equation) as the best "solution" to this.
// a = b - c * d;
BigDecimal a = b.subtract( c.multiply( d ) )

You might go the route of an expression evaluator. There is a decent (albeit paid) one at http://www.singularsys.com/jep. Antlr has a rudimentary grammar that also does expression evaluation (tho I am not sure how it would perform) at http://www.antlr.org/wiki/display/ANTLR3/Expression+evaluator.
Neither would give you the compile-time safety you would have with true operators. You could also write the various algorithm-based classes in something like Scala, which does support operator overloading out of the box and would interoperate seamlessly with your other Java classes.

Detecting equivalent expressions

I'm currently working on a Java application where I need to implement a system for building BPF expressions. I also need to implement mechanism for detecting equivalent BPF expressions.
Building the expression is not too hard. I can build a syntax tree using the Interpreter design pattern and implement the toString for getting the BPF syntax.
However, detecting if two expressions are equivalent is much harder. A simple example would be the following:
A: src port 1024 and dst port 1024
B: dst port 1024 and src port 1024
In order to detect that A and B are equivalent I probably need to transform each expression into a "normalized" form before comparing them. This would be easy for above example, however, when working with a combination of nested AND, OR and NOT operations it's getting harder.
Does anyone know how I should best approach this problem?

One way to compare boolean expressions may be to convert both to the disjunctive normal form (DNF), and compare the DNF. Here, the variables would be Berkeley Packet Filter tokens, and the same token (e.g. port 80) appearing anywhere in either of the two expressions would need to be assigned the same variable name.
There is an interesting-looking applet at http://www.izyt.com/BooleanLogic/applet.php - sadly I can't give it a try right now due to Java problems in my browser.

I'm pretty sure detecting equivalent expressions is either an np-hard or np-complete problem, even for boolean-only expressions. Meaning that to do it perfectly, the optimal way is basically to build complete tables of all possible combinations of inputs and the results, then compare the tables.
Maybe BPF expressions are limited in some way that changes that? I don't know, so I'm assuming not.
If your problems are small, that may not be a problem. I do exactly that as part of a decision-tree designing algorithm.
Alternatively, don't try to be perfect. Allow some false negatives (cases which are equivalent, but which you won't detect).
A simple approach may be to do a variant of the normal expression-evaluation, but evaluating an alternative representation of the expression rather than the result. Impose an ordering on commutative operators. Apply some obvious simplifications during the evaluation. Replace a rich operator set with a minimal set of primitive operators - e.g. using de-morgans to eliminate OR operators.
This alternative representation forms a canonical representation for all members of a set of equivalent expressions. It should be an equivalence class in the sense that you always find the same canonical form for any member of that set. But that's only the set-theory/abstract-algebra sense of an equivalence class - it doesn't mean that all equivalent expressions are in the same equivalence class.
For efficient dictionary lookups, you can use hashes or comparisons based on that canonical representation.

I'd definitely go with syntax normalization. That is, like aix suggested, transform the booleans using DNF and reorder the abstract syntax tree such that the lexically smallest arguments are on the left-hand side. Normalize all comparisons to < and <=. Then, two equivalent expressions should have equivalent syntax trees.

Is there a Java equivalent of Python's printf hash replacement?

Specifically I am converting a python script into a java helper method. Here is a snippet (slightly modified for simplicity).
# hash of values
vals = {}
vals['a'] = 'a'
vals['b'] = 'b'
vals['1'] = 1
output = sys.stdout
file = open(filename).read()
print >>output, file % vals,
So in the file there are %(a), %(b), %(1) etc that I want substituted with the hash keys. I perused the API but couldn't find anything. Did I miss it or does something like this not exist in the Java API?

You can't do this directly without some additional templating library. I recommend StringTemplate. Very lightweight, easy to use, and very optimized and robust.

I doubt you'll find a pure Java solution that'll do exactly what you want out of the box.
With this in mind, the best answer depends on the complexity and variety of Python formatting strings that appear in your file:
If they're simple and not varied, the easiest way might be to code something up yourself.
If the opposite is true, one way to get the result you want with little work is by embedding Jython into your Java program. This will enable you to use Python's string formatting operator (%) directly. What's more, you'll be able to give it a Java Map as if it were a Python dictionary (vals in your code).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.