ANTLR4 dynamic token type - java

The language that I am lexing requires the ability to hot-swap keywords depending on runtime configuration.
It's relatively simple how to do this so long as you are OK embedding target-specific code in your grammar (Java):1
lexer grammar LanguageLexer;
tokens {
If, Else, While // etc
}
#header {
import java.util.Map;
}
#members {
private Map<String, Integer> keywords;
public NafiLexer(CharStream input, Map<String, Integer> keywords) {
this(input);
this.keywords = keywords;
}
}
WS: [ \n\t\r]+ -> skip;
ID: [a-zA-Z]+ { if(keywords.containsKey(getText())) setType(keywords.get(getText())); };
However, I would like to remove all target-specific code from my .g4 file, as my .g4s will be used across multiple target languages for separate projects.
In a Parser, you can use a Listener to remove embedded actions and decouple the grammar from application-specific code. However, if there exists a way to do this at the Lexer level2, I have yet to find it (thus asking this question).
The way to accomplish this seems to be to wrap the TokenStream pulled from the Lexer. This wrapping TokenStream would read Tokens as they were provided, and apply the transformation currently in an embedded action to any ID tokens present.
This (in theory) would not be difficult to implement; however, this feels like functionality that should be possible with just the already defined ANTLR symbols. So, the question is: is it possible to conditionally change the type of tokens passing through a TokenStream within the existing ANTLR system? If not, what is the lowest-friction way of accomplishing that task? An example using the Java library would be preferred, as that is the one I am most familiar with.
And as a sub-question: if I end up creating a TokenTransformationStream for my required targets, would it be worth suggesting adding it to the existing libraries? (I can create symbols for all current supplied targets.)
1 Yes, this will crash if you construct a Lexer with the regular constructor. In a real application, it might be worth fixing that, but for this example, it doesn't matter.
2 I feel this is an appropriate task for the lexer level for a couple reasons. The main reason is that it seems common practice to pass keywords as keyword tokens always, and then, if necessary, allow them as identifiers at the parser level (such as context-sensitive keywords). Also, other questions asking simply how to achieve this effect suggest a method basically equivalent to the above provided embedded actions solution.

This may not turn out to be the answer to the question, but it's simply too long for comment.
I meant lexer modes in the comments because I was focused on this part hot-swap keywords. I don't know why you need to change the token type, but if you use lexer modes maybe you will not care about it.
The only catch is there need to be some keywords which indicate changing of the lexer mode. Basically one lexer mode would be a sub-lexer grammar (of sorts.)
RUNTIME_CFG_! : 'runtime_cfg_1' -> mode(m_CGF_1);
...
mode m_CGF_1;
KEYWORD1 : 'key1;
...
If there are some same keywords you can also use lexer function type* to explicitly set the type of the token.
*I can't remember in the moment how it's called but by lexer function I mean one of those like mode, skip etc..

Related

ANTLR 4 and StringTemplate 4 - using tree walker with templates

Disclaimer: I never used Java before last month, and I had never heard of ANTLR or StringTemplate before then either. For my internship this summer I was given a project using tools that nobody else at the company has ever used. Everyone "has faith in me" that I will "figure it out." Hence the huge gaps in my understanding. I love this project and I've learned a ton, so don't take this as complaining. I just want to make it work.
Right now I'm working on a pretty printer proof of concept for an old domain-specific language. My ANTLR grammar is producing nice parse trees, and I'm able to output simple StringTemplate examples like the ones in the introduction.
Say I have an simple template in my .stg file:
module(type, name, content) ::= "<type> MODULE <name>; <content>; END MODULE."
In Java I'm able to use add() to set the values for each of the template arguments:
STGroup g = new STGroupFile("example.stg");
ST st = g.getInstanceOf("module");
st.add("type", "MAIN");
st.add("name", "test");
st.add("content", "abc");
System.out.println(st.render());
// prints "MAIN MODULE test; abc; END MODULE."
How do I get ANTLR and ST to read in a text file and produce pretty-printed output?
MAIN MODULE test;
abc;
END MODULE.
Should become
MAIN MODULE test; abc; END MODULE.
For example. (That's not how I plan to format all the output, don't worry. It'll pretty print much prettier than that.)
In this answer I learned that ANTLR 4 generates walkers automatically. Assuming my ANTLR grammar is correct/well-written, how do I match up the ANTLR rules/tokens to my template arguments to generate output from an input text file?
If I missed it in the documentation somewhere let me know. There are much fewer examples of ANTLR 4 and ST 4 than the previous versions.
Given a parser rule
r : a b c ;
the generated parse-tree will contain a node rContext with child nodes aContext, bContext, cContex, each potentially having further child nodes, for each instance in the input stream where the rule is matched.
The walk will produce the series of listener (or visitor) calls
enterR
enterA
....
exitA
enterB
....
exitB
enterC
....
exitC
exitR
Each call contains a reference to the instance context within the parse-tree, giving access to the actual values that could be passed to ST in prefix/suffix order relative to intervening child nodes.
Where simple prefix/suffix access ordering alone is not sufficient (or undesirably complex), use one or more prior parse-tree walks to analyze the more complex nodes and annotate the node instances with the analysis products. In the final output walk, reference the analysis products for the values to pass to ST.
Depending on actual circumstances, it would not be unusual for the analysis of a node to collect values from its children, pass the lot to a template for detail expansion, formatting, etc, and store the result as a node annotation string pending output in the final output walk.
Update
To annotate parse-tree nodes, you can use ParseTreeProperty.
Where the annotation set becomes more than 'trivial', a typical option is to associate a node-type specific 'decorator' class instance with a parse-tree node/context instance largely as a better data container. Of course, the node-type specific methods can then be embedded into their corresponding decorator classes to keep concerns nicely separated.
The listener methods become something like this:
public void exitNodeB(NodeBContext ctx) {
super.exitNodeB(ctx);
NodeBDescriptor descriptor = (NodeBDescriptor) getDescriptor(ctx);
if (analysisPhase) {
descriptor.process(); // node-type specific analysis
} else {
descriptor.output(); // node-type specific output generation
}
}
The specifics of when to analyze (on enter, exit, or both) and when to output will be dependent on the particular application. Implement to suit your purposes.

Why can't you create a 1 statement function without curly braces?

So, in most programming language, if you are using a loop or an if, you can do it without curly braces if there is only a single statement in it, example:
if (true)
//Single statement;
for (int i = 0; i < 10; i++)
//Single Statement
while (true)
//Single statement
However, it doesn't work for functions, example:
void myFunction()
//Single Statement
So, my question, why doesn't it work for functions?
C++ needs it to disambiguate some constructs:
void Foo::bar() const int i = 5;
Now does the const belong to bar or i ?
Because language grammar forbids you to do that.
The Java grammar defines a method as following:
MethodDeclaration:
MethodHeader MethodBody
Methodbody as:
MethodBody:
Block
;
Which means either a Block (see below) or a single semicolon
Block:
{ BlockStatementsopt }
And a block as one or more statements within curly brackets.
However an if is defined as:
IfThenStatement:
if ( Expression ) Statement
Where no block is needed after the closing ) and therefore a single line is ok.
Why they chose to define it that way? One can only guess.
Grammar can be found here: http://docs.oracle.com/javase/specs/jls/se7/html/index.html
This is not a rule, in some languages you can (Python? Yes, I know that's really contrived example :)) ), in other you cannot.
You could very well extend your question for example to class and namespaces, for example, why not:
namespace Example
class Foo : public Bar
public: std::string myMethod()
return "Oh noes!";
right? At each level, that's just a single item, so why not skip the braces everywhere?
The answer is at the same time simple and complex.
In simple terms, it's about readability. Remember that you can layout your code as you like, since whitespaces are usually discarded by the compiler:
namespace Example class Foo : public Bar public: std::string myMethod() return "Oh noes!";
Well, that starts looking unreadable. Notice that if you add the braces back
namespace Example { class Foo : public Bar { public: std::string myMethod() {return "Oh noes!";}}}
then it, strangely, becomes somewhat comprehensible.
The actual problem is not readability (who cares anyways? I'm joking of course) but in the latter: comprehension. Not only you must be able to comprehend the code - the compiler must. And for the compiler there is no such thing as "oh, this looks like function". The compiler must be absolutely sure that it is a function. Also, it must be completely sure about where it starts, where it ends, and so on. And it must do that without looking at whitespaces too much, since C-family languages allow you to do add them in any quantities you like.
So, let's look again at the packed-up no-braces example
namespace Example class Foo : public Bar public : std::string myMethod() return "Oh noes!";
^ ^ ^^
I've marked some problematic symbols. Assuming you could define a grammar that handles it, please note how the meaning of ":" character changes. At one time it's denoting that you're specifying inheritance, at other point it's specifying access modifier to a method, at third place it's just namespace qualifier. Ok, the third one could be discarded if you were smart and noticed it's actually '::' symbol, not just a ':' character.
Also, meaning of keywords can change:
namespace Example class Foo : public Bar public : std::string myMethod() return "Oh noes!";
^^^^^^ ^^^^^^
At first place, it defines access modifier for inherited base class, at second place it defined access modifier for a method. What's more, at first place it's not meant to be followed by a ":" and at second place it's required to be followed by it!
So many rules, exceptions and corner cases, and we covered just 2 simple things: public and ':'. Now, imagine you are to specify the grammar for the whole language. You describe everything in the way you'd like to have. But, when you gather all the rules together, they at some point may start overlap and collide with each other. After adding Nth rule, it may happen that your 'compiler' would be unable to tell whether the 'public' actually marks inheritance, or starts a method:
namespace Example class Foo : public ::Bar public : std::string myMethod() return "Oh noes!";
^^^^^^^^ ^^^^^^^^
Note that I only changed the Bar to ::Bar. I only added a namespace qualifier, and now our rule of "public is followed by a colon" is trashed. As I now added a rule that "base class names may have namespace qualifiers", I also must add more rules to cover yet another corner cases - to remove the ambiguity of the meaning of "public" and ":" in this place.
To cut the long talk: the more rules, the more problem you have. The "compiler" grows, gets slower, eats more resources to work. This results in inability to handle large code files, or in frustration when the user must wait oh-so-long for that module to compile.
But what's worse for the user is, the more complex or ambiguous, the worse error messages are. Noone wants to use a compiler that is unable to parse some code and also unable to tell you what's wrong with it.
Remember in C++ what happens when you forget some ';' in a .h file? Or when you forget some }? Compiler reports you an error 30 or 300 lines farther. This is because the ';' and '{}' can be ommitted in many places, and for that 30 or 300 lines, the compiler simply does not yet know that's something wrong! Were the braces required everywhere, the point of error could be pinpointed faster.
The other way: making them optional at namespace, class, or function level, would remove the basic block-starts/block-ends markers and, at least:
could make the grammar ambiguous (and hence force to add more rules)
could hurt detecting (and reporting!) errors
any part of which noone really wants.
The C++ grammar is so complex, that it actually might be not possible to omit the braces at those places at all. For Java or plain C, I think it could be possible to make a grammar/compiler that would not require them, but would it would still hurt error reporting much. Especially in C which allows to use #include and macros. In early Java, the impact might be lesser, as the grammar is relatively simple, compared i.e. to current C++..
Probably the simplest, fastest, easiest to implement, and probably easiest to learn grammar would .. require braces (or any other delimiters) just about everywhere. Check LISP for example. But then, large part of your work would consist of constantly writing the same required markers, which many language-users simply does not like (i.e. I get nauseous when I need to work on some old code in VisualBasic with its "if then end if" yuck)
Now, if you look at brace-less language like Python - how does they solve it? They denote the block-starts/block-ends by .. intendation. In this language you must indent your code properly. If you don't indent it correctly, it will not compile at all, or it the loops/functions/etc will silently get their code messed up, because the compiler will not know what part does belong to which scope. No free lunch here again.
Basically a method(function) is a collection of statements that are grouped together to perform an operation. We group the statements for reusable. That is if you know that a set of instructions will used often in that case we create it as a separate function.
If you can perform the task in a single line of code, then why do you need to write a function?
Because the grammar of the language doesn't allow you to.
Here is the grammar for a function in C taken from the ISO/IEC 9899-1999 specification:
6.9.1 Function definitions
Syntax
1 function-definition:
declaration-specifiers declarator declaration-listopt compound-statement
The compound-statement part is the body of a function, and a compound statement is declared as
compound-statement:
{ block-item-listopt }
i.e. it starts and ends with braces.
An if, while or similar body can have a statement as its body.
(6.8.5) iteration-statement:
while ( expression ) statement
A statement can be one of several constructs.
statement:
labeled-statement
compound-statement
expression-statement
selection-statement
iteration-statement
jump-statement
of which only compound-statement requires the braces.
In c++ you need a compound statement to make a function body - which is actually surrounded with curly barces. It does not mean you need to have curly braces right immediately, following will compile just fine:
int foobar()
try {
return 1;
}
catch (...){return 0;}
You can't precisely say there are no one statement functions in C#. Anonymous methods could be one of them. Without single line statements we could not have Lambda expression in c#. The C# 3.0 wouldn't be exist.
There is no reason to add that extra parsing code in the compiler because the functionality is really useless, how many one line methods have you written that are not accessors or mutators? This has been dealt with in C# via properties but not yet in Java.
So the reason is, it's unlikely to be used considering most developers discourage leaving out optional bracket blocks anyway.

parsing a Python expression from Java

I've got a bit of an interesting challenge
To the point:
I want to allow a user to enter an expression in a text field, and have that string treated as a python expression. There are a number of local variables I would like to make available to this expression.
I do have a solution though it will be cumbersome to implement. I was thinking of keeping a Python class source file, with a function that has a single %s in it. When the user enters his expression, we simply do a string format, and then call Jython's interpreter, to spit out something we can execute. There would have to be a number of variable declaration statements in front of that expression to make sure the variables we want to expose to the user for his expression.
So the user would be presented with a text field, he would enter
x1 + (3.5*x2) ** x3
and we would do our interpreting process to come up with an open delegate object. We then punch the values into this object from a map, and call execute, to get the result of the expression.
Any objections to using Jython, or should I be doing something other than modifying source code? I would like to think that some kind of mutable object akin to C#'s Expression object, where we could do something like
PythonExpression expr = new PythonExpression(userSuppliedText)
expr.setDefaultNamespace();
expr.loadLibraries("numPy", /*other libraries?*/);
//comes from somewhere else in the flow, but effectively we get
Map<String, Double> symbolValuesByName = new HashMap<>(){{
put("x1", 3.0);
put("x2", 20.0);
put("x3", 2.0);
}};
expr.loadSymbols(symbolValuesByName);
Runnable exprDelegate = expr.compile();
//sometime later
exprDelegate.run();
but, I'm hoping for a lot, and it looks like Jython is as good as it gets. Still, modifying source files and then passing them to an interpreter seems really heavy-handed.
Does that sound like a good approach? Do you guys have any other libraries you'd suggest?
Update: NumPy does not work with Jython
I should've discovered this one on my own.
So now my question shifts: Is there any way that from a single JVM process instance (meaning, without ever having to fork) I can compile and run some Python code?
If you simply want to parse the expressions, you ought to be able to put something together with a Java parser generator.
If you want to parse, error check and evaluate the expressions, then you will need a substantial subset of the functionality a full Python interpreter.
I'm not aware of a subset implementation.
If such a subset implementation exists, it is unclear that it would be any easier to embed / call than to use a full Python interpreter ... like Jython.
If the powers that be dictate that "thou shalt use python", then they need to pay for the extra work it is going to cause you ... and the next guy who is going to need to maintain a hybrid system across changes in requirements, and updates to the Java and Python / Jython ecosystems. Factor it into the project estimates.
The other approach would be to parse the full python expression grammar, but limit what your evalutor can handle ... based on what it actually required, and what is implementable in your project's time-frame. Limit the types supported and the operations on the types. Limit the built-in functions supported. Etcetera.
Assuming that you go down the Java calling Jython route, there is a lot of material on how to implement it here: http://www.jython.org/jythonbook/en/1.0/JythonAndJavaIntegration.html

Best sandboxed expression language for JVM

I want an expression language that runs on the JVM and includes support for
math expressions, including operator priority
string expressions, like substring, etc
supports named functions
this allows me to decorate and control exactly who and what functions can be executed.
read/write variables that are "typeless" / allow type conversion in a controlled manner.
does not allow arbitary java scriptlets.
it should not be possible to include constructs like new Someclass()
cannot execute arbitrary static or otherwise method
does not allow any OGNL like expressions.
I only want to functions I map to be available.
support for control constructs like if this then that is for the moment optional.
must be embeddable.
This previous stackoverflow question is similar, but:
does not really answer "how" or "what" as does the above,
allows java object expressions, throwing an exception from a SecurityManager to stop method execution, which is nasty and wrong.
java object like expressions should be an error at parse time.
jexel seem to be closest possible match, but License is a bit horrible (GPL/Commercial).
If you only want the scripts to output text, then Apache Velocity fit's your constraints quite well. It runs in an environment where it only has access to the objects you give it, but can do things like basic math.
The Apache license is a bit friendlier than GPL too.

Java source refactoring of 7000 references

I need to change the signature of a method used all over the codebase.
Specifically, the method void log(String) will take two additional arguments (Class c, String methodName), which need to be provided by the caller, depending on the method where it is called. I can't simply pass null or similar.
To give an idea of the scope, Eclipse found 7000 references to that method, so if I change it the whole project will go down. It will take weeks for me to fix it manually.
As far as I can tell Eclipse's refactoring plugin of Eclipse is not up to the task, but I really want to automate it.
So, how can I get the job done?
Great, I can copy a previous answer of mine and I just need to edit a tiny little bit:
I think what you need to do is use a source code parser like javaparser to do this.
For every java source file, parse it to a CompilationUnit, create a Visitor, probably using ModifierVisitor as base class, and override (at least) visit(MethodCallExpr, arg). Then write the changed CompilationUnit to a new File and do a diff afterwards.
I would advise against changing the original source file, but creating a shadow file tree may me a good idea (e.g. old file: src/main/java/com/mycompany/MyClass.java, new file src/main/refactored/com/mycompany/MyClass.java, that way you can diff the entire directories).
Eclipse is able to do that using Refactor -> Change Method signature and provide default values for the new parameters.
For the class parameter the defaultValue should be this.getClass() but you are right in your comment I don't know how to do for the method name parameter.
IntelliJ IDEA shouldn't have any trouble with this.
I'm not a Java expert, but something like this could work. It's not a perfect solution (it may even be a very bad solution), but it could get you started:
Change the method signature with IntelliJ's refactoring tools, and specify default values for the 2 new parameters:
c: self.getClass()
methodName: Thread.currentThread().getStackTrace()[1].getMethodName()
or better yet, simply specify null as the default values.
I think that there are several steps to dealing with this, as it is not just a technical issue but a 'situation':
Decline to do it in short order due to the risk.
Point out the issues caused by not using standard frameworks but reinventing the wheel (as Paul says).
Insist on using Log4j or equivalent if making the change.
Use Eclipse refactoring in sensible chunks to make the changes and deal with the varying defaults.
I have used Eclipse refactoring on quite large changes for fixing old smelly code - nowadays it is fairly robust.
Maybe I'm being naive, but why can't you just overload the method name?
void thing(paramA) {
thing(paramA, THE_DEFAULT_B, THE_DEFAULT_C)
}
void thing(paramA, paramB, paramC) {
// new method
}
Do you really need to change the calling code and the method signature? What I'm getting at is it looks like the added parameters are meant to give you the calling class and method to add to your log data. If the only requirement is just adding the calling class/method to the log data then Thread.currentThread().getStackTrace() should work. Once you have the StackTraceElement[] you can get the class name and method name for the caller.
If the lines you need replaced fall into a small number of categories, then what you need is Perl:
find -name '*.java' | xargs perl -pi -e 's/log\(([^,)]*?)\)/log(\1, "foo", "bar")/g'
I'm guessing that it wouldn't be too hard to hack together a script which would put the classname (derived from the filename) in as the second argument. Getting the method name in as the third argument is left as an exercise to the reader.
Try refactor using intellij. It has a feature called SSR (Structural Search and Replace). You can refer classes, method names, etc for a context. (seanizer's answer is more promising, I upvoted it)
I agree with Seanizer's answer that you want a tool that can parse Java. That's necessary but not sufficient; what you really want is a tool that can carry out a reliable mass-change.
To do this, you want a tool that can parse Java, can pattern match against the parsed code, install the replacement call, and spit out the answer without destroying the rest of the source code.
Our DMS Software Reengineering Toolkit can do all of this for a variety of languages, including Java. It parses complete java systems of source, builds abstract syntax trees (for the entire set of code).
DMS can apply pattern-directed, source-to-source transformations to achieve the desired change.
To achieve the OP's effect, he would apply the following program transformation:
rule replace_legacy_log(s:STRING): expression -> expression
" log(\s) " -> " log( \s, \class\(\), \method\(\) ) "
What this rule says is, find a call to log which has a single string argument, and replace it with a call to log with two more arguments determined by auxiliary functions class and method.
These functions determine the containing method name and containing class name for the AST node root where the rule finds a match.
The rule is written in "source form", but actually matches against the AST and replaces found ASTs with the modified AST.
To get back the modified source, you ask DMS to simply prettyprint (to make a nice layout) or fidelity print (if you want the layout of the old code preserved). DMS preserves comments, number radixes, etc.\
If the exisitng application has more than one defintion of the "log" function, you'll need to add a qualifier:
... if IsDesiredLog().
where IsDesiredLog uses DMS's symbol table and inheritance information to determine if the specific log refers to the definition of interest.
Il fact your problem is not to use a click'n'play engine that will allow you to replace all occurences of
log("some weird message");
by
log(this.getClass(), new Exception().getStackTrace()[1].getMethodName());
As it has few chances to work on various cases (like static methods, as an example).
I would tend to suggest you to take a look at spoon. This tool allows source code parsing and transformation, allowing you to achieve your operation in a -obviously code based- slow, but controlled operation.
However, you could alos consider transforming your actual method with one exploring stack trace to get information or, even better, internally use log4j and a log formatter that displays the correct information.
I would search and replace log( with log(#class, #methodname,
Then write a little script in any language (even java) to find the class name and the method names and to replace the #class and #method tokens...
Good luck
If the class and method name are required for "where did this log come from?" type data, then another option is to print out a stack trace in your log method. E.g.
public void log(String text)
{
StringWriter sw = new StringWriter();
PrintWriter pw = new PrintWriter(sw, true);
new Throwable.printStackTrace(pw);
pw.flush();
sw.flush();
String stackTraceAsLog = sw.toString();
//do something with text and stackTraceAsLog
}

Categories

Resources