Best design for generating code from an AST?

Best design for generating code from an AST? - java

I'm working on a pretty complex DSL that I want to compile down into a few high level languages. The whole process has been a learning experience. The compiler is written in java.
I was wondering if anyone knew a best practice for the design of the code generator portion. I currently have everything parsed into an abstract syntax tree.
I was thinking of using a template system, but I haven't researched that direction too far yet as I would like to hear some wisdom first from stack overflow.
Thanks!

When I was doing this back in my programming languages class, we ended up using emitters based on following the visitor pattern. It worked pretty well - makes retargeting it to new output languages pretty easy, as long as your AST matches what you're printing fairly well.

What you really want is a program transformation system, that maps syntax structures in one language (your DSL) into syntax patterns in other langauges. Such a tool can carry out arbitrary transformations (tree-rewrites generalize string-rewrites which are Post systems which are full Turing capable) during the code generation project, which means that what you generate and how sophisticated your generation process is determined only by your ambition, not by "code generator framework" properties.
Sophtisticated program transformation systems combine various types of scoping, flow analysis and/or custom analyzers to enable the tranformations. This doesn't add any theoretical power, but it adds a lot of practical power: most real languages (even DSLs) have namespaces, control and data flow, need type inference, etc. etc.
Our DMS Software Reengineering Toolkit is this type of transformation system. It has been used to analyze/transform both conventional languages and DSLs, for simple and complex languages, and for small, large and even huge software systems.
Related to comments by OP about "turning the AST into other languages", that is accomplished by DMS by writing transformations that map surface syntax for the DSL (implemented behind the scenes his DSL's AST) to surface syntax for the target language (implemented using target language ASTs). The resulting target langauge AST is then prettyprinted automatically by DMS to provide actual source code in the target language, that corresponds to the target AST.

If you are already using ANTLR and have your AST ready you might want to take a look at StringTemplate:
http://www.antlr.org/wiki/display/ST/StringTemplate+Documentation
Also Section 9.6 of The Definitive ANTLR Reference: Building Domain-Specific Languages explains this:
http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference
The free code samples are available at http://media.pragprog.com/titles/tpantlr/code/tpantlr-code.tgz. In the subfolder code\templates\generator\2pass\ you'll find an example converting mathematical expressions to java bytecode.

Related

source code translators from a new language to subset of C

I am working on a project which has got a new language implementation. I have been assigned the task to construct a translator for the language. The translator should be built in Java and the translation output should produce the subset of the new language into C. I have few questions regarding that:
1. How to proceed with this?
2. Which phase should I emphasize more on? Should it be code generator phase of the compiler?
3. Do I need to have one more grammar for target language?
thanks in advance.

I'd investigate ANTLR, if you're not already at least aware of it. From http://www.antlr.org/about.html (emphasis mine):
ANTLR, ANother Tool for Language Recognition, is a language tool that
provides a framework for constructing recognizers, compilers, and
translators from grammatical descriptions containing actions in a
variety of target languages. ANTLR automates the construction of
language recognizers. From a formal grammar, ANTLR generates a program
that determines whether sentences conform to that language. In other
words, it's a program that writes other programs. By adding code
snippets to the grammar, the recognizer becomes a translator or
interpreter. ANTLR provides excellent support for intermediate-form
tree construction, tree walking, translation and provides
sophisticated automatic error recovery and reporting.
As an added bonus, ANTLR is written in and easily callable by Java.
Additional details are available at http://en.wikipedia.org/wiki/ANTLR.

AST builders for Java?

I am building an AST tree by hand to use with my application. I currently have a lot of data in my program's memory using a standard OO approach, that will use to form an AST.
I was wondering if by chance there are already any frameworks / code generators that could help me with this task.
I am not looking for a compiler compiler. I don't want to define a grammar and have the code generator generate me a parser for it. I intend to instantiate the nodes of the tree by myself, I am only looking for a faster and cheaper way to build the .java files themselves (a plus would be having options for the node's attributes, optional beginVisit() / endVisit() methods, etc).

I would highly recommend that you take a look at Eclipse's Java Development Tools. It includes a very robust AST framework.
My understanding is that with using this API, you would have access to all attributes of the various types of AST Nodes and you can also create visitors with beginVisit() and endVisit() methods.

This seems to be the answer to the question:
http://www.jetbrains.com/mps/
The major goal of MPS is to allow extending languages. This is because
every existing language already has a strict language syntax defined,
which limits its flexibility.
The problem in extending language syntax is mainly the textual
presentation of code. This is especially true if we want to use
different language extensions, where each one may have its own syntax.
This naturally leads to the idea of non-textual presentation of
program code. A major benefit of this approach is that it eliminates
the need for code parsing. Our solution is to have code always
maintained in an Abstract Syntax Tree (AST), which consists of nodes
with properties, children and references, and fully describes the
program code.
At the same time, MPS offers an efficient way to keep writing code in
a text-like manner.
In creating a language, you define the rules for code editing and
rendering. You can also specify the language type-system and
constraints. This allows MPS to verify program code on the fly, and
thus makes programming with the new language easy and less
error-prone.
MPS uses a generative approach. You can also define generators for
their language to transform code in the custom language to compilable
code in some conventional language. Currently, MPS is particularly
good for, but is not limited to, generating Java code. You can also
generate XML, HTML, JavaScript, and more.

Java Grammar syntax analyzer (ASCII to graph)

I am developing an assistant to type database commands for DBAs, because these commands have many parameters, and an assistant will help a lot with their job. For this assistant, I need the grammar of the commands, but database vendors (Oracle, DB2) do not provide that information in any format, the only thing is the documentation.
One example of a DB2 command is: http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.cmd.doc/doc/r0001933.html
For this reason, I am trying to analyze the grammar diagram or railroad diagram (http://en.wikipedia.org/wiki/Syntax_diagram), but I have not found anything in Java that could help me. I would like some re-engineering (reverse) tool that takes the ASCII (textual representation) of the grammar, and creates a graph in Java. Then, with the graph in Java, the assistant could propose options of the current typed command.
On example of the assistant http://www.youtube.com/watch?v=5sBoUHJupvs
If you have information about how to analyze grammar diagrams with Java (not generate) I will appreciate that information.

The closest tool I've seen is Grammar Recovery System by Ralf Lammel. It depends on accessibility of railroad diagrams as text strings. That's generally not how they are found. You appear to be lucky in the DB2 case, Ralf's work points in the right direction.
Considering that such diagrams are usally rendered as just a set of pixels (PLSQL's are like this in the PDF files provided for documentation), you have several sets of problems: recognizing graphical entities from pixels, assembling them into actual representations of the railroad diagrams, and then using such as your assistant.
I think this is a long, hard, impractical approach. If you got it to work, you'd discover the diagrams are slightly wrong in many places (read Ralf's paper or find out the hard way), and therefore unusable for a tool that is supposed to produce the "right" stuff to help your DBAs.
Of course, you are objecting to the other long, hard, "impractical" approach of reading the documentation and producing grammars that match, and then validating those grammars against the real world. Yes, this is a tough slog too, but it actually does produce useful results. You need to find vendors that done this and will make it available to you.
ANTLR.org offers a variety of grammars. Have you checked there?
My company offers grammars and tools for processing them. We have done this for PLSQL and SQL2011 but not yet DB2.
Given a grammar, you now need to use it to provide "advice" to your users. Your users aren't going to type in a complete "program"; they want to generate fragments (e.g., SELECT statements). Now you need a parser that will process grammar fragments and at least say "legal" or "not". Most won't do that. Our DMS Software Reengineering Toolkit will do that.
To provide advice, you need to be able to walk the grammar (much as you considered for railroad diagrams) to compute "what is legal next". That's actually pretty hard (an in fact it is roughly equivalent to what an LR/GLR parser generator does when building tables). Our DMS engine does that during syntax error repair by traversing its GLR parse tables (since that work is already encoded in the tables!). That's not easy to do, as it is a peculiar variant of the GLR parsing algorithm. You might do better with an Earley parser, which keeps around all possible parses as a set of choices; you could simply inspect each one.
But this looks like quite a lot of work, and I think you'll be surprised by the amount of machinery you need.
The best work in this area is Harmonia, which produces incremental editors for code. Our DMS engine's parser is based on earlier work done by this project, because we are interested in the incrementality aspect.

You can try using ANTLR http://www.antlr.org/
It will not be able to understand an ASCII representation of the grammar, but it is powerful enough to do anything else you need, if you don't mind spending the time to learn the software.

Complete metaprogramming framework for Java?

I'm interested in metaprogramming (i.e. programs that help programmers do tedious programming tasks). I'm looking for a tool which has the following properties:
usable both at compile time and runtime;
inspects program structure;
can add new classes, methods or fields and make them visible to Java compiler;
can change behavior of methods;
Java-based (well, Java is most popular programming language according to some rankings);
good integration with IDEs and build tools like Ant, Gradle or Maven;
actively maintained project;
easy to use and extend;
There are some solutions for this, like:
reflection
AspectJ
Annotation Processing Tool
bytecode manipulation (CGLIB, Javassist, java.lang.instrument)
Eclipse JDT
Project Lombok
Groovy, JRuby, Scala
But unfortunately none of them meets all the criteria above. Is there any complete metaprogramming solution for Java?

There's JackPot, which is Java based but I don't think gets a lot a current attention. Has ASTs and symbol tables AFAIK. You can probably extend it; I doubt anybody will stop (or help) you.
There's the Java-based compiler APIs for the Sun, er, Oracle java compiler. They're likely actively maintained, but I don't think you can modify source code and regenerate it. Certainly has symbol tables; dunno about trees. Probably pretty hard to extend; you have to keep up with the compiler guys, not the other way round.
There is ANTLR, which has a Java implementation and a Java parser that will build ASTs. I don't think it has full symbol tables, so doing serious code analysis/revision is likely to be hard. ANTLR is certainly actively maintained, and nobody will object to you enhancing the Java grammar with symbol tables. Just know that will take you about 6 months for Java 1.6 if that's all you do. (That's how long it took our internal [smart] guy to do it for DMS, starting with symbol table support for 1.4).
Not in Java, and not easily integrated into IDEs, but capable of carrying massive analysis and transformation on Java code is our DMS Software Reengineering Toolkit with its Java Front End.
DMS is generic compiler machinery: parsing, AST building, symbol table machinery, flow analysis machinery, with that additional bonuses of source-to-source transformations and generic prettyprinting of ASTs back to legal text including retention of comments. It offers a set of APIs supporting these services, and additional tools for defining grammars and langauge-dependent flow analyzers.
The Java Front End gives crucial detail (using those APIs) to DMS to allow it process Java: a grammar/parser, full symbol table construction for Java 1.4-1.6 (with 1.7 due momentarily), as well as some control and data flow analysis (to be extended over time because this stuff is so useful).
By using the services provided by DMS and the Java Front end, one can reasonably contemplate building arbitrary Java anlaysis and transformation tools. (This makes the tool a "complete" metaprogramming tool, in that it can inspect any language structure, or change any language structure, as opposed to say template metaprogramming or reflection). We believe this to be much more effective than ad hoc tools because you don't have to build the infrastructure, the infrastructure provided is robust and handles cases you don't have the energy to implement, and it is designed to support such tasks. YMMV.
DMS/Java Front end have been used to construct a variety of Java tools: test coverage, profilers, dead code elimination, clone detection on scale, JavaDoc with hyperlinked source-code, fast XML parser/generators, etc.
Yes, its actively maintained; undergoing continuous enhancement since the first version in 1998.

There's a Java metaprogramming framework that is part of Tapestry IOC, it's called Plastic. It munges class bytecodes using custom classloaders, I haven't tried it yet but it looks like it gives a simple interface that still enables the programmer to make powerful metaprogramming changes.

Check out the Meta Programming System:
http://www.jetbrains.com/mps/
It has great IDE support and is used quite frequently by the smart folks at JetBrains.

Check out Spring Roo.

parsing and translating from text to xml

I need to translate programs written in a domain specific language into xml representation. These programs are in the form of simple text file. What approach would you suggest me? What api should I use to:
Parse the text files written in this language.
Write xml based on the token and token streams I obtain.
My criteria is more of a rapid and easier development rather then memory or computing time efficiency.
Many Thanks
Ketan

The less trivial part of the job is with step #1, parsing the Domain Specific Language (DSL) text, rather than #2, pushing this to some XML language.
Hopefully you readily have a parser for the DSL (obviously this language must have been put to use somewhere...), and you may be able to "hook" your export/conversion logic into this parser. If such is not possible, you'll need to write a new parser.
Depending on the complexity of the DSL, you may be able to write, longhand, a simple parser based on a few loops and switch cases.
For more complicated languages, ANTLR is often a good choice. In a nutshell, one formalize the grammar of the DSL, in Backus Naur Form (BNF, or actually EBNF, here, i.e. the Extended family) and ANTLR produces a parser, written in a target language of choice (including Java). The learning curve with ANTLR is a factor to consider but in the context of a moderately to extremely sophisticated language, a well worth investment. ANTLR is similar but, in my opinion, a better tool than GNU Bison, this latter would however do the trick as well, and too, target Java is so desired.
If you are familiar with other languages, in particular Python, there are many other tools that can be put to use for more or less ad-hoc parsers; I've also used PyParsing and gladly recommend it.

XStream is the best XML serializer/deserializer for Java EVAR. If you can turn your DSL into Java classes, this is a great library to use.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.