parsing and translating from text to xml

parsing and translating from text to xml - java

I need to translate programs written in a domain specific language into xml representation. These programs are in the form of simple text file. What approach would you suggest me? What api should I use to:
Parse the text files written in this language.
Write xml based on the token and token streams I obtain.
My criteria is more of a rapid and easier development rather then memory or computing time efficiency.
Many Thanks
Ketan

The less trivial part of the job is with step #1, parsing the Domain Specific Language (DSL) text, rather than #2, pushing this to some XML language.
Hopefully you readily have a parser for the DSL (obviously this language must have been put to use somewhere...), and you may be able to "hook" your export/conversion logic into this parser. If such is not possible, you'll need to write a new parser.
Depending on the complexity of the DSL, you may be able to write, longhand, a simple parser based on a few loops and switch cases.
For more complicated languages, ANTLR is often a good choice. In a nutshell, one formalize the grammar of the DSL, in Backus Naur Form (BNF, or actually EBNF, here, i.e. the Extended family) and ANTLR produces a parser, written in a target language of choice (including Java). The learning curve with ANTLR is a factor to consider but in the context of a moderately to extremely sophisticated language, a well worth investment. ANTLR is similar but, in my opinion, a better tool than GNU Bison, this latter would however do the trick as well, and too, target Java is so desired.
If you are familiar with other languages, in particular Python, there are many other tools that can be put to use for more or less ad-hoc parsers; I've also used PyParsing and gladly recommend it.

XStream is the best XML serializer/deserializer for Java EVAR. If you can turn your DSL into Java classes, this is a great library to use.

Related

What's the best way to write a OCaml parser in scala/java?

So I started to write a parser for OCaml in Scala with the Scala CombinatorParser,
but I get the feeling that this is not the right tool for the job.
Especially getting the precedences and associativity of operators and non-closed constructions right can be challenging.
So my question is: Whats the best way to for such a real world parser like one for OCaml?
I looked into parser generators like ANTLR, but there are numerous and I have no idea which one would actually make the job easier.

You can have a look at JavaCC generator. I find it quite useful to make DSL parsers. I guess it's a good candidate for parsing "real" languages too.

OCaml parser is implemented in pretty straightforward lex+yacc. Therefore, the easiest way is to port the rules using the equivalent lex+yacc toolset in your language.
I do not mean converting OCaml parsing rules in LL(k) (i.e. Parsec) is completely impossible. Actually it is not very difficult if you write an automatic conversion tool: see my blog entry about it http://camlspotter.blogspot.sg/2011/05/planck-small-parser-combinator-library.html But, with human hands, it is an almost impossible task to do correctly in short time.
-edit-
On the second thought, the easiest way, if you are not a Scala/Java purist, is to use the original OCaml parser and write some OCaml code to output its AST to something easy to parse for any other languages, for example, S-exp.

You may want to check out ANTLR. For small DSLs I found it very usable. I assume it can handle complex languages as well.

AST builders for Java?

I am building an AST tree by hand to use with my application. I currently have a lot of data in my program's memory using a standard OO approach, that will use to form an AST.
I was wondering if by chance there are already any frameworks / code generators that could help me with this task.
I am not looking for a compiler compiler. I don't want to define a grammar and have the code generator generate me a parser for it. I intend to instantiate the nodes of the tree by myself, I am only looking for a faster and cheaper way to build the .java files themselves (a plus would be having options for the node's attributes, optional beginVisit() / endVisit() methods, etc).

I would highly recommend that you take a look at Eclipse's Java Development Tools. It includes a very robust AST framework.
My understanding is that with using this API, you would have access to all attributes of the various types of AST Nodes and you can also create visitors with beginVisit() and endVisit() methods.

This seems to be the answer to the question:
http://www.jetbrains.com/mps/
The major goal of MPS is to allow extending languages. This is because
every existing language already has a strict language syntax defined,
which limits its flexibility.
The problem in extending language syntax is mainly the textual
presentation of code. This is especially true if we want to use
different language extensions, where each one may have its own syntax.
This naturally leads to the idea of non-textual presentation of
program code. A major benefit of this approach is that it eliminates
the need for code parsing. Our solution is to have code always
maintained in an Abstract Syntax Tree (AST), which consists of nodes
with properties, children and references, and fully describes the
program code.
At the same time, MPS offers an efficient way to keep writing code in
a text-like manner.
In creating a language, you define the rules for code editing and
rendering. You can also specify the language type-system and
constraints. This allows MPS to verify program code on the fly, and
thus makes programming with the new language easy and less
error-prone.
MPS uses a generative approach. You can also define generators for
their language to transform code in the custom language to compilable
code in some conventional language. Currently, MPS is particularly
good for, but is not limited to, generating Java code. You can also
generate XML, HTML, JavaScript, and more.

Java to HTML Parser / State Machine

I wish to create a app that translates input java code into HTML formatted java code,
For example:
public class ReadWithScanner
Would become
<span class="public">public</span> <span class="class">class</span> ReadWithScanner
However it gets quite complicated when it comes to parameters and regular expressions. Now I have a bit of time on my hands, and I wish to write my own code parser.
How would I start this? and is there any tutorials or online content to not only help me write this, but understand it.
Thanks

For help with the complexity of parsing, you'll need to rely on the Java Language Specification.
As I seem to recall, Java is an LL(k) language (see here, for instance). However, the Java language, despite all attempts to keep it "compact", is still quite large and complex. The grammar is spread out over the entire document. This is not a project for the faint at heart. You might consider using a Java parsing tool (like Java-front).

What you need to do is use ANTLR, it already has Java grammars for parsing Java, then you just need to supply your own templates to output whatever you want from the Abstract Syntax Tree you generate with ANTLR.

If you need a resource for learning about parsers, I can recommend Basics of Compiler Design, which is available as a free download.
It covers more than just parsers, but if you read the first few chapters, you should have a good basic understanding of both lexers and parsers.

I think you need a lexical analyzer.
I used early the Flex lexical analyzer. It is not too complicated to use.
If you need to parse the analyzed text you can use the bison c++
bisoncpp.sourceforge.net/
(C++ konwledge need and linux environment)

Yacc equivalent for Java

I'm working on a compiler design project in Java. Lexical analysis is done (using jflex) and I'm wondering which yacc-like tool would be best(most efficient, easiest to use, etc.) for doing syntactical analysis and why.

If you specifically want YACC-like behavior (table-driven), the only one I know is CUP.
In the Java world, it seems that more people lean toward recursive descent parsers like ANTLR or JavaCC.
And efficiency is seldom a reason to pick a parser generator.

In the past, I've used ANLTR for both lexer and parser, and the JFlex homepage says it can interoperate with ANTLR. I wouldn't say that ANTLR's online documentation is that great. I ended up investing in 'The Definitive ANTLR reference', which helped considerably.

GNU Bison has a Java interface,
http://www.gnu.org/software/bison/manual/html_node/Java-Bison-Interface.html
You can use it go generate Java code.

There is also jacc.
Jacc is about as close to yacc as you can get, but it is implemented in pure java and generates a java parser.
It interfaces well with jFlex
http://web.cecs.pdx.edu/~mpj/jacc/

Another option would be the GOLD Parser.
Unlike many of the alternatives, the GOLD parser generates the parsing tables from the grammar and places them in a binary, non-executable file. Each supported language then has an engine which reads the binary tables and parses your source file.
I've not used the Java implementation specifically, but have used the Delphi engine with fairly good results.

Best design for generating code from an AST?

I'm working on a pretty complex DSL that I want to compile down into a few high level languages. The whole process has been a learning experience. The compiler is written in java.
I was wondering if anyone knew a best practice for the design of the code generator portion. I currently have everything parsed into an abstract syntax tree.
I was thinking of using a template system, but I haven't researched that direction too far yet as I would like to hear some wisdom first from stack overflow.
Thanks!

When I was doing this back in my programming languages class, we ended up using emitters based on following the visitor pattern. It worked pretty well - makes retargeting it to new output languages pretty easy, as long as your AST matches what you're printing fairly well.

What you really want is a program transformation system, that maps syntax structures in one language (your DSL) into syntax patterns in other langauges. Such a tool can carry out arbitrary transformations (tree-rewrites generalize string-rewrites which are Post systems which are full Turing capable) during the code generation project, which means that what you generate and how sophisticated your generation process is determined only by your ambition, not by "code generator framework" properties.
Sophtisticated program transformation systems combine various types of scoping, flow analysis and/or custom analyzers to enable the tranformations. This doesn't add any theoretical power, but it adds a lot of practical power: most real languages (even DSLs) have namespaces, control and data flow, need type inference, etc. etc.
Our DMS Software Reengineering Toolkit is this type of transformation system. It has been used to analyze/transform both conventional languages and DSLs, for simple and complex languages, and for small, large and even huge software systems.
Related to comments by OP about "turning the AST into other languages", that is accomplished by DMS by writing transformations that map surface syntax for the DSL (implemented behind the scenes his DSL's AST) to surface syntax for the target language (implemented using target language ASTs). The resulting target langauge AST is then prettyprinted automatically by DMS to provide actual source code in the target language, that corresponds to the target AST.

If you are already using ANTLR and have your AST ready you might want to take a look at StringTemplate:
http://www.antlr.org/wiki/display/ST/StringTemplate+Documentation
Also Section 9.6 of The Definitive ANTLR Reference: Building Domain-Specific Languages explains this:
http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference
The free code samples are available at http://media.pragprog.com/titles/tpantlr/code/tpantlr-code.tgz. In the subfolder code\templates\generator\2pass\ you'll find an example converting mathematical expressions to java bytecode.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.