Purely as a self-learning exercise, I'm trying to write a Java parser in Perl using the Parse::RecDescent module. I may later re-implement the parser using other other tools like Antlr, bison, etc.
But how would I ensure that my parser is indeed generating the correct parse, per the Java Language Specification? Meaning, its correct handling of dangling else's, operator-associativity and -precedence etc.
One method would be to compare my parser against a known, bug-free parser by having both parsers generate ASTs for a very large number of test Java programs, and then comparing the two sets of ASTs.
If this is indeed the only method, where could I find a large suite of test Java programs thoroughly covering the entire Java Language Specification?
I have looked at JavaParser but it doesn't seem to have an exhaustive test dataset.
The other method would, of course, be writing by hand tens of thousands test Java programs myself, which would be very impractical for me, not only time-wise but also in ensuring its exhaustiveness!
To decide if you have the right answer, you ideally have to compare to some kind of standard. This is hard for a computer languages.
Comparing ASTs is going to be hard, because there are no standards for such. Each parser that builds ASTs, builds an AST whose structure is designed by the person that coded the parser.
That means if you build an AST-producing parser, and you get somebody else's AST-producing parser, you'll discover that the AST nodes you have chosen don't match the other AST. Now you'll have to build a mapping from your AST to the other one (and how will you know the mapping is valid?). You can try to make your parser generate the AST from another parser, but what you will discover is the AST you produce is influenced by the parsing technology you use.
We have a similar problem with the Java front end my company produces (see bio if you want to know more). What we settle for is testing that the answer is self-consistent and then we do a lot of long-term experiential testing on big pieces of code.
Our solution is to:
(Build a parser, using the strongest parsing technology we can get (GLR). This means we can recognize certain constructs not easily recognized by other parsing technologies (LL, LR, ...), and thus produce AST nodes that other parsers would have a hard time producing. See comments below for an example where this matters. Even so, we produce AST nodes in way that avoids completely our having to hand-code AST node construction as demanded by most other parsing technologies; that tends to produce somewhat different ASTs than hand-coded).
Parse a LOT of Java code (producing ASTs) to make sure we don't have parsing errors. [The JDK is a pretty good size example and it is easy to get]
Our tools can take an AST and regenerate (prettyprint) source code, complete with comments but perhaps a somewhat different layout. We verify that
parsed-then-prettyprinted code also parses. We re-prettyprint the parsed prettyprinted version; this should be identical to the prettyprinted version since we always produce the same layout. This test is a good indication that our AST design and implementation doesn't lose anything about the source code
Build symbol tables, resolve the meaning of names, and verify that the legal Java programs type-check according to our front end. That doesn't tell you anything about the nature of the AST except that it is good enough (which in fact, is good enough!) Because the type checking task is very complex (go check your local Java standard), it is also pretty fragile. If you don't have everything right, the type checking will likely fail when applied across a large body of code. Again, the JDK is a pretty good test of this. Note: a Java parser without name and type resolution isn't very useful in practice
Produce JavaDoc like cross references that include hyperlinked source code from the above results. This means it is easy to manually check a bit code to see that name resolution (therefore AST construction) is sane.
Live with the results, applying the front end to various program
analysis and transformtions of code. We find the occasional problem and fix it.
It is tough to get this right; you have to get close and keep the testing pressure on continuously, especially since Java the language keeps moving.
(We're at Java 8, and Java 9 is being threatened). Bottom line: it is a lot of work to build such a parser and check its sanity.
We'd love to have an independent set of tests, but we haven't seen one in the wild. And I would expect those tests if they exist (I assume Oracle and IBM have them) really don't test parsing and name resolution directly, but rather test that some bit of code compiles and runs producing a known result. Since we aren't building a compiler, we wouldn't be able to run such tests if we had them. We would be able to do the name resolution and type consistency checks and that would be helpful.
[We actually do this for a number of language front ends. You think Java is hard, try this with C++]
I agree with what #ira-baxter wrote.
In short, it is easy to verify that your parser can parse all valid source code (just throw at it a lot of code: we did that to test JavaParser, we just did not put in the repository some hundreds of megabytes of source code).
It is hard to verify the AST you produce have the right shape also because you can have many different ways of parsing the same code and building different but equally valid ASTs. Some shapes will be easier to work with w.r.t to others but there is no "truth" to adhere with.
If your tool produce an AST different from the one produced by another parser it does not necessarily mean that any of the two parsers is wrong, it could be just a design choice.
Related
I'm trying to find information (documentation, advice, etc) on how certain IDE templates (e.g. in Eclipse, IntelliJ, and NetBeans) are instantiated internally by IDEs, and I'm having some trouble.
I'm hoping, perhaps optimistically, that I can automatically generate multiple (at least two) distinct samples of each pattern from templates written in the associated grammars.
Every pattern-parameter (including cursors) must be filled, and samples for the same pattern should only have non-pattern-parameter content in common.
At this stage, they need to be syntactically valid so that they can be parsed, but do not need to be fully semantically valid/compilable snippets.
If anyone knows how any of these IDEs work internally, and can tell me if/how I might be able to do this (or can point me towards sufficient documentation), I would greatly appreciate it.
Background/Context
I'm trying to create a research dataset for a pattern mining task - specifically, for mining code templates. I've been looking into it for some time and, as far as I'm aware, there isn't a suitable precedent dataset, so I have to make one.
Rather than painstakingly defining every feature of every pattern myself, I'm writing tools to partially automate the process. Specifically, automating the tasks of deriving candidate patterns from samples, and of filtering out any candidates not observed in the actual corpus. The tools are input-language-agnostic, but I am initially targetting Java ASTs via the Eclipse JDT.
My thinking is that well-established patterns such as idioms and IDE code templates, from sufficiently reputable sources, are rational and intuitive pattern candidates with which I can, at least, evaluate recall. I can, and will, define some target-sample sets manually. However, I would prefer to generate them automatically, so that I can collect more complicated templates en masse (e.g. those published by IDE community members).
Thanks in advance,
Marcos C-S
I want to parse Java/COBOL/VB etc code to collect information like variable name, method etc
I am using javacc grammar but my problem is that if any exception comes then parser fails
Apart from JAVA, I use javacc grammar for COBOL, VB etc
I do not want that parser should fail so i am trying to read java code line by line to get desired result.
Any better way to do the parsing without throwing exception?
Thanks in advance.
Parsers (and therefore the machinery provided by parser generators) must have some mean to handle invalid source files. Thus each parser (and parser generator) has chosen some method to manage syntax errors.
An easy solution provided by most is simply to throw an exception when such an error is encounterd. The user code invoking the parser has to catch this exception and live with an aborted parse. A more sophisticated solution to have the parser report a syntax error, but recover from the error and continue parsing; any such recovery must handle AST building with some kind of marker on the nodes at the point of error if you also hope to get a usable tree. Most parser generators will offer some kind of syntax recovery, but leave you on your own to handle AST building the face of such errors. Such parser recovery logic is relatively hard to build, and you likely can't do it yourself without becoming an expert in parser error recovery, and making custom changes to the particular parser generators support code.
I'm not specifically familiar with JavaCC (or most of the other parser generators out there), so I don't know if it does this. Obviously, check the docs. If the error handling support you want isn't there, move on to another generator that has it.
I suspect your real problem will actually be getting grammars appropriate for your task. Nobody has "Java" or "COBOL"; they have a specific dialect, e.g., Java 1.5 or IBM Enterprise COBOL or VB6. These are more different from the imagined base language that you expect based on my long experience. You can hope that such grammars as you can obtain will work (including error recovery) to enable you to parse the various dialects of each ins spite of such differences, but generally you'll get large numbers of errors from code in one dialect that doesn't match the grammer you want. (What will you do with the card numbers out past column 72, in your IBM Enterprise COBOL code that has EBCDIC source files?) So you really want a tool that has lots of parsers that handle the various dialects, and that should drive your choice IMHO.
I think ANTLR has a lot of language definitions (more than JavaCC) so it kind of qualifies. However, many the grammars at that site are experimental or unfinished (some are pretty good) so you get kind of pot luck.
Our DMS Software Reenginering Toolkit has a lot of grammars, and we think it our task to make these production quality. We aren't perfect either but our grammers tend to have been tested on large bodies of code, and have support for various dialects. Error recovery is built in, and you do get a tree back (with error nodes in the AST) if the number of errors is less than a specified threshold. DMS also handles nasty issues such as character encodings (we do a wide variety including 80 column EBCDIC with card numbers in column 72 for IBM COBOL and JCL). DMS may not be what you want; for instance, it is not Java based. But we try to make up for that by providing a huge amount of machinery to support post-parsing tasks, such as what you want to do. This machinery includes support for building symbol tables, extracting control and data flows, matching patterns and applying source-to-source transformations, etc. But that's a tradeoff for you to make.
You could attempt to modify the grammar, but why not just work from an AST such as that available from Eclipse?
In the end this will likely be more reliable than most grammars you'll find on the net.
I've had great success with SableCC for Java. It's surprisingly easy to use and has a Java 1.5 grammar available
I am developing an assistant to type database commands for DBAs, because these commands have many parameters, and an assistant will help a lot with their job. For this assistant, I need the grammar of the commands, but database vendors (Oracle, DB2) do not provide that information in any format, the only thing is the documentation.
One example of a DB2 command is: http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.cmd.doc/doc/r0001933.html
For this reason, I am trying to analyze the grammar diagram or railroad diagram (http://en.wikipedia.org/wiki/Syntax_diagram), but I have not found anything in Java that could help me. I would like some re-engineering (reverse) tool that takes the ASCII (textual representation) of the grammar, and creates a graph in Java. Then, with the graph in Java, the assistant could propose options of the current typed command.
On example of the assistant http://www.youtube.com/watch?v=5sBoUHJupvs
If you have information about how to analyze grammar diagrams with Java (not generate) I will appreciate that information.
The closest tool I've seen is Grammar Recovery System by Ralf Lammel. It depends on accessibility of railroad diagrams as text strings. That's generally not how they are found. You appear to be lucky in the DB2 case, Ralf's work points in the right direction.
Considering that such diagrams are usally rendered as just a set of pixels (PLSQL's are like this in the PDF files provided for documentation), you have several sets of problems: recognizing graphical entities from pixels, assembling them into actual representations of the railroad diagrams, and then using such as your assistant.
I think this is a long, hard, impractical approach. If you got it to work, you'd discover the diagrams are slightly wrong in many places (read Ralf's paper or find out the hard way), and therefore unusable for a tool that is supposed to produce the "right" stuff to help your DBAs.
Of course, you are objecting to the other long, hard, "impractical" approach of reading the documentation and producing grammars that match, and then validating those grammars against the real world. Yes, this is a tough slog too, but it actually does produce useful results. You need to find vendors that done this and will make it available to you.
ANTLR.org offers a variety of grammars. Have you checked there?
My company offers grammars and tools for processing them. We have done this for PLSQL and SQL2011 but not yet DB2.
Given a grammar, you now need to use it to provide "advice" to your users. Your users aren't going to type in a complete "program"; they want to generate fragments (e.g., SELECT statements). Now you need a parser that will process grammar fragments and at least say "legal" or "not". Most won't do that. Our DMS Software Reengineering Toolkit will do that.
To provide advice, you need to be able to walk the grammar (much as you considered for railroad diagrams) to compute "what is legal next". That's actually pretty hard (an in fact it is roughly equivalent to what an LR/GLR parser generator does when building tables). Our DMS engine does that during syntax error repair by traversing its GLR parse tables (since that work is already encoded in the tables!). That's not easy to do, as it is a peculiar variant of the GLR parsing algorithm. You might do better with an Earley parser, which keeps around all possible parses as a set of choices; you could simply inspect each one.
But this looks like quite a lot of work, and I think you'll be surprised by the amount of machinery you need.
The best work in this area is Harmonia, which produces incremental editors for code. Our DMS engine's parser is based on earlier work done by this project, because we are interested in the incrementality aspect.
You can try using ANTLR http://www.antlr.org/
It will not be able to understand an ASCII representation of the grammar, but it is powerful enough to do anything else you need, if you don't mind spending the time to learn the software.
Basically, I do lots of one-off code generation, large-scale refactorings, etc. etc. in Java.
My tool language of choice is Python, but I'll take whatever solutions you can offer.
Here is a simplified illustration of what I would like, in a pseudocode
Generating an implementation for an interface
search within my project:
for each Interface as iName:
write class(name=iName+"Impl", implements=iName)
search within the body of iName:
for each Method as mName:
write method(name=mName, body="// TODO implement this...")
Basically, the tool I'm searching for would allow me to:
parse files according to their Java structure ("search for interfaces")
search for words contextualized by language elements and types ("variables of type SomeClass", "doStuff() method calls on SomeClass instances")
to run searches with structural context ("within the body of the current result")
easily replace or generate code (with helpers to generate, as above, or functions for replacing, "rename the interface to Foo", "insert the line Blah.Blah()", etc.)
The point is, I don't want to spend a lot of time writing these things, as they are usually throwaway. But sometimes I need something just a little smarter than what grep offers. It wouldn't be too hard to write up a simplistic version of this, but if I'm going to use something like this at all, I'd expect it to be robust.
Any suggestions of a tool/library that will help me accomplish this?
Edit to add some clarification
Python is definitely not necessary; I'll take whatever is that. I merely suggest it incase there are choices.
This is to be used in combination with IDE refactoring; sometimes it just doesn't do everything I want.
In instances where I'm using for code generation (as above), it's for augmenting the output of other code generators. e.g. a library we use outputs a tonne of interfaces, and we need to make standard implementations of each one to mesh it to our codebase.
First, I am not aware of any tool or libraries implemented in Python that specifically designed for refactoring Java code, and a Google search did not give me any leads.
Second, I would posit that writing such a decent tool or library for refactoring Java in Python would be a large task. You would have to implement a Java compiler front-end (lexer/parser, AST builder and type analyser) in Python, then figure out how to integrate this with a program editor. I'm not surprised that nobody has done this ... given that mature alternatives already exist.
Thirdly, doing refactoring without a full analysis of the source code (but uses pattern matching for example) will be incapable of doing complex refactoring, and will is likely to make mistakes in edge cases that the implementor did not think of. I expect that is the level at which the OP is currently operating ...
Given that bleak outlook, what are the alternatives:
One alternative is to use one of the existing Java IDEs (e.g. NetBeans, Eclipse, IDEA. etc) as a refactoring tool. The OP won't be able to extend the capabilities of such a tool in Python code, but the chances are that he won't really need to. I expect that at least one of these IDEs does 95% of what he needs, and (if he is realistic) that should be good enough. Especially when you consider that IDEs have lots of incidental features that help make refactoring easier; e.g. structured editing, undo/redo, incremental compilation, intelligent code completion, intelligent searching, type and call hierarchy views, and so on.
(Aside ... if existing IDEs are not good enough (#WizardOfOdds - only the OP can make that call!!), it would make more sense to try to extend the refactoring capability of an existing IDE than start again in a different implementation language.)
Depending on what he is actually doing, model-driven code generation may be another alternative. For instance, if the refactoring is happening because he is frequently creating and recreating his object model(s), then an alternative is to code the models in some modeling language and generate his code from those models. My tool of choice when doing this kind of thing is Eclipse EMF and related technologies. The EMF technologies include generation of editors, XML serialization, persistence, queries, model to model transformation and so on. I have used EMF to implement and roll out projects with object models consisting of 50 to 100 distinct classes with complex relationships and validation requirements. EMF's support for merging source code edits when you regenerate from an updated model is a key feature.
If you are coding in Java, I strongly recommend that you use NetBeans IDE. It has this kind of refactoring support builtin. Eclipse also supports this kind of thing (although I prefer NetBeans). Both projects are open source, so if you want to see how they perform this refactoring, you can look at their source code.
Java has its fair share of criticism these days but in the area of tooling - it isn't justified.
We are spoiled for choice; Eclipse, Netbeans, Intellij are the big three IDEs. All of them offer excellent levels of searching and Refactoring. Eclipse has the edge on Netbeans I think and Intellij is often ahead of Eclipse
You can also use static analysis tools such as FindBugs, CheckTyle etc to find issues - i.e. excessively long methods and classes, overly complex code.
If you really want to leverage your Python skills - take a look at Jython. Its a Python interpreter written in Java.
I'm working on a pretty complex DSL that I want to compile down into a few high level languages. The whole process has been a learning experience. The compiler is written in java.
I was wondering if anyone knew a best practice for the design of the code generator portion. I currently have everything parsed into an abstract syntax tree.
I was thinking of using a template system, but I haven't researched that direction too far yet as I would like to hear some wisdom first from stack overflow.
Thanks!
When I was doing this back in my programming languages class, we ended up using emitters based on following the visitor pattern. It worked pretty well - makes retargeting it to new output languages pretty easy, as long as your AST matches what you're printing fairly well.
What you really want is a program transformation system, that maps syntax structures in one language (your DSL) into syntax patterns in other langauges. Such a tool can carry out arbitrary transformations (tree-rewrites generalize string-rewrites which are Post systems which are full Turing capable) during the code generation project, which means that what you generate and how sophisticated your generation process is determined only by your ambition, not by "code generator framework" properties.
Sophtisticated program transformation systems combine various types of scoping, flow analysis and/or custom analyzers to enable the tranformations. This doesn't add any theoretical power, but it adds a lot of practical power: most real languages (even DSLs) have namespaces, control and data flow, need type inference, etc. etc.
Our DMS Software Reengineering Toolkit is this type of transformation system. It has been used to analyze/transform both conventional languages and DSLs, for simple and complex languages, and for small, large and even huge software systems.
Related to comments by OP about "turning the AST into other languages", that is accomplished by DMS by writing transformations that map surface syntax for the DSL (implemented behind the scenes his DSL's AST) to surface syntax for the target language (implemented using target language ASTs). The resulting target langauge AST is then prettyprinted automatically by DMS to provide actual source code in the target language, that corresponds to the target AST.
If you are already using ANTLR and have your AST ready you might want to take a look at StringTemplate:
http://www.antlr.org/wiki/display/ST/StringTemplate+Documentation
Also Section 9.6 of The Definitive ANTLR Reference: Building Domain-Specific Languages explains this:
http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference
The free code samples are available at http://media.pragprog.com/titles/tpantlr/code/tpantlr-code.tgz. In the subfolder code\templates\generator\2pass\ you'll find an example converting mathematical expressions to java bytecode.