I am developing an assistant to type database commands for DBAs, because these commands have many parameters, and an assistant will help a lot with their job. For this assistant, I need the grammar of the commands, but database vendors (Oracle, DB2) do not provide that information in any format, the only thing is the documentation.
One example of a DB2 command is: http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.cmd.doc/doc/r0001933.html
For this reason, I am trying to analyze the grammar diagram or railroad diagram (http://en.wikipedia.org/wiki/Syntax_diagram), but I have not found anything in Java that could help me. I would like some re-engineering (reverse) tool that takes the ASCII (textual representation) of the grammar, and creates a graph in Java. Then, with the graph in Java, the assistant could propose options of the current typed command.
On example of the assistant http://www.youtube.com/watch?v=5sBoUHJupvs
If you have information about how to analyze grammar diagrams with Java (not generate) I will appreciate that information.
The closest tool I've seen is Grammar Recovery System by Ralf Lammel. It depends on accessibility of railroad diagrams as text strings. That's generally not how they are found. You appear to be lucky in the DB2 case, Ralf's work points in the right direction.
Considering that such diagrams are usally rendered as just a set of pixels (PLSQL's are like this in the PDF files provided for documentation), you have several sets of problems: recognizing graphical entities from pixels, assembling them into actual representations of the railroad diagrams, and then using such as your assistant.
I think this is a long, hard, impractical approach. If you got it to work, you'd discover the diagrams are slightly wrong in many places (read Ralf's paper or find out the hard way), and therefore unusable for a tool that is supposed to produce the "right" stuff to help your DBAs.
Of course, you are objecting to the other long, hard, "impractical" approach of reading the documentation and producing grammars that match, and then validating those grammars against the real world. Yes, this is a tough slog too, but it actually does produce useful results. You need to find vendors that done this and will make it available to you.
ANTLR.org offers a variety of grammars. Have you checked there?
My company offers grammars and tools for processing them. We have done this for PLSQL and SQL2011 but not yet DB2.
Given a grammar, you now need to use it to provide "advice" to your users. Your users aren't going to type in a complete "program"; they want to generate fragments (e.g., SELECT statements). Now you need a parser that will process grammar fragments and at least say "legal" or "not". Most won't do that. Our DMS Software Reengineering Toolkit will do that.
To provide advice, you need to be able to walk the grammar (much as you considered for railroad diagrams) to compute "what is legal next". That's actually pretty hard (an in fact it is roughly equivalent to what an LR/GLR parser generator does when building tables). Our DMS engine does that during syntax error repair by traversing its GLR parse tables (since that work is already encoded in the tables!). That's not easy to do, as it is a peculiar variant of the GLR parsing algorithm. You might do better with an Earley parser, which keeps around all possible parses as a set of choices; you could simply inspect each one.
But this looks like quite a lot of work, and I think you'll be surprised by the amount of machinery you need.
The best work in this area is Harmonia, which produces incremental editors for code. Our DMS engine's parser is based on earlier work done by this project, because we are interested in the incrementality aspect.
You can try using ANTLR http://www.antlr.org/
It will not be able to understand an ASCII representation of the grammar, but it is powerful enough to do anything else you need, if you don't mind spending the time to learn the software.
Related
I'm trying to find information (documentation, advice, etc) on how certain IDE templates (e.g. in Eclipse, IntelliJ, and NetBeans) are instantiated internally by IDEs, and I'm having some trouble.
I'm hoping, perhaps optimistically, that I can automatically generate multiple (at least two) distinct samples of each pattern from templates written in the associated grammars.
Every pattern-parameter (including cursors) must be filled, and samples for the same pattern should only have non-pattern-parameter content in common.
At this stage, they need to be syntactically valid so that they can be parsed, but do not need to be fully semantically valid/compilable snippets.
If anyone knows how any of these IDEs work internally, and can tell me if/how I might be able to do this (or can point me towards sufficient documentation), I would greatly appreciate it.
Background/Context
I'm trying to create a research dataset for a pattern mining task - specifically, for mining code templates. I've been looking into it for some time and, as far as I'm aware, there isn't a suitable precedent dataset, so I have to make one.
Rather than painstakingly defining every feature of every pattern myself, I'm writing tools to partially automate the process. Specifically, automating the tasks of deriving candidate patterns from samples, and of filtering out any candidates not observed in the actual corpus. The tools are input-language-agnostic, but I am initially targetting Java ASTs via the Eclipse JDT.
My thinking is that well-established patterns such as idioms and IDE code templates, from sufficiently reputable sources, are rational and intuitive pattern candidates with which I can, at least, evaluate recall. I can, and will, define some target-sample sets manually. However, I would prefer to generate them automatically, so that I can collect more complicated templates en masse (e.g. those published by IDE community members).
Thanks in advance,
Marcos C-S
Purely as a self-learning exercise, I'm trying to write a Java parser in Perl using the Parse::RecDescent module. I may later re-implement the parser using other other tools like Antlr, bison, etc.
But how would I ensure that my parser is indeed generating the correct parse, per the Java Language Specification? Meaning, its correct handling of dangling else's, operator-associativity and -precedence etc.
One method would be to compare my parser against a known, bug-free parser by having both parsers generate ASTs for a very large number of test Java programs, and then comparing the two sets of ASTs.
If this is indeed the only method, where could I find a large suite of test Java programs thoroughly covering the entire Java Language Specification?
I have looked at JavaParser but it doesn't seem to have an exhaustive test dataset.
The other method would, of course, be writing by hand tens of thousands test Java programs myself, which would be very impractical for me, not only time-wise but also in ensuring its exhaustiveness!
To decide if you have the right answer, you ideally have to compare to some kind of standard. This is hard for a computer languages.
Comparing ASTs is going to be hard, because there are no standards for such. Each parser that builds ASTs, builds an AST whose structure is designed by the person that coded the parser.
That means if you build an AST-producing parser, and you get somebody else's AST-producing parser, you'll discover that the AST nodes you have chosen don't match the other AST. Now you'll have to build a mapping from your AST to the other one (and how will you know the mapping is valid?). You can try to make your parser generate the AST from another parser, but what you will discover is the AST you produce is influenced by the parsing technology you use.
We have a similar problem with the Java front end my company produces (see bio if you want to know more). What we settle for is testing that the answer is self-consistent and then we do a lot of long-term experiential testing on big pieces of code.
Our solution is to:
(Build a parser, using the strongest parsing technology we can get (GLR). This means we can recognize certain constructs not easily recognized by other parsing technologies (LL, LR, ...), and thus produce AST nodes that other parsers would have a hard time producing. See comments below for an example where this matters. Even so, we produce AST nodes in way that avoids completely our having to hand-code AST node construction as demanded by most other parsing technologies; that tends to produce somewhat different ASTs than hand-coded).
Parse a LOT of Java code (producing ASTs) to make sure we don't have parsing errors. [The JDK is a pretty good size example and it is easy to get]
Our tools can take an AST and regenerate (prettyprint) source code, complete with comments but perhaps a somewhat different layout. We verify that
parsed-then-prettyprinted code also parses. We re-prettyprint the parsed prettyprinted version; this should be identical to the prettyprinted version since we always produce the same layout. This test is a good indication that our AST design and implementation doesn't lose anything about the source code
Build symbol tables, resolve the meaning of names, and verify that the legal Java programs type-check according to our front end. That doesn't tell you anything about the nature of the AST except that it is good enough (which in fact, is good enough!) Because the type checking task is very complex (go check your local Java standard), it is also pretty fragile. If you don't have everything right, the type checking will likely fail when applied across a large body of code. Again, the JDK is a pretty good test of this. Note: a Java parser without name and type resolution isn't very useful in practice
Produce JavaDoc like cross references that include hyperlinked source code from the above results. This means it is easy to manually check a bit code to see that name resolution (therefore AST construction) is sane.
Live with the results, applying the front end to various program
analysis and transformtions of code. We find the occasional problem and fix it.
It is tough to get this right; you have to get close and keep the testing pressure on continuously, especially since Java the language keeps moving.
(We're at Java 8, and Java 9 is being threatened). Bottom line: it is a lot of work to build such a parser and check its sanity.
We'd love to have an independent set of tests, but we haven't seen one in the wild. And I would expect those tests if they exist (I assume Oracle and IBM have them) really don't test parsing and name resolution directly, but rather test that some bit of code compiles and runs producing a known result. Since we aren't building a compiler, we wouldn't be able to run such tests if we had them. We would be able to do the name resolution and type consistency checks and that would be helpful.
[We actually do this for a number of language front ends. You think Java is hard, try this with C++]
I agree with what #ira-baxter wrote.
In short, it is easy to verify that your parser can parse all valid source code (just throw at it a lot of code: we did that to test JavaParser, we just did not put in the repository some hundreds of megabytes of source code).
It is hard to verify the AST you produce have the right shape also because you can have many different ways of parsing the same code and building different but equally valid ASTs. Some shapes will be easier to work with w.r.t to others but there is no "truth" to adhere with.
If your tool produce an AST different from the one produced by another parser it does not necessarily mean that any of the two parsers is wrong, it could be just a design choice.
I am using Java (but am open to solutions in other languages as well). I am looking at open source predictive modeling solutions for guessing what GUI/application features a user is interested in (I will have the specific user behavior data on the GUI/application). Instead of just looking at most used actions etc, should I possibly look at incorporating SVM or decision trees? I am looking at weka, mahout and jahmm - is there any other resource I can look at (specifically for GUI behavior - which hopefully returns results fast enough even if accuracy is reduced). Since I am not extremely knowledgeable about this field, please inquire about any information I may have left out to better ascertain a working solution. Thanks!
It's incredibly difficult to say given that we don't know what data you're using (I don't know of existing software to do this, but it may very well exist). With respect to support vector machines, they are binary or one-versus all classifiers, so I don't think they would be applicable here, if I understand your intentions correctly.
If you're unfamiliar with machine learning, Weka may be a good place for you to start. If you have supervised data, then you can feed all of your feature vectors with associated classification data into Weka and use cross-validation to see what type of technique suits you best. Additionally, you can use Weka to see if certain features are more important than others and do manual dimensionality reduction. Or of course, you can use one of Weka's dimensionality reduction techniques, but it may be difficult to decide which one if you don't know the assumptions that they make or how your data is related (this also applies to whatever prediction technique you try/use). Although, if you have enough time, you can just play around and manually just see what works best.
I want to parse Java/COBOL/VB etc code to collect information like variable name, method etc
I am using javacc grammar but my problem is that if any exception comes then parser fails
Apart from JAVA, I use javacc grammar for COBOL, VB etc
I do not want that parser should fail so i am trying to read java code line by line to get desired result.
Any better way to do the parsing without throwing exception?
Thanks in advance.
Parsers (and therefore the machinery provided by parser generators) must have some mean to handle invalid source files. Thus each parser (and parser generator) has chosen some method to manage syntax errors.
An easy solution provided by most is simply to throw an exception when such an error is encounterd. The user code invoking the parser has to catch this exception and live with an aborted parse. A more sophisticated solution to have the parser report a syntax error, but recover from the error and continue parsing; any such recovery must handle AST building with some kind of marker on the nodes at the point of error if you also hope to get a usable tree. Most parser generators will offer some kind of syntax recovery, but leave you on your own to handle AST building the face of such errors. Such parser recovery logic is relatively hard to build, and you likely can't do it yourself without becoming an expert in parser error recovery, and making custom changes to the particular parser generators support code.
I'm not specifically familiar with JavaCC (or most of the other parser generators out there), so I don't know if it does this. Obviously, check the docs. If the error handling support you want isn't there, move on to another generator that has it.
I suspect your real problem will actually be getting grammars appropriate for your task. Nobody has "Java" or "COBOL"; they have a specific dialect, e.g., Java 1.5 or IBM Enterprise COBOL or VB6. These are more different from the imagined base language that you expect based on my long experience. You can hope that such grammars as you can obtain will work (including error recovery) to enable you to parse the various dialects of each ins spite of such differences, but generally you'll get large numbers of errors from code in one dialect that doesn't match the grammer you want. (What will you do with the card numbers out past column 72, in your IBM Enterprise COBOL code that has EBCDIC source files?) So you really want a tool that has lots of parsers that handle the various dialects, and that should drive your choice IMHO.
I think ANTLR has a lot of language definitions (more than JavaCC) so it kind of qualifies. However, many the grammars at that site are experimental or unfinished (some are pretty good) so you get kind of pot luck.
Our DMS Software Reenginering Toolkit has a lot of grammars, and we think it our task to make these production quality. We aren't perfect either but our grammers tend to have been tested on large bodies of code, and have support for various dialects. Error recovery is built in, and you do get a tree back (with error nodes in the AST) if the number of errors is less than a specified threshold. DMS also handles nasty issues such as character encodings (we do a wide variety including 80 column EBCDIC with card numbers in column 72 for IBM COBOL and JCL). DMS may not be what you want; for instance, it is not Java based. But we try to make up for that by providing a huge amount of machinery to support post-parsing tasks, such as what you want to do. This machinery includes support for building symbol tables, extracting control and data flows, matching patterns and applying source-to-source transformations, etc. But that's a tradeoff for you to make.
You could attempt to modify the grammar, but why not just work from an AST such as that available from Eclipse?
In the end this will likely be more reliable than most grammars you'll find on the net.
I've had great success with SableCC for Java. It's surprisingly easy to use and has a Java 1.5 grammar available
I'm working on a pretty complex DSL that I want to compile down into a few high level languages. The whole process has been a learning experience. The compiler is written in java.
I was wondering if anyone knew a best practice for the design of the code generator portion. I currently have everything parsed into an abstract syntax tree.
I was thinking of using a template system, but I haven't researched that direction too far yet as I would like to hear some wisdom first from stack overflow.
Thanks!
When I was doing this back in my programming languages class, we ended up using emitters based on following the visitor pattern. It worked pretty well - makes retargeting it to new output languages pretty easy, as long as your AST matches what you're printing fairly well.
What you really want is a program transformation system, that maps syntax structures in one language (your DSL) into syntax patterns in other langauges. Such a tool can carry out arbitrary transformations (tree-rewrites generalize string-rewrites which are Post systems which are full Turing capable) during the code generation project, which means that what you generate and how sophisticated your generation process is determined only by your ambition, not by "code generator framework" properties.
Sophtisticated program transformation systems combine various types of scoping, flow analysis and/or custom analyzers to enable the tranformations. This doesn't add any theoretical power, but it adds a lot of practical power: most real languages (even DSLs) have namespaces, control and data flow, need type inference, etc. etc.
Our DMS Software Reengineering Toolkit is this type of transformation system. It has been used to analyze/transform both conventional languages and DSLs, for simple and complex languages, and for small, large and even huge software systems.
Related to comments by OP about "turning the AST into other languages", that is accomplished by DMS by writing transformations that map surface syntax for the DSL (implemented behind the scenes his DSL's AST) to surface syntax for the target language (implemented using target language ASTs). The resulting target langauge AST is then prettyprinted automatically by DMS to provide actual source code in the target language, that corresponds to the target AST.
If you are already using ANTLR and have your AST ready you might want to take a look at StringTemplate:
http://www.antlr.org/wiki/display/ST/StringTemplate+Documentation
Also Section 9.6 of The Definitive ANTLR Reference: Building Domain-Specific Languages explains this:
http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference
The free code samples are available at http://media.pragprog.com/titles/tpantlr/code/tpantlr-code.tgz. In the subfolder code\templates\generator\2pass\ you'll find an example converting mathematical expressions to java bytecode.