I am trying to create a lexical analyzer program using java.Program must have the concept of tokenization .I have beginner level knowledge in compiler programming.I know there are lot of Lexical generators exist on internet.I can use them to test my own lexical analyzer out put .But i need to do my own lexical analyzer .Can any one please give some best references or articles or ideas to start my cording ?
"Compilers Principles, Techniques and Tools" by Aho Sethi and Ullman has a chapter on lexical analysers. It includes a lot of the theory on regular expressions and finite automata that are core to this problem domain.
I would try taking a look at the source code for some of the better ones out there. I have used Sablecc in the past. If you go to this page describing how to to set you your environment, there is a link to the source code for it. Antlr is also a really commonly used one. Here is the source code for it.
Also, The Dragon Book is really good.
As Suggested by SK-logic I am adding Modern Compiler Implementation as another option.
Related
I'm writing a bachelor's thesis on "Analysis of the source code in Java applications". I have a few points that must include written part. One of them is "a brief description of the grammar and writing Java." Since this is a bachelor thesis, sources of information must be verified - the books, the official site of Java, etc. Unfortunately I can not find this information on the Java website (maybe I'm just casually looking for). If possible, it is easier for me to use online resources than books.
Can anyone advise me where I found this information verified? Of course we were in school in certain subjects taught either syntax or semantics of Java, but it does not seem so "official source".
Thank you all.
Use the official Java Specification from Oracle.
http://docs.oracle.com/javase/specs/jls/se7/html/jls-2.html
I think that Code Conventions for the Java Programming Language is good place to start.
For static code analysis in Java you can find some automated tools.
Static Code Analysis Tools
Have you tried google scolar.
Scolar only gives proper scientific results.
for instance you might like this article
I am developing an assistant to type database commands for DBAs, because these commands have many parameters, and an assistant will help a lot with their job. For this assistant, I need the grammar of the commands, but database vendors (Oracle, DB2) do not provide that information in any format, the only thing is the documentation.
One example of a DB2 command is: http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.cmd.doc/doc/r0001933.html
For this reason, I am trying to analyze the grammar diagram or railroad diagram (http://en.wikipedia.org/wiki/Syntax_diagram), but I have not found anything in Java that could help me. I would like some re-engineering (reverse) tool that takes the ASCII (textual representation) of the grammar, and creates a graph in Java. Then, with the graph in Java, the assistant could propose options of the current typed command.
On example of the assistant http://www.youtube.com/watch?v=5sBoUHJupvs
If you have information about how to analyze grammar diagrams with Java (not generate) I will appreciate that information.
The closest tool I've seen is Grammar Recovery System by Ralf Lammel. It depends on accessibility of railroad diagrams as text strings. That's generally not how they are found. You appear to be lucky in the DB2 case, Ralf's work points in the right direction.
Considering that such diagrams are usally rendered as just a set of pixels (PLSQL's are like this in the PDF files provided for documentation), you have several sets of problems: recognizing graphical entities from pixels, assembling them into actual representations of the railroad diagrams, and then using such as your assistant.
I think this is a long, hard, impractical approach. If you got it to work, you'd discover the diagrams are slightly wrong in many places (read Ralf's paper or find out the hard way), and therefore unusable for a tool that is supposed to produce the "right" stuff to help your DBAs.
Of course, you are objecting to the other long, hard, "impractical" approach of reading the documentation and producing grammars that match, and then validating those grammars against the real world. Yes, this is a tough slog too, but it actually does produce useful results. You need to find vendors that done this and will make it available to you.
ANTLR.org offers a variety of grammars. Have you checked there?
My company offers grammars and tools for processing them. We have done this for PLSQL and SQL2011 but not yet DB2.
Given a grammar, you now need to use it to provide "advice" to your users. Your users aren't going to type in a complete "program"; they want to generate fragments (e.g., SELECT statements). Now you need a parser that will process grammar fragments and at least say "legal" or "not". Most won't do that. Our DMS Software Reengineering Toolkit will do that.
To provide advice, you need to be able to walk the grammar (much as you considered for railroad diagrams) to compute "what is legal next". That's actually pretty hard (an in fact it is roughly equivalent to what an LR/GLR parser generator does when building tables). Our DMS engine does that during syntax error repair by traversing its GLR parse tables (since that work is already encoded in the tables!). That's not easy to do, as it is a peculiar variant of the GLR parsing algorithm. You might do better with an Earley parser, which keeps around all possible parses as a set of choices; you could simply inspect each one.
But this looks like quite a lot of work, and I think you'll be surprised by the amount of machinery you need.
The best work in this area is Harmonia, which produces incremental editors for code. Our DMS engine's parser is based on earlier work done by this project, because we are interested in the incrementality aspect.
You can try using ANTLR http://www.antlr.org/
It will not be able to understand an ASCII representation of the grammar, but it is powerful enough to do anything else you need, if you don't mind spending the time to learn the software.
We inherited some leagcy code that has a whole lot of code copy/pasted across projects. Is there a way to find these? PMD can do a single project
Summary
There is also CloneDetective, Simian and Simscan. This paper from the International Conference on Software Engineering 2009 compares them, and PMD's CPD.
In detail
One tool that can handle several languages is CloneDetective (based on ConQuat, Continuous Quality Assessment Toolkit): ABAP, ADA, Java, C#, C/C++, Visual Basic, Cobol, PL1.
Another tool is Simian, the Similarity Analyser, which identifies duplication in Java, C#, C, C++, COBOL, Ruby, JSP, ASP, HTML, XML, Visual Basic, Groovy source code and even plain text files. It runs on JVM and .NET.
Actually, if you look at .NET, there are a lot of copy paste detection tools...
SimScan, the SimilarityScanner is an Eclipse/IDEA/JBUILDER plugin that finds duplicated or similar fragments of code in large Java source code bases. I don't know it, and have no idea what "similar fragments" means. It sounds like it might also just look isolatedly in single projects, but the IntelliJ-Screenshots look nifty.
This paper from the International Conference on Software Engineering 2009 compares CloneDetective, PMD's CPD, Simian and Simscan.
Just as PMD's copy & paste finder is actually called CPD for "copy paste detector", using that term as the terminus technicus for googling helps. Another term often used is "clone detection".
You could try using the command line version of PMD CPD:
http://pmd.sourceforge.net/cpd.html
You should be able to specify multiple source trees to check.
Simian, which is the other prominent copy/paste detector has similar command line capabilities.
See our Java CloneDR, a tool for finding duplicated code across large sets of code.
CloneDR finds exact and near-miss clones using the structure of the code (abstract syntax trees) as a guide, so it isn't confused by whitespace or comment changes. For detected clones, it shows you the clone instances, and a parameterized generalization that you can use as the basis of replacement abstraction (in Java, that's pretty much done by making a method; other languages have other techniques).
Another poster references a technical paper comparing clone detectors. If you examine the paper, reference number [1] is to CloneDR. The authors of that paper do not compare their detector against CloneDR, as their detector only uses tokens, not the more sophisticated method CloneDR has that uses language structure.
CloneDR works for a variety of languages: Java, C#, C++, COBOL, JavaScript, PHP, many others.
To handle multiple projects, you just tell CloneDR the set of files in all the projects.
If you can put those projects into one Eclipse workspace, Codepro Analytix will happily consume all of them together: https://code.google.com/javadevtools/codepro/doc/index.html
Sonar is pretty neat to do this kind of thing. I really like all the indicators you can have...
If you are looking for an Eclipse plugin, checkout UCDetector: Unnecessary Code Detector
I wish to create a app that translates input java code into HTML formatted java code,
For example:
public class ReadWithScanner
Would become
<span class="public">public</span> <span class="class">class</span> ReadWithScanner
However it gets quite complicated when it comes to parameters and regular expressions. Now I have a bit of time on my hands, and I wish to write my own code parser.
How would I start this? and is there any tutorials or online content to not only help me write this, but understand it.
Thanks
For help with the complexity of parsing, you'll need to rely on the Java Language Specification.
As I seem to recall, Java is an LL(k) language (see here, for instance). However, the Java language, despite all attempts to keep it "compact", is still quite large and complex. The grammar is spread out over the entire document. This is not a project for the faint at heart. You might consider using a Java parsing tool (like Java-front).
What you need to do is use ANTLR, it already has Java grammars for parsing Java, then you just need to supply your own templates to output whatever you want from the Abstract Syntax Tree you generate with ANTLR.
If you need a resource for learning about parsers, I can recommend Basics of Compiler Design, which is available as a free download.
It covers more than just parsers, but if you read the first few chapters, you should have a good basic understanding of both lexers and parsers.
I think you need a lexical analyzer.
I used early the Flex lexical analyzer. It is not too complicated to use.
If you need to parse the analyzed text you can use the bison c++
bisoncpp.sourceforge.net/
(C++ konwledge need and linux environment)
i would like to create a source code analyser for Java Project (like FindBugs and other static analysis programs) that would be able to detect certain method calls.
I would prefer to do it using Python, but any advice would be great !
I'm going to start by studying the FindBugs source code, but if anyone could explain to me the underlying concepts and if it's easily do-able, i would be really grateful.
Thank you for your time.
Olivier.
Read the book : Language implementation Patterns. It is a very accessible book out there and you can gauge the effort that will be required to achieve what you want to achieve
http://www.pragprog.com/titles/tpdsl/language-implementation-patterns