Java tool to find - copy/ paste code across projects

Java tool to find - copy/ paste code across projects - java

We inherited some leagcy code that has a whole lot of code copy/pasted across projects. Is there a way to find these? PMD can do a single project

Summary
There is also CloneDetective, Simian and Simscan. This paper from the International Conference on Software Engineering 2009 compares them, and PMD's CPD.
In detail
One tool that can handle several languages is CloneDetective (based on ConQuat, Continuous Quality Assessment Toolkit): ABAP, ADA, Java, C#, C/C++, Visual Basic, Cobol, PL1.
Another tool is Simian, the Similarity Analyser, which identifies duplication in Java, C#, C, C++, COBOL, Ruby, JSP, ASP, HTML, XML, Visual Basic, Groovy source code and even plain text files. It runs on JVM and .NET.
Actually, if you look at .NET, there are a lot of copy paste detection tools...
SimScan, the SimilarityScanner is an Eclipse/IDEA/JBUILDER plugin that finds duplicated or similar fragments of code in large Java source code bases. I don't know it, and have no idea what "similar fragments" means. It sounds like it might also just look isolatedly in single projects, but the IntelliJ-Screenshots look nifty.
This paper from the International Conference on Software Engineering 2009 compares CloneDetective, PMD's CPD, Simian and Simscan.
Just as PMD's copy & paste finder is actually called CPD for "copy paste detector", using that term as the terminus technicus for googling helps. Another term often used is "clone detection".

You could try using the command line version of PMD CPD:
http://pmd.sourceforge.net/cpd.html
You should be able to specify multiple source trees to check.
Simian, which is the other prominent copy/paste detector has similar command line capabilities.

See our Java CloneDR, a tool for finding duplicated code across large sets of code.
CloneDR finds exact and near-miss clones using the structure of the code (abstract syntax trees) as a guide, so it isn't confused by whitespace or comment changes. For detected clones, it shows you the clone instances, and a parameterized generalization that you can use as the basis of replacement abstraction (in Java, that's pretty much done by making a method; other languages have other techniques).
Another poster references a technical paper comparing clone detectors. If you examine the paper, reference number [1] is to CloneDR. The authors of that paper do not compare their detector against CloneDR, as their detector only uses tokens, not the more sophisticated method CloneDR has that uses language structure.
CloneDR works for a variety of languages: Java, C#, C++, COBOL, JavaScript, PHP, many others.
To handle multiple projects, you just tell CloneDR the set of files in all the projects.

If you can put those projects into one Eclipse workspace, Codepro Analytix will happily consume all of them together: https://code.google.com/javadevtools/codepro/doc/index.html

Sonar is pretty neat to do this kind of thing. I really like all the indicators you can have...

If you are looking for an Eclipse plugin, checkout UCDetector: Unnecessary Code Detector

Related

Android how to read obfuscated Java code after getting through reverse engineering

I got the Java classes from an APK after using some tools like dex2jar and JD-GUI. As everybody knows Java byte code can be converted to Java classes back so mostly it is optimized and obfuscated through some tools (like ProGuard is used in the case of Android) to make it secure from others. So what I got is obfuscated code and I want to make it error-free, readable, understandable so that I can further modify it for my own purpose (for my personal use only, I don't mean to violate any copyrights). So any help i.e advices, tools, helping material to make this obfuscated code much closer to what was written by a developer or to make it error-free and understandable will help me a lot. Currently my focus is about to reversing obfuscating techniques used by ProGuard like when I tried reverse engineering on my own projects and found that:
int resource values can be altered with ids by matching through the R file which is generated with reverse engineering.
The if/else conditions mostly converted to while(true) and some continues and breaks.
Inner classes mostly broke up to separate files
So, any other techniques and helping material for the above mentioned ways which can describe how to properly reverse them will be very helpful.

There isn't a magical tool that will refactor obfuscated code into a buildable project. Most likely, you won't be able to decompile and de-obfuscate an APK to be clean and maintainable code. This is a good thing.
There are tools which are better than dex2jar and jd-gui. One of them is apk-deguard, which claims to reverse the process of obfuscation. From their about page:
DeGuard
DeGuard (http://www.apk-deguard.com) is a novel system for statistical
deobfuscation of Android APKs, developed at the Software Reliability
Lab, ETH Zurich, the same group which developed the widely used JSNice
system. Similarly to JSNice, DeGuard is based on powerful
probabilistic graphical models learned from thousands of open source
programs. Using these models, DeGuard recovers important information
in Android APKs, including method and class names as well as
third-party libraries. DeGuard can reveal string decoders and classes
that handle sensitive data in Android malware.
You should use Enjarify, which is owned by Google, instead of dex2jar. Also, apktool is good for decompiling an APK's resources, which is not handled by dex2jar and enjarify.
Other tools include jadx, procyon, fernflower, show-java, smali/baksmali.
You will need a good IDE for refactoring. JEB looks like a good tool for refactoring. This is a paid tool mostly used by Android security researchers.

This should help:
DeObfuscator

Reverse engineering is a difficult task (i would say subtle art), mostly hit and miss, especially with obfuscated code, what you can do is to focus in some special function, that seems pretty obvious and start from there, renaming and refactoring classes, also a good IDE may help you a lot (my personal recommendation: NetBeans).

Ideas for Create a lexical analyzer program using java

I am trying to create a lexical analyzer program using java.Program must have the concept of tokenization .I have beginner level knowledge in compiler programming.I know there are lot of Lexical generators exist on internet.I can use them to test my own lexical analyzer out put .But i need to do my own lexical analyzer .Can any one please give some best references or articles or ideas to start my cording ?

"Compilers Principles, Techniques and Tools" by Aho Sethi and Ullman has a chapter on lexical analysers. It includes a lot of the theory on regular expressions and finite automata that are core to this problem domain.

I would try taking a look at the source code for some of the better ones out there. I have used Sablecc in the past. If you go to this page describing how to to set you your environment, there is a link to the source code for it. Antlr is also a really commonly used one. Here is the source code for it.
Also, The Dragon Book is really good.
As Suggested by SK-logic I am adding Modern Compiler Implementation as another option.

What is the difference between Acceleo and Xpand?

I have a DSL which is based on a custom metamodel, which in its turn is based on EMF/Ecore. I am trying to figure out which solution to choose, and I cant find any decent comparisons anywhere.
Does anyone have any reasons why I should choose one over the other?
What I know so far is that Acceleo uses a OMG standardized language, but it seems harder to use than Xpand.

First of all, I wonder why you consider Acceleo more difficult to learn than Xpand, while both languages have differences (blocks and delimiters for example) they have quite a similar structure. I won't details all the elements in both languages but, for example, I don't see such a difference between something like:
«FOREACH myAttributes AS a»«a.name»«ENDFOREACH»
and
[for (a: Attribute|myAttributes)][a.name/][/for]
Both are template based languages and as such they have quite the same structure. The main difference between Acceleo and Xpand comes from the fact that Acceleo is based on the standards MOFM2T and OCL from the OMG and the tooling.
I am not very familiar with Xpand tooling but you can find more about it on their wiki. Acceleo on the other side contains an editor with syntax highlighting, code completion, error detection, refactoring and more. It also contains a debugger, a profiler, Ant and Maven support. You can also easily deploy your generators as Eclipse plugin for other users or use them out of Eclipse in a regular Java application. You can find more information on Acceleo here. You can see in videos most of the features of Acceleo on the Obeo Network (registration required).
Finally, the latest activity on xPand as occurred a year ago while Acceleo is actively developed. You can even follow the Acceleo development on github if you want.
Stephane Begaudeau
Disclaimer: I am one of the member of the Acceleo dev' team.

I am a dabbler, not an expert.
My impression is that if you need little more than a templating language, then Xpand is the way to go. Otherwise, pick Acceleo - but as you say, the learning curve is very steep.
When do you need more than a templating language? For me, they seem to run out of gas when the structure (not content) of the output is dependent on multiple independent pieces of the input. If you don't want to get into Acceleo, but have one of these cases, consider inventing an auto-generated "shim" language that gets you partway from input language to output language, perhaps with a lot of redundancy in it to avoid lookups at template-generation time.

I've been using the old 2.x Acceleo on a full scalled project and done some test with the new one.
The langage is pretty easy to use, but with the new version it's a little bit more difficult to bind some
java code to your template when the script langage is not enought.
I was a very big fan of the 2.x, but with the 3.x, I add lots of troubles to make it work. You have to write java code to handle eclipse resources for instance. I totaly gave up when updating to juno, my acceleo projects didn't worked anymore and I didn't manage to correct it in two days. I hope they will make it easier to use out of the box.

Basically the main difference is that ACCELEO is an implementation of the MOF Models To Text Transformation Language which is the OMG (Object Management Group) Standard for the definition of Models to Text transformation. It is therefore a standard language designed by the same group ho designed MOF, UML, SysML and MDA in general. XPAnd is a language which I guess existed before the standard but it is now different from it.
If you start from scratch then start with Acceleo.

In my case, I use a custom meta-model (derived from UML2) with custom stereotypes and stereotypes properties). I tried both Acceleo and Xpand template languages. Indeed they are pretty similar in term of structure and capabilities.
However, I can see one big difference (which makes Xpand much better in this use case): you can use your custom stereotypes in your Xpand templates.
Xpand engine brilliantly chooses the "best matching template/rule" for every stereotype (taking into account inheritance between stereotypes as well).
Furthermore, it is very easy to obtain stereotype properties.
These two "features" make the templates very elegant, compact and readable.
For example:
«DEFINE myTemplate FOR MyUmlProfile::MyStereoType»
MyValue: «this.myStereotypeProperty» or simply: «myStereotypeProperty»
«ENDDEFINE»
In Acceleo, I found it clumsy to achieve the same (longer statements, more code) and my templates ended up lengthy and complex. The positive thing about Acceleo, however, was that it worked conveniently from IBM RSA (applied directly to RSA (emx) models). It has code highlighting and auto-complete working nicely.
Xpand only worked if I exported my RSA models to ".uml" (~XML) format. It doesn't offer code highlighting or auto-complete (or at least I didn't figure out how).
Considering all pros and cons, I still vote for Xpand (in my use case).

Intelligent search and generation of Java code, preferrably using Python?

Basically, I do lots of one-off code generation, large-scale refactorings, etc. etc. in Java.
My tool language of choice is Python, but I'll take whatever solutions you can offer.
Here is a simplified illustration of what I would like, in a pseudocode
Generating an implementation for an interface
search within my project:
for each Interface as iName:
write class(name=iName+"Impl", implements=iName)
search within the body of iName:
for each Method as mName:
write method(name=mName, body="// TODO implement this...")
Basically, the tool I'm searching for would allow me to:
parse files according to their Java structure ("search for interfaces")
search for words contextualized by language elements and types ("variables of type SomeClass", "doStuff() method calls on SomeClass instances")
to run searches with structural context ("within the body of the current result")
easily replace or generate code (with helpers to generate, as above, or functions for replacing, "rename the interface to Foo", "insert the line Blah.Blah()", etc.)
The point is, I don't want to spend a lot of time writing these things, as they are usually throwaway. But sometimes I need something just a little smarter than what grep offers. It wouldn't be too hard to write up a simplistic version of this, but if I'm going to use something like this at all, I'd expect it to be robust.
Any suggestions of a tool/library that will help me accomplish this?
Edit to add some clarification
Python is definitely not necessary; I'll take whatever is that. I merely suggest it incase there are choices.
This is to be used in combination with IDE refactoring; sometimes it just doesn't do everything I want.
In instances where I'm using for code generation (as above), it's for augmenting the output of other code generators. e.g. a library we use outputs a tonne of interfaces, and we need to make standard implementations of each one to mesh it to our codebase.

First, I am not aware of any tool or libraries implemented in Python that specifically designed for refactoring Java code, and a Google search did not give me any leads.
Second, I would posit that writing such a decent tool or library for refactoring Java in Python would be a large task. You would have to implement a Java compiler front-end (lexer/parser, AST builder and type analyser) in Python, then figure out how to integrate this with a program editor. I'm not surprised that nobody has done this ... given that mature alternatives already exist.
Thirdly, doing refactoring without a full analysis of the source code (but uses pattern matching for example) will be incapable of doing complex refactoring, and will is likely to make mistakes in edge cases that the implementor did not think of. I expect that is the level at which the OP is currently operating ...
Given that bleak outlook, what are the alternatives:
One alternative is to use one of the existing Java IDEs (e.g. NetBeans, Eclipse, IDEA. etc) as a refactoring tool. The OP won't be able to extend the capabilities of such a tool in Python code, but the chances are that he won't really need to. I expect that at least one of these IDEs does 95% of what he needs, and (if he is realistic) that should be good enough. Especially when you consider that IDEs have lots of incidental features that help make refactoring easier; e.g. structured editing, undo/redo, incremental compilation, intelligent code completion, intelligent searching, type and call hierarchy views, and so on.
(Aside ... if existing IDEs are not good enough (#WizardOfOdds - only the OP can make that call!!), it would make more sense to try to extend the refactoring capability of an existing IDE than start again in a different implementation language.)
Depending on what he is actually doing, model-driven code generation may be another alternative. For instance, if the refactoring is happening because he is frequently creating and recreating his object model(s), then an alternative is to code the models in some modeling language and generate his code from those models. My tool of choice when doing this kind of thing is Eclipse EMF and related technologies. The EMF technologies include generation of editors, XML serialization, persistence, queries, model to model transformation and so on. I have used EMF to implement and roll out projects with object models consisting of 50 to 100 distinct classes with complex relationships and validation requirements. EMF's support for merging source code edits when you regenerate from an updated model is a key feature.

If you are coding in Java, I strongly recommend that you use NetBeans IDE. It has this kind of refactoring support builtin. Eclipse also supports this kind of thing (although I prefer NetBeans). Both projects are open source, so if you want to see how they perform this refactoring, you can look at their source code.

Java has its fair share of criticism these days but in the area of tooling - it isn't justified.
We are spoiled for choice; Eclipse, Netbeans, Intellij are the big three IDEs. All of them offer excellent levels of searching and Refactoring. Eclipse has the edge on Netbeans I think and Intellij is often ahead of Eclipse
You can also use static analysis tools such as FindBugs, CheckTyle etc to find issues - i.e. excessively long methods and classes, overly complex code.
If you really want to leverage your Python skills - take a look at Jython. Its a Python interpreter written in Java.

Generating Class Diagram

HI All I am at the end of the release of my project.So in order to keep working our manager asked us to generate Class Diagrams for the code we had written.Its medium project with 3500 java files .So I think we need to generate class diagrams.First I need to know how reverse engineering works here. Also I looked for some tools in Google(Green, Violet) but not sure
whether they are of any help.Please suggest me how to proceed.Also a good beginning tutorial is appreciated.

I strongly recommend BOUML. Its Java reverse support is absolutely ROCK SOLID.
BOUML has many other advanteges:
it is extremely fast (fastest UML tool ever created, check out benchmarks),
has rock solid C++, Java, PHP and others import support,
it is multiplatform (Linux, Windows, other OSes),
has a great SVG export support, which is important, because viewing large graphs in vector format, which scales fast in e.g. Firefox, is very convenient (you can quickly switch between "birds eye" view and class detail view),
it is full featured, impressively intensively developed (look at development history, it's hard to believe that such fast progress is possible).
supports plugins, has modular architecture (this allows user contributions, looks like BOUML community is forming up)

The tool you want to use is Doxygen. It's similar to Javadoc, but works across multiple languages. If figures out the dependencies, and can call graphviz to render the class diagrams. Here's an example of a few Java classes run through Doxygen.

This is more a toolchain than a tool and I haven't tried it out myself. But it maybe a starting point. Using UMLGraph, ant and GraphViz. Explained step by step: in this article.

I ve used Visual Paradigm for UML for what you want to do and it was quite good.
See here for details.
Just go Tools -> Instant reverse and select your packages.

You may be able to reverse engineer class diagrams with the open source modelleing tool ArgoUML http://argouml.tigris.org/

ObjectAid is pretty nice. You can drag classes into a diagram and arrange them the way you want.

Visual Paradigm for UML Standard Edition (or Better) will reverse engineer Java files in to Class Diagrams.

I guess if your boss just wants to keep you busy until the next project starts then there's no harm in it, but you will find pretty quickly that creating a class diagram with 3500 classes will tell you exactly NOTHING about your system. In fact, you don't really want a diagram with more than about 10 classes on it. So once you have reversed all the code into your modelling tool, you will want to start organizing and arranging to find the meaning. Create a new diagram, drop a single important class onto it and bring in all the classes that are directly related to that class. Repeat for maybe the 300 most significant classes. Don't worry, it isn't as horrible as it sounds, maybe a week's work.
For the record, my modelling tool of choice is Enterprise Architect by Sparx Systems. It will reverse java sources or .jar files. There is a free 30 day trial edition.

There are some tools available that will help you generate these diagrams. These cost money.
Otherwise you could to try to parse your Java files. This could be as simple to create a simple parser that reads the Java files and writes the name of the class and all the import statements to a file and generates a class diagram from there, graphviz can help you there.

I've been using Enterprise Architect for a number of years. A JBoss developer suggested it to me. It works very well for all types of UML modeling including the reverse engineering of class models (Java, C# and others). The basic version is currently $120 per seat, but it has most of the capabilities of much more expensive tools and it is much easier to learn. I particularly like its ability to generate HTML and RTF documentation.
It is very easy to synchronize code between the tool and your source code. Even bi-directional if you want.
Your PM may also like the activity and sequence diagrams that it can create. I also frequently use the deployment diagrams. It's very helpful to have all of this in one tool.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.