How can I check if two Java classes are semantically identical?

How can I check if two Java classes are semantically identical? - java

I need to merge two similar huge projects (1000+ classes). The second one is a fork of the first one, and it contains some country-specific behavior. The two projects diverge a lot, because svn versioning was handled very poorly.
It often happens that two classes are semantically identical. Their source codes only differ in terms of warnings, import statements, the order of some methods or variables, code formatting, comments, etc.
Is there a way to automatically check if two classes are semantically identical?

You should consider using program analysis tools like Soot. Soot has some excellent APIs to analyze code that is best suited for your purpose. For example, to check whether two classes "semantically identical", you can consider (1) whether both of the classes have same (or similar fields) (2) both of the classes has same (or similar methods).
Fields are represented as SootField in Soot. You will have all the necessary information in a SootField object that you want to use for comparison. To check the semantic similarity of two methods you can check whether their control flow graphs (CFGs) are similar or not (Details are in Section 5.7 of this guide).
Tips on how you can use soot.
If your source dir is srcDir, Java Home is javaHome and the list of classes is classNames, then you can use the following code snippet to programmatically load your classes in Soot toolset.
String sootClassPath = srcDir + ":"
+ javaHome + "/jre/lib/rt.jar:"
+javaHome + "/jre/lib/jce.jar";
Options.v().set_output_format(Options.output_format_jimple);
Options.v().set_src_prec(Options.src_prec_java);
for (String className : classNames) { // // "className" is like a.b.Myclass
Options.v().classes().add(className);
}
Options.v().set_keep_line_number(true);
Options.v().set_allow_phantom_refs(true);
Scene.v().setSootClassPath(sootClassPath);
Scene.v().loadBasicClasses();
When your classes are loaded, you can access a class like below:
SootClass sClass = Scene.v().loadClassAndSupport(className); // "className" is like a.b.Myclass
Now you can access the fields and methods of sClass like below:
Chain<SootField> fieldList = sClass.getFields(); // import soot.util.Chain;
List<SootMethod> methods = sClass.getMethods();
You can iterate the CFG of a method, like below to get the list of instructions of it,
if (method.isConcrete()) {
List<Unit> instructionList = new ArrayList<>();
Body b = method.retrieveActiveBody();
DirectedGraph g = new ExceptionalUnitGraph(b);
Iterator gitr = g.iterator();
while (gitr.hasNext()) {
Unit unit = (Unit) gitr.next();
instructionList.add(unit);
}
}

Maybe first convert 2 projects' code into UML diagrams using a tool like Architexa.
This may help identify the real function of classes in the context of the system objective.
Suspected equivalent classes can then be compared in detail.

Related

ANTLR 4 and StringTemplate 4 - using tree walker with templates

Disclaimer: I never used Java before last month, and I had never heard of ANTLR or StringTemplate before then either. For my internship this summer I was given a project using tools that nobody else at the company has ever used. Everyone "has faith in me" that I will "figure it out." Hence the huge gaps in my understanding. I love this project and I've learned a ton, so don't take this as complaining. I just want to make it work.
Right now I'm working on a pretty printer proof of concept for an old domain-specific language. My ANTLR grammar is producing nice parse trees, and I'm able to output simple StringTemplate examples like the ones in the introduction.
Say I have an simple template in my .stg file:
module(type, name, content) ::= "<type> MODULE <name>; <content>; END MODULE."
In Java I'm able to use add() to set the values for each of the template arguments:
STGroup g = new STGroupFile("example.stg");
ST st = g.getInstanceOf("module");
st.add("type", "MAIN");
st.add("name", "test");
st.add("content", "abc");
System.out.println(st.render());
// prints "MAIN MODULE test; abc; END MODULE."
How do I get ANTLR and ST to read in a text file and produce pretty-printed output?
MAIN MODULE test;
abc;
END MODULE.
Should become
MAIN MODULE test; abc; END MODULE.
For example. (That's not how I plan to format all the output, don't worry. It'll pretty print much prettier than that.)
In this answer I learned that ANTLR 4 generates walkers automatically. Assuming my ANTLR grammar is correct/well-written, how do I match up the ANTLR rules/tokens to my template arguments to generate output from an input text file?
If I missed it in the documentation somewhere let me know. There are much fewer examples of ANTLR 4 and ST 4 than the previous versions.

Given a parser rule
r : a b c ;
the generated parse-tree will contain a node rContext with child nodes aContext, bContext, cContex, each potentially having further child nodes, for each instance in the input stream where the rule is matched.
The walk will produce the series of listener (or visitor) calls
enterR
enterA
....
exitA
enterB
....
exitB
enterC
....
exitC
exitR
Each call contains a reference to the instance context within the parse-tree, giving access to the actual values that could be passed to ST in prefix/suffix order relative to intervening child nodes.
Where simple prefix/suffix access ordering alone is not sufficient (or undesirably complex), use one or more prior parse-tree walks to analyze the more complex nodes and annotate the node instances with the analysis products. In the final output walk, reference the analysis products for the values to pass to ST.
Depending on actual circumstances, it would not be unusual for the analysis of a node to collect values from its children, pass the lot to a template for detail expansion, formatting, etc, and store the result as a node annotation string pending output in the final output walk.
Update
To annotate parse-tree nodes, you can use ParseTreeProperty.
Where the annotation set becomes more than 'trivial', a typical option is to associate a node-type specific 'decorator' class instance with a parse-tree node/context instance largely as a better data container. Of course, the node-type specific methods can then be embedded into their corresponding decorator classes to keep concerns nicely separated.
The listener methods become something like this:
public void exitNodeB(NodeBContext ctx) {
super.exitNodeB(ctx);
NodeBDescriptor descriptor = (NodeBDescriptor) getDescriptor(ctx);
if (analysisPhase) {
descriptor.process(); // node-type specific analysis
} else {
descriptor.output(); // node-type specific output generation
}
}
The specifics of when to analyze (on enter, exit, or both) and when to output will be dependent on the particular application. Implement to suit your purposes.

Why doesn't Java compiler shorten names by default? (both for performance and obfuscation)

I cannot understand why the Java compiler does not shorten names of variables, parameters, method names, by replacing them with some unique IDs.
For instance, given the class
public class VeryVeryVeryVeryVeryLongClass {
private int veryVeryVeryVeryVeryLongInt = 3;
public void veryVeryVeryVeryVeryLongMethod(int veryVeryVeryVeryVeryLongParamName) {
this.veryVeryVeryVeryVeryLongInt = veryVeryVeryVeryVeryLongParamName;
}
}
the compiled file contains all these very long names:
Wouldn't simple unique IDs speed the parsing, and also provide a first obfuscation?

You assume that obfuscation is always desired, but it isn't:
Reflection would break, and with it JavaBeans and many frameworks reliant on it
Stack traces would become completely unreadable
If you tried to code against a compiled JAR, you'd end up with code like String name = p.a1() instead of String name = p.getName()
Obfuscation is normally the very last step taken, when you're delivering the finished app, and even then it's not used particularly often except when the target platform has severe memory constraints.

When you use a class, you refer to its methods by their name. Therefore, the compiler needs to preserve those names.
In any event, I don't see why the compiler should aim to obfuscate anything. Rather, it should aim to do exactly the opposite: be as transparent as possible.

The JVM does use numeric IDs internally.
Class files cannot be obfuscated like that because Java is dynamically linked: names of members must be publicly readable or other classes cannot use your code.

Wouldn't simple unique IDs speed the parsing?
No. It would add a mapping that would probably slow it down.
and also provide a first obfuscation
Yes, but who wants the compiler to do obfuscation buy default? Not me.
Your suggestion has no merit.

What is a good approach/tool to embed a mini-language in Java app?

I am trying to solve the following problem:
I have a Java app (not written by me) whose purpose is to take 2 report files (just comma separated table output), and compare each row and each column of the files to each other cell by cell - basically, for regression testing.
I would like to enhance it with the following functionality:
Let's say I made a software change that causes all the values in column C1 to increase by 100%.
When comparing specific column "C1", the software currently will report that 100% of values in C1 changed.
What I want to do is to be able to configure the regression tester to NOT simply compare "is C1 in row1 the same in both files", but instead apply a pre-configured comparison rule/formula to the field, e.g. "Make sure that C1 value in file#2 is 2 times bigger than C1 value in file #1". This way, not only will I suppress 100% of bogus regression errors that say every row's column C1 doesn't match, but ALSO catch any real errors where column C1 is not exactly 100% bigger using new software.
When coding this sort of functionality in Perl previously, the solution was very simple - simply code custom per-column comparator configuration into a Perl hash stored in the config file, with the hash keys being columns and hash values being Perl subroutines for comparing 2 values any complicated way I wanted.
Obviously, this approach will NOT work with Java since I can NOT write custom comparator logic in Java and have the Java differ evaluate/compile/execute those comparators during differ runtime.
This means that I need to come up with some domain specific interpretable language that my differ would interpret/evaluate/execute.
Since I'm not very familiar with Java ecosystem and libraries, I'm asking SO:
What would be a good solution for implementing this DSL for configurable comparator logic?
My requirements are:
The solution must be "free as in beer"
The solution must be "shrink wrapped". E.g. an existing library, that I can simply drop into my code, add the config file, and have it work.
Something which requires me to write my own BNF grammar and provides generic grammar parser for which I must write my own interpreter is NOT acceptable.
The solution must be reasonably flexible as far as data crunching and syntactically rich. E.g.:
you should be able to at the very least pass in - and reference/address from within the DSL - an entire row of data as a hash
it should have reasonably complete syntax; at the very least do basic string operations (concatenate, substring, ideally some level of regex matching and substitutions); basic arithmetic including ability to do abs(val1 - val2) > tolerance on floating point #s; and basic flow control/logic such as conditionals and ideally loops.
The solution should be reasonably fast, and must be scalable. E.g. comparing 100x100 size files should not take 10 minutes with 10-20 custom columns.
If it matters, the target environment is Java 1.6

There are multiple dynamic JVM programming languages that can be easily integrated in Java applications without much effort. I think it would be worth looking into Groovy and/or Scala.
Another possible option would be creating your own DSL using XText or XTend.

Whenever it comes to dynamic features in Java I come up with Janino, a charming runtime and in-memory compiler. In your case it would give you something similar to an eval(...) for plain Java, see: http://docs.codehaus.org/display/JANINO/Basic#Basic-expressionevaluator
The point here is that you don't have a DSL for your test configuration but you can use plain Java syntax to write custom expressions in your test configuration.
The only requirement that won't be fulfilled by the solution proposed below is that you can address a whole row from within the config file. The solution assumes that you write a Java test class that iterates through your test-data value-by-value (or better pair-by-pair) and uses your configured expressions to compare single values. So the dynamic part are the expressions, the static part is the iteration of your test-data.
However the code needed is very small and simple as shown below:
Config File (Java properties syntax, key is column name, value is test expression):
# custom expression for column C1
C1 = a == 2 * b
# custom expression for column C4
C4 = a == b ^ 2
# custom expression for column C47
C47 = Math.abs(a - b) < 1
Sketch for test code:
// read config file into Properties
Properties expressions = Properties.load(....);
// Compile expressions, this could also be done lazily
HashMap<String, ExpressionEvaluator> evals = new HashMap<String, ExpressionEvaluator>();
for (String column : expressions.stringPropertyNames()) {
String expression = expressions.getProperty(column);
ExpressionEvaluator eval = new ExpressionEvaluator(
expression, // expression
boolean.class, // expressionType
new String[] { "a", "b" }, // parameterNames
new Class[] { int.class, int.class } // parameterTypes, depends on your data
);
evals.put(column, eval);
}
// Now for every value pair (a, b) check if a custom expression is defined
// for the according column and if yes execute:
Boolean correct = (Boolean) theEvalForTheColumn.evaluate(
new Object[] { a, b } // parameterValues
);
if (!correct) throw Exception("Wrong values...");
As said on the Janino pages, performance for compiled expressions is pretty good (they are real java byte code), only the compilation will slow down the process. So it might be a problem if you have many custom expressions but it should scale well with an increasing number of values. hth.

No imbedded language is needed. Define your comparator as an interface.
You can load classes which define the interface at runtime using class.forName(name),
where name can be specified by command line arguments or any other convenient means.
your comparator class would look something like
class SpecialColumn3 implements ColumnCompare
{ boolean compare(String a,String b) {...}
}

Java Code Use Checker

I am working on a library where we want to determine how much of our library is being used. I.E. we want to know how many methods in our library are public, but never being called.
Goal:
Static Analysis
Determine how many lines of code call each public method in package A in the current project. If the number of calls is zero, the method should be reported as such.

I belive you are looking for this eclipse plugin --> UCDetector
From the documentation (pay notice to second bullet point)
Unnecessary (dead) code
Code where the visibility could be changed to protected, default or
private
Methods of fields, which can be final
On Larger scale, if you want to do Object Level Static Analysis, look at this tool from IBM -->Structural Analysis for Java. It is really helpful for object analysis of libraries, APIs, etc.

Not exactly what you are looking for, but:
Something similar be done with code coverage tools (like Cobertura). They do not do static inspection of the source code, but instrument the bytecode to gather metrics at runtime. Of course, you need to drive the application in a way that exercises all usage pattern, and might miss the rarer code paths.
On the static analysis front, maybe these tools can help you (the Apache project uses them to check for API compatibility for new releases, seems like that task is somewhat related to what you are trying to do):
Clirr is a tool that checks Java libraries for binary and source compatibility with older releases. Basically you give it two sets of jar files and Clirr dumps out a list of changes in the public api.
JDiff is a Javadoc doclet which generates an HTML report of all the packages, classes, constructors, methods, and fields which have been removed, added or changed in any way, including their documentation, when two APIs are compared.

Client use of reflective calls is one hole in static analysis to consider. As there's no way to know for sure that a particular method isn't being called via some bizarre reflection scheme. So, maybe a combination of runtime and static analysis might be best.

I don't think you are able to measure how "often" a class or a function is needed.
There are some simple questions:
What defines, if a usage statistic of your game library is "normal" or an "outlier"? Is it wrong to kill yourself in the game too often? You would use the "killScreen" class more frequently like a good gamer.
What defines "much"? Time or usage count? POJOs will consume rare time, but are used pretty frequently.
Conclusion:
I don't know what you are trying to accomplish.
If you want to display your code dependencies, there are other tools for doing this. If you're trying to measure your code execution, there are profiler or benchmarks for Java. If you are a statistic geek, you'll be happy with RapidMiner ;)
Good luck with that!

I would suggest JDepend shows you the dependencies between packages and classes, excellent to find cyclic dependencies!
http://clarkware.com/software/JDepend.html
(it has an eclipse plugin: http://andrei.gmxhome.de/jdepend4eclipse/
and also PMD for other metrics
http://pmd.sourceforge.net/

IntelliJ has a tool to detect methods, fields, class which can have more restricted modifiers. It also a has a quick fix to apply these changes which can save you a lot of work as well. If you don't want to pay for it, you can get the 30-day eval license which is more than enough time to change your code, its not something your should need to do very often.
BTW: IntelliJ has about 650 code inspections to improve code quality, about half has automatic fixes so I suggest spend a couple of day using it to refactor/tidy up your code.

Please take a look at Dead Code Detector. It claims to do just what you are looking for: finding unused code using static analysis.

Here's are a few lists of Java code coverage tools. I haven't used any of these personally, but it might get you started:
http://java-source.net/open-source/code-coverage
http://www.codecoveragetools.com/index.php/coverage-process/code-coverage-tools-java.html

Proguard may be an option too (http://proguard.sourceforge.net/):
"Some uses of ProGuard are:
...
Listing dead code, so it can be removed from the source code.
... "
See also http://proguard.sourceforge.net/manual/examples.html#deadcode

You could write your own utility for that (within an hours after reading this) using the ASM bytecode analysis library (http://asm.ow2.org). You'll need to implement a ClassVisitor and a MethodVisitor. You'll use a ClassReader to parse the class files in your library.
Your ClassVisitor's visitMethod(..) will be called for each declared method.
Your MethodVisitor's visitMethodInsn(..) will be called for each called method.
Maintain a Map to do the counting. The keys represent the methods (see below). Here's some code:
class MyClassVisitor {
// ...
public void visit(int version, int access, String name, ...) {
this.className = name;
}
public MethodVisitor visitMethod(int access, String name, String desc, ...):
String key = className + "." + name + "#" + desc;
if (!map.containsKey() {
map.put(key, 0);
}
return new MyMethodVisitor(map);
}
// ...
}
void class MyMethodVisitor {
// ...
public visitMethodInsn(int opcode, String name, String owner, String desc, ...) {
String key = owner + "." + name + "#" + desc;
if (!map.containsKey() {
map.put(key, 0);
}
map.put(key, map.get(key) + 1);
}
// ...
}
Basically that's it. Your're starting the show with something like this:
Map<String,Integer> map = new HashMap<String,Integer>();
for (File classFile : my library) {
InputStream input = new FileInputStream(classFile);
new ClassReader(input).accept(new MyClassVisitor(map), 0);
input.close();
}
for (Map.Entry<String,Integer> entry : map.entrySet()) {
if (entry.getValue() == 0) {
System.out.println("Unused method: " + entry.getKey());
}
}
Enjoy!

Can I add and remove elements of enumeration at runtime in Java

It is possible to add and remove elements from an enum in Java at runtime?
For example, could I read in the labels and constructor arguments of an enum from a file?
#saua, it's just a question of whether it can be done out of interest really. I was hoping there'd be some neat way of altering the running bytecode, maybe using BCEL or something. I've also followed up with this question because I realised I wasn't totally sure when an enum should be used.
I'm pretty convinced that the right answer would be to use a collection that ensured uniqueness instead of an enum if I want to be able to alter the contents safely at runtime.

No, enums are supposed to be a complete static enumeration.
At compile time, you might want to generate your enum .java file from another source file of some sort. You could even create a .class file like this.
In some cases you might want a set of standard values but allow extension. The usual way to do this is have an interface for the interface and an enum that implements that interface for the standard values. Of course, you lose the ability to switch when you only have a reference to the interface.

Behind the curtain, enums are POJOs with a private constructor and a bunch of public static final values of the enum's type (see here for an example). In fact, up until Java5, it was considered best-practice to build your own enumeration this way, and Java5 introduced the enum keyword as a shorthand. See the source for Enum<T> to learn more.
So it should be no problem to write your own 'TypeSafeEnum' with a public static final array of constants, that are read by the constructor or passed to it.
Also, do yourself a favor and override equals, hashCode and toString, and if possible create a values method
The question is how to use such a dynamic enumeration... you can't read the value "PI=3.14" from a file to create enum MathConstants and then go ahead and use MathConstants.PI wherever you want...

I needed to do something like this (for unit testing purposes), and I came across this - the EnumBuster:
http://www.javaspecialists.eu/archive/Issue161.html
It allows enum values to be added, removed and restored.
Edit: I've only just started using this, and found that there's some slight changes needed for java 1.5, which I'm currently stuck with:
Add array copyOf static helper methods (e.g. take these 1.6 versions: http://www.docjar.com/html/api/java/util/Arrays.java.html)
Change EnumBuster.undoStack to a Stack<Memento>
In undo(), change undoStack.poll() to undoStack.isEmpty() ? null : undoStack.pop();
The string VALUES_FIELD needs to be "ENUM$VALUES" for the java 1.5 enums I've tried so far

I faced this problem on the formative project of my young career.
The approach I took was to save the values and the names of the enumeration externally, and the end goal was to be able to write code that looked as close to a language enum as possible.
I wanted my solution to look like this:
enum HatType
{
BASEBALL,
BRIMLESS,
INDIANA_JONES
}
HatType mine = HatType.BASEBALL;
// prints "BASEBALL"
System.out.println(mine.toString());
// prints true
System.out.println(mine.equals(HatType.BASEBALL));
And I ended up with something like this:
// in a file somewhere:
// 1 --> BASEBALL
// 2 --> BRIMLESS
// 3 --> INDIANA_JONES
HatDynamicEnum hats = HatEnumRepository.retrieve();
HatEnumValue mine = hats.valueOf("BASEBALL");
// prints "BASEBALL"
System.out.println(mine.toString());
// prints true
System.out.println(mine.equals(hats.valueOf("BASEBALL"));
Since my requirements were that it had to be possible to add members to the enum at run-time, I also implemented that functionality:
hats.addEnum("BATTING_PRACTICE");
HatEnumRepository.storeEnum(hats);
hats = HatEnumRepository.retrieve();
HatEnumValue justArrived = hats.valueOf("BATTING_PRACTICE");
// file now reads:
// 1 --> BASEBALL
// 2 --> BRIMLESS
// 3 --> INDIANA_JONES
// 4 --> BATTING_PRACTICE
I dubbed it the Dynamic Enumeration "pattern", and you read about the original design and its revised edition.
The difference between the two is that the revised edition was designed after I really started to grok OO and DDD. The first one I designed when I was still writing nominally procedural DDD, under time pressure no less.

You can load a Java class from source at runtime. (Using JCI, BeanShell or JavaCompiler)
This would allow you to change the Enum values as you wish.
Note: this wouldn't change any classes which referred to these enums so this might not be very useful in reality.

A working example in widespread use is in modded Minecraft. See EnumHelper.addEnum() methods on Github
However, note that in rare situations practical experience has shown that adding Enum members can lead to some issues with the JVM optimiser. The exact issues may vary with different JVMs. But broadly it seems the optimiser may assume that some internal fields of an Enum, specifically the size of the Enum's .values() array, will not change. See issue discussion. The recommended solution there is not to make .values() a hotspot for the optimiser. So if adding to an Enum's members at runtime, it should be done once and once only when the application is initialised, and then the result of .values() should be cached to avoid making it a hotspot.
The way the optimiser works and the way it detects hotspots is obscure and may vary between different JVMs and different builds of the JVM. If you don't want to take the risk of this type of issue in production code, then don't change Enums at runtime.

You could try to assign properties to the ENUM you're trying to create and statically contruct it by using a loaded properties file. Big hack, but it works :)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.