How to check if binaries are built from particular sources

How to check if binaries are built from particular sources - java

The legacy project I am working on includes some external library in a form of set of binary jar files. We decided that for analysis and potential patching, we want to receive sources of this library, use them to build new binaries and after detailed and long enough regression testing switch to these binaries.
Assume that we have already retrieved and built the sources (I am actually in planning phase). Before real testing, I would like to perform some "compatibility checks" to exclude possibility that the sources represent something dramatically different from what is in the "old" binaries.
Using the javap tool I was able to extract the version of JDK used for compilation (at least I believe it is the version of JDK). It says, the binaries were built using major version 46 and minor 0. According to this article it maps to JDK 1.2.
Assume that the same JDK would be used for sources compilation.
The question is:
Is there a reliable and possibly effective method of verification if both of these binaries are built from the same sources? I would like to know if all method signatures and class definitions are identical and if most or maybe all of method implementations are identical/similar.
The library is pretty big, so I think that detailed analysis of decompiled binaries may be not an option.

I suggest a multi-stage process:
Apply the previously suggested Jardiff or similar to see if there are any API differences. If possible, pick a tool that has an option for reporting private methods etc. In practice, any substantial implementation change in Java is likely to change some methods and classes, even if the public API is unchanged.
If you have an API match, compile a few randomly selected files with the indicated compiler, decompile the result and the original class files, and compare the results. If they match, apply the same process to larger and larger bodies of code until you either find a mismatch, or have checked everything.
Diffs of decompiled code are more likely to give you clues about the nature of the differences, and are easier to filter for non-significant differences, than the actual class files.
If you get a mismatch, analyze it. It may be due to something you do not care about. If so, try to construct a script that will delete that form of difference and resume the compile-and-compare process. If you get widespread mismatches, experiment with compiler parameters such as optimization. If adjustments to the compiler parameters eliminate the differences, continue with the bulk comparison. The objective in this phase is to find a combination of compiler parameters and decompiled code filters that produce a match on the sample files, and apply them to bulk comparison of the library.
If you cannot get a reasonably close match in the decompiled code, you probably do not have the right source code. Even so, if you have an API match it may be worth building your system and running your tests using the result of the compilation. If your tests run at least as well with the version you built from source, continue work using it.

There are a variety of JAR comparison tools out there. One that used to be pretty good is Jardiff. I haven't used it in awhile but I'm sure it's still available. There are also some commercial offerings in the same space that could fit your needs.

Jardiff that Perception mentioned is a good start, however there is no way to do it 100% percent sure theoretically. This is because the same source can be compiled with different compilers and different compiler configurations and optimization levels. So there is no way to compare binary code (bytecode) beyond class and method signatures.
What do you mean by "similar implementation" of a method? Let's suppose that a clever compiler drops an else case because it figures out that the condition may not be true ever. Are the two similar? Yes and no.. :-)
The best way to go IMHO is setting up very good regression test cases that check every key feature of your libraries. This might be a horror, but on long term might be cheaper than hunting for bugs. It all depends on your future plans in this project. Not a trivial easy decision.

For method signatures, use a tool like jardiff.
For similarity of implementation, you have to fall back to a wild guess. Comparing the bytecode on opcode-level may be compiler-dependent and lead to a large number of false negatives. If this is the case, you could fall back to compare the methods of a class using the LineNumberTable.
It gives you a list of line numbers for each method (as long as the class file has been compiled with the debug flag, which is often missing in very old or commercial libraries).
If two class files are compiled from the same source code, then at least the line numbers of each method should match exactly.
You can use a library such as Apache BCEL to retrieve the LineNumberTable:
// import org.apache.bcel.classfile.ClassParser;
JavaClass fooClazz = new ClassParser( "Foo.class" ).parse();
for( Method m : fooClazz.getMethods() )
{
LineNumberTable lnt = m.getLineNumberTable();
LineNumber[] tab = lnt.getLineNumberTable();
for( LineNumber ln : tab )
{
System.out.println( ln.getLineNumber() );
}
}

Related

What is the preference for Java compilers regarding order of classes?

Do certain Java compilers prefer a different layout of a Java file before it is compiled unto a class file for the JVM?
What I mean is, by first writing your main Class and then writing all your other classes in the following lines bring or not a faster compile time?
Does the compiler take longer because it has not yet encountered what it is it needs for the relevant information in the main Class?
If I recall correctly, Java doesn't use explicit pointers either so I don't see that being an issue.
In other words, if you write your Classes outside of main first does this speed up compile time?

If any such difference exists, it would be so negligible you won't notice it.
In other words - you should focus on organizing the classes in a way that would make sense and would be easy for you to maintain, not on helping the compiler.

It is pretty simple: you specify the order of classes.
In other words: you give a list of file or directory names to the compiler. And then the compiler processes those in the order given, to then walk through each file. Sometimes it will make forward reference to understand that types are used that aren't defined yet.
I guess: when you ask the compiler to go for a complete directory, it will simply read the files in the order that the file system uses (like alphabetical).
Finally: this is definitely an implementation detail of the compiler (or even the build tool that generates the commands running the compiler). So a different tool, or tool version might lead to different results. So again: don't waste your time to "optimize" for this.

Extract a reference graph while compiling Java codebase?

Background:
I'm working with (for me) a reasonably large codebase (eg: I've only got a few of the related projects checked out at the moment, and its > 11000 classes).
Build is ant, Tests are JUnit, CI is Jenkins.
Running all tests before checkin is not an option, it takes Jenkins hours. Even for some of the individual apps it can be 45 minutes.
There are some tests that don't reference individual methods based on reflection, and in some cases don't even directly reference the class of the tested methods, as they interrogate an aggregator class, and are aware of the patterns of pass-through methods in use here. As it's a big codebase, > 10 developers, and I'm not in charge, this is something I can not change for now.
What I want, is the ability to before check-in print out a list of all test classes that are two degrees away (Kevin-Bacon-wise) from any class in the git diff list. This way I can run them all and cut down on angry emails from Jenkins when something I missed eventually gets run and has an error.
The easiest way I can think of to achieve this is to code it myself with a Ruby script or similar, which allows me to account for some of the patterns we're using, but to do it I need to be able to query "which classes reference class X?"
I could parse .java or (easier) .class files to get this info, but I'd rather not :) Is there a way I can make Javac export it in a simple format as it compiles?

Is there a way I can make Javac export it in a simple format as it compiles?
AFAIK, no.
However, there are other ways to get a list of the dependencies:
How do I get a list of Java class dependencies for a main class?.
(Note however that you are unlikely to get a static tool to extract dependencies resulting from Class.forName(), etcetera. Also note that you cannot infer the complete set of dependencies from bytecode files because of the way that "compile time constants" are handled.)
It strikes me that there are a few problems here:
It sounds to me like your build, and indeed your project structure is monolithic. If you could restructure the code base into large-scale modules that build separately (according to their dependencies), and version controlled separately, then you only need to do a full build and run all unit tests when there is a change high up ... in a module that everything else depends on. (Can I suggest the "Maven" word. It really helps for a large codebase, and 11,000 classes is large.)
It sounds like you may be suffering from the "branches are hard" problem of classic VCS systems.
It sounds like you may need a beefier CI system. If you've got more cores and the build framework is right, you should be able to get faster CI builds. (And if you modularize so that you rebuild less ...)
I think it might be easier to address your slow build/test cycle that way rather than via extra (possibly bespoke) tooling to do dependency analysis.
But I recognize that it may not be up to you to make those decisions.

Is there any class to diagnose invoked method in a java class?

I need to diagnose all invoked methods in a class(either declared in the class or not) using it's source code. Means that give the class source code to a method as an input and get the invoked method by the class as the output. In fact I need a class/method which operates same as java lexical analyzer .
Is there any method to diagnose all invoked methods ?
of course I tried to use Runtime.traceMethodCalls(); to solve the problem but there was no output. I've read I need to run java debug with java -g but unfortunately when I try to run java -g it makes error. Now what should I do ? Is there any approach ?

1) In the general case, no. Reflection will always allow the code to make method calls that you won't be able to analyze without actually running the code.
2) Tracing the method calls won't give you the full picture either, since a method is not in any way guaranteed (or even likely) to make all the calls it can every time you call it.
Your best bet is some kind of "best effort" code analysis. You may want to try enlisting the compiler's help with that. For example, compile the code and analyze the generated class file for all emitted external symbols. It won't guarantee catching every call (see #1), but it will get you close in most cases.

You can utilize one of the open source static analyzers for Java as a starting point. Checkstyle allows you to build your own modules. Soot has a pretty flexible API and a good example of call analysis. FindBugs might also allow you too write a custom module. AFAIK all three are embeddable in the form of a JAR, so you can incorporate whatever you come up with into your own custom program.

From your question it is hard to determine what is exactly problem you're trying to solve.
But in case:
If you want to analyze source code, to see which parts of it are redundant and may be removed, then you could use some IDE (Eclipse, IntelliJ IDEA Community Edition etc.) In IDE's you have features to search for usages of method and also you have functionality to analyze code and highlight unused methods as warnings/errors.
If you want to see where during runtime some method is called, then you could use profiling tool to collect information on those method invocations. Depending on tool you could see also from where those methods were called. But bare in mind, that when you execute program, then it is not guaranteed that your interesting method is called from every possible place.
if you are developing an automated tool for displaying calling graphs of methods. Then you need to parse source and start working with code entities. One way would be to implement your own compiler and go on from there. But easier way would be to reuse opensourced parser/compiler/analyzer and build your tool around it.
I've used IntelliJ IDEA CE that has such functionalitys and may be downloaded with source http://www.jetbrains.org/display/IJOS/Home
Also there is well known product Eclipse that has its sources available.
Both of these products have enormous code base, so isolating interesting part would be difficult. But it would still be easier than writing your own java compiler and werifying that it works for every corner case.

For analyzing the bytecode as mentioned above you could take a look at JBoss Bytecode. It is more for testing but may also be helpful for analyzing code.
sven.malvik.de

You may plug into the compiler.
Have a look the source of Project Lombok for instance.
There is no general mechanism, so they have one mechanism for javac and one for eclipse's compiler.
http://projectlombok.org/

(2009) - Tool to deobfuscate Java codes

Is there a tool to deobfuscate java obfuscated codes?
The codes is extracted from a compiled class but they are obfuscated and non-readable.

First step would be to learn with which tool it was obfuscated. Maybe there's already a "deobfuscator" around for the particular obfuscator.
On the other hand, you can also just run an IDE and use its refactoring powers. Rename the class, method and variable names to something sensitive. Use your human logical thinking powers to figure what the code actually represents and name them sensitively. And the picture would slowly but surely grow.
Good luck.

Did you try to make the code less obscure with Java Deobfuscator (aka JDO), a kind of smart decompiler?
Currently JDO does the following:
renames obfuscated methods, variables, constants and class names
to be unique and more indicative of
their type
propogates changes throughout the entire source tree (beta)
has an easy to use GUI
allow you to specify the name for a field, method and class (new feature!)
Currently JDO does not do the
following (but it might one day)
modify method bytecode in any way

Not to gravedig but I wrote a tool that works on most commercial obfuscators
https://github.com/Contra/JMD

I used Java Deobfuscator (aka JDO) but it has a few bugs. It can't work with case sensitive file names.
So I've changed the source and uploaded a patch for that in sourceforge.
The patch, Download

Most likely only human mindpower to make sense of it. Get the best decompiler available and ponder on its output.

Maybe it will work on Unix/Linux/MacOS?
If so, you could move one step of your process to a VM, in where you unpack the code, before you rename the too long names. How long is the file name limit on Windows?

How do I strip the fluff out of a third party library?

It may not be best practice but are there ways of removing unsused classes from a third party's jar files. Something that looks at the way in which my classes are using the library and does some kind of coverage analysis, then spits out another jar with all of the untouched classes removed.
Obviously there are issues with this. Specifically, the usage scenario I put it though may not use all classes all the time.
But neglecting these problems, can it be done in principle?

There is a way.
The JarJar project does this AFAIR. The first goal of the JarJar project is to allow one to embed third party libraries in your own jar, changing the package structure if necessary. Doing so it can strip out the classes that are not needed.
Check it out at http://code.google.com/p/jarjar/.
Here is a link about shrinking jars: http://sixlegs.com/blog/java/jarjar-keep.html

There is a tool in Ant called a classfileset. You specify the list of root classes that you know you need, and then the classfileset recursively analyzes their code to find all dependencies.
Alternatively, you could develop a good test suite that exercises all of the functions that you need, then run your tests under a test coverage tool. The tool will tell you which classes (and statement in them) were actually utilized. This could give you an even smaller set of code than what you'd find with static analysis.

I use ProGuard for this. As well as being an excellent obfuscator, it has a code shrinking phase which can combine multiple JARs and then strip out any unused classes or class members. It does an excellent job at shrinking.

At a previous job, I used a Java obfuscator that as well as obfuscating the code, also removed classes and methods that weren't being used. If you were doing "Class.byName" or any other type of reflection stuff, you needed to tell the obfuscator because it couldn't tell by inspecting the code what classes or methods called by reflection.
The problem, of course, is that you don't know if other parts of the third party library are doing any reflection, and so removing an "unused" class might cause things to break in an obscure case that you haven't tested.

jar is just a zip file, so I guess you can. If you could get to the source, it's cleaner. Maybe try disassembling the class?

Adding to this question, can that improve performance? Since the classes not used would not be JIT compiled improving startup time or does the java automatically detect that while compiling to bytecode and do not even deal with the code that is not used?

This would be an interesting project (has anyone done it already?)
I presume you'd give the tool your jar(s) as a starting point, and the library jar to clean up. It could use reflection to determine which classes your jar(s) reference directly, and which are used indirectly down the call tree (this is not trivial at all, but doable). If it encounters any reflection code in any of the two places, it should give a very loud warning.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.