As I was looking through the Java source code, I found some unusual files, mostly related to ByteBuffers in the java.nio package which had a very messy source code and were labelled This file was mechanically generated: Do not edit!.
These files also contained large portions of blank lines (some even in the middle of javadocs (!!?)), presumably to prevent the line numbers from changing. I have also seen a few java decompilers, such as procyon-decompiler, which have an option to keep line numbers, but I doubt that's the case, because putting blank lines before the final accolade changes nothing.
Here are a few of these files (I couldn't find any links to them online and didn't pastebin them because I don't want to break any copyright, but you can find them in the src.zip folder at the root of your JDK installation folder):
java.nio.ByteBuffer
java.nio.DirectByteBufferR
java.nio.Bits
java.nio.BufferOverflowException
I'd be curious to know:
Which tool generated these files?
Why does the tool keep the line numbers the same? Is it to make debugging (stacktraces) easier?
Why would a tool be used to generate them, while all other classes are programmed by humans?
Why would the tool put blank lines randomly inside parentheses, before the final accolade, or even in javadocs?
I can probably not answer all of the questions, but some background is:
In the Makefile at http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/make/java/nio/Makefile, they are generating different java source files from the same template file through some preprocessor:
...
$(BUF_GEN)/CharBuffer.java: $(X_BUF_TEMPLATE) $(GEN_BUFFER_SH)
$(prep-target)
#$(RM) $#.temp
TYPE=char SRC=$< DST=$#.temp $(GEN_BUFFER_CMD)
$(MV) $#.temp $#
$(BUF_GEN)/ShortBuffer.java: $(X_BUF_TEMPLATE) $(GEN_BUFFER_SH)
$(prep-target)
#$(RM) $#.temp
TYPE=short SRC=$< DST=$#.temp $(GEN_BUFFER_CMD)
$(MV) $#.temp $#
...
$(X_BUF_TEMPLATE) refers to X-Buffer.java.template, which is the source for typed buffers like CharBuffer, ShortBuffer and some more.
Note: The URLs might change in the future. Also sorry for referring to Java 7 - in Java 8 they have modified the build system, I did not find the corresponding Makefiles so far.
Which tool generated these files?
GEN_BUFFER_SH / GEN_BUFFER_CMD finally refers to genBuffer.sh, so the script which creates these files is http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/make/java/nio/genBuffer.sh.
Why would a tool be used to generate them, while all other classes are programmed by humans?
I don't have an authoritative answer for this specific case, but usually you are using code generation tools
if you need to create a lot of similar classes/methods which only differ in some detail, but which is subtle enough so that you can not use established mechanisms like generics or method parameters (probably the case here, since the buffers are generated for primitive types which can not be used with Generics)
if you need to create complex algorithms from a much simpler representation (like generating parsers from a grammar).
Why does the tool keep the line numbers the same? Is it to make debugging (stacktraces) easier?
I am guessing: yes, its to retain the line numbers in stack traces so that they match the template files. Other tools like the C preprocessor work similar.
Related
In order to create a valid .class file, every method has to have a full internal name and type descriptors associated with it. When procedurally creating these, is there some sort of lookup table one can use (outside of Java, where a ClassLoader can be used) to get these type descriptors from a method name? For example, how would one go from Scanner.hasNextByte to boolean java.util.Scanner.hasNextByte(int) / boolean java.util.Scanner.hasNextByte() (or even from java.util.Scanner.hasNextByte to boolean java.util.Scanner.hasNextByte(int) / boolean java.util.Scanner.hasNextByte())? The above example has overloading in it, which is another problem a human- but mostly computer-readable declarations file would hopefully address.
I've found many sources of human-readable documentation like https://docs.oracle.com/javase/8/docs/api/index.html containing uses of each method, hyperlinks to other places, etc. but never a simple text file or collection of files containing just declarations in any format. If there's no such file(s) don't worry about it, I can try and scrape some annoying HTML files, but if there is it would save a lot of time. Thanks!
The short answer is No.
There isn't a "header file" containing the class and method signatures for the Java class libraries. The Java tool chain has no need for such a thing. Nor do 3rd-party Java compilers, or compilers for other languages that rely on the Java SE class libraries.
AFAIK, there isn't a 3rd-party tool that builds such a file or an equivalent database or in-memory data structures.
You could create one though.
You could chose an existing Java parsing library, and use it to build parse trees for all of the source files in the class library, and emit the information that you need.
You could potentially create a custom Javadoc "doclet" plugin to emit the information.
Having said that, I don't understand why you would need such a mapping. Surely, your IDE does this already ... and exposes the information via some internal API. And if this is not for an IDE plugin, what it is for?
You commented:
I'm making a compiler for a JVM-based programming language ....
Ah ... so your compiler should do what other compilers do. Get the information from the ".class" file. You can either load the class using a standard or custom class loader, or you can use a library like asm or bcel or javassist ... which can read a ".class" file without loading it.
(I haven't checked, but I think the standard javac compiler uses an internal API to do this.)
Note that your proposed approaches won't work for interfacing with 3rd-party Java libraries where the source code is not available and/or the javadoc is not scrapable.
What about building it from the source files for the standard library?
The Oracle Java 8 API web pages you referenced was created by Javadoc processing of source files for the Java standard library.
If you use an IDE with a debugger, there is a good chance you already have much of the standard library source code downloaded. After all, if you set a break point, and then follow the program step-by-step with "Step into", you can trace the execution of the program into standard library methods. The source files would be part of the JDK.
However, some parts of the standard library source might not be available, due to licensing restrictions.
I wrote a Java program then compiled it and then decompiled it and got a Java file back. The decompiler gave back the exact Java file with exact indentations as my original Java file, but that is what decompilers do, this is not a problem the thing that burned the question in my mind is indentation.
As decompiled Java file have the exact same indentation as the original I conclude that class files store indentation data too.
Now the questions are as follows:
Is my conclusion correct?
If yes:
Why does a class file need to know about the indentation?
Indentation is meant for readability and no human reads class files, so why can't class file just not store it and save some space?
Are class files just an encrypted Java source file in which decompilers can decrypt to give the original back?
Compiled class files do not contain exact whitespace information of the source file.
There are two possible reasons that I can think of that your decompiled class files look exactly like your input:
you happen to have written your original input with a style matching what the decompiler uses
you didn't actually look at the decompiled source, but at the original file somehow (for example because the decompiler didn't write the output where you thought it did)
They do contain line numbers associated with the code that's executed to help with debugging.
But try adding a few random spaces into some of your expressions and you'll see that the decompiler can't reconstruct those.
And no, class files are not a simple "encryption" of the source files, plenty of information is lost when compiling (and some information taking from referenced classes might actually be added, such as the specific signatures of methods that are called).
You can even turn off debugging information during compilation to strip even more information (like local variable names, for example).
As decompiled Java file have the exact same indentation as the
original I conclude that class files stores indentation data too
Why would you conclude that based on a single observation?
Why did you not at least assume the possibility of it being a coincidence? Most Java code style follows fairly standard conventions. Its distinctly possible that the decompiler and you have very similar ideas about how to lay out the same Java code.
Anyway, you could have easily tested your hypothesis. Just write a Java class with some extremely non-standard indentation - perhaps no whitespace or linebreaks at all - and see what the result is then.
You will find that your hypothesis is incorrect.
The legacy project I am working on includes some external library in a form of set of binary jar files. We decided that for analysis and potential patching, we want to receive sources of this library, use them to build new binaries and after detailed and long enough regression testing switch to these binaries.
Assume that we have already retrieved and built the sources (I am actually in planning phase). Before real testing, I would like to perform some "compatibility checks" to exclude possibility that the sources represent something dramatically different from what is in the "old" binaries.
Using the javap tool I was able to extract the version of JDK used for compilation (at least I believe it is the version of JDK). It says, the binaries were built using major version 46 and minor 0. According to this article it maps to JDK 1.2.
Assume that the same JDK would be used for sources compilation.
The question is:
Is there a reliable and possibly effective method of verification if both of these binaries are built from the same sources? I would like to know if all method signatures and class definitions are identical and if most or maybe all of method implementations are identical/similar.
The library is pretty big, so I think that detailed analysis of decompiled binaries may be not an option.
I suggest a multi-stage process:
Apply the previously suggested Jardiff or similar to see if there are any API differences. If possible, pick a tool that has an option for reporting private methods etc. In practice, any substantial implementation change in Java is likely to change some methods and classes, even if the public API is unchanged.
If you have an API match, compile a few randomly selected files with the indicated compiler, decompile the result and the original class files, and compare the results. If they match, apply the same process to larger and larger bodies of code until you either find a mismatch, or have checked everything.
Diffs of decompiled code are more likely to give you clues about the nature of the differences, and are easier to filter for non-significant differences, than the actual class files.
If you get a mismatch, analyze it. It may be due to something you do not care about. If so, try to construct a script that will delete that form of difference and resume the compile-and-compare process. If you get widespread mismatches, experiment with compiler parameters such as optimization. If adjustments to the compiler parameters eliminate the differences, continue with the bulk comparison. The objective in this phase is to find a combination of compiler parameters and decompiled code filters that produce a match on the sample files, and apply them to bulk comparison of the library.
If you cannot get a reasonably close match in the decompiled code, you probably do not have the right source code. Even so, if you have an API match it may be worth building your system and running your tests using the result of the compilation. If your tests run at least as well with the version you built from source, continue work using it.
There are a variety of JAR comparison tools out there. One that used to be pretty good is Jardiff. I haven't used it in awhile but I'm sure it's still available. There are also some commercial offerings in the same space that could fit your needs.
Jardiff that Perception mentioned is a good start, however there is no way to do it 100% percent sure theoretically. This is because the same source can be compiled with different compilers and different compiler configurations and optimization levels. So there is no way to compare binary code (bytecode) beyond class and method signatures.
What do you mean by "similar implementation" of a method? Let's suppose that a clever compiler drops an else case because it figures out that the condition may not be true ever. Are the two similar? Yes and no.. :-)
The best way to go IMHO is setting up very good regression test cases that check every key feature of your libraries. This might be a horror, but on long term might be cheaper than hunting for bugs. It all depends on your future plans in this project. Not a trivial easy decision.
For method signatures, use a tool like jardiff.
For similarity of implementation, you have to fall back to a wild guess. Comparing the bytecode on opcode-level may be compiler-dependent and lead to a large number of false negatives. If this is the case, you could fall back to compare the methods of a class using the LineNumberTable.
It gives you a list of line numbers for each method (as long as the class file has been compiled with the debug flag, which is often missing in very old or commercial libraries).
If two class files are compiled from the same source code, then at least the line numbers of each method should match exactly.
You can use a library such as Apache BCEL to retrieve the LineNumberTable:
// import org.apache.bcel.classfile.ClassParser;
JavaClass fooClazz = new ClassParser( "Foo.class" ).parse();
for( Method m : fooClazz.getMethods() )
{
LineNumberTable lnt = m.getLineNumberTable();
LineNumber[] tab = lnt.getLineNumberTable();
for( LineNumber ln : tab )
{
System.out.println( ln.getLineNumber() );
}
}
I am writing an eclipse plugin which needs to be able to determine which lines of a file have changed compared to a different version of the same file.
Is there an existing class or library which I can use for this task?
The closest I have found is org.eclipse.compare.internal.merge.DocumentMerger. This can be used to find the information I need but is in an internal package so is not suitable for me to use. I could copy/paste the source of this class and adapt it to my requirements. However, I am hoping there is an existing library to handle textual comparisons.
For textual comparisons, try the google-diff-match-patch library. (I don't know whether Eclipse already has something similar built-in.)
Is there a tool to deobfuscate java obfuscated codes?
The codes is extracted from a compiled class but they are obfuscated and non-readable.
First step would be to learn with which tool it was obfuscated. Maybe there's already a "deobfuscator" around for the particular obfuscator.
On the other hand, you can also just run an IDE and use its refactoring powers. Rename the class, method and variable names to something sensitive. Use your human logical thinking powers to figure what the code actually represents and name them sensitively. And the picture would slowly but surely grow.
Good luck.
Did you try to make the code less obscure with Java Deobfuscator (aka JDO), a kind of smart decompiler?
Currently JDO does the following:
renames obfuscated methods, variables, constants and class names
to be unique and more indicative of
their type
propogates changes throughout the entire source tree (beta)
has an easy to use GUI
allow you to specify the name for a field, method and class (new feature!)
Currently JDO does not do the
following (but it might one day)
modify method bytecode in any way
Not to gravedig but I wrote a tool that works on most commercial obfuscators
https://github.com/Contra/JMD
I used Java Deobfuscator (aka JDO) but it has a few bugs. It can't work with case sensitive file names.
So I've changed the source and uploaded a patch for that in sourceforge.
The patch, Download
Most likely only human mindpower to make sense of it. Get the best decompiler available and ponder on its output.
Maybe it will work on Unix/Linux/MacOS?
If so, you could move one step of your process to a VM, in where you unpack the code, before you rename the too long names. How long is the file name limit on Windows?