I wrote a Java program then compiled it and then decompiled it and got a Java file back. The decompiler gave back the exact Java file with exact indentations as my original Java file, but that is what decompilers do, this is not a problem the thing that burned the question in my mind is indentation.
As decompiled Java file have the exact same indentation as the original I conclude that class files store indentation data too.
Now the questions are as follows:
Is my conclusion correct?
If yes:
Why does a class file need to know about the indentation?
Indentation is meant for readability and no human reads class files, so why can't class file just not store it and save some space?
Are class files just an encrypted Java source file in which decompilers can decrypt to give the original back?
Compiled class files do not contain exact whitespace information of the source file.
There are two possible reasons that I can think of that your decompiled class files look exactly like your input:
you happen to have written your original input with a style matching what the decompiler uses
you didn't actually look at the decompiled source, but at the original file somehow (for example because the decompiler didn't write the output where you thought it did)
They do contain line numbers associated with the code that's executed to help with debugging.
But try adding a few random spaces into some of your expressions and you'll see that the decompiler can't reconstruct those.
And no, class files are not a simple "encryption" of the source files, plenty of information is lost when compiling (and some information taking from referenced classes might actually be added, such as the specific signatures of methods that are called).
You can even turn off debugging information during compilation to strip even more information (like local variable names, for example).
As decompiled Java file have the exact same indentation as the
original I conclude that class files stores indentation data too
Why would you conclude that based on a single observation?
Why did you not at least assume the possibility of it being a coincidence? Most Java code style follows fairly standard conventions. Its distinctly possible that the decompiler and you have very similar ideas about how to lay out the same Java code.
Anyway, you could have easily tested your hypothesis. Just write a Java class with some extremely non-standard indentation - perhaps no whitespace or linebreaks at all - and see what the result is then.
You will find that your hypothesis is incorrect.
Related
Do certain Java compilers prefer a different layout of a Java file before it is compiled unto a class file for the JVM?
What I mean is, by first writing your main Class and then writing all your other classes in the following lines bring or not a faster compile time?
Does the compiler take longer because it has not yet encountered what it is it needs for the relevant information in the main Class?
If I recall correctly, Java doesn't use explicit pointers either so I don't see that being an issue.
In other words, if you write your Classes outside of main first does this speed up compile time?
If any such difference exists, it would be so negligible you won't notice it.
In other words - you should focus on organizing the classes in a way that would make sense and would be easy for you to maintain, not on helping the compiler.
It is pretty simple: you specify the order of classes.
In other words: you give a list of file or directory names to the compiler. And then the compiler processes those in the order given, to then walk through each file. Sometimes it will make forward reference to understand that types are used that aren't defined yet.
I guess: when you ask the compiler to go for a complete directory, it will simply read the files in the order that the file system uses (like alphabetical).
Finally: this is definitely an implementation detail of the compiler (or even the build tool that generates the commands running the compiler). So a different tool, or tool version might lead to different results. So again: don't waste your time to "optimize" for this.
I have this model object representing a Java source file.
It has a constructor like so:
private SourceFile(File file)
I want this constructor to actually make sure (as much as it can) that the File it's being given is actually a Java source.
I have a batch operation that takes a lot of text files. Some of them are Java sources, I wan't a good way to differentiate them (other than file extension).
So has anyone been in this situation before and can you recommend a good way to check plausibility (not validity, for a validity check I'd need to compile it) ?
I'd do two things:
Check that the file ends in .java.
Check that the file declares a class that has the same name as the file (see here).
It depends on how accurate you want to be. If you want 100% you have to compile it. If you would be happy with something low you can check printable characters. Reasonable level may be achieved by key work check. And so on...
Use javaparser, on given link is wiki how to use it. But in Java 1.6 the compiler has an API build in the JDK, through it you can access the results of the Java parser.
As I was looking through the Java source code, I found some unusual files, mostly related to ByteBuffers in the java.nio package which had a very messy source code and were labelled This file was mechanically generated: Do not edit!.
These files also contained large portions of blank lines (some even in the middle of javadocs (!!?)), presumably to prevent the line numbers from changing. I have also seen a few java decompilers, such as procyon-decompiler, which have an option to keep line numbers, but I doubt that's the case, because putting blank lines before the final accolade changes nothing.
Here are a few of these files (I couldn't find any links to them online and didn't pastebin them because I don't want to break any copyright, but you can find them in the src.zip folder at the root of your JDK installation folder):
java.nio.ByteBuffer
java.nio.DirectByteBufferR
java.nio.Bits
java.nio.BufferOverflowException
I'd be curious to know:
Which tool generated these files?
Why does the tool keep the line numbers the same? Is it to make debugging (stacktraces) easier?
Why would a tool be used to generate them, while all other classes are programmed by humans?
Why would the tool put blank lines randomly inside parentheses, before the final accolade, or even in javadocs?
I can probably not answer all of the questions, but some background is:
In the Makefile at http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/make/java/nio/Makefile, they are generating different java source files from the same template file through some preprocessor:
...
$(BUF_GEN)/CharBuffer.java: $(X_BUF_TEMPLATE) $(GEN_BUFFER_SH)
$(prep-target)
#$(RM) $#.temp
TYPE=char SRC=$< DST=$#.temp $(GEN_BUFFER_CMD)
$(MV) $#.temp $#
$(BUF_GEN)/ShortBuffer.java: $(X_BUF_TEMPLATE) $(GEN_BUFFER_SH)
$(prep-target)
#$(RM) $#.temp
TYPE=short SRC=$< DST=$#.temp $(GEN_BUFFER_CMD)
$(MV) $#.temp $#
...
$(X_BUF_TEMPLATE) refers to X-Buffer.java.template, which is the source for typed buffers like CharBuffer, ShortBuffer and some more.
Note: The URLs might change in the future. Also sorry for referring to Java 7 - in Java 8 they have modified the build system, I did not find the corresponding Makefiles so far.
Which tool generated these files?
GEN_BUFFER_SH / GEN_BUFFER_CMD finally refers to genBuffer.sh, so the script which creates these files is http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/make/java/nio/genBuffer.sh.
Why would a tool be used to generate them, while all other classes are programmed by humans?
I don't have an authoritative answer for this specific case, but usually you are using code generation tools
if you need to create a lot of similar classes/methods which only differ in some detail, but which is subtle enough so that you can not use established mechanisms like generics or method parameters (probably the case here, since the buffers are generated for primitive types which can not be used with Generics)
if you need to create complex algorithms from a much simpler representation (like generating parsers from a grammar).
Why does the tool keep the line numbers the same? Is it to make debugging (stacktraces) easier?
I am guessing: yes, its to retain the line numbers in stack traces so that they match the template files. Other tools like the C preprocessor work similar.
Is there a simple to use Java library that can take a String and return a set of Strings which are the keywords/keyphrases.
It doesn't have to be particularly clever, just use stop words and stemming to match keywords.
I am looking at the KEA package http://code.google.com/p/kea-algorithm/ but I can't figure out how to use their code.
Ideally something simple which has a little example documentation would be good. In the meantime I will set about writing this myself!
EDIT: When I say I can't see how to figure out how to use their code, I mean I can't see a simple way. The individiual classes by themselves have useful methods that will do much of the work.
This is a fairly old question and probably the OP has already solved his problem, but putting it here for others who may stumble upon the question looking for how to use KEA.
For KEA, you will need a training set - some of your documents will need to have keywords already set. The training data consists of a directory of documents (.txt files) and corresponding keywords files (.key files), with one keyword per line. You train KEA on this set, then use the model to extract keywords on the rest of your documents, which are in another directory of .txt files. KEA will write out corresponding .key files in this directory.
For more information, take a look at one or more of the following:
1) The KEA source distribution has a TestKEA.java class which shows how to extract keywords from a small test corpus. The README has details on the directory format required.
2) This blog post has (a somewhat terse IMO) instructions on how to use KEA.
http://kea-pranay.blogspot.com/2010/02/kea-key-extraction-algorithm.html
3) My blog post which I wrote up last weekend while trying to learn how to generate keywords from a corpus I had (which were already manually annotated with keywords). It has Python code to pre-process data to the way KEA expects it, Scala (KEA provides a Java API) code to train and run the extractor, and Python code to do analyze and visualize the generated keywords.
http://sujitpal.blogspot.com/2014/08/keyword-extraction-with-kea.html
You might try the Porter Stemming algorithm: the java version is at http://tartarus.org/~martin/PorterStemmer/java.txt and the main page is at http://tartarus.org/~martin/PorterStemmer/. Its old, but doesn't do a bad job.
Is there a tool to deobfuscate java obfuscated codes?
The codes is extracted from a compiled class but they are obfuscated and non-readable.
First step would be to learn with which tool it was obfuscated. Maybe there's already a "deobfuscator" around for the particular obfuscator.
On the other hand, you can also just run an IDE and use its refactoring powers. Rename the class, method and variable names to something sensitive. Use your human logical thinking powers to figure what the code actually represents and name them sensitively. And the picture would slowly but surely grow.
Good luck.
Did you try to make the code less obscure with Java Deobfuscator (aka JDO), a kind of smart decompiler?
Currently JDO does the following:
renames obfuscated methods, variables, constants and class names
to be unique and more indicative of
their type
propogates changes throughout the entire source tree (beta)
has an easy to use GUI
allow you to specify the name for a field, method and class (new feature!)
Currently JDO does not do the
following (but it might one day)
modify method bytecode in any way
Not to gravedig but I wrote a tool that works on most commercial obfuscators
https://github.com/Contra/JMD
I used Java Deobfuscator (aka JDO) but it has a few bugs. It can't work with case sensitive file names.
So I've changed the source and uploaded a patch for that in sourceforge.
The patch, Download
Most likely only human mindpower to make sense of it. Get the best decompiler available and ponder on its output.
Maybe it will work on Unix/Linux/MacOS?
If so, you could move one step of your process to a VM, in where you unpack the code, before you rename the too long names. How long is the file name limit on Windows?