Comparing two .jars with different obfuscation - java

I need to compare to jar files that have many of the same classes but with different names.
Lets say you are looking for a class that contains this:
public class AStar {
private int verbose = 0;
private int maxSteps = -1;
private int numSearchSteps;
public ISearchNode bestNodeAfterSearch;
etc..., but it's obfuscated into
public class ard {
private int fas = 0;
private int asd = -1;
private int ags;
public ars arser;
and you have to compare the first file against 100 of others to find this one.
My guess was a byte code comparison, but I can't find a tool for it or a method to compare all files against each other in the two jars.

I've done this in the past, but the problem is that generally a lot of manual work is also required to determine the type of information that is preserved, and which libraries to compare it with.
For example, in one case, I found that the obfuscated Jar had added a method to a library class which threw off the comparison until I found and accounted for it. Another common problem is that obfuscators will remove unused methods and interfaces and sometimes add obfuscator-specific methods.
In order to get good results, you can't just consider individual classes. You need to match up inheritance hierarchies, interfaces, and cross references between the classes in order to unambiguously match most classes, and even then it isn't always successful.
Luckily, they almost never reorder or change the signatures of the fields and methods. Otherwise it would be extremely difficult to collect enough information to unambiguously match up the classes. As it is, there are often classes with the exact same set of methods and inheritance (for example two classes that implement the same interface). If you're lucky, you'll be able to infer it by matching references from a third class, but this isn't always possible.
Anyway, I can send you my code if you want. It's designed for the recognition of open source libraries included in an obfuscated app, but it could probably be adapted to match two obfuscated apps as well.

You should be able to pull this off with ASM. It has pretty good documentation, and quite some samples.
You build an internal model from the types and values, and then compare and spit out the identical classes.
If it was you who obfuscated it, you should be able to get the mappings though...

In the general case, determining whether two arbitrary programs do the same thing for all inputs is undecidable (reducible to the halting problem).
For the following, I'll assume the obfuscation doesn't mess with the class structure: it will only rename fields, methods and classes and possibly obfuscate bytecode.
Let's say you're looking for an obfuscated class that's equivalent to some class C. Here are some searches you could perform, in increasing order of difficulty:
Find all classes with the exact same number of fields and methods as C has.
For each obfuscated class, compute the set of field types it contains (but, for simplicity, don't include types that point to other obfuscated classes). All classes where this set of field types is not a subset of the field types of C can be filtered out.
Do the same for method signatures.
You could go further but it could get pretty complicated.
In the end, what works best depends on what specific things the obfuscator does and does not try to hide.
ASM is a good library for parsing and processing .class files.

If the obfuscation changed only variable names, and not variable order or any of the compiler-generated bytecode, you should be able to do this with ASM or Javassist or other bytecode library. In fact, the list below can be done using regular Java reflection.
Two class files would be candidates for equality if:
They have the same number of methods
There is a 1-to-1 mapping between the parameter signatures of the methods in class A and class B
The matching method also match in terms of flags (private/public, static, abstract, etc.)
That would be a pretty good match. Beyond that and you might have to get into the details of the byte code. The byte code should be similar, but references to the Const Pool might be scrambled. You would have to decipher those. For example one class might ldc #12 and the other might ldc #34; if it turns out that #12 in class A is the same as #34 in class B, they match (at least for that).
If the obfuscator rewires the order of parameters on private methods, it might be really hard to detect a match easily. Still, maybe all you need to do is to narrow it down to a reasonable number of candidates, so applying the list above to public and protected methods might be all you need.

I use Beyond Compare to compare jar files:
http://www.scootersoftware.com/
You may have some luck using their additional file formats to compare .class files (decompiled)
http://www.scootersoftware.com/download.php?zz=kb_moreformats_win

Related

What are 'real' and 'synthetic' Method parameters in Java?

Looking into j.l.r.Executable class I've found a method called hasRealParameterData() and from its name and code context I assume that it tells whether a particular method has 'real' or 'synthetic' params.
If I take e.g. method Object.wait(long, int) and call hasRealParameterData() it turns out that it returns false which is confusing to me, as the method is declared in Object class along with its params.
From this I've got a couple of questions:
What are 'real' and 'synthetic' Method parameters and why Java believes that params of Object.wait(long, int) are not 'real'?
How can I define a method with 'real' params?
Preamble - don't do this.
As I mentioned in the comments as well: This is a package private method. That means:
[A] It can change at any time, and code built based on assuming it is there will need continuous monitoring; any new java release means you may have to change things. You probably also need a framework if you want your code to be capable of running on multiple different VM versions. Maybe it'll never meaningfully change, but you have no guarantee so you're on the hook to investigate each and every JVM version released from here on out.
[B] It's undocumented by design. It may return weird things.
[C] The java module system restriction stuff is getting tighter every release; calling this method is hard, and will become harder over time.
Whatever made you think this method is the solution to some problem you're having - unlikely. If it does what you want at all, there are probably significantly better solutions available. I strongly advise you take one step backwards and ask a question about the problem you're trying to solve, instead of asking questions about this particular solution you've come up with.
Having gotten that out of the way...
Two different meanings
The problem here is that 'synthetic' means two utterly unrelated things and the docs are interchanging the meaning. The 4 unrelated meanings here are:
SYNTHETIC, the JVM flag. This term is in the JLS.
'real', a slang term used to indicate anything that is not marked with the JVM SYNTETHIC flag. This term is, as far as I know, not official. There isn't an official term other than simply 'not SYNTHETIC'.
Synthetic, as in, the parameter name (and other data not guaranteed to be available in class files) are synthesised.
Real, as in, not the previous bullet point's synthetic. The parameter is fully formed solely on the basis of what the class file contains.
The 'real' in hasRealParameterData is referring to the 4th bullet, not the second. But, all 4 bullet point meanings are used in various comments in the Executable.java source file!
The official meaning - the SYNTHETIC flag
The JVM has the notion of the synthetic flag.
This means it wasn't in the source code but javac had to make this element in order to make stuff work. This is done to paper over mismatches between java-the-language and java-the-VM-definition, as in, differences between .java and .class. Trivial example: At least until the nestmates concept, the notion of 'an inner class' simply does not exist at the class file level. There is simply no such thing. Instead, javac fakes it: It turns:
class Outer {
private static int foo() {
return 5;
}
class Inner {
void example() {
Outer.foo();
}
}
}
Into 2 seemingly unrelated classes, one named Outer, and one named Outer$Inner, literally like that. You can trivially observe this: Compile the above file and look at that - 2 class files, not one.
This leaves one problem: The JLS claims that inner classes get to call private members from their outer class. However, at the JVMS (class file) level, we turned these 2 classes into separate things, and thus, Outer$Inner cannot call foo. Now what? Well, javac generates a 'bridger' method. It basically compiles this instead:
class Outer {
private static int foo() {
return 5;
}
/* synthetic */ static int foo$() {
return foo();
}
}
class Outer$Inner {
private /* synthetic */ Outer enclosingInstance;
void example() {
Outer.foo$();
}
}
The JVM can generate fields, extra overload methods (for example, if you write class MyClass implements List<String> {}, you will write e.g. add(String x), but .add(Object x) still needs to exist to cater to erasure - that method is generated by javac, and will be marked with the SYNTHETIC modifier.
One effect of the SYNTHETIC modifier is that javac acts as if these methods do not exist. If you attempt to actually write Outer.foo$() in java code, it won't compile, javac will act as if the method does not exist. Even though it does. If you use bytebuddy or a hex editor to clear that flag in the class file, then javac will compile that code just fine.
generating parameter names
Weirdly, perhaps, in the original v1.0 Java Language Spec, parameter types were, obviously, a required part of a method's signature and are naturally encoded in class files. You can write this code: Integer.class.getMethods();, loop through until you find the static parseInt method, and then ask the j.l.r.Method instance about its parameter type, which will dutifully report: the first param's type is String. You can even ask it for its annotations.
But weirdly enough as per JLS 1.0 you cannot ask for its name - simply because it is not there, there was no actual need to know it, it does take up space, java wanted to be installed on tiny devices (I'm just guessing at the reasons here), so the info is not there. You can add it - as debug info, via the -g parameter, because having the names of things is convenient.
However, in later days this was deemed too annoying, and more recently compilers DO stuff the param name in a class file. Even if you do not use the -g param to 'include debug symbol info'.
Which leaves one final question: java17 can still load classes produced by javac 1.1. So what is it supposed to do when you ask for the name of param1 of such a method? The name simply cannot be figured out, it simply isn't there in the class file. It can fall back to looking at the debug symbol table (and it does), but if that isn't there - then you're just out of luck.
What the JVM does is make that name arg0, arg1, etc. You may have seen this in decompiler outputs.
THAT is what the hasRealParameterData() method is referring to as 'real' - arg0 is 'synthesized', and in contrast, foo (the actual name of the param) is 'real'.
So how would one have a method that has 'real' data in that sense (the 4th bullet)? Simply compile it, it's quite hard to convince a modern java compiler to strip all param names. Some obfuscators do this. You can compile with a really old -target and definitely don't add -g, and you'll probably get non-real, as per hasRealParameterData().

Reduce visibility of classes and methods

TL;DR: Given bytecode, how can I find out what classes and what methods get used in a given method?
In my code, I'd like to programmatically find all classes and methods having too generous access qualifiers. This should be done based on an analysis of inheritance, static usage and also hints I provide (e.g., using some home-brew annotation like #KeepPublic). As a special case, unused classes and methods will get found.
I just did something similar though much simpler, namely adding the final keyword to all classes where it makes sense (i.e., it's allowed and the class won't get proxied by e.g., Hibernate). I did it in the form of a test, which knows about classes to be ignored (e.g., entities) and complains about all needlessly non-final classes.
For all classes of mine, I want to find all methods and classes it uses. Concerning classes, there's this answer using ASM's Remapper. Concerning methods, I've found an answer proposing instrumentation, which isn't what I want just now. I'm also not looking for a tool like ucdetector which works with Eclipse AST. How can I inspect method bodies based on bytecode? I'd like to do it myself so I can programmatically eliminate unwanted warnings (which are plentiful with ucdetector when using Lombok).
Looking at the usage on a per-method basis, i.e. by analyzing all instructions, has some pitfalls. Besides method invocations, there might be method references, which will be encoded using an invokedynamic instruction, having a handle to the target method in its bsm arguments. If the byte code hasn’t been generated from ordinary Java code (or stems from a future version), you have to be prepared to possibly encounter ldc instructions pointing to a handle which would yield a MethodHandle at runtime.
Since you already mentioned “analysis of inheritance”, I just want to point out the corner cases, i.e. for
package foo;
class A {
public void method() {}
}
class B implements bar.If {
}
package bar;
public interface If {
void method();
}
it’s easy to overlook that A.method() has to stay public.
If you stay conservative, i.e. when you can’t find out whether B instances will ever end up as targets of the If.method() invocations at other places in your application, you have to assume that it is possible, you won’t find much to optimize. I think that you need at least inlining of bridge methods and the synthetic inner/outer class accessors to identify unused members across inheritance relationships.
When it comes class references, there are indeed even more possibilities, to make a per-instruction analysis error prone. They may not only occur as owner of member access instructions, but also for new, checkcast, instanceof and array specific instructions, annotations, exception handlers and, even worse, within signatures which may occur at member references, annotations, local variable debugging hints, etc. The ldc instruction may refer to classes, producing a Class instance, which is actually used in ordinary Java code, e.g. for class literals, but as said, there’s also the theoretical possibility to produce MethodHandles which may refer to an owner class, but also have a signature bearing parameter types and a return type, or to produce a MethodType representing a signature.
You are better off analyzing the constant pool, however, that’s not offered by ASM. To be precise, a ClassReader has methods to access the pool, but they are actually not intended to be used by client code (as their documentation states). Even there, you have to be aware of pitfalls. Basically, the contents of a CONSTANT_Utf8_info bears a class or signature reference if a CONSTANT_Class_info resp. the descriptor index of a CONSTANT_NameAndType_info or a CONSTANT_MethodType_info points to it. However, declared members of a class have direct references to CONSTANT_Utf8_info pool entries to describe their signatures, see Methods and Fields. Likewise, annotations don’t follow the pattern and have direct references to CONSTANT_Utf8_info entries of the pool assigning a type or signature semantic to it, see enum_const_value and class_info_index…

Java Obscured Obfuscation

Similar Questions: Here and Here
I guess the situation is pretty uncommon to begin with, and so I admit it is probably too localized for SO.
The Problem
public class bqf implements azj
{
...
public static float b = 0.0F;
...
public void b(...)
{
...
/* b, in both references below,
* is meant to be a class (in the
* default package)
*
* It is being obscured by field
* b on the right side of the
* expression.
*/
b var13 = b.a(var9, var2, new br());
...
}
}
The error is: cannot invoke a(aji, String, br) on primitive type float.
Compromisable limitations:
Field b cannot be renamed.
Class b cannot be renamed or refactored.
Why
I am modifying an obfuscated program. For irrelevant[?], unknown (to me), and uncompromisable reasons the modification must be done via patching the original jar with .class files. Hence, renaming the public field b or class b would require modifying much of the program. Because all of the classes are in the default package, refactoring class b would require me to modify every class which references b (much of the program). Nevertheless there is a substantial amount of modification I do intend on doing, and it is a pain to do it at the bytecode level; just not enough to warrant renaming/refactoring.
Possible Solutions
The most obvious one is to rename/refactor. There are thousands of classes, and every single one is in the default package. It seems like every java program I want to modify has that sort of obfuscation. : (
Anyways sometimes I do take the time to just go about manually renaming/refactoring the program. But when when there's too many errors (I once did 18,000), this is not a viable option.
The second obvious option is to do it in bytecode (via ASM). Sometimes this is ok, when the modifications are small or simple enough. Unfortunately doing bytecode modifications on only the files which I can't compile through java (which is most of them, but this is what I usually try to do) is painfully slow by comparison.
Sometimes I can extend class b, and use that in my modified class. Obviously this won't always work, for example when b is an enum. Unfortunately this means a lot of extra classes.
It may be possible to create a class with static wrapper methods to avoid obscurity. I just thought of this.
A tool which remaps all of the names (not deobfuscate, just unique names), then unmaps them after you make modifications. That would be sweet. I should make one if it doesn't exist.
The problem would also be solved with a way to force the java compiler to require the keyword "this".
b.a(var9, var2, new br());
can easily be rewritten using reflection:
Class.forName("b").getMethod("a", argTypes...).invoke(null, var9, var2, new br());
The problem would also be solved with a way to force the java compiler to require the keyword "this".
I don't think how this would help you for a static member. Compiler would have to require us to qualify everything—basically, disallow simple names altogether except for locals.
Write a helper method elsewhere that invokes b.a(). You can then call that.
Note: In Java the convention is that the class would be named B and not b(which goes for bqf and aqz too) and if that had been followed the problem would not have shown.
The real, long time cure, is not to put classes in the default package.

Can I always use the Reflection API if the code is going to be obfuscated?

I found that there seem to be 2 general solutions:
don't obfuscate what is referred to through the reflection API [Retroguard, Jobfuscate]
replace Strings in reflection API invocations with the obfuscated name.
Those solutions work only for calls within the same project - client code (in another project) may not use the reflection API to access non-public API methods.
In the case of 2 it also only works when the Reflection API is used with Strings known at compile-time (private methods testing?). In those cases dp4j also offers a solution injecting the reflection code after obfuscation.
Reading Proguard FAQ I wondered if 2 otherwise always worked when it says:
ProGuard automatically handles
constructs like
Class.forName("SomeClass") and
SomeClass.class. The referenced
classes are preserved in the shrinking
phase, and the string arguments are
properly replaced in the obfuscation
phase.
With variable string arguments, it's generally not possible to determine
their possible values.
Q: what does the statement in bold mean? Any examples?
With variable string arguments, it's generally not possible to determine their possible values.
public Class loadIt(String clsName) throws ClassNotFoundException {
return Class.forName(clsName);
}
basically if you pass a non-constant string to Class.forName, there's generally no way for proguard or any obfuscation tool to figure out what class you are talking about, and thus can't automatically adjust the code for you.
The Zelix KlassMaster Java obfuscator can automatically handle all Reflection API calls. It has a function called AutoReflection which uses an "encrypted old name" to "obfuscated name" lookup table.
However, it again can only work for calls within the same obfuscated project.
See http://www.zelix.com/klassmaster/docs/tutorials/autoReflectionTutorial.html.
It means that this:
String className;
if (Math.random() <= 0.5) className = "ca.simpatico.Foo";
else className = "ca.simpatico.Bar";
Class cl = Class.forName(className);
Won't work after obfuscation. ProGuard doesn't do a deep enough dataflow analysis to see that the class name which gets loaded came from those two string literals.
Really, your only plausible option is to decide which classes, interfaces, and methods should be accessible through reflection, and then not obfuscate those. You're effectively defining a strange kind of API to clients - one which will only be accessed reflectively.

Static String constants VS enum in Java 5+

I've read that question & answers:
What is the best way to implement constants in Java?
And came up with a decision that enum is better way to implement a set of constants.
Also, I've read an example on Sun web site how to add the behaviour to enum (see the link in the previously mentioned post).
So there's no problem in adding the constructor with a String key to the enum to hold a bunch of String values.
The single problem here is that we need to add ".nameOfProperty" to get access to the String value.
So everywhere in the code we need to address to the constant value not only by it's name (EnumName.MY_CONSTANT), but like that (Enum.MY_CONSTANT.propertyName).
Am I right here? What do you think of it?
Yes, the naming may seem a bit longer. But not as much as one could imagine...
Because the enum class already give some context ("What is the set of constants that this belong to?"), the instance name is usually shorter that the constant name (strong typing already discriminated from similar named instances in other enums).
Also, you can use static imports to further reduce the length. You shouldn't use it everywhere, to avoid confusions, but I feel that a code that is strongly linked to the enum can be fine with it.
In switches on the enum, you don't use the class name. (Switches are not even possible on Strings pre Java 7.)
In the enum class itself, you use the short names.
Because enums have methods, many low-level codes that would make heavy use of the constants could migrate from a business code to the enum class itself (either dynamic or static method). As we saw, migrating code to the enum reduces the long names uses even further.
Constants are often treated in groups, such as an if that test for equality with one of six constants, or four others etc. Enums are equipped with EnumSets with a contains method (or similarly a dynamic method that returns the appropriate group), that allow you to treat a group as a group (as a secondary advantage, note that these two implementations of the grouping are extraordinarily fast - O(1) - and low on memory!).
With all these points, I found out that the actual codes are much much shorter !
With regard to the question about constants - enums should represent constants that are all the same type. If you are doing arbitrary constants this is the wrong way to go, for reasons all described in that other question.
If all you want are String constants, with regard to verbose code you are right. However, you could override the toString() method return the name of the property. If all you want to do is concatenate the String to other Strings then this will save you some extra verbosity in your code.
However, have you considered using Properties files or some other means of internationalisation? Often when defining dets of Strings it is for user interface messages, and extracting these to a separate file might save you a lot of future work, and makes translation much easier.

Categories

Resources