Randomized testing in Java

Randomized testing in Java - java

I write a lot of unit tests. Often, you need to write carefully considered test cases by hand, a form of whitebox testing. If you are lucky enough to work for a company with a separate quality assurance engineers, perhaps someone else writes test cases for you (kind of a mix between white and black box testing).
Many times, however, randomized testing would find many bugs and would serve as a great complement to hand-written cases.
For example, I might have a self-contained class and be able to express the invariants and broad-stroke behavior of the class simply (such as "this method never throws an exception" or "this method always returns a positive value"). I would like a test framework that just bashes on my class and checks the invariants.
A similar case: I often have a class which implements similar functionality to another class (but does it with different performance characteristics or with some added functionality). I would to A vs B test the two classes in a randomized way. For example, if I was implementing TreeMap, I could use HashMap as a comparable implementation (modulo a few differences due to the sorted behavior of TreeMap) and check most of the basic functionality in a randomized way. Simlarly, someone implementing LinkedList could use ArrayList as a comparable implementation and vice-versa.
I've written some basic stuff to do this in the past, but it is painstaking to set to up all the boilerplate to:
Create objects with random initial state
Apply random mutations
Create mappings between "like" objects for A vs B testing
Define invariants and rules such as "when will exceptions be thrown"
I still do it from time to time, but I want to reduce my effort level. For example, are there frameworks that remove or simplify the required boilerplate?
Otherwise, what techniques are used to do randomized testing in Java?
This is related, but not the same as fuzz testing. Fuzz testing seems to focus on random inputs to a single entity, in hope of triggering bad behavior, often with an adaptive input model based on dynamic coverage observations. That's covers a lot of the above, but doesn't cover, stuff like A vs B testing when comparable implementations exist, or invariant checking. In any case, I'm also interested in decent fuzz testing libraries for Java.

I think what you're trying to find is a library for Property Based Testing in Java (see types of randomized testing). Shortly: instead of testing the value of the result you're testing a property of it. E.g. instead of checking that 2+2 is 4 you're checking properties like:
random1 + 0 = random1
random1 + random2 >= random1
...
Take a look at this article that explains Property Based Testing in details.
Another option that you mention is to check with your Test Oracle - something that knows the true answer (e.g. old bullet-proof algorithm). So you pass a random variable both to old and new algorithm and you check that the results are equal.
Couple of Java libraries:
JUnit QuickCheck - a specialized lib for Property Based Testing. Allows you to define the properties and passes random values for these properties to check. So far (06/2016) it's pretty young, so you may want to check out ScalaCheck since it's possible to write Scala tests for Java code.
Datagen - random values generator for Java in case standard randomizers are not enough. Disclaimer: I'm the author.

Related

AssertJ testing on a collection: what is better: 'extracting' or Java8 forEach

I'm new to AssertJ and using it to unit-test my written code and was thinking how to assert a list.
Lets assume we have a list of Consumers Entities. each Entity has it own Phone, own ServiceProvider which has it own Name and EntityName.
Now we want to assert that each Entity from a repository gets the right data, so we want to test that each item on list has equal Phone.
ConsumerEntity savedConsumer1 = Consumer(phone, name, serviceProvider)
List<ConsumerEntity> consumerListFromRepository = repository.findAllByPhone(phone)
Now I want to test that the data given from Repository is correct,
I can use this:
assertThat(consumerListFromRepository)
.extracting(ConsumerEntity::getPhone())
.containsOnly(savedConsumer1.getPhone());
Or I can do this with forEach (java 8):
consumerListFromRepository.forEach(consumerEntity ->
assertThat(consumerEntity.getPhone()).isEqualTo(savedConsumer1.getPhone()));
1. Which one is faster/simple-r/readable? I will go for the forEach for less lines of code but less read-ability as well.
2. Is there any other way to do it 1liner like the foreach but with asserThat? so it will be readable and simple - and without the need to use EqualTo each
time? something like:
asserThat(list).forEach........
3. Which one is faster? Extracting or forEach?
Thanks!

I'm not sure that "faster" is a primary concern here. It's likely that any performance difference is immaterial; either the underlying implementations are ~equivalent in terms of non-functionals or - since the context here is a unit test - the consumerListFromRepository is trivially small thereby limiting the scope for any material performance differences.
I think your main concerns here should be
Making it as easy as possible for other developers to:
Understand/reason about your test case
Edit/refactor your test case
Ensuring that your approach to asserting is consistent with other test cases in your code base
Judging which of your two approaches best ticks this box is somewhat subjective but I think the following considerations are relevant:
The Java 8 forEach construct is well understood and the isEqualTo matcher is explicit and easily understood
The AssertJ extracting helper paired with the containsOnly is less common that Java8's forEach construct but this pairing reads logically and is easily understood
So, IMHO both approaches are valid. If your code base consistently uses AssertJ then I'd suggest using the extracting helper paired with the containsOnly matcher for consistency. Otherwise, use whichever of them reads best to you :)

Mocking a bit stream reader with Mockito

I am currently debugging a rather complicated algorithm that fixes errors in a bit stream. A BitReader interface is quite simple, and the main reading method is like this:
/**
Reads bits from the stream.
#param length number of bits to read (<= 64)
#return read bits in the least significant bits
*/
long read(int length) throws IOException;
The objective is to test whether BitStreamFixer actually fixes the stream (in a way that is too hard to describe here). Basically I need to provide “broken” inputs to it and test whether its output is as correct as it can be (some inputs can't be fixed completely), like this:
BitStreamFixer fixer = new BitStreamFixer(input);
int word1 = fixer.readWord();
int word2 = fixer.readWord();
// possibly a loop here
assertEquals(VALID_WORD1, word1);
assertEquals(VALID_WORD2, word2);
// maybe a loop here too
Now, the BitStreamFixer class accepts an instance of BitReader. When unit testing the fixer, I obviously need one such instance. But where do I get one? I have two obvious options: either give it a real implementation of BitReader or mock it.
The former option is not really appealing because it would create a dependency on another object which has nothing to do with the class being tested. Moreover, it's not that easy because existing BitReader implementations read form input streams, so I'll need either a file or somehow prepared byte array, which is a tedious thing to do.
The latter option looks better and fits the usual unit testing approach. However, since I'm not even supposed to know what arguments the fixer will give to read, mocking it is not easy. I'll have to go with when(bitReader.read(anyInt())).thenAnswer(...) approach, implementing a custom answer that will create a lot of bit-fiddling logic to spoon-feed the object under test with proper bits in chunks of whatever size it asks for. Considering that bit streams I'm working with have rather complicated higher-level structure, it's not easy. And introducing logic in unit tests also doesn't smell good.
What do you think, is there any other option? Or maybe one of these can be improved in a way I fail to notice?

Write, test, and use a clear reusable test helper.
In a general sense, in unit testing, you're supposed to establish confidence in a system by watching it successfully interact with systems that you DO have confidence in. Of course you also want the system to be fast, deterministic, and easy to read/modify, but ultimately those come secondary to the assertion that your system work.
You've listed two options:
Use a mock BitReader, where you have enough confidence in predicting your system's interactions that you can set up the entire "when A then B" conversation. Mocking can be pretty easy when you have a small API surface of independent methods, like an RPC layer, but mocking can be very difficult when you have a stateful object with unpredictable method calls. Mocking is further useful to deterministically stub nondeterministic systems, like external servers or pseudorandom sources, or systems that don't exist yet; none of those is the case for you.
Because your read method can take a wide variety of parameters, each of which is valid and changes your system's state, then it's probably not a smart idea to use mocking here. Unless the order of calls that BitStreamFixer makes to BitReader is deterministic enough to make part of its contract, a mock BitReader will likely result in a brittle test: one that breaks when the implementation changes even if the system is perfectly functional. You'll want to avoid that.
Note that mocking should never yield "complicated logic", only complicated set-up. You're using mocks to avoid using real logic in your tests.
Use a real BitReader, which sounds like it will be painful and opaque to construct. This is probably the most realistic solution, though, especially if you've already finished writing and testing it.
You worry about "introducing new dependencies", but if your BitReader implementation exists and is fast, deterministic, and well-tested, then you shouldn't feel any worse about using it than using a real ArrayList or ByteArrayInputStream in your test. It sounds like the only real problem here is that creating the byte array would make it hard to maintain your test, which is a valid consideration.
In the comments, though, the real answer comes through: Build the BitWriter you're missing.
#Test public void shouldFixBrokenStream() {
BitReader bitReader = new StreamBitReader(BitWriter.create()
.pushBits(16, 0x8080)
.pushBits(12, 0x000) // invalid 12-bit sequence
.pushBits(16, 0x8080)
.asByteArrayInputStream());
BitStreamFixer fixer = new BitStreamFixer(bitReader);
assertEquals(0x80808080, fixer.read(32));
}
/** Of course, you could skip the BitReader yourself, and just make a new one. */
#Test public void shouldFixBrokenStream_bitReader() {
BitReader bitReader = new InMemoryBitReader();
bitReader.pushBits(16, 0x8080);
bitReader.pushBits(12, 0x000); // invalid 12-bit sequence
bitReader.pushBits(16, 0x8080);
BitStreamFixer fixer = new BitStreamFixer(bitReader);
assertEquals(0x80808080, fixer.read(32));
}
This is more readable than constructing an opaque bitstream offline and copy-pasting it into your test (particularly if well-commented), less brittle than mocks, and much more testable itself than an anonymous inner class (or Answer-based version of the same). It is also likely that you can use a system like that across multiple test cases, and possibly even multiple tests.

How to implement a complex algorithm with TDD

I have a fairly complex algorithm I would like to implement (in Java) with TDD. The algorithm, for translation of natural languages, is called stack decoding.
When I tried to do that I was able to write and fix some simple test cases (empty translation, one word, etc'..), but I am not able to get to the direction of the algorithm I want. I mean that I can't figure out how to write a massive amount of the algorithm as described below in baby steps.
This is the pseudo code of the algorithm:
1: place empty hypothesis into stack 0
2: for all stacks 0...n − 1 do
3: for all hypotheses in stack do
4: for all translation options do
5: if applicable then
6: create new hypothesis
7: place in stack
8: recombine with existing hypothesis if possible
9: prune stack if too big
10: end if
11: end for
12: end for
13: end for
Am I missing any way I can do the baby steps, or should I just get some coverage and do the major implementation?

TL;DR
Recast your task as testing an implementation of an API.
As the algorithm is expressed in terms of concepts (hypothesis, translation option) that are themselves quite complicated, program it bottom-up by first developing (using TDD) classes that encapsulate those concepts.
As the algorithm is not language specific, but abstracts language specific operations (all translation options, if applicable), encapsulate the language specific parts in a separate object, use dependency injection for that object, and test using simple fake implementations of that object.
Use your knowledge of the algorithm to suggest good test cases.
Test the API
By focusing on the implementation (the algorithm), you are making a mistake. Instead, first imagine you had a magic class which did the job that the algorithm performs. What would its API be? What would its inputs be, what would its outputs be? And what would the required connections between the inputs and outputs be. You want to encapsulate your algorithm in this class, and recast your problem as generating this class.
In this case, its seems the input is a sentence that has been tokenized (split into words), and the output is a tokenized sentence , which has been translated into a different language. So I guess the API is something like this:
interface Translator {
/**
* Translate a tokenized sentence from one language to another.
*
* #param original
* The sentence to translate, split into words,
* in the language of the {#linkplain #getTranslatesFrom() locale this
* translates from}.
* #return The translated sentence, split into words,
* in the language of the {#linkplain #getTranslatesTo() locale this
* translates to}. Not null; containing no null or empty elements.
* An empty list indicates that the translator was unable to translate the
* given sentence.
*
* #throws NullPointerException
* If {#code original} is null, or contains a null element.
* #throws IllegalArgumentException
* If {#code original} is empty or has any empty elements.
*/
public List<String> translate(List<String> original);
public Locale getTranslatesFrom();
public Locale getTranslatesTo();
}
That is, an example of the Strategy design pattern. So your task becomes, not "how do I use TDD to implement this algorithm" but rather "how do I use TDD to implement a particular case of teh Strategy design pattern".
Next you need to think up a sequence of test cases, from easiest to hardest, that use this API. That is, a set of original sentence values to pass to the translate method. For each of those inputs, you must give a set of constraints on the output. The translation must satisfy those constraints. Note that you already have some constraints on the output:
Not null; not empty; containing no null or empty elements.
You will want some example sentences for which you know for sure what the algorithm should output. I suspect you will find there are very few such sentences. Arrange these tests from easiest to pass to hardest to pass. This becomes your TODO list while you are implementing the Translator class.
Implement Bottom Up
You will find that making your code passes more than a couple of these cases very hard. So how can you thoroughly test your code?
Look again at the algorithm. It is complicated enough that the translate method does not do all the work directly itself. It will delegate to other classes for much of the work
place empty hypothesis into stack 0
Do you need a Hypothesis class. A HypothesisStack class?
for all translation options do
Do you need a TranslationOption class?
if applicable then
Is there a method TranslationOption.isApplicable(...)?
recombine with existing hypothesis if possible
Is there a Hypothesis.combine(Hypothesis) method? A Hypothesis.canCombineWith(Hypothesis) method?
prune stack if too big
Is there a HypothesisStack.prune() method?
Your implementation will likely need additional classes. You can implement each of those individually using TDD. Your few tests of the Translator class will end up being integration tests. The other classes will be easier to test than the Translator because they will have precisely defined, narrow definitions of what they ought to do.
Therefore, defer implementing Translator until you have implemented those classes it delegates to. That is, I recommend that you write your code bottom up rather than top down. Writing code that implements the algorithm you have given becomes the last step. At that stage you have classes that you can use to write the implementation using Java code that looks very like the pseudo code of your algorithm. That is, the body of your translate method will be only about 13 lines long.
Push Real-Life Complications into an Associated Object
Your translation algorithm is general purpose; it can be used to translate between any pair of languages. I guess that what makes it applicable to translating a particular pair of languages is for all translation options and if applicable then parts. I guess the latter could be implemented by a TranslationOption.isApplicable(Hypothesis) method. So what makes the algorithm specific to a particular language is generating the translation options. Abstract that to a a factory object that the class delegates to. Something like this:
interface TranslationOptionGenerator {
Collection<TranslationOption> getOptionsFor(Hypothesis h, List<String> original);
}
Now so far you have probably thinking in terms of translating between real languages, with all their nasty complexities. But you do not need that complexity to test your algorithm. You can test it using a fake pair of languages, which are much simpler than real languages. Or (equivalently) using a TranslationOptionGenerator that is not as rich as a practical one. Use dependency injection to associate the Translator with a TranslationOptionGenerator.
Use Simple Fake Implementations instead of Realistic Associates
Now consider some of the most extreme simple cases of a TranslationOptionGenerator that the algorithm must deal with:
Such as when there are no options for the empty hypothesis
There is only one option
There are only two options.
An English to Harappan translator. Nobody knows how to translate to Harappan, so this translator indicates it is unable to translate every time.
The identity translator: a translator that translates English to English.
A Limited English to newborn translator: it can translate only the sentences "I am hungry", "I need changing" and "I am tired", which it translates to "Waaaa!".
A simple British to US English translator: most words are the same, but a few have different spellings.
You can use this to generate test cases that do not require some of the loops, or do not require the isApplicable test to be made.
Add those test cases to your TODO list. You will have to write fake simple TeranslatorOptionGenerator objects for use by those test cases.

The key here is to understand that "baby steps" do not necessarily mean writing just a small amount of production code at a time. You can just as well write a lot of it, provided you do it to pass a relatively small & simple test.
Some people think that TDD can only be applied one way, namely by writing unit tests. This is not true. TDD does not dictate the kind of tests you should write. It's perfectly valid to do TDD with integration tests that exercise a lot of code. Each test, however, should be focused in a single well-defined scenario. This scenario is the "baby step" that really matters, not the number of classes that may get exercised by the test.
Personally, I develop a complex Java library exclusively with integration-level tests, rigorously following the TDD process. Quite often, I create a very small and simple-looking integration test, which ends up taking a lot of programming effort to make pass, requiring changes to several existing classes and/or the creation of new ones. This has been working well for me for the past 5+ years, to the point I now have over 1300 such tests.

TL;DR: Start with nothing and don't write any code that a test does not call for. Then you can't help but TDD the solution
When building something via TDD, you should start with nothing then implement test cases until it does the thing you want. That's why they call it red green refactoring.
Your first test would be to see if the internal object has 0 hypothesis (an empty implementation would be null. [red]). Then you initialize the hypothesis list [green].
Next you would write a test that the hypothesis is checked (it's just created [red]). Implement the "if applicable" logic and apply it to the one hypothesis [green].
You write a test that when a hypothesis is applicable, then create a new hypothesis (check if there's > 1 hypothesis for a hypothesis that's applicable [red]). Implement the create hypothesis logic and stick it in the if body. [green]
conversely, if a hypothesis is not applicable, then do nothing ([green])
Just kinda follow that logic to individually test the algorithm over time. It's much easier to do with an incomplete class than a complete one.

Generate random but static test data

When designing test cases, I want to be able to use data that is random but static.
If I use data that is not random, then I will use trivial examples that are representative of the data I expect, rather than the data I have guarded against in my code. For example, if my code is expecting a string with a max length of 15 characters then I would rather specify these constraints and have the data generated for me, within those constraints, rather than some arbitrary example which may be, due to my expectations, within a more strict set of constraints.
If I use data that is not static, then my tests won't be repeatable. It is no good using a string that changes every time the test is run f the test then fails occasionally. It would be much better to use a consistent string and then specify more constraints upon how that string is generated (and obviously make the same checks in my code), if and when a bug is found.
Is this a good strategy for test data?
If so, then I know how to achieve both of these goals independently. For static, but non-random data I just enter something arbitrary e.g. foo. For something random but not static, I just use apache random utils e.g. randomString(5). How can I get both?
Note that when data must be unique, it would also be handy to have some way to specify that two pieces of generated data must be distinct. Randomness does this most of the time but cannot be relied upon, obviously, without having unreliable tests!
TL;DR: How can I specify the type of data I want to generate, without having randomised generated data?

Use a random with a constant seed. You can use the Random(long seed) constructor for it.
The RandomStringUtils.random() method can accept a Random source, which you could have created with a constant seed as described.
Using a constant seed is very useful for making experiments reproduceable - and using them is a very good practice, IMO.

don't do it. it gives you a headache, makes your tests unreadable and gives you no benefit. you already see the problems: specification of constraints. so let's go to the imaginary benefits. you worry about that manually you provide more constrained data then random data. but you want to use same data every time (same seed). so how do you know that random data are better than your manually provided data? how do you know that you chose seed properly? if you are not sure if your test data are good enough then:
simplify your code (extract methods/classes, avoid ifs, avoid nulls, be more immutable and functional)
look at edge cases and include them in your tests
look at generated data and check if some of them differs from what you were thinking of and add those data to your tests
use mutation testing
whenever a bug is discovered dufing development, uat or production, add those data to your tests
do truly random (not repetitive), long running tests. every generated data that breaks the tests should be logged and add to your deterministic unit tests.
by pretending to use random data you just lie to yourself. the data is not random, you don't control it and it makes you stop thinking about edge cases of your code. so don't do it, face the truth and make your tests readable and check more conditions

What you are describing is property based testing - the best known example being Haskell's quickcheck.
http://www.haskell.org/haskellwiki/Introduction_to_QuickCheck1
There have been a number of java ports such as
https://bitbucket.org/blob79/quickcheck
https://github.com/kjw/supercheck
https://github.com/pholser/junit-quickcheck
The Quickcheck philosophy emphasises the use of random data, but most (all?) of the java ports allow you to set a fixed seed so that the generated values are repeatable.
I've never got round to actually trying this approach, but I would hope it would make your tests more readable (rather than less readable as piotrek suggests), by separating the values from the tests.
If knowledge of the values is important to understand the test/SUT behavior then it is the wrong approach.

Instancio is a data generation library for unit tests that does what you are looking. E.g. if you need a random string of certain length:
Foo foo = Instancio.of(Foo.class)
.generate(field("fooString"), gen -> gen.string().length(10))
.create();
To generate a random predictable values, you can supply a seed:
Foo foo = Instancio.of(Foo.class)
.generate(field("fooString"), gen -> gen.string().length(10))
.withSeed(123)
.create();
Or if you use JUnit 5:
#ExtendWith(InstancioExtension.class)
class ExampleTest{
#Seed(1234)
#Test
void example {
Foo foo = Instancio.of(Foo.class)
.generate(field("fooString"), gen -> gen.string().length(10))
.create();
// ...
}
}
If you need predictable data always, you can configure a global seed value through a properties file. This way you won't need to specify it in the code.
https://github.com/instancio/instancio/

How to test for equality of complex object graphs?

Say I have a unit test that wants to compare two complex for objects for equality. The objects contains many other deeply nested objects. All of the objects' classes have correctly defined equals() methods.
This isn't difficult:
#Test
public void objectEquality() {
Object o1 = ...
Object o2 = ...
assertEquals(o1, o2);
}
Trouble is, if the objects are not equal, all you get is a fail, with no indication of which part of the object graph didn't match. Debugging this can be painful and frustrating.
My current approach is to make sure everything implements toString(), and then compare for equality like this:
assertEquals(o1.toString(), o2.toString());
This makes it easier to track down test failures, since IDEs like Eclipse have a special visual comparator for displaying string differences in failed tests. Essentially, the object graphs are represented textually, so you can see where the difference is. As long as toString() is well written, it works great.
It's all a bit clumsy, though. Sometimes you want to design toString() for other purposes, like logging, maybe you only want to render some of the objects fields rather than all of them, or maybe toString() isn't defined at all, and so on.
I'm looking for ideas for a better way of comparing complex object graphs. Any thoughts?

The Atlassian Developer Blog had a few articles on this very same subject, and how the Hamcrest library can make debugging this kind of test failure very very simple:
How Hamcrest Can Save Your Soul (part 1)
Hamcrest saves your soul - Now with less suffering! (part 2)
Basically, for an assertion like this:
assertThat(lukesFirstLightsaber, is(equalTo(maceWindusLightsaber)));
Hamcrest will give you back the output like this (in which only the fields that are different are shown):
Expected: is {singleBladed is true, color is PURPLE, hilt is {...}}
but: is {color is GREEN}

What you could do is render each object to XML using XStream, and then use XMLUnit to perform a comparison on the XML. If they differ, then you'll get the contextual information (in the form of an XPath, IIRC) telling you where the objects differ.
e.g. from the XMLUnit doc:
Comparing test xml to control xml [different]
Expected element tag name 'uuid' but was 'localId' -
comparing <uuid...> at /msg[1]/uuid[1] to <localId...> at /msg[1]/localId[1]
Note the XPath indicating the location of the differing elements.
Probably not fast, but that may not be an issue for unit tests.

Because of the way I tend to design complex objects, I have a very easy solution here.
When designing a complex object for which I need to write an equals method (and therefore a hashCode method), I tend to write a string renderer, and use the String class equals and hashCode methods.
The renderer, of course, is not toString: it doesn't really have to be easy for humans to read, and includes all and only the values I need to compare, and by habit I put them in the order which controls the way I'd want them to sort; none of which is necessarily true of the toString method.
Naturally, I cache this rendered string (and the hashCode value as well). It's normally private, but leaving the cached string package-private would let you see it from your unit tests.
Incidentally, this isn't always what I end up with in delivered systems, of course - if performance testing shows that this method is too slow, I'm prepared to replace it, but that's a rare case. So far, it's only happened once, in a system in which mutable objects were being rapidly changed and frequently compared.
The reason I do this is that writing a good hashCode isn't trivial, and requires testing(*), while making use of the one in String avoids the testing.
(* Consider that step 3 in Josh Bloch's recipe for writing a good hashCode method is to test it to make sure that "equal" objects have equal hashCode values, and making sure that you've covered all possible variations are covered isn't trivial in itself. More subtle and even harder to test well is distribution)

The code for this problem exists at http://code.google.com/p/deep-equals/
Use DeepEquals.deepEquals(a, b) to compare two Java objects for semantic equality. This will compare the objects using any custom equals() methods they may have (if they have an equals() method implemented other than Object.equals()). If not, this method will then proceed to compare the objects field by field, recursively. As each field is encountered, it will attempt to use the derived equals() if it exists, otherwise it will continue to recurse further.
This method will work on a cyclic Object graph like this: A->B->C->A. It has cycle detection so ANY two objects can be compared, and it will never enter into an endless loop.
Use DeepEquals.hashCode(obj) to compute a hashCode() for any object. Like deepEquals(), it will attempt to call the hashCode() method if a custom hashCode() method (below Object.hashCode()) is implemented, otherwise it will compute the hashCode field by field, recursively (Deep). Also like deepEquals(), this method will handle Object graphs with cycles. For example, A->B->C->A. In this case, hashCode(A) == hashCode(B) == hashCode(C). DeepEquals.deepHashCode() has cycle detection and therefore will work on ANY object graph.

Unit tests should have well-defined, single thing they test. This means that in the end you should have well-defined, single thing that can be different about those two object. If there are too many things that can differ, I would suggest splitting this test into several smaller tests.

I followed the same track you are on. I also had additionnal troubles:
we can't modify classes (for equals or toString) that we don't own (JDK), arrays etc.
equality is sometimes different in various contexts
For example, tracking entities equality might rely on database ids when available ("same row" concept), rely the equality of some fields (the business key) (for unsaved objects). For Junit assertion, you might want all fields equality.
So I ended up creating objects that run through a graph, doing their job as they go.
There is typically a superclass Crawling object:
crawl through all properties of the objects ; stop at:
enums,
framework classes (if applicable),
at unloaded proxies or distant connections,
at objects already visited (to avoid looping)
at Many-To-One relationship, if they indicate a parent (usually not included in the equals semantic)
...
configurable so that it can stop at some point (stop completely, or stop crawling inside the current property):
when mustStopCurrent() or mustStopCompletely() methods return true,
when encountering some annotations on a getter or a class,
when the current (class, getter) belong to a list of exceptions
...
From that Crawling superclass, subclasses are made for many needs:
For creating a debug string (calling toString as needed, with special cases for Collections and arrays that don't have a nice toString ; handling a size limit, and much more).
For creating several Equalizers (as said before, for Entities using ids, for all fields, or solely based on equals ;). These equalizers often need special cases also (for example for classes outside your control).
Back to the question : These Equalizers could remember the path to the differing values, that would be very useful your JUnit case to understand the difference.
For creating Orderers. For example, saving entities need to be done is a specific order, and efficiency will dictate that saving the same classes together will give a huge boost.
For collecting a set of objects that can be found at various levels in the graph. Looping on the result of the Collector is then very easy.
As a complement, I must say that, except for entities where performance is a real concern, I did choose that technology to implements toString(), hashCode(), equals() and compareTo() on my entities.
For example, if a business key on one or more fields is defined in Hibernate via a #UniqueConstraint on the class, let's pretend that all my entities have a getIdent() property implemented in a common superclass.
My entities superclass has a default implementation of these 4 methods that relies on this knowledge, for example (nulls need to be taken care of):
toString() prints "myClass(key1=value1, key2=value2)"
hashCode() is "value1.hashCode() ^ value2.hashCode()"
equals() is "value1.equals(other.value1) && value2.equals(other.value2)"
compareTo() is combine the comparison of the class, value1 and value2.
For entities where performance is of concern, I simply override these methods to not use reflexion. I can test in regression JUnit tests that the two implementations behave identically.

We use a library called junitx to test the equals contract on all of our "common" objects:
http://www.extreme-java.de/junitx/
The only way I can think of to test the different parts of your equals() method is to break down the information into something more granular. If you are testing a deeply-nested tree of objects, what you are doing is not truly a unit test. You need to test the equals() contract on each individual object in the graph with a separate test case for that type of object. You can use stub objects with a simplistic equals() implementation for the class-typed fields on the object under test.
HTH

I would not use the toString() because as you say, it is usually more useful for creating a nice representation of the object for display or logging purposes.
It sounds to me that your "unit" test is not isolating the unit under test. If, for example, your object graph is A-->B-->C and you are testing A, your unit test for A should not care that the equals() method in C is working. Your unit test for C would make sure it works.
So I would test the following in the test for A's equals() method:
- compare two A objects that have identical B's, in both directions, e.g. a1.equals(a2) and a2.equals(a1).
- compare two A objects that have different B's, in both directions
By doing it this way, with a JUnit assert for each comparison, you will know where the failure is.
Obviously if your class has more children that are part of determining equality, you would need to test many more combinations. What I'm trying to get at though is that your unit test should not care about the behavior of anything beyond the classes it has direct contact with. In my example, that means, you would assume C.equals() works correctly.
One wrinkle may be if you are comparing collections. In that case I would use a utility for comparing collections, such as commons-collections CollectionUtils.isEqualCollection(). Of course, only for collections in your unit under test.

If you're willing to have your tests written in scala you could use matchete. It is a collection of matchers that can be used with JUnit and provide amongst other things the ability to compare objects graphs:
case class Person(name: String, age: Int, address: Address)
case class Address(street: String)
Person("john",12, Address("rue de la paix")) must_== Person("john",12,Address("rue du bourg"))
Will produce the following error message
org.junit.ComparisonFailure: Person(john,12,Address(street)) is not equal to Person(john,12,Address(different street))
Got : address.street = 'rue de la paix'
Expected : address.street = 'rue du bourg'
As you can see here I've been using case classes, which are recognized by matchete in order to dive into the object graph.
This is done through a type-class called Diffable. I'm not going to discuss type-classes here, so let's say that it is the corner stone for this mechanism, which compare 2 instances of a given type. Types that are not case-classes (so basically all types in Java) get a default Diffable that uses equals. This isn't very useful, unless you provide a Diffable for your particular type:
// your java object
public class Person {
public String name;
public Address address;
}
// you scala test code
implicit val personDiffable : Diffable[Person] = Diffable.forFields(_.name,_.address)
// there you go you can now compare two person exactly the way you did it
// with the case classes
So we've seen that matchete works well with a java code base. As a matter of fact I've been using matchete at my last job on a large Java project.
Disclaimer : i'm the matchete author :)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.