Generate random but static test data

Generate random but static test data - java

When designing test cases, I want to be able to use data that is random but static.
If I use data that is not random, then I will use trivial examples that are representative of the data I expect, rather than the data I have guarded against in my code. For example, if my code is expecting a string with a max length of 15 characters then I would rather specify these constraints and have the data generated for me, within those constraints, rather than some arbitrary example which may be, due to my expectations, within a more strict set of constraints.
If I use data that is not static, then my tests won't be repeatable. It is no good using a string that changes every time the test is run f the test then fails occasionally. It would be much better to use a consistent string and then specify more constraints upon how that string is generated (and obviously make the same checks in my code), if and when a bug is found.
Is this a good strategy for test data?
If so, then I know how to achieve both of these goals independently. For static, but non-random data I just enter something arbitrary e.g. foo. For something random but not static, I just use apache random utils e.g. randomString(5). How can I get both?
Note that when data must be unique, it would also be handy to have some way to specify that two pieces of generated data must be distinct. Randomness does this most of the time but cannot be relied upon, obviously, without having unreliable tests!
TL;DR: How can I specify the type of data I want to generate, without having randomised generated data?

Use a random with a constant seed. You can use the Random(long seed) constructor for it.
The RandomStringUtils.random() method can accept a Random source, which you could have created with a constant seed as described.
Using a constant seed is very useful for making experiments reproduceable - and using them is a very good practice, IMO.

don't do it. it gives you a headache, makes your tests unreadable and gives you no benefit. you already see the problems: specification of constraints. so let's go to the imaginary benefits. you worry about that manually you provide more constrained data then random data. but you want to use same data every time (same seed). so how do you know that random data are better than your manually provided data? how do you know that you chose seed properly? if you are not sure if your test data are good enough then:
simplify your code (extract methods/classes, avoid ifs, avoid nulls, be more immutable and functional)
look at edge cases and include them in your tests
look at generated data and check if some of them differs from what you were thinking of and add those data to your tests
use mutation testing
whenever a bug is discovered dufing development, uat or production, add those data to your tests
do truly random (not repetitive), long running tests. every generated data that breaks the tests should be logged and add to your deterministic unit tests.
by pretending to use random data you just lie to yourself. the data is not random, you don't control it and it makes you stop thinking about edge cases of your code. so don't do it, face the truth and make your tests readable and check more conditions

What you are describing is property based testing - the best known example being Haskell's quickcheck.
http://www.haskell.org/haskellwiki/Introduction_to_QuickCheck1
There have been a number of java ports such as
https://bitbucket.org/blob79/quickcheck
https://github.com/kjw/supercheck
https://github.com/pholser/junit-quickcheck
The Quickcheck philosophy emphasises the use of random data, but most (all?) of the java ports allow you to set a fixed seed so that the generated values are repeatable.
I've never got round to actually trying this approach, but I would hope it would make your tests more readable (rather than less readable as piotrek suggests), by separating the values from the tests.
If knowledge of the values is important to understand the test/SUT behavior then it is the wrong approach.

Instancio is a data generation library for unit tests that does what you are looking. E.g. if you need a random string of certain length:
Foo foo = Instancio.of(Foo.class)
.generate(field("fooString"), gen -> gen.string().length(10))
.create();
To generate a random predictable values, you can supply a seed:
Foo foo = Instancio.of(Foo.class)
.generate(field("fooString"), gen -> gen.string().length(10))
.withSeed(123)
.create();
Or if you use JUnit 5:
#ExtendWith(InstancioExtension.class)
class ExampleTest{
#Seed(1234)
#Test
void example {
Foo foo = Instancio.of(Foo.class)
.generate(field("fooString"), gen -> gen.string().length(10))
.create();
// ...
}
}
If you need predictable data always, you can configure a global seed value through a properties file. This way you won't need to specify it in the code.
https://github.com/instancio/instancio/

Related

AssertJ testing on a collection: what is better: 'extracting' or Java8 forEach

I'm new to AssertJ and using it to unit-test my written code and was thinking how to assert a list.
Lets assume we have a list of Consumers Entities. each Entity has it own Phone, own ServiceProvider which has it own Name and EntityName.
Now we want to assert that each Entity from a repository gets the right data, so we want to test that each item on list has equal Phone.
ConsumerEntity savedConsumer1 = Consumer(phone, name, serviceProvider)
List<ConsumerEntity> consumerListFromRepository = repository.findAllByPhone(phone)
Now I want to test that the data given from Repository is correct,
I can use this:
assertThat(consumerListFromRepository)
.extracting(ConsumerEntity::getPhone())
.containsOnly(savedConsumer1.getPhone());
Or I can do this with forEach (java 8):
consumerListFromRepository.forEach(consumerEntity ->
assertThat(consumerEntity.getPhone()).isEqualTo(savedConsumer1.getPhone()));
1. Which one is faster/simple-r/readable? I will go for the forEach for less lines of code but less read-ability as well.
2. Is there any other way to do it 1liner like the foreach but with asserThat? so it will be readable and simple - and without the need to use EqualTo each
time? something like:
asserThat(list).forEach........
3. Which one is faster? Extracting or forEach?
Thanks!

I'm not sure that "faster" is a primary concern here. It's likely that any performance difference is immaterial; either the underlying implementations are ~equivalent in terms of non-functionals or - since the context here is a unit test - the consumerListFromRepository is trivially small thereby limiting the scope for any material performance differences.
I think your main concerns here should be
Making it as easy as possible for other developers to:
Understand/reason about your test case
Edit/refactor your test case
Ensuring that your approach to asserting is consistent with other test cases in your code base
Judging which of your two approaches best ticks this box is somewhat subjective but I think the following considerations are relevant:
The Java 8 forEach construct is well understood and the isEqualTo matcher is explicit and easily understood
The AssertJ extracting helper paired with the containsOnly is less common that Java8's forEach construct but this pairing reads logically and is easily understood
So, IMHO both approaches are valid. If your code base consistently uses AssertJ then I'd suggest using the extracting helper paired with the containsOnly matcher for consistency. Otherwise, use whichever of them reads best to you :)

Randomized testing in Java

I write a lot of unit tests. Often, you need to write carefully considered test cases by hand, a form of whitebox testing. If you are lucky enough to work for a company with a separate quality assurance engineers, perhaps someone else writes test cases for you (kind of a mix between white and black box testing).
Many times, however, randomized testing would find many bugs and would serve as a great complement to hand-written cases.
For example, I might have a self-contained class and be able to express the invariants and broad-stroke behavior of the class simply (such as "this method never throws an exception" or "this method always returns a positive value"). I would like a test framework that just bashes on my class and checks the invariants.
A similar case: I often have a class which implements similar functionality to another class (but does it with different performance characteristics or with some added functionality). I would to A vs B test the two classes in a randomized way. For example, if I was implementing TreeMap, I could use HashMap as a comparable implementation (modulo a few differences due to the sorted behavior of TreeMap) and check most of the basic functionality in a randomized way. Simlarly, someone implementing LinkedList could use ArrayList as a comparable implementation and vice-versa.
I've written some basic stuff to do this in the past, but it is painstaking to set to up all the boilerplate to:
Create objects with random initial state
Apply random mutations
Create mappings between "like" objects for A vs B testing
Define invariants and rules such as "when will exceptions be thrown"
I still do it from time to time, but I want to reduce my effort level. For example, are there frameworks that remove or simplify the required boilerplate?
Otherwise, what techniques are used to do randomized testing in Java?
This is related, but not the same as fuzz testing. Fuzz testing seems to focus on random inputs to a single entity, in hope of triggering bad behavior, often with an adaptive input model based on dynamic coverage observations. That's covers a lot of the above, but doesn't cover, stuff like A vs B testing when comparable implementations exist, or invariant checking. In any case, I'm also interested in decent fuzz testing libraries for Java.

I think what you're trying to find is a library for Property Based Testing in Java (see types of randomized testing). Shortly: instead of testing the value of the result you're testing a property of it. E.g. instead of checking that 2+2 is 4 you're checking properties like:
random1 + 0 = random1
random1 + random2 >= random1
...
Take a look at this article that explains Property Based Testing in details.
Another option that you mention is to check with your Test Oracle - something that knows the true answer (e.g. old bullet-proof algorithm). So you pass a random variable both to old and new algorithm and you check that the results are equal.
Couple of Java libraries:
JUnit QuickCheck - a specialized lib for Property Based Testing. Allows you to define the properties and passes random values for these properties to check. So far (06/2016) it's pretty young, so you may want to check out ScalaCheck since it's possible to write Scala tests for Java code.
Datagen - random values generator for Java in case standard randomizers are not enough. Disclaimer: I'm the author.

Mocking a bit stream reader with Mockito

I am currently debugging a rather complicated algorithm that fixes errors in a bit stream. A BitReader interface is quite simple, and the main reading method is like this:
/**
Reads bits from the stream.
#param length number of bits to read (<= 64)
#return read bits in the least significant bits
*/
long read(int length) throws IOException;
The objective is to test whether BitStreamFixer actually fixes the stream (in a way that is too hard to describe here). Basically I need to provide “broken” inputs to it and test whether its output is as correct as it can be (some inputs can't be fixed completely), like this:
BitStreamFixer fixer = new BitStreamFixer(input);
int word1 = fixer.readWord();
int word2 = fixer.readWord();
// possibly a loop here
assertEquals(VALID_WORD1, word1);
assertEquals(VALID_WORD2, word2);
// maybe a loop here too
Now, the BitStreamFixer class accepts an instance of BitReader. When unit testing the fixer, I obviously need one such instance. But where do I get one? I have two obvious options: either give it a real implementation of BitReader or mock it.
The former option is not really appealing because it would create a dependency on another object which has nothing to do with the class being tested. Moreover, it's not that easy because existing BitReader implementations read form input streams, so I'll need either a file or somehow prepared byte array, which is a tedious thing to do.
The latter option looks better and fits the usual unit testing approach. However, since I'm not even supposed to know what arguments the fixer will give to read, mocking it is not easy. I'll have to go with when(bitReader.read(anyInt())).thenAnswer(...) approach, implementing a custom answer that will create a lot of bit-fiddling logic to spoon-feed the object under test with proper bits in chunks of whatever size it asks for. Considering that bit streams I'm working with have rather complicated higher-level structure, it's not easy. And introducing logic in unit tests also doesn't smell good.
What do you think, is there any other option? Or maybe one of these can be improved in a way I fail to notice?

Write, test, and use a clear reusable test helper.
In a general sense, in unit testing, you're supposed to establish confidence in a system by watching it successfully interact with systems that you DO have confidence in. Of course you also want the system to be fast, deterministic, and easy to read/modify, but ultimately those come secondary to the assertion that your system work.
You've listed two options:
Use a mock BitReader, where you have enough confidence in predicting your system's interactions that you can set up the entire "when A then B" conversation. Mocking can be pretty easy when you have a small API surface of independent methods, like an RPC layer, but mocking can be very difficult when you have a stateful object with unpredictable method calls. Mocking is further useful to deterministically stub nondeterministic systems, like external servers or pseudorandom sources, or systems that don't exist yet; none of those is the case for you.
Because your read method can take a wide variety of parameters, each of which is valid and changes your system's state, then it's probably not a smart idea to use mocking here. Unless the order of calls that BitStreamFixer makes to BitReader is deterministic enough to make part of its contract, a mock BitReader will likely result in a brittle test: one that breaks when the implementation changes even if the system is perfectly functional. You'll want to avoid that.
Note that mocking should never yield "complicated logic", only complicated set-up. You're using mocks to avoid using real logic in your tests.
Use a real BitReader, which sounds like it will be painful and opaque to construct. This is probably the most realistic solution, though, especially if you've already finished writing and testing it.
You worry about "introducing new dependencies", but if your BitReader implementation exists and is fast, deterministic, and well-tested, then you shouldn't feel any worse about using it than using a real ArrayList or ByteArrayInputStream in your test. It sounds like the only real problem here is that creating the byte array would make it hard to maintain your test, which is a valid consideration.
In the comments, though, the real answer comes through: Build the BitWriter you're missing.
#Test public void shouldFixBrokenStream() {
BitReader bitReader = new StreamBitReader(BitWriter.create()
.pushBits(16, 0x8080)
.pushBits(12, 0x000) // invalid 12-bit sequence
.pushBits(16, 0x8080)
.asByteArrayInputStream());
BitStreamFixer fixer = new BitStreamFixer(bitReader);
assertEquals(0x80808080, fixer.read(32));
}
/** Of course, you could skip the BitReader yourself, and just make a new one. */
#Test public void shouldFixBrokenStream_bitReader() {
BitReader bitReader = new InMemoryBitReader();
bitReader.pushBits(16, 0x8080);
bitReader.pushBits(12, 0x000); // invalid 12-bit sequence
bitReader.pushBits(16, 0x8080);
BitStreamFixer fixer = new BitStreamFixer(bitReader);
assertEquals(0x80808080, fixer.read(32));
}
This is more readable than constructing an opaque bitstream offline and copy-pasting it into your test (particularly if well-commented), less brittle than mocks, and much more testable itself than an anonymous inner class (or Answer-based version of the same). It is also likely that you can use a system like that across multiple test cases, and possibly even multiple tests.

Don't generate the *Count method in java protobuf

According to protobuf documentation
Repeated fields have some extra methods – a Count method
so something like this:
// repeated .tutorial.Person.PhoneNumber phone = 4;
public List<PhoneNumber> getPhoneList();
public int getPhoneCount();
public PhoneNumber getPhone(int index);
Is it possible to suppress the generation of getPhoneCount? I don't want it in the resulting java class. Is it possible to not generate it?
EDIT: To make clear what my problem is, we have .proto file with something like this
message Bar {
...
optional int32 entries_count = 123
...
repeated Foo entries = 456
...
}
Because of that, both entries_count and entries tries to generate function getEntriesCount(), which is obviously not possible. So it's generated instead as getEntriesCount123() and getEntriesCount456(), which is not exactly user friendly. So I would like to suppress generation of one of them, since they are supposed to return same value anyway.
Sadly I'm not really sure how feasible is changing the format, too many things around may depend on it :/

No, there's no way of doing this.
If you look at the generator code (primitive fields, message fields, enum fields etc) you can see that the ...Count() methods (both interface and implementation) are written unconditionally.
Options:
Live with the existing generation code
Use your own fork of protoc
Create a pull request for the main project
I'd strongly recommend option 1. With option 2 you'll be forever having to do work to keep it up to date, and I'd be quite surprised if you managed to get option 3 accepted into the codebase... the bar for adding an extra option is pretty high.
Basically, you should remove your entries_count field. It's an obvious place where data can get out of sync - and the real value is always available to clients anyway, in all platforms I'm aware of. If you want it to mean something other than just "the number of values in entries" (e.g. some estimated total count, where you've only got some sample) then you should rename it to be more specific, at which point your existing problem will go away at the same time.

Refactoring large data object

What are some common strategies for refactoring large "state-only" objects?
I am working on a specific soft-real-time decision support system which does online modeling/simulation of the national airspace. This piece of software consumes a number of live data feeds, and produces a once-per-minute estimate of the "state" of a large number of entities in the airspace. The problem breaks down neatly until we hit what is currently the lowest-level entity.
Our mathematical model estimates/predicts upwards of 50 parameters for a timeline of several hours into the past and future for each of these entities, roughly once per minute. Currently, these records are encoded as a single Java class with a lot of fields (some get collapsed into an ArrayList). Our model is evolving, and the dependencies among the fields are not yet set in stone, so each instance wanders through a convoluted model, accumulating settings as it goes along.
Currently we have something like the following, which uses a builder pattern approach to build up the contents of the record, and enforce what the known dependencies are (as a check against programmer error as evolve the mode.) Once the estimate is done, we convert the below into an immutable form using a .build() type method.
final class OneMinuteEstimate {
enum EstimateState { INFANT, HEADER, INDEPENDENT, ... };
EstimateState state = EstimateState.INFANT;
// "header" stuff
DateTime estimatedAtTime = null;
DateTime stamp = null;
EntityId id = null;
// independent fields
int status1 = -1;
...
// dependent/complex fields...
... goes on for 40+ more fields...
void setHeaderFields(...)
{
if (!EstimateState.INFANT.equals(state)) {
throw new IllegalStateException("Must be in INFANT state to set header");
}
...
}
}
Once a very large number of these estimates are complete, they are assembled into timelines where aggregate patterns/trends are analyzed. We have looked at using an embedded database but have struggled with performance issues; we'd rather get this sorted out in terms of data modeling and then incrementally move portions of the soft-real-time code into an embedded data store.
Once the "time sensitive" pieces of this are done, the products are flushed to flat files and a database.
Problems:
It's a giant class, with way too many fields.
There is very little behavior encoded in the class; it's mostly a holder for data fields.
Maintaining the build() method is extremely cumbersome.
It feels clumsy to manually maintain a "state machine" abstraction merely for the purpose of ensuring that a large number of dependent modeling components are properly populating a data object, but it has saved us a lot of frustration as the model evolves.
There is a lot of duplication, particularly when the records described above are aggregated into very similar "rollups" which amount to rolling sums/averages or other statistical products of the above structure in time series.
While some of the fields could be clumped together, they are all logically "peers" of one another, and any breakdown we've tried has resulted in having behavior/logic artificially split and needing to reach two levels deep in indirection.
Out of the box ideas entertained, but this is something we need to evolve incrementally. Before anyone else says it, I'll note that one could suggest that our mathematical model is insufficiently crisp if the data representation for that model is this hard to get ahold of. Fair point, and we're working that, but I think that's a side-effect of an R&D environment with a lot of contributors, and a lot of concurrent hypotheses in play.
(Not that it matters, but this is implemented in Java. We use HSQLDB or Postgres for output products. We don't use any persistence framework, partly out of a lack of familiarity, partly because we have enough performance trouble with just the database alone and hand-coded storage routines... we're skeptical of moving towards additional abstraction.)

I had much of the same problem you did.
At least I think I did, sounds like I did. Representation was different, but at 10,000 feet, sounds pretty much the same. Crapload of discrete, "arbitrary" variables and a bunch of ad hoc relationships among them (essentially business driven), subject to change at a moment's notice.
You also have another issue, which you sorta mentioned, and that was the performance requirement. Sounds like faster is better, and likely a slow perfect solution would be tossed out for the fast lousy one, simply because the slower one can't meet a baseline performance requirement, no matter how good it is.
To put it simply, what I did was I designed a simple domain specific rule language for my system.
The entire point of the DSL was to implicitly express relationships and package them up in to modules.
Very crude, contrived example:
D = 7
C = A + B
B = A / 5
A = 10
RULE 1: IF (C < 10) ALERT "C is less than 10"
RULE 2: IF (C > 5) ALERT "C is greater than 5"
RULE 3: IF (D > 10) ALERT "D is greater than 10"
MODULE 1: RULE 1
MODULE 2: RULE 3
MODULE 3: RULE 1, RULE 2
First, this is not representative of my syntax.
But you can see from the Modules, that it is 3, simple rules.
The key though, is that it's obvious from this that Rule 1 depends on C, which depends on A and B, and B depends on A. Those relationships are implied.
So, for that module, all of those dependencies "come with it". You can see if I generated code for Module 1 it might look something like:
public void module_1() {
int a = 10;
int b = a / 5;
int c = a + b;
if (c < 10) {
alert("C is less than 10");
}
}
Whereas if I created Module 2, all I would get is:
public void module_2() {
int d = 7;
if (d > 10) {
alert("D is greater than 10.");
}
}
In Module 3 you see the "free" reuse:
public void module_3() {
int a = 10;
int b = a / 5;
int c = a + b;
if (c < 10) {
alert("C is less than 10");
}
if (c > 5) {
alert("C is greater than 5");
}
}
So, even though I have one "soup" of rules, the Modules root the base of the dependencies, and thus filter out the stuff it doesn't care about. Grab a module, shake the tree and keep what's left hanging.
My system used the DSL to generate source code, but you can easily have it create a mini runtime interpreter as well.
Simple topological sorting handled the dependency graph for me.
So, the nice thing about this is that while there was inevitable duplication in the final, generated logic, at least across modules, there wasn't any duplication in the rule base. What you as a developer/knowledge worker maintain is the rule base.
What is also nice is that you can change an equation, and not worry so much about the side effects. For example, if I change do C = A / 2, then, suddenly, B drops out completely. But the rule for IF (C < 10) doesn't change at all.
With a few simple tools, you can show the entire dependency graph, you can find orphaned variables (like B), etc.
By generating source code, it's going to run as fast as you want.
In my case, it was interesting to see a rule drop a single variable and see 500 lines of source code vanish from the resulting module. That's 500 lines I didn't have to crawl through by hand and remove during maintenance and development. All I had to do was change a single rule in my rule base and let "magic" happen.
I was even able to do some simple peephole optimization and eliminate variables.
It's not that hard to do. Your rule language can be XML, or a simple expression parser. No reason to go full boat Yacc or ANTLR on it if you don't want to. I'll put a plug in for S-Expressions, no grammar needed, brain dead parsing.
Spreadsheets also make a great input tool, actually. Just be strict on the formatting. Kind of sucks for merging in SVN (so, Don't Do That), but end users love it.
You may well be able to get away with an actual rule based system. My system wasn't dynamic at runtime, and didn't really need sophisticated goal seeking and inference, so I didn't need the overhead of such a system. But if one works for you out of the box, then happy day.
Oh, and for an implementation note, for those who don't believe you can hit the 64K code limit in a Java method, well I can assure you it can be done :).

Splitting a Large Data Object is very similar to Normalizing a Large Relational Table (first and second normal form). Follow the rules to reach at least second normal form and you may have a good decomposition of the original class.

From experience working also with R&D stuff with soft real-time performance constrains (and sometimes monster fat classes), I would suggest NOT to use OR mappers. In such situations, you'll be better off dealing "touching the metal" and working directly with JDBC result sets. This is my suggestion for apps with soft real-time constrains and massive amounts of data items per package. More importantly, if the number of distinct classes (not class instances, but class definitions) that need to persisted is large, and you also have memory constrains in your specs, you will also want to avoid ORMs like Hibernate.
Going back to your original question:
What you seem to have is a typical problem of 1) mapping multiple data items into a OO model and 2) such multiple data items do not exhibit a good way of grouping or segregation (and any attempt to grouping tends simply not to feel right.) Sometimes the domain model does not lend itself for such aggregation, and coming up with an artificial way of doing so typically ends up in compromises that don't satisfy all design requirements and desires.
To make matters worse, a OO model typically requires/expects you to have all the items present in a class as class' fields. Such a class is typically without behavior, so it is just a struct-like construct, aka data envelope or data shuttle.
But such situations beg the following questions:
Does your application need to read/write all 40, 50+ data items at once, always?
*Must all data items be always present?*
I do not know the specifics of your problem domain, but in general I've found that we rarely ever need to deal with all data items at once. This is where a relational model shines because you don't have to query all rows from a table at once. You only pulls those you need as projections of the table/view in question.
In a situation where we have a potentially large number of data items, but on average the number of data items being passed down the wire is less than the maximum, you'd be better off using a Properties pattern.
Instead of defining a monster envelope class holding all items :
// java pseudocode
class envelope
{
field1, field2, field3... field_n;
...
setFields(m1,m2,m3,...m_n){field1=m1; .... };
...
}
Define a dictionary (based on a map for example):
// java pseudocode
public enum EnvelopeField {field1, field2, field3,... field_n);
interface Envelope //package visible
{
// typical map-based read fields.
Object get(EnvelopeField field);
boolean isEmpty();
// new methods similar to existing ones in java.lang.Map, but
// more semantically aligned with envelopes and fields.
Iterator<EnvelopeField> fields();
boolean hasField(EnvelopeField field);
}
// a "marker" interface
// code that only needs to read envelopes must operate on
// these interfaces.
public interface ReadOnlyEnvelope extends Envelope {}
// the read-write version of envelope, notice that
// it inherits from Envelope, but not from ReadOnlyEnvelope.
// this is done to make it difficult (but not impossible
// unfortunately) to "cast-up" a read only envelope into a
// mutable one.
public interface MutableEnvelope extends Envelope
{
Object put(EnvelopeField field);
// to "cast-down" or "narrow" into a read only version type that
// cannot directly be "cast-up" back into a mutable.
ReadOnlyEnvelope readOnly();
}
// the standard interface for map-based envelopes.
public interface MapBasedEnvelope extends
Map<EnvelopeField,java.lang.Object>
MutableEnvelope
{
}
// package visible, not public
class EnvelopeImpl extends HashMap<EnvelopeField,java.lang.Object>
implements MapBasedEnvelope, ReadOnlyEnvelope
{
// get, put, isEmpty are automatically inherited from HashMap
...
public Iterator<EnvelopeField> fields(){ return this.keySet().iterator(); }
public boolean hasField(EnvelopeField field){ return this.containsKey(field); }
// the typecast is redundant, but it makes the intention obvious in code.
public ReadOnlyEnvelope readOnly(){ return (ReadOnlyEnvelope)this; }
}
public class final EnvelopeFactory
{
static public MapBasedEnvelope new(){ return new EnvelopeImpl(); }
}
No need to set up read-only internal flags. All you need to do is downcast your envelope instances as Envelope instances (that only provide getters).
Code that expects to read should operate on read-only envelopes and code that expects to change fields should operate on mutable envelopes. Creation of the actual instances would be compartmentalized in factories.
That is, you use the compiler to enforce things to be read-only (or allow things to be mutable) by establishing some code conventions, rules governing what interfaces to use where and how.
You can layer your code into sections that need to write separate from code that only needs to read. Once that's done, simple code reviews (or even grep) can identify code that is using the wrong interface.)
Problems:
Non-public Parent Interface:
Envelope is not declared as a public interface to prevent erroneous/malicious code from casting a read-only envelope down to a base envelope and then back to a mutable envelope. The intended flow is from mutable to read-only only - it is not intended to be bi-directional.
The problem here is that extension of Envelope is restricted to the package that contains it. Whether that is a problem will depend on the particular domain and intended usage.
Factories:
The problem is that factories can (and most likely will) be very complex. Again, the nature of the beast.
Validation:
Another problem introduced with this approach is that now you have to worry about code that expects field X to be present. Having the original monster envelope class partially frees you from that worry because, at least syntactically, all fields are there...
... whether the fields are set or not, that was another matter that still remains with this new model I'm proposing.
So if you have client code that expects to see field X, the client code has to throw some type of exception if the field is not present (or to computer or read a sensible default somehow.) In such cases, you will have to
Identify patterns of field presence. Clients that expect field X to be present might be grouped separately (layered apart) from clients that expect some other field to be present.
Associate custom validators (proxies to read-only envelope interfaces) that either throw exceptions or compute default values for missing fields according to some rules (rules provided programmatically, with an interpreter, or with a rules engine.)
Lack of Typing:
This might be debatable, but people used to work with static typing might feel uneasy with losing the benefits of static typing by going to a loosely typied map-based approach. The counter-argument of this is that most of the web works on a loose typing approach, even on the Java side (JSTL, EL.)
Problems aside, the larger the maximum number of possible fields and the lower the average number of fields present at any given time, the most effective wrt performance this approach will be. It adds additional code complexity, but that's the nature of the beast.
That complexity doesn't go away, and either will be present in your class model or in your validation code. Serialization and transferring down the wire is much more efficient, though, specially if you expect massive numbers of individual data transfers.
Hope it helps.

Actually this looks like a frequent problem that game developers face, bloated classes holding numerous variables and methods because of a deep inheritance tree etc.
There's this blog post about how and why to select composition over inheritance, maybe it would help.

One way you may be able to intelligently break up a large data class is to look at patterns of access by client classes. For example, if a set of classes only accesses fields 1-20 and another set of classes only accesses fields 25-30, maybe those groups of fields belong in separate classes.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.