Dealing with non-trivial command and event payloads in Axon

Dealing with non-trivial command and event payloads in Axon - java

Whenever I take a look at Axon Bank I start wondering whether I should follow a set of design rules for events and commands.
In Axon Bank both events and commands exclusively consist of primitives. In my applications I tend to avoid primitive usage as much as possible, mainly to build an expressive domain and to have type safety wherever I can get it.
Axon itself comes around with some DDD references but no matter which documents I browse, not a single example makes use of compound objects as part of event/command payloads.
Which confuses me. There is built-in-support for full-blown xml and json serialization capable of more than just having some key-value pairs.
I understand that especially events tend to be small and simple structures since they only reflect incremental state changes but there will always be some kind of gap between a complex domain model and an event (entry).
In my domain I could have a bunch of Classes like OverdraftLimit, CurrentBalance, Deposit and AccountIdentifier.
Now there are two possible ways to design events and commands:
1. Primitives and extensive converting
Treat Events as raw data with a nice label on it
Convert raw data to powerful objects as soon as it "enters" the application
When creating events simply strip them down again.
public class BankAccountcreatedEvent {
private final String accountIdentifier;
private final int overdraftLimt;
// ...
}
And somewhere else:
public void on (BankAccountCreatedEvent event) {
this.accountIdentifier = AccountIentifier.fromString(event.getAccountIdentifier());
this.overdraftLimit = new OverdraftLimit(event.getOverdraftLimit());
}
Pros:
Simple command/event API that does not have any weird dependencies
Makes distribution easier
Upcasters will only be needed if the actual event structure changes and therefore can be anticipated easily.
Cons:
A huge conversion layer needs to be written and maintained
Decoupling events/commands and the rest of the domain model for mainly technical reasons introduces a new, artificial, contextual gap
2. Expressive Payloads
Use sophisticated types directly as attributes
public class BankAccountCreatedEvent {
private final BankAccountIdentifier bankAccountIdentifier;
private final OverdraftLimit overdraftLimit;
//..
}
Pros:
Less to write, easier to read
Keep together what naturally belongs together
Cons:
Domain logic influences event structure indirectly, upcasting will be needed more frequently and will be less predictable.
I need a second opinion. Is there a recommended way?

The primary thing to keep in mind is that the serialized form of the Event is your formal contract. How you represent that in Java classes is up to each application, in the end. If you configure your serializer to ignore unknown fields, you can leave fields you don't care about out, for example.
Personally, I don't mind primitives in Events. However, I do understand the value of using explicit Value Objects for certain fields, as they allow you to express the "mathematics" involved with each of them. In the case of identifiers, they prevent a "mix-up" where an identifier is used to accidentally attempt to identify another type of object.
In the end, it doesn't matter that much. With a few simple Jackson annotations, you can translate these Value Objects to a simple value in JSON. Check out #JsonValue, for example.
public class BankAccountCreatedEvent {
private final BankAccountIdentifier bankAccountIdentifier;
private final OverdraftLimit overdraftLimit;
//..
}
would map to:
{
"bankAccountIdentifier": "abcdef1234",
"overdraftLimit" : 1000
}
If the BankAccountIdentifier and OverdraftLimit classes would both have an #JsonValue annotated method that would return their 'simple' value.

Related

Any reason to initialize Entity properties with synchronized Collections?

For JPA-Entities in a project I work on, properties of type List or Map are always initialized to the synchronized implementations Vector and Hashtable.
(Unsynchronized ArrayList and HashMap are the standard implementations in Java, except if synchronization is really needed.)
Does anyone know a reason why synchronized Collections would be needed? We use EclipseLink.
When I asked about it, nobody knew why it was done like that. It seems it was always done like this. Maybe this was needed for an old version of EclipseLink?
I'm asking for two reasons:
I would prefer to use the standard implementations ArrayList and HashMap like anywhere else. If that's safe.
There's no matching synchronized Set implementation in the JDK. At least not a serializable one as EclipseLink expects.
Example Entity:
#Entity
public class Person {
...
#ManyToMany(cascade=CascadeType.ALL)
#JoinTable( ... )
private List<Role> accessRoles;
#ElementCollection
#CollectionTable( ... )
#MapKeyColumn(name="KEY")
#Column(name="VALUE")
private Map<String, String> attrs;
public Person() {
// Why Vector/Hashtable instead of ArrayList/HashMap?
accessRoles = new Vector<Role>();
attrs = new Hashtable<String, String>();
}
public List<Role> getAccessRoles() {
return accessRoles;
}
public void setAccessRoles(List<Role> accessRoles) {
this.accessRoles = accessRoles;
}
public Map<String, String> getAttrs() {
return attrs;
}
public void setAttrs(Map<String, String> attrs) {
this.attrs = attrs;
}
}

There's usually no need for a Vector and an ArrayList is more commonly used. So if your current codebase is full of Vectors, this is a bit of a code smell and it is wise to make sure your team members know what the difference is. See also What are the differences between ArrayList and Vector? and Why is Java Vector class considered obsolete or deprecated?
That does not mean you should do the Big Cleanup and replace all the Vectors in your existing code with ArrayLists.
Your code uses Lists and you won't notice a single difference when programming.
The only advantage to be expected is increased performance.
It is hard to tell if none of your code depends on the synchronization provided by the Vectors.
So, unless you are currently suffering performance issues, or are explicitly (re)designing the synchronization of your entire codebase, you risk introducing hard to fix concurrency bugs without any benefits.
Also, be aware that performance suffers most significantly from the use of Vectors when multiple threads access your collections concurrently. So if you are suffering from performance loss and decide to replace the Vectors for that reason, you'll need to be very careful to keep access sufficiently synchronized.
EDIT: You ask about EclipseLink JPA specifically.
It'd be rather surprising if they demanded you use Vectors and Hashtables since that would mean they ask you to rely on obsolete data structures.
In their examples, they use ArrayLists and HashMaps so from that we may conclude that this is indeed not the case.
Diving a bit more specifically into the source code, we can see that their CollectionContainerPolicy uses the Collection interface and does not care about the implementation of your collections. It does, however, surprisingly have special cases for when your internal collection class is Vector. See for instance buildContainerFromVector. And its default container class is Vector, though you can alter that.
See also the documentation for the Container policy.
The most intrusive moment where EclipseLink and your Lists meet is when you're lazy loading collections. EclipseLink will replace the collection with its own IndirectList which internally uses a Vector. See What collections does jpa return? So in those cases, EclipseLink will give you a Vector anyways(!) and it does not even matter what collection you specify in the collection's initialization.
So EclipseLink indeed has a preference for using Vectors and using
Vectors with EclipseLink means less copying of object references from
one collection to the other.

Much of the internals of EclipseLink date back to a time when Vector and Hashtable were the standard collection types in Java. EclipseLink was TopLink back then, which originated from a persistence framework for Smalltalk - so, much of EclipseLinks code is actually older than Java itself, so to speak.
For many years I have worked with TopLink, and always their standard mappings for collection properties used Vector and Hashtable.
To me, the only reasonable explanation for Vector and Hashtable still appearing in EclipseLink is that it has been working like this for a long time and - because it is working - hitherto no one has gotten around to changing it.
For myself, I wouldn't ever use Vector or Hashtable again. If I need a synchronized collection, I'd rather use the SynchronizedList ...Map etc. APIs.
Just my 2 ct.

Going through the code base of eclipselink, it looks like the usage of vector is inherited from older code base and is much like the Vector class itself - legacy.
Somehow the intent was to use Vector to allow multiple threads to act safely on the relationships which are loaded lazily - "indirection" in eclipselink parlance.
(More on the concepts here- the different types of indirection discussed being ValueHolder indirection, Transparent Indirection, Proxy indirection etc.)
However typically the entities and their relationships are not shared among multiple threads in usual use-cases. Each thread gets it's own copy of entity and its
relationships if accessed in their own unit of work.
In case of ValueHoder indirection - one of the implementations of ValueHoderInterface is ValueHoder which is typically initialized with a vector. The relevant part of code is below along with the
code comment as is. The comments are interesting as well
IndirectList.java
..........................
.........................
/**
* INTERNAL:
* Return the valueHolder.
* This method used to be synchronized, which caused deadlock.
*/
public ValueHolderInterface getValueHolder() {
// PERF: lazy initialize value holder and vector as are normally set after creation.
if (valueHolder == null) {
synchronized(this) {
if (valueHolder == null) {
valueHolder = new ValueHolder(new Vector(this.initialCapacity, this.capacityIncrement));
}
}
}
return valueHolder;
}
...................
..................
Also there were few issues reported due to the usage of Vector as mentioned here and here.

You don't need synchronized Collections for the JPA, It should be only related to the business logic.. Which i supposed that doesn't need this.. Because you would know.
So basically it is suggested to use not synchronize and it will increase performance.

As #flup answered with some interesting references, I could only make some additional presumptions:
The team that developed and/or the specifications simply were unaware of the Collection API.
The team wanted to use the code in a highly concurrent environment (either in your Web application, like passing some entities to some other threads or in another desktop application, as JPA is not limited to WEB applications only). Also do note, that IndirectSet is not thread-safe, so meaning that if the team wanted to write some thread-safe code, they should have taken some additional measures (if they use Sets)!

Builder pattern with error-checking: is it possible/advisable?

I am using the Builder pattern to make it easier to create objects. However, the standard builder pattern examples do not include error-checking, which are needed in my code. For example, the accessibility and demandMean arrays in the Simulator object should have the same length. A brief framework of the code is shown below:
public class Simulator {
double[] accessibility;
double[] demandMean;
// Constructor obmitted for brevity
public static class Builder {
private double[] _accessibility;
private double[] _demandMean;
public Builder accessibility(double[] accessibility) {
_accessibility = accessiblity.clone();
return this;
}
public Builder demandMean(double[] demandMean) {
_demandMean = demandMean.clone();
return this;
}
// build() method obmitted for brevity
}
}
As another example, in a promotion optimization problem, there are various promotional vehicles (e.g. flyers, displays) and promotion modes, which are a set of promotional vehicles (e.g. none, flyer only, display only, flyer and display). When I create the Problem, I have to define the set of vehicles available, and check that the promotion modes use a subset of these vehicles and not some other unavailable vehicles, as well as that the promotion modes are not identical (e.g. there aren't two promo modes that are both "flyer only"). A brief framework of the code is shown below:
public class Problem {
Set<Vehicle> vehicles;
Set<PromoMode> promoModes;
public static class Builder {
Set<Vehicle> _vehicles;
Set<PromoMode> _promoModes;
}
}
public class PromoMode {
Set<Vehicle> vehiclesUsed;
}
My questions are the following:
Is there a standard approach to address such a situation?
Should the error checking be done in the constructor or in the builder when the build() method is called?
Why is this the "right" approach?

When you need invariants to hold while creating an object then stop construction if any parameter violates the invariants. This is also a fail-fast approach.
The builder pattern helps creating an object when you have a large number of parameters.
That does not mean that you don't do error checking.
Just throw an appropriate RuntimeException as soon as a parameter violates the objects invariants

You should use the constructor, since that follows the Single Responsibility Principle better. It is not the responsibility of the Builder to check invariants. It's only real job is to collect the data needed to build the object.
Also, if you decide to change the class later to have public constructors, you don't have to move that code.
You definitely shouldn't check invariants in setter methods. This has several benefits:
* You only need to do checking ONCE
* In cases such as your code, you CAN'T check your invariants earlier, since you're adding your two arrays at different times. You don't know what order your users are going to add them, so you don't know which method should run the check.
Unless a setter in your builder does some intense calculations (which is rarely the case - generally, if there's some sort of calculation required, it should happen in the constructor anyway), it doesn't help very much to 'fail early' in, especially since fluent Builders like yours use only 1 line of code to build the object anyway, so any try block would surround that whole line either way.

The "right" approach really depends on the situation - if it is invalid to construct the arrays with different sizes, i'd say it's better to do the handling in the construction, the sooner an invalid state is caught the better.
Now, if you for instance can change the arrays and put in a different one - then it might be better to do it when calling them.

Database information -> object: how should it be done?

My application will upon request retrieve information from a database and produce an object from that information. I'm currently considering two different techniques (but I'm open to others as well!) to complete this task:
Method one:
class Book {
private int id;
private String author;
private String title;
public Book(int id) {
ResultSet book = getBookFromDatabaseById(id);
this.id = book.id;
this.author = book.author;
// ...
}
}
Method two:
public class Book {
private HashMap<String, Object> propertyContainer;
public Book(int id) {
this.propertyContainer = getBookFromDatabaseById(id);
}
public Object getProperty(String propertyKey) {
return this.propertyContainer.get(propertyKey);
}
}
With method one, I believe that it's easier to control, limit and possibly access properties, adding new properties, however, becomes smoother with method two.
What's the proper way to do this?

I think this problem has been solved in many ways: ORM, DAO, row and table mapper, lots of others. There's no need to redo it again.
One issue you have to think hard about is coupling and cyclic dependencies between packages. You might think you're doing something clever by telling a model object how to persist itself, but one consequence of this design choice is coupling between model objects and the persistence tier. You can't use model objects without persistence if you do this. They really become one big, unwieldy package. There's no layering.
Another choice is to have model objects remain oblivious to whether or not they're persisted. It's a one way dependence that way: persistence knows about model objects, but not the other way around.
Google for those other solutions. There's no need to beat that dead horse again.

The first method will provide you with type safety for associated accessors so you will know what type of object you are getting back and don.t have to cast to that type the you are expecting (this becomes more important when providing anything other than primitives).
For that reason (plus that it will make the resulting code simpler and easier to read) I would pick the first one. In any large applications you will also be able to quickly, easily and neatly get parameter values back in the code for debug etc. within the object itself.
If anyone else is going to be working on this code also (or your planning on working it after you forget about it) the first one will also help as you know the parameters etc. The second one will only give you this with extensive javadoc.

The first one is the classical way. The second one is really tricky for nothing.

Forcing devs to explicitly define keys for configuration data

We are working in a project with multiple developers and currently the retrieval of values from a configuration file is somewhat "wild west":
Everybody uses some string to retrieve a value from the Config object
Those keys are spread across multiple classes and packages
Sometimes the are not even declared as constants
Naming of the keys is inconsistent and the config file (.properties) looks messy
I would like to sort that out and force everyone to explicitly define their configuration keys. Ideally in one place to streamline how config keys actually look.
I was thingking of using an Enum as a key and turning my retrieval method into:
getConfigValue(String key)
into something like
getConfigValue(ConfigKey)
NOTE: I am using this approach since the Preferences API seems a bit overkill to me plus I would actually like to have the configuration in a simple file.
What are the cons of this approach?

First off, FWIW, I think it's a good idea. But you did specifically ask what the "cons" are, so:
The biggest "con" is that it ties any class that needs to use configuration data to the ConfigKey class. Adding a config key used to mean adding a string to the code you were working on; now it means adding to the enum and to the code you were working on. This is (marginally) more work.
You're probably not markedly increasing inter-dependence otherwise, since I assume the class that getConfigValue is part of is the one on which you'd define the enum.

The other downside to consolidation is if you have multiple projects on different parts of the same code base. When you develop, you have to deal with delivery dependencies, which can be a PITA.
Say Project A and Project B are scheduled to get released in that order. Suddenly political forces change in the 9th hour and you have to deliver B before A. Do you repackage the config to deal with it? Can your QA cycles deal with repackaging or does it force a reset in their timeline.
Typical release issues, but just one more thing you have to manage.

From your question, it is clear that you intend to write a wrapper class for the raw Java Properties API, with the intention that your wrapper class provides a better API. I think that is a good approach, but I'd like to suggest some things that I think will improve your wrapper API.
My first suggested improvement is that an operation that retrieves a configuration value should take two parameters rather than one, and be implemented as shown in the following pseudocode:
class Configuration {
public String getString(String namespace, String localName) {
return properties.getProperty(namespace + "." + localName);
}
}
You can then encourage each developer to define a string constant value to denote the namespace for whatever class/module/component they are developing. As long as each developer (somehow) chooses a different string constant for their namespace, you will avoid accidental name clashes and promote a somewhat organised collection of property names.
My second suggested improvement is that your wrapper class should provide type-safe access to property values. For example, provide getString(), but also provide methods with names such as getInt(), getBoolean(), getDouble() and getStringList(). The int/boolean/double variants should retrieve the property value as a string, attempt to parse it into the appropriate type, and throw a descriptive error message if that fails. The getStringList() method should retrieve the property value as a string and then split it into a list of strings based on using, say, a comma as a separator. Doing this will provide a consistent way for developers to get a list value.
My third suggested improvement is that your wrapper class should provide some additional methods such as:
int getDurationMilliseconds(String namespace, String localName);
int getDurationSeconds(String namespace, String localName);
int getMemorySizeBytes(String namespace, String localName);
int getMemorySizeKB(String namespace, String localName);
int getMemorySizeMB(String namespace, String localName);
Here are some examples of their intended use:
cacheSize = cfg.getMemorySizeBytes(MY_NAMSPACE, "cache_size");
timeout = cfg.getDurationMilliseconds(MY_NAMSPACE, "cache_timeout");
The getMemorySizeBytes() method should convert string values such as "2048 bytes" or "32MB" into the appropriate number of bytes, and getMemorySizeKB() does something similar but returns the specified size in terms of KB rather than bytes. Likewise, the getDuration<units>() methods should be able to handle string values like "500 milliseconds", "2.5 minutes", "3 hours" and "infinite" (which is converted into, say, -1).
Some people may think that the above suggestions have nothing to do with the question that was asked. Actually, they do, but in a sneaky sort of way. The above suggestions will result in a configuration API that developers will find to be much easier to use than the "raw" Java Properties API. They will use it to obtain that ease-of-use benefit. But using the API will have the side effect of forcing the developers to adopt a namespace convention, which will help to solve the problem that you are interested in addressing.
Or to look at it another way, the main con of the approach described in the question is that it offers a win-lose situation: you win (by imposing a property-naming convention on developers), but developers lose because they swap the familiar Java Properties API for another API that doesn't offer them any benefits. In contrast, the improvements I have suggested are intended to provide a win-win situation.

Refactoring large data object

What are some common strategies for refactoring large "state-only" objects?
I am working on a specific soft-real-time decision support system which does online modeling/simulation of the national airspace. This piece of software consumes a number of live data feeds, and produces a once-per-minute estimate of the "state" of a large number of entities in the airspace. The problem breaks down neatly until we hit what is currently the lowest-level entity.
Our mathematical model estimates/predicts upwards of 50 parameters for a timeline of several hours into the past and future for each of these entities, roughly once per minute. Currently, these records are encoded as a single Java class with a lot of fields (some get collapsed into an ArrayList). Our model is evolving, and the dependencies among the fields are not yet set in stone, so each instance wanders through a convoluted model, accumulating settings as it goes along.
Currently we have something like the following, which uses a builder pattern approach to build up the contents of the record, and enforce what the known dependencies are (as a check against programmer error as evolve the mode.) Once the estimate is done, we convert the below into an immutable form using a .build() type method.
final class OneMinuteEstimate {
enum EstimateState { INFANT, HEADER, INDEPENDENT, ... };
EstimateState state = EstimateState.INFANT;
// "header" stuff
DateTime estimatedAtTime = null;
DateTime stamp = null;
EntityId id = null;
// independent fields
int status1 = -1;
...
// dependent/complex fields...
... goes on for 40+ more fields...
void setHeaderFields(...)
{
if (!EstimateState.INFANT.equals(state)) {
throw new IllegalStateException("Must be in INFANT state to set header");
}
...
}
}
Once a very large number of these estimates are complete, they are assembled into timelines where aggregate patterns/trends are analyzed. We have looked at using an embedded database but have struggled with performance issues; we'd rather get this sorted out in terms of data modeling and then incrementally move portions of the soft-real-time code into an embedded data store.
Once the "time sensitive" pieces of this are done, the products are flushed to flat files and a database.
Problems:
It's a giant class, with way too many fields.
There is very little behavior encoded in the class; it's mostly a holder for data fields.
Maintaining the build() method is extremely cumbersome.
It feels clumsy to manually maintain a "state machine" abstraction merely for the purpose of ensuring that a large number of dependent modeling components are properly populating a data object, but it has saved us a lot of frustration as the model evolves.
There is a lot of duplication, particularly when the records described above are aggregated into very similar "rollups" which amount to rolling sums/averages or other statistical products of the above structure in time series.
While some of the fields could be clumped together, they are all logically "peers" of one another, and any breakdown we've tried has resulted in having behavior/logic artificially split and needing to reach two levels deep in indirection.
Out of the box ideas entertained, but this is something we need to evolve incrementally. Before anyone else says it, I'll note that one could suggest that our mathematical model is insufficiently crisp if the data representation for that model is this hard to get ahold of. Fair point, and we're working that, but I think that's a side-effect of an R&D environment with a lot of contributors, and a lot of concurrent hypotheses in play.
(Not that it matters, but this is implemented in Java. We use HSQLDB or Postgres for output products. We don't use any persistence framework, partly out of a lack of familiarity, partly because we have enough performance trouble with just the database alone and hand-coded storage routines... we're skeptical of moving towards additional abstraction.)

I had much of the same problem you did.
At least I think I did, sounds like I did. Representation was different, but at 10,000 feet, sounds pretty much the same. Crapload of discrete, "arbitrary" variables and a bunch of ad hoc relationships among them (essentially business driven), subject to change at a moment's notice.
You also have another issue, which you sorta mentioned, and that was the performance requirement. Sounds like faster is better, and likely a slow perfect solution would be tossed out for the fast lousy one, simply because the slower one can't meet a baseline performance requirement, no matter how good it is.
To put it simply, what I did was I designed a simple domain specific rule language for my system.
The entire point of the DSL was to implicitly express relationships and package them up in to modules.
Very crude, contrived example:
D = 7
C = A + B
B = A / 5
A = 10
RULE 1: IF (C < 10) ALERT "C is less than 10"
RULE 2: IF (C > 5) ALERT "C is greater than 5"
RULE 3: IF (D > 10) ALERT "D is greater than 10"
MODULE 1: RULE 1
MODULE 2: RULE 3
MODULE 3: RULE 1, RULE 2
First, this is not representative of my syntax.
But you can see from the Modules, that it is 3, simple rules.
The key though, is that it's obvious from this that Rule 1 depends on C, which depends on A and B, and B depends on A. Those relationships are implied.
So, for that module, all of those dependencies "come with it". You can see if I generated code for Module 1 it might look something like:
public void module_1() {
int a = 10;
int b = a / 5;
int c = a + b;
if (c < 10) {
alert("C is less than 10");
}
}
Whereas if I created Module 2, all I would get is:
public void module_2() {
int d = 7;
if (d > 10) {
alert("D is greater than 10.");
}
}
In Module 3 you see the "free" reuse:
public void module_3() {
int a = 10;
int b = a / 5;
int c = a + b;
if (c < 10) {
alert("C is less than 10");
}
if (c > 5) {
alert("C is greater than 5");
}
}
So, even though I have one "soup" of rules, the Modules root the base of the dependencies, and thus filter out the stuff it doesn't care about. Grab a module, shake the tree and keep what's left hanging.
My system used the DSL to generate source code, but you can easily have it create a mini runtime interpreter as well.
Simple topological sorting handled the dependency graph for me.
So, the nice thing about this is that while there was inevitable duplication in the final, generated logic, at least across modules, there wasn't any duplication in the rule base. What you as a developer/knowledge worker maintain is the rule base.
What is also nice is that you can change an equation, and not worry so much about the side effects. For example, if I change do C = A / 2, then, suddenly, B drops out completely. But the rule for IF (C < 10) doesn't change at all.
With a few simple tools, you can show the entire dependency graph, you can find orphaned variables (like B), etc.
By generating source code, it's going to run as fast as you want.
In my case, it was interesting to see a rule drop a single variable and see 500 lines of source code vanish from the resulting module. That's 500 lines I didn't have to crawl through by hand and remove during maintenance and development. All I had to do was change a single rule in my rule base and let "magic" happen.
I was even able to do some simple peephole optimization and eliminate variables.
It's not that hard to do. Your rule language can be XML, or a simple expression parser. No reason to go full boat Yacc or ANTLR on it if you don't want to. I'll put a plug in for S-Expressions, no grammar needed, brain dead parsing.
Spreadsheets also make a great input tool, actually. Just be strict on the formatting. Kind of sucks for merging in SVN (so, Don't Do That), but end users love it.
You may well be able to get away with an actual rule based system. My system wasn't dynamic at runtime, and didn't really need sophisticated goal seeking and inference, so I didn't need the overhead of such a system. But if one works for you out of the box, then happy day.
Oh, and for an implementation note, for those who don't believe you can hit the 64K code limit in a Java method, well I can assure you it can be done :).

Splitting a Large Data Object is very similar to Normalizing a Large Relational Table (first and second normal form). Follow the rules to reach at least second normal form and you may have a good decomposition of the original class.

From experience working also with R&D stuff with soft real-time performance constrains (and sometimes monster fat classes), I would suggest NOT to use OR mappers. In such situations, you'll be better off dealing "touching the metal" and working directly with JDBC result sets. This is my suggestion for apps with soft real-time constrains and massive amounts of data items per package. More importantly, if the number of distinct classes (not class instances, but class definitions) that need to persisted is large, and you also have memory constrains in your specs, you will also want to avoid ORMs like Hibernate.
Going back to your original question:
What you seem to have is a typical problem of 1) mapping multiple data items into a OO model and 2) such multiple data items do not exhibit a good way of grouping or segregation (and any attempt to grouping tends simply not to feel right.) Sometimes the domain model does not lend itself for such aggregation, and coming up with an artificial way of doing so typically ends up in compromises that don't satisfy all design requirements and desires.
To make matters worse, a OO model typically requires/expects you to have all the items present in a class as class' fields. Such a class is typically without behavior, so it is just a struct-like construct, aka data envelope or data shuttle.
But such situations beg the following questions:
Does your application need to read/write all 40, 50+ data items at once, always?
*Must all data items be always present?*
I do not know the specifics of your problem domain, but in general I've found that we rarely ever need to deal with all data items at once. This is where a relational model shines because you don't have to query all rows from a table at once. You only pulls those you need as projections of the table/view in question.
In a situation where we have a potentially large number of data items, but on average the number of data items being passed down the wire is less than the maximum, you'd be better off using a Properties pattern.
Instead of defining a monster envelope class holding all items :
// java pseudocode
class envelope
{
field1, field2, field3... field_n;
...
setFields(m1,m2,m3,...m_n){field1=m1; .... };
...
}
Define a dictionary (based on a map for example):
// java pseudocode
public enum EnvelopeField {field1, field2, field3,... field_n);
interface Envelope //package visible
{
// typical map-based read fields.
Object get(EnvelopeField field);
boolean isEmpty();
// new methods similar to existing ones in java.lang.Map, but
// more semantically aligned with envelopes and fields.
Iterator<EnvelopeField> fields();
boolean hasField(EnvelopeField field);
}
// a "marker" interface
// code that only needs to read envelopes must operate on
// these interfaces.
public interface ReadOnlyEnvelope extends Envelope {}
// the read-write version of envelope, notice that
// it inherits from Envelope, but not from ReadOnlyEnvelope.
// this is done to make it difficult (but not impossible
// unfortunately) to "cast-up" a read only envelope into a
// mutable one.
public interface MutableEnvelope extends Envelope
{
Object put(EnvelopeField field);
// to "cast-down" or "narrow" into a read only version type that
// cannot directly be "cast-up" back into a mutable.
ReadOnlyEnvelope readOnly();
}
// the standard interface for map-based envelopes.
public interface MapBasedEnvelope extends
Map<EnvelopeField,java.lang.Object>
MutableEnvelope
{
}
// package visible, not public
class EnvelopeImpl extends HashMap<EnvelopeField,java.lang.Object>
implements MapBasedEnvelope, ReadOnlyEnvelope
{
// get, put, isEmpty are automatically inherited from HashMap
...
public Iterator<EnvelopeField> fields(){ return this.keySet().iterator(); }
public boolean hasField(EnvelopeField field){ return this.containsKey(field); }
// the typecast is redundant, but it makes the intention obvious in code.
public ReadOnlyEnvelope readOnly(){ return (ReadOnlyEnvelope)this; }
}
public class final EnvelopeFactory
{
static public MapBasedEnvelope new(){ return new EnvelopeImpl(); }
}
No need to set up read-only internal flags. All you need to do is downcast your envelope instances as Envelope instances (that only provide getters).
Code that expects to read should operate on read-only envelopes and code that expects to change fields should operate on mutable envelopes. Creation of the actual instances would be compartmentalized in factories.
That is, you use the compiler to enforce things to be read-only (or allow things to be mutable) by establishing some code conventions, rules governing what interfaces to use where and how.
You can layer your code into sections that need to write separate from code that only needs to read. Once that's done, simple code reviews (or even grep) can identify code that is using the wrong interface.)
Problems:
Non-public Parent Interface:
Envelope is not declared as a public interface to prevent erroneous/malicious code from casting a read-only envelope down to a base envelope and then back to a mutable envelope. The intended flow is from mutable to read-only only - it is not intended to be bi-directional.
The problem here is that extension of Envelope is restricted to the package that contains it. Whether that is a problem will depend on the particular domain and intended usage.
Factories:
The problem is that factories can (and most likely will) be very complex. Again, the nature of the beast.
Validation:
Another problem introduced with this approach is that now you have to worry about code that expects field X to be present. Having the original monster envelope class partially frees you from that worry because, at least syntactically, all fields are there...
... whether the fields are set or not, that was another matter that still remains with this new model I'm proposing.
So if you have client code that expects to see field X, the client code has to throw some type of exception if the field is not present (or to computer or read a sensible default somehow.) In such cases, you will have to
Identify patterns of field presence. Clients that expect field X to be present might be grouped separately (layered apart) from clients that expect some other field to be present.
Associate custom validators (proxies to read-only envelope interfaces) that either throw exceptions or compute default values for missing fields according to some rules (rules provided programmatically, with an interpreter, or with a rules engine.)
Lack of Typing:
This might be debatable, but people used to work with static typing might feel uneasy with losing the benefits of static typing by going to a loosely typied map-based approach. The counter-argument of this is that most of the web works on a loose typing approach, even on the Java side (JSTL, EL.)
Problems aside, the larger the maximum number of possible fields and the lower the average number of fields present at any given time, the most effective wrt performance this approach will be. It adds additional code complexity, but that's the nature of the beast.
That complexity doesn't go away, and either will be present in your class model or in your validation code. Serialization and transferring down the wire is much more efficient, though, specially if you expect massive numbers of individual data transfers.
Hope it helps.

Actually this looks like a frequent problem that game developers face, bloated classes holding numerous variables and methods because of a deep inheritance tree etc.
There's this blog post about how and why to select composition over inheritance, maybe it would help.

One way you may be able to intelligently break up a large data class is to look at patterns of access by client classes. For example, if a set of classes only accesses fields 1-20 and another set of classes only accesses fields 25-30, maybe those groups of fields belong in separate classes.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.