Application design for processing data prior to database

Application design for processing data prior to database - java

I have a large collection of data in an excel file (and csv files). The data needs to be placed into a database (mysql). However, before it goes into the database it needs to be processed..for example if columns 1 is less than column 3 add 4 to column 2. There are quite a few rules that must be followed before the information is persisted.
What would be a good design to follow to accomplish this task? (using java)
Additional notes
The process needs to be automated. In the sense that I don't have to manually go in and alter the data. We're talking about thousands of lines of data with 15 columns of information per line.
Currently, I have a sort of chain of responsibility design set up. One class(Java) for each rule. When one rule is done, it calls the following rule.
More Info
Typically there are about 5000 rows per data sheet. Speed isn't a huge concern because
this large input doesn't happen often.
I've considered drools, however I wasn't sure the task was complicated enough for drols.
Example rules:
All currency (data in specific columns) must not contain currency symbols.
Category names must be uniform (e.g. book case = bookcase)
Entry dates can not be future dates
Text input can only contain [A-Z 0-9 \s]
etc..
Additionally if any column of information is invalid it needs to be reported when
processing is complete
(or maybe stop processing).
My current solution works. However I think there is room for improvement so I'm looking
for ideals as to how it can be improved and or how other people have handled similar
situations.
I've considered (very briefly) using drools but I wasn't sure the work was complicated enough to take advantage of drools.

If I didn't care to do this in 1 step (as Oli mentions), I'd probably use a pipe and filters design. Since your rules are relatively simple, I'd probably do a couple delegate based classes. For instance (C# code, but Java should be pretty similar...perhaps someone could translate?):
interface IFilter {
public IEnumerable<string> Filter(IEnumerable<string> file) {
}
}
class PredicateFilter : IFilter {
public PredicateFilter(Predicate<string> predicate) { }
public IEnumerable<string> Filter(IEnumerable<string> file) {
foreach (string s in file) {
if (this.Predicate(s)) {
yield return s;
}
}
}
}
class ActionFilter : IFilter {
public ActionFilter(Action<string> action) { }
public IEnumerable<string> Filter(IEnumerable<string> file) {
foreach (string s in file) {
this.Action(s);
yield return s;
}
}
}
class ReplaceFilter : IFilter {
public ReplaceFilter(Func<string, string> replace) { }
public IEnumerable<string> Filter(IEnumerable<string> file) {
foreach (string s in file) {
yield return this.Replace(s);
}
}
}
From there, you could either use the delegate filters directly, or subclass them for the specifics. Then, register them with a Pipeline that will pass them through each filter.

I think your method is OK. Especially if you use the same interface on every processor.
You could also look to somethink called Drules, currently Jboss-rules. I used that some time ago for a rule-heavy part of my app and what I liked about it is that the business logic can be expressed in for instance a spreadsheet or DSL which then get's compiled to java (run-time and I think there's also a compile-time option). It makes rules a bit more succint and thus readable. It's also very easy to learn (2 days or so).
Here's a link to the opensource Jboss-rules. At jboss.com you can undoubtedly purchase an offically maintained version if that's more to your companies taste.

Just create a function to enforce each rule, and call every applicable function for each value. I don't see how this requires any exotic architecture.

A class for each rule? Really? Perhaps I'm not understanding the quantity or complexity of these rules, but I would (semi-pseudo-code):
public class ALine {
private int col1;
private int col2;
private int coln;
// ...
public ALine(string line) {
// read row into private variables
// ...
this.Process();
this.Insert();
}
public void Process() {
// do all your rules here working with the local variables
}
public void Insert() {
// write to DB
}
}
foreach line in csv
new ALine(line);

Your methodology of using classes for each rule does sound a bit heavy weight but it has the advantage of being easy to modify and expand should new rules come along.
As for loading the data bulk loading is the way to go. I have read some informaiton which suggests it may be as much as 3 orders of magnitude faster than loading using insert statements. You can find some information on it here

Bulk load the data into a temp table, then use sql to apply your rules.
use the temp table, as a basis for the insert into real table.
drop the temp table.

you can see that all the different answers are coming from their own experience and perspective.
Since we don't know much about the complexity and number of rows in your system, we tend to give advice based on what we have done earlier.
If you want to narrow down to a 1/2 solutions for your implementation, try giving more details.
Good luck

It may not be what you want to hear, it isn't the "fun way" by any means, but there is a much easier way to do this.
So long as your data is evaluated line by line... you can setup another worksheet in your excel file and use spreadsheet style functions to do the necessary transforms, referencing the data from the raw data sheet. For more complex functions you can use the vba embedded in excel to write out custom operations.
I've used this approach many times and it works really well; its just not very sexy.

Related

Open Closed and Interface Segregation

I am writing a code which basically reads a text file (Tabular format) and check if the file contains expected data type or not. For that I have write following class.
The sample file would be something like this.
name age
abc 20
xyz vf
aaa 22
And I have JSON file which says, which conlumn should contains what?
{
filename:"test.txt",
cols:{
name:string,
age: int
}
}
A JSON file contains the DataType for each row, so I know what to expect?
Following code working without any issue. However, it seems that this code Violates the open closed and interface segregation principle.
public class DataValidation {
public boolean isInt(String value) {
try {
Integer.parseInt(value);
return true;
} catch (NumberFormatException ne) {
return false;
}
}
public boolean isFloat(String value) {
try {
Float.parseFloat(value);
return true;
} catch (NumberFormatException ne) {
return false;
}
}
}
And so I am thinking to Refector the code as mentioned below. However, would like to know what advantage I will get and is there any better approach?
public interface DataValidation {
boolean validate(String value);
}
public class IntValidator implements DataValidation {
public boolean validate(String value) {
try{
Integer.parseInt(value);
return true;
}catch (NumberFormatException ne){
return false;
}
}
}

Basic Definition of Open-Close Principle (OCP): (Meyer 1988)
the open/closed principle states "software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification"; that is, such an entity can allow its behavior to be extended without modifying its source code. see Reference
BUT: On the other hand Uncle Bob in this reference provide some clarifications about the meaning of OCP. (that I used them in the following)
First of all: In my idea, your class (DataValidation) did not conflict
Open-Close Principle.
Your class JUST check the primitive data types. (as you answered to my question in comment). There are just 8 primitive data types in Java. This number will not change in the future. So if you put all the 8 methods in one class, you don't have any extensions or modifications on data types in the future.
In the other hand, OCP is about adding new source codes without any changes in old codes. So event if Java adds new data type, you can add the method easily without modifications in other parts of code.
Therefore, I think that your class is not BIG enough to violate Open-Close Principle.
Secondly: To use Interface Segregation Principle (ISP)
To use ISP, we need some prerequisites. We should have some dependencies between part of our system (or class). we should need for Dependency Management to manage some parts of system and we consciously decide what each part of the system can depend on. please read this reference in-depth
I think that your class is just a Checker class and does not have any states (Attributes or Fields). So there is not any reason to use ISP.
To sum up: Using all Object Oriented principles and heuristics (like SOLID), should help us to reduce COMPLEXITY. In your project, there is no need to use them.
To offer a solution for your problem
You can use an enum DataTypes {boolean, char, _etc_} for Primitive Data Types and use only one method like DataType getDataType(String S) to get the type of given String as enum. But your approach (DataValidation class) is good enough too.

Gholamali-Irani provide great answer, but I want add some my own thinks about yours topic:
First of all, almost every best practice, paradigm, etc. trying to increase degree of maintainability, testability and extensibility. Do you really need them? How big probability of adding some custom and complex type? If its very low, than maybe your first variant is good enough for your task (not for general task of verification, just for yours).
Secondly, much depends on how you use it. You didn't show how exactly you use all this methods/classes/interfaces. "Servant" code might be very good, it can cleanest code in the world, but who cares if its used incorrectly or its very hard to use?

Is this the right naming convention ?

Assuming we have a method which calls another.
public readXMLFile() {
// reading each line and parsing each line to return node.
Node = parse(line);
}
private parse() {
}
Now is it a good practice to use a more comprehensive function name like "readXMLFileAndParse" ?
Pro's:
It provides a more comprehensive information to caller of what the function is supposed to be doing.
Else client may wonder if it only reads where is the "parse" utility.
In other words I see a clear advantage of a function name to be comprehensive of all the activities nested within it. Is this right thing to do aka is this considered a good practice ?

It's a guideline that every method can only have one job (single responsibility).
However this will cause problems for the naming where a method will return a result of a combination of sub-methods.
Therefore you should name it to describe its primary function: parsing a file. Reading a file is part of that, but it's not vital to the end-user since it's implicated.
Then again, you have to think of what this exactly entails: nobody just parses a file just to parse it. Do you retrieve data? Do you write data?
You should describe your actions on that file, but not as literally as 'readfile' or 'parsefile'.
RetrieveCustomers if you're reading customers would be a lot more descriptive.
public List<Customer> RetrieveCustomers() {
// loop over lines
// call parser
}
private Customer ParseCustomer() { }
If you'd share what exactly it is you're trying to parse, that would help a lot.

I think it depends on the complexity of your class. Since the method is private, no-one, in theory, should care. Named it descriptively enough so you can read your own code 6 months from now, and stop there.
public methods, on the other hand, should be well-named and well-documented. Extra descriptiveness there can't hurt.

how do I get a nice Coding-Style and easy maintenance when using translation maps in an Java Enterprise App?

Imagine that you are editing a big back office Enterprise Java app, where other people might poke around years from now. That means you have to keep the code clean and easy to understand, performance might not be the #1 priority.
There is a module that needs to
Extract data from objects
Map data parameters, for example SE -> Sweden [this only applies and is used in this module, for now]
Send these new parameters to somewhere (for example via email/xml)
For a small set of data, then i'd use a small HashMap, but the custom table of data that has to be transformed has grown to 3 HashMaps with ~100 elements in some. I have them in a file called Translater.Java
and there I got a method:
public String getCountryCode(String country) {
return countryCodes.get(country);
}
which is initiated with
countryCodes = new HashMap<String, String>() {{
put("Andorra", "AD");
put("Afghanistan", "AF");
...
}};
it looks ugly! But my choices seem to be:
Make a database table in a new database, which would add another layer of obfuscation when a coder just wants to see what maps to what. It is also not needed to ever change this data, and if so its better done as a code change since the db is not source code controlled! (we use hibernate)
Store this static data as a config file, the application uses a database table for configuration options, this would add to the maintenance.
Use the config database table to store this, that would work but could also make the rest of the configuration options harder to find since the other types of data in the configuration table are relatively small and cohesive.

Try with simply enum for this, this is very effective and easy to maintain.
Example:
public enum Country {
ANDORRA("AD"),
AFGHANISTAN("AF"),
...;
private String code;
private Country(String code) {
this.code = code;
}
public static String findCountryCode(String country) {
return valueOf(country.toUpperCase()).getCode();
}
public String getCode() {
return code;
}
}
public class CountryTest {
#Test
public void testGetCode() throws Exception {
assertThat(Country.findCountryCode("Andorra")).isEqualTo("AD");
}
}

Edit: I'm not fully sure from your question which way the mapping should go, or if you need to be able to do lookups both ways. The following assumes that you are looking up country code as the value by the key of country name.
In my experience, number 3 is the best option. In a lot of system architectures you would have to redeploy the application if you need to change hard coded mappings.
I have seen from your comments to the first answer that your mappings are only likely to change once every 3 years or so. However, you can't guarantee that; requirements can change, and so too can international relations.
Your reservations towards number 3 was that
that would work but could also make the rest of the configuration options harder to find since the other types of data in the configuration table are relatively small and cohesive.
The solution to this point is to have a well defined and clear naming convention for keys in the configuration database. You could, for instance, use multiple levels of prefixes in the key name to narrow down the intended scope/place of use of the configuration values. For example:
general.translation.countrycodes.Andorra

Refactoring large data object

What are some common strategies for refactoring large "state-only" objects?
I am working on a specific soft-real-time decision support system which does online modeling/simulation of the national airspace. This piece of software consumes a number of live data feeds, and produces a once-per-minute estimate of the "state" of a large number of entities in the airspace. The problem breaks down neatly until we hit what is currently the lowest-level entity.
Our mathematical model estimates/predicts upwards of 50 parameters for a timeline of several hours into the past and future for each of these entities, roughly once per minute. Currently, these records are encoded as a single Java class with a lot of fields (some get collapsed into an ArrayList). Our model is evolving, and the dependencies among the fields are not yet set in stone, so each instance wanders through a convoluted model, accumulating settings as it goes along.
Currently we have something like the following, which uses a builder pattern approach to build up the contents of the record, and enforce what the known dependencies are (as a check against programmer error as evolve the mode.) Once the estimate is done, we convert the below into an immutable form using a .build() type method.
final class OneMinuteEstimate {
enum EstimateState { INFANT, HEADER, INDEPENDENT, ... };
EstimateState state = EstimateState.INFANT;
// "header" stuff
DateTime estimatedAtTime = null;
DateTime stamp = null;
EntityId id = null;
// independent fields
int status1 = -1;
...
// dependent/complex fields...
... goes on for 40+ more fields...
void setHeaderFields(...)
{
if (!EstimateState.INFANT.equals(state)) {
throw new IllegalStateException("Must be in INFANT state to set header");
}
...
}
}
Once a very large number of these estimates are complete, they are assembled into timelines where aggregate patterns/trends are analyzed. We have looked at using an embedded database but have struggled with performance issues; we'd rather get this sorted out in terms of data modeling and then incrementally move portions of the soft-real-time code into an embedded data store.
Once the "time sensitive" pieces of this are done, the products are flushed to flat files and a database.
Problems:
It's a giant class, with way too many fields.
There is very little behavior encoded in the class; it's mostly a holder for data fields.
Maintaining the build() method is extremely cumbersome.
It feels clumsy to manually maintain a "state machine" abstraction merely for the purpose of ensuring that a large number of dependent modeling components are properly populating a data object, but it has saved us a lot of frustration as the model evolves.
There is a lot of duplication, particularly when the records described above are aggregated into very similar "rollups" which amount to rolling sums/averages or other statistical products of the above structure in time series.
While some of the fields could be clumped together, they are all logically "peers" of one another, and any breakdown we've tried has resulted in having behavior/logic artificially split and needing to reach two levels deep in indirection.
Out of the box ideas entertained, but this is something we need to evolve incrementally. Before anyone else says it, I'll note that one could suggest that our mathematical model is insufficiently crisp if the data representation for that model is this hard to get ahold of. Fair point, and we're working that, but I think that's a side-effect of an R&D environment with a lot of contributors, and a lot of concurrent hypotheses in play.
(Not that it matters, but this is implemented in Java. We use HSQLDB or Postgres for output products. We don't use any persistence framework, partly out of a lack of familiarity, partly because we have enough performance trouble with just the database alone and hand-coded storage routines... we're skeptical of moving towards additional abstraction.)

I had much of the same problem you did.
At least I think I did, sounds like I did. Representation was different, but at 10,000 feet, sounds pretty much the same. Crapload of discrete, "arbitrary" variables and a bunch of ad hoc relationships among them (essentially business driven), subject to change at a moment's notice.
You also have another issue, which you sorta mentioned, and that was the performance requirement. Sounds like faster is better, and likely a slow perfect solution would be tossed out for the fast lousy one, simply because the slower one can't meet a baseline performance requirement, no matter how good it is.
To put it simply, what I did was I designed a simple domain specific rule language for my system.
The entire point of the DSL was to implicitly express relationships and package them up in to modules.
Very crude, contrived example:
D = 7
C = A + B
B = A / 5
A = 10
RULE 1: IF (C < 10) ALERT "C is less than 10"
RULE 2: IF (C > 5) ALERT "C is greater than 5"
RULE 3: IF (D > 10) ALERT "D is greater than 10"
MODULE 1: RULE 1
MODULE 2: RULE 3
MODULE 3: RULE 1, RULE 2
First, this is not representative of my syntax.
But you can see from the Modules, that it is 3, simple rules.
The key though, is that it's obvious from this that Rule 1 depends on C, which depends on A and B, and B depends on A. Those relationships are implied.
So, for that module, all of those dependencies "come with it". You can see if I generated code for Module 1 it might look something like:
public void module_1() {
int a = 10;
int b = a / 5;
int c = a + b;
if (c < 10) {
alert("C is less than 10");
}
}
Whereas if I created Module 2, all I would get is:
public void module_2() {
int d = 7;
if (d > 10) {
alert("D is greater than 10.");
}
}
In Module 3 you see the "free" reuse:
public void module_3() {
int a = 10;
int b = a / 5;
int c = a + b;
if (c < 10) {
alert("C is less than 10");
}
if (c > 5) {
alert("C is greater than 5");
}
}
So, even though I have one "soup" of rules, the Modules root the base of the dependencies, and thus filter out the stuff it doesn't care about. Grab a module, shake the tree and keep what's left hanging.
My system used the DSL to generate source code, but you can easily have it create a mini runtime interpreter as well.
Simple topological sorting handled the dependency graph for me.
So, the nice thing about this is that while there was inevitable duplication in the final, generated logic, at least across modules, there wasn't any duplication in the rule base. What you as a developer/knowledge worker maintain is the rule base.
What is also nice is that you can change an equation, and not worry so much about the side effects. For example, if I change do C = A / 2, then, suddenly, B drops out completely. But the rule for IF (C < 10) doesn't change at all.
With a few simple tools, you can show the entire dependency graph, you can find orphaned variables (like B), etc.
By generating source code, it's going to run as fast as you want.
In my case, it was interesting to see a rule drop a single variable and see 500 lines of source code vanish from the resulting module. That's 500 lines I didn't have to crawl through by hand and remove during maintenance and development. All I had to do was change a single rule in my rule base and let "magic" happen.
I was even able to do some simple peephole optimization and eliminate variables.
It's not that hard to do. Your rule language can be XML, or a simple expression parser. No reason to go full boat Yacc or ANTLR on it if you don't want to. I'll put a plug in for S-Expressions, no grammar needed, brain dead parsing.
Spreadsheets also make a great input tool, actually. Just be strict on the formatting. Kind of sucks for merging in SVN (so, Don't Do That), but end users love it.
You may well be able to get away with an actual rule based system. My system wasn't dynamic at runtime, and didn't really need sophisticated goal seeking and inference, so I didn't need the overhead of such a system. But if one works for you out of the box, then happy day.
Oh, and for an implementation note, for those who don't believe you can hit the 64K code limit in a Java method, well I can assure you it can be done :).

Splitting a Large Data Object is very similar to Normalizing a Large Relational Table (first and second normal form). Follow the rules to reach at least second normal form and you may have a good decomposition of the original class.

From experience working also with R&D stuff with soft real-time performance constrains (and sometimes monster fat classes), I would suggest NOT to use OR mappers. In such situations, you'll be better off dealing "touching the metal" and working directly with JDBC result sets. This is my suggestion for apps with soft real-time constrains and massive amounts of data items per package. More importantly, if the number of distinct classes (not class instances, but class definitions) that need to persisted is large, and you also have memory constrains in your specs, you will also want to avoid ORMs like Hibernate.
Going back to your original question:
What you seem to have is a typical problem of 1) mapping multiple data items into a OO model and 2) such multiple data items do not exhibit a good way of grouping or segregation (and any attempt to grouping tends simply not to feel right.) Sometimes the domain model does not lend itself for such aggregation, and coming up with an artificial way of doing so typically ends up in compromises that don't satisfy all design requirements and desires.
To make matters worse, a OO model typically requires/expects you to have all the items present in a class as class' fields. Such a class is typically without behavior, so it is just a struct-like construct, aka data envelope or data shuttle.
But such situations beg the following questions:
Does your application need to read/write all 40, 50+ data items at once, always?
*Must all data items be always present?*
I do not know the specifics of your problem domain, but in general I've found that we rarely ever need to deal with all data items at once. This is where a relational model shines because you don't have to query all rows from a table at once. You only pulls those you need as projections of the table/view in question.
In a situation where we have a potentially large number of data items, but on average the number of data items being passed down the wire is less than the maximum, you'd be better off using a Properties pattern.
Instead of defining a monster envelope class holding all items :
// java pseudocode
class envelope
{
field1, field2, field3... field_n;
...
setFields(m1,m2,m3,...m_n){field1=m1; .... };
...
}
Define a dictionary (based on a map for example):
// java pseudocode
public enum EnvelopeField {field1, field2, field3,... field_n);
interface Envelope //package visible
{
// typical map-based read fields.
Object get(EnvelopeField field);
boolean isEmpty();
// new methods similar to existing ones in java.lang.Map, but
// more semantically aligned with envelopes and fields.
Iterator<EnvelopeField> fields();
boolean hasField(EnvelopeField field);
}
// a "marker" interface
// code that only needs to read envelopes must operate on
// these interfaces.
public interface ReadOnlyEnvelope extends Envelope {}
// the read-write version of envelope, notice that
// it inherits from Envelope, but not from ReadOnlyEnvelope.
// this is done to make it difficult (but not impossible
// unfortunately) to "cast-up" a read only envelope into a
// mutable one.
public interface MutableEnvelope extends Envelope
{
Object put(EnvelopeField field);
// to "cast-down" or "narrow" into a read only version type that
// cannot directly be "cast-up" back into a mutable.
ReadOnlyEnvelope readOnly();
}
// the standard interface for map-based envelopes.
public interface MapBasedEnvelope extends
Map<EnvelopeField,java.lang.Object>
MutableEnvelope
{
}
// package visible, not public
class EnvelopeImpl extends HashMap<EnvelopeField,java.lang.Object>
implements MapBasedEnvelope, ReadOnlyEnvelope
{
// get, put, isEmpty are automatically inherited from HashMap
...
public Iterator<EnvelopeField> fields(){ return this.keySet().iterator(); }
public boolean hasField(EnvelopeField field){ return this.containsKey(field); }
// the typecast is redundant, but it makes the intention obvious in code.
public ReadOnlyEnvelope readOnly(){ return (ReadOnlyEnvelope)this; }
}
public class final EnvelopeFactory
{
static public MapBasedEnvelope new(){ return new EnvelopeImpl(); }
}
No need to set up read-only internal flags. All you need to do is downcast your envelope instances as Envelope instances (that only provide getters).
Code that expects to read should operate on read-only envelopes and code that expects to change fields should operate on mutable envelopes. Creation of the actual instances would be compartmentalized in factories.
That is, you use the compiler to enforce things to be read-only (or allow things to be mutable) by establishing some code conventions, rules governing what interfaces to use where and how.
You can layer your code into sections that need to write separate from code that only needs to read. Once that's done, simple code reviews (or even grep) can identify code that is using the wrong interface.)
Problems:
Non-public Parent Interface:
Envelope is not declared as a public interface to prevent erroneous/malicious code from casting a read-only envelope down to a base envelope and then back to a mutable envelope. The intended flow is from mutable to read-only only - it is not intended to be bi-directional.
The problem here is that extension of Envelope is restricted to the package that contains it. Whether that is a problem will depend on the particular domain and intended usage.
Factories:
The problem is that factories can (and most likely will) be very complex. Again, the nature of the beast.
Validation:
Another problem introduced with this approach is that now you have to worry about code that expects field X to be present. Having the original monster envelope class partially frees you from that worry because, at least syntactically, all fields are there...
... whether the fields are set or not, that was another matter that still remains with this new model I'm proposing.
So if you have client code that expects to see field X, the client code has to throw some type of exception if the field is not present (or to computer or read a sensible default somehow.) In such cases, you will have to
Identify patterns of field presence. Clients that expect field X to be present might be grouped separately (layered apart) from clients that expect some other field to be present.
Associate custom validators (proxies to read-only envelope interfaces) that either throw exceptions or compute default values for missing fields according to some rules (rules provided programmatically, with an interpreter, or with a rules engine.)
Lack of Typing:
This might be debatable, but people used to work with static typing might feel uneasy with losing the benefits of static typing by going to a loosely typied map-based approach. The counter-argument of this is that most of the web works on a loose typing approach, even on the Java side (JSTL, EL.)
Problems aside, the larger the maximum number of possible fields and the lower the average number of fields present at any given time, the most effective wrt performance this approach will be. It adds additional code complexity, but that's the nature of the beast.
That complexity doesn't go away, and either will be present in your class model or in your validation code. Serialization and transferring down the wire is much more efficient, though, specially if you expect massive numbers of individual data transfers.
Hope it helps.

Actually this looks like a frequent problem that game developers face, bloated classes holding numerous variables and methods because of a deep inheritance tree etc.
There's this blog post about how and why to select composition over inheritance, maybe it would help.

One way you may be able to intelligently break up a large data class is to look at patterns of access by client classes. For example, if a set of classes only accesses fields 1-20 and another set of classes only accesses fields 25-30, maybe those groups of fields belong in separate classes.

Refactoring advice and tools

I have some code that consists of a lot (several hundreds of LOC) of uggly conditionals i.e.
SomeClass someClass = null;
if("foo".equals(fooBar)) {
// do something possibly involving more if-else statments
// and possibly modify the someClass variable among others...
} else if("bar".equals(fooBar)) {
// Same as above but with some slight variations
} else if("baz".equals(fooBar)) {
// and yet again as above
}
//... lots of more else ifs
} else {
// and if nothing matches it is probably an error...
// so there is some error handling here
}
// Some code that acts on someClass
GenerateOutput(someClass);
Now I had the idea of refactoring this kind of code something along the lines of:
abstract class CheckPerform<S,T,Q> {
private CheckPerform<T> next;
CheckPerform(CheckPerform<T> next) {
this.next = next;
}
protected abstract T perform(S arg);
protected abstract boolean check(Q toCheck);
public T checkPerform(S arg, Q toCheck) {
if(check(toCheck)) {
return perform(arg);
}
// Check if this CheckPerform is the last in the chain...
return next == null ? null : next.checkPerform();
}
}
And for each if statment generate a subclass of CheckPerform e.g.
class CheckPerformFoo extends CheckPerform<SomeInput, SomeClass, String> {
CheckPerformFoo(CheckPerform<SomeInput, SomeClass, String> next) {
super(next);
}
protected boolean check(String toCheck) {
// same check as in the if-statment with "foo" above"
returs "foo".equals(toCheck);
}
protected SomeClass perform(SomeInput arg) {
// Perform same actions (as in the "foo" if-statment)
// and return a SomeClass instance (that is in the
// same state as in the "foo" if-statment)
}
}
I could then inject the diffrent CheckPerforms into eachother so that the same order of checks are made and the corresponding actions taken. And in the original class I would only need to inject one CheckPerform object. Is this a valid approach to this type of problem? The number of classes in my project is likely to explode, but atleast I will get more modular and testable code. Should I do this some other way?
Since these if-else-if-...-else-if-else statments are what I would call a recurring theme of the code base I would like to do this refactoring as automagically as possible. So what tools could I use to automate this?
a) Some customizable refactoring feature hidden somewhere in an IDE that I have missed (either in Eclipse or IDEA preferably)
b) Some external tool that can parse Java code and give me fine grained control of transformations
c) Should I hack it myself using Scala?
d) Should I manually go over each class and do the refactoring using the features I am familiar with in my IDE?
Ideally the output of the refactoring should also include some basic test code template that I can run (preferably also test cases for the original code that can be run on both new and old as a kind of regression test... but that I leave for later).
Thanks for any input and suggestions!

What you have described is the Chain of Responsibility Pattern and this sounds like it could be a good choice for your refactor. There could be some downsides to this.
Readability Because you are going to be injecting the the order of the CheckPerformers using spring or some such, this means that it is difficult to see what the code will actually do at first clance.
Maintainence If someone after you wants to add a new condition, as well as adding a whole new class they also have to edit some spring config. Choosing the correct place to add there new CheckPerformer could be difficult and error prone.
Many Classes Depending on how many conditions you have and how much repeated code within those conditions you could end up with a lot of new classes. Even though the long list of if else its very pretty, the logic it in one place, which again aids readability.

To answer the more general part of your question, I don't know of any tools for automatic refactoring beyond basic IDE support, but if you want to know what to look for to refactor have a look at the Refactoring catalog. The specific of your question are covered by replace conditional with Polymorphism and replace conditional with Visitor.

To me the easiest approach would involve a Map<String, Action>, i.e. mapping various strings to specific actions to perform. This way the lookup would be simpler and more performant than the manual comparison in your CheckPerform* classes, getting rid of much duplicated code.
The actions can be implemented similar to your design, as subclasses of a common interface, but it may be easier and more compact to use an enum with overridden method(s). You may see an example of this in an earlier answer of mine.
Unfortunately I don't know of any automatic refactoring which could help you much in this. Earlier when I did somewhat similar refactorings, I wrote unit tests and did the refactoring step-by-step, manually, using automated support at the level of Move Method et al. Of course since the unit tests were pretty similar to each other in their structure, I could reuse part of the code there.
Update
#Sebastien pointed out in his comment, that I missed the possible sub-ifs within the bigger if blocks. One can indeed use a hierarchy of maps to resolve this. However, if the hierarchy starts to be really complex with a lot of duplicated functionality, a further improvement might be to implement a DSL, to move the whole mapping out of code into a config file or DB. In its simplest form it might look something like
foo -> com.foo.bar.SomeClass.someMethod
biz -> com.foo.bar.SomeOtherClass.someOtherMethod
baz -> com.foo.bar.YetAnotherClass.someMethod
bar -> com.foo.bar.SomeOtherClass.someMethod
biz -> com.foo.bar.DifferentClass.aMethod
baz -> com.foo.bar.AndAnotherClass.anotherMethod
where the indented lines configure the sub-conditions for each bigger case.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.