Summary
We have recently changed our String-based ID schema in a complex retrieval engine and observed a severe performance drop. In essence, we changed the IDs from XXX-00000001 to X384840564 (see below for details on the ID schema) and suffer from almost doubled runtimes. Choosing a different string hash function solved the problem, but we still lack a good explanation. Thus, our questions are:
Why do we see such a strong performance drop when changing from
the old to the new ID schema?
Why does our solution of using the “parent hash” actually work?
To approach the problem, we hereafter provide (a) detailed information about the ID schemata and hash functions used, (b) a minimal working example in Java that highlights the performance defect, and (c) our performance results and observations.
(Despite the lengthy description, we have already massively reduced the code example to 4 performance critical lines – see phase 2 in the listing.)
(a) Old and new ID schema; hash functions
Our ID objects consist of a parent ID object (string of 16 characters in [A-Z0-9]) and a child ID string. The same parent ID string is on average used by 1–10 child IDs. The old child IDs had a three-letter prefix, a dash, and a zero-padded running index number of length 8, for example, XXX-00000001 (12 characters in total; X may be any letter [A-Z]). The new child IDs have one letter and 9 non-consecutive digits, for example, X384840564 (10 characters in total, X may be any letter [A-Z]). An obvious difference is that the old child ID strings are often recurring (i.e., the string ABC-00000002 occurs with multiple different parent IDs, as the running index typically starts with 1), while the new child IDs with their arbitrary digit combinations typically occur only a few times or even only with a single parent ID.
Since the ID objects are put into HashSets and HashMaps, the choice of a hash function seems crucial. Currently, the system uses the standard string hash for the parent IDs. For the child IDs, we used to XOR the string hashes of parent and child ID – called XOR hash henceforth. In theory, this should distribute different child IDs quite well. As a variant, we experimented with using only the string hash of the parent ID as the hash code of the child ID – called parent hash henceforth. That is, all child IDs sharing the same parent ID share the same hash. In theory, the parent hash could be suboptimal, as all children sharing the same parent ID end up in the same bucket, while the XOR hash should yield a better data distribution.
(b) Minimal working example
Please refer to the following listing (explanation below):
package mwe;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Random;
import java.util.Set;
public class Main {
private static final Random RANDOM = new Random(42);
private static final String DIGITS = "0123456789";
private static final String ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" + DIGITS;
private static final int NUM_IDS = 5_000_000;
private static final int MAX_CHILDREN = 5;
private static final int REPETITIONS = 5;
private static final boolean RUN_ID_OLD = true; // e.g., 8IBKMAO2T1ORICNZ__XXX-00000002
private static final boolean RUN_ID_NEW = false; // e.g., 6TEG9R5JP1KHJN55__X580104176
private static final boolean USE_PARENT_HASH = false;
private static final boolean SHUFFLE_SET = false;
private abstract static class BaseID {
protected int hash;
public abstract BaseID getParentID();
#Override
public int hashCode() {
return this.hash;
}
}
private static class ParentID extends BaseID {
private final String id;
public ParentID(final String id) {
this.id = id;
this.hash = id.hashCode();
}
#Override
public BaseID getParentID() {
return null;
}
#Override
public boolean equals(final Object obj) {
if (this == obj) {
return true;
}
if (obj instanceof ParentID) {
final ParentID o = (ParentID) obj;
return this.id.equals(o.id);
}
return false;
}
#Override
public String toString() {
return this.id;
}
}
private static class ChildID extends BaseID {
private final String id;
private final BaseID parent;
public ChildID(final String id, final BaseID parent) {
this.id = id;
this.parent = parent;
// Initialize the hash code of the child ID:
if (USE_PARENT_HASH) {
// Only use the parent hash (i.e., all children have the same hash).
this.hash = parent.hashCode();
} else {
// XOR parent and child hash.
this.hash = parent.hashCode() ^ id.hashCode();
}
}
#Override
public BaseID getParentID() {
return this.parent;
}
#Override
public boolean equals(final Object obj) {
if (this == obj) {
return true;
}
if (this.hash != obj.hashCode()) {
return false;
}
if (obj instanceof ChildID) {
final ChildID o = (ChildID) obj;
final BaseID oParent = o.getParentID();
if (this.parent == null && oParent != null) {
return false;
}
if (this.parent != null && oParent == null) {
return false;
}
if (this.parent == null || !this.parent.equals(oParent)) {
return false;
}
return this.id.equals(o.id);
}
return false;
}
#Override
public String toString() {
return this.parent.toString() + "__" + this.id;
}
}
public static void run(final int repetitions, final boolean useVariant2IDs) throws IOException {
for (int i = 0; i < repetitions; i++) {
System.gc(); // Force memory reset for the next repetition.
// -- PHASE 1: CREATE DATA --------------------------------------------------------------------------------
// Fill a set of several millions random IDs. Each ID is a child ID with a reference to its parent ID.
// Each parent ID has between 1 and MAX_CHILDREN children.
Set<BaseID> ids = new HashSet<>(NUM_IDS);
for (int parentIDIdx = 0; parentIDIdx < NUM_IDS; parentIDIdx++) {
// Generate parent ID: 16 random characters.
final StringBuilder parentID = new StringBuilder();
for (int k = 0; k < 16; k++) {
parentID.append(ALPHABET.charAt(RANDOM.nextInt(ALPHABET.length())));
}
// Generate between 1 and MAX_CHILDREN child IDs.
final int childIDCount = RANDOM.nextInt(MAX_CHILDREN) + 1;
for (int childIDIdx = 0; childIDIdx < childIDCount; childIDIdx++) {
final StringBuilder childID = new StringBuilder();
if (useVariant2IDs) {
// Variant 2: Child ID = letter X plus 9 random digits.
childID.append("X");
for (int k = 0; k < 9; k++) {
childID.append(DIGITS.charAt(RANDOM.nextInt(DIGITS.length())));
}
} else {
// Variant 1: Child ID = XXX- plus zero-padded index of length 8.
childID.append("XXX-").append(String.format("%08d", childIDIdx + 1));
}
final BaseID id = new ChildID(childID.toString(), new ParentID(parentID.toString()));
ids.add(id);
}
}
System.out.print(ids.iterator().next().toString());
System.out.flush();
if (SHUFFLE_SET) {
final List<BaseID> list = new ArrayList<>(ids);
Collections.shuffle(list);
ids = new LinkedHashSet<>(list);
}
System.gc(); // Clean up temp data before starting the timer.
// -- PHASE 2: INDEX DATA ---------------------------------------------------------------------------------
// Iterate over the ID set and fill a map indexed by parent IDs. The map values are irrelevant here, so
// use empty objects.
final long timer = System.currentTimeMillis();
final HashMap<BaseID, Object> map = new HashMap<>();
for (final BaseID id : ids) {
map.put(id.getParentID(), new Object());
}
System.out.println("\t" + (System.currentTimeMillis() - timer));
// Ensure that map and IDs are not GC:ed before the timer stops.
if (map.get(new ParentID("_do_not_gc")) == null) {
map.put(new ParentID("_do_not_gc"), new Object());
}
ids.add(new ParentID("_do_not_gc"));
}
}
public static void main(final String[] args) throws IOException {
if (RUN_ID_OLD) {
run(REPETITIONS, false);
}
if (RUN_ID_NEW) {
run(REPETITIONS, true);
}
}
}
In essence, the program first generates a HashSet of IDs and then indexes these IDs by their parent ID in a HashMap. In detail:
The first phase (PHASE 1) generates 5 million parent IDs, each with 1 to 10 child IDs using either the old (e.g., XXX-00000001) or the new ID schema (e.g., X384840564) and one of the two hash functions. The generated child IDs are collected in a HashSet. We explicitly create new parent ID objects for each child ID to match the functionality of the original system. For experimentation, the IDs can optionally be shuffled in a LinkedHashSet to distort the hash-based ordering (cf. boolean SHUFFLE_SET).
The second phase (PHASE 2) simulates the performance-critical path. It reads all IDs (child IDs with their parents) from the HashSet and puts them into a HashMap with the parent IDs as keys (i.e., aggregate IDs by parent).
Note: The actual retrieval system has a more complex logic, such as reading IDs from multiple sets and merging child IDs as the map entry’s values, but it turned out that none of these steps was responsible for the strong performance gap in question.
The remaining lines try to control for the GC, such that the data structures are not GC:ed too early. We’ve tried different alternatives for controlling the GC, but the results seemed pretty stable overall.
When running the program, the constants RUN_ID_OLD and RUN_ID_NEW activate the old and the new ID schema, respectively (best activate only one at a time). USE_PARENT_HASH switches between the XOR hash (false) and the parent hash (true). SHUFFLE_SET distorts the item order in the ID set. All other constants can remain as they are.
(c) Results
All results here are based on a typical Windows desktop with OpenJDK 11. We also tested Oracle JDK 8 and a Linux machine, but observed similar effects in all cases. For the following figure, we tested each configuration in independent runs, whereas each run repeats the timing 5 times. To avoid outliers, we report the median of the repetitions. Note, however, that the timings of the repetitions do not differ much. The performance is measured in milliseconds.
Observations:
Using XOR hash yields a substantial performance drop in the
HashSet setting when switching to the new ID schema. This hash
function seems suboptimal, but we lack a good explanation.
Using the parent hash function speeds up the process regardless of the ID
schema. We speculate that the internal HashSet order is beneficial,
since the resulting HashMap will build up the same order (because
ID.hash = ID.parent.hash). Interestingly, this effect can also be
observed if the HashSet is split into, say, 50 parts, each holding a
random partition of the full HashSet. This leaves us puzzled.
The entire process seems to be heavily dependent of the reading
order in the for loop of the second phase (i.e., the internal order of the
HashSet). If we distort the order in the shuffled LinkHashSet, we
end up in a worst-case scenario, regardless of the ID schema.
In a separate experiment, we also diagnosed the number of
collisions when filling the HashMap, but could not find obvious
differences when changing the ID schema.
Who can shed more light on explaining these results?
Update
The image below shows some profiling results (using VisualVM) for the non-shuffled runs. Indent indicates nested calls. All percentage values are relative to the phase 2 timing (100%).
An obious difference seems to be HashMap.putVal's self time. There was no obvious difference for treeifying buckets.
Related
I have to Array lists with 1000 objects in each of them. I need to remove all elements in Array list 1 which are there in Array list 2. Currently I am running 2 loops which is resulting in 1000 x 1000 operations in worst case.
List<DataClass> dbRows = object1.get("dbData");
List<DataClass> modifiedData = object1.get("dbData");
List<DataClass> dbRowsForLog = object2.get("dbData");
for (DataClass newDbRows : dbRows) {
boolean found=false;
for (DataClass oldDbRows : dbRowsForLog) {
if (newDbRows.equals(oldDbRows)) {
found=true;
modifiedData.remove(oldDbRows);
break;
}
}
}
public class DataClass{
private int categoryPosition;
private int subCategoryPosition;
private Timestamp lastUpdateTime;
private String lastModifiedUser;
// + so many other variables
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
DataClass dataClassRow = (DataClass) o;
return categoryPosition == dataClassRow.categoryPosition
&& subCategoryPosition == dataClassRow.subCategoryPosition && (lastUpdateTime.compareTo(dataClassRow.lastUpdateTime)==0?true:false)
&& stringComparator(lastModifiedUser,dataClassRow.lastModifiedUser);
}
public String toString(){
return "DataClass[categoryPosition="+categoryPosition+",subCategoryPosition="+subCategoryPosition
+",lastUpdateTime="+lastUpdateTime+",lastModifiedUser="+lastModifiedUser+"]";
}
public static boolean stringComparator(String str1, String str2){
return (str1 == null ? str2 == null : str1.equals(str2));
}
public int hashCode() {
int hash = 7;
hash = 31 * hash + (int) categoryPosition;
hash = 31 * hash + (int) subCategoryPosition
hash = 31 * hash + (lastModifiedUser == null ? 0 : lastModifiedUser.hashCode());
return hash;
}
}
The best work around i could think of is create 2 sets of strings by calling tostring() method of DataClass and compare string. It will result in 1000 (for making set1) + 1000 (for making set 2) + 1000 (searching in set ) = 3000 operations. I am stuck in Java 7. Is there any better way to do this? Thanks.
Let Java's builtin collections classes handle most of the optimization for you by taking advantage of a HashSet. The complexity of its contains method is O(1). I would highly recommend looking up how it achieves this because it's very interesting.
List<DataClass> a = object1.get("dbData");
HashSet<DataClass> b = new HashSet<>(object2.get("dbData"));
a.removeAll(b);
return a;
And it's all done for you.
EDIT: caveat
In order for this to work, DataClass needs to implement Object::hashCode. Otherwise, you can't use any of the hash-based collection algorithms.
EDIT 2: implementing hashCode
An object's hash code does not need to change every time an instance variable changes. The hash code only needs to reflect the instance variables that determine equality.
For example, imagine each object had a unique field private final UUID id. In this case, you could determine if two objects were the same by simply testing the id value. Fields like lastUpdateTime and lastModifiedUser would provide information about the object, but two instances with the same id would refer to the same object, even if the lastUpdateTime and lastModifiedUser of each were different.
The point is that if you really want to want to optimize this, include as few fields as possible in the hash computation. From your example, it seems like categoryPosition and subCategoryPosition might be enough.
Whatever fields you choose to include, the simplest way to compute a hash code from them is to use Objects::hash rather than running the numbers yourself.
It is a Set A-B operation(only retain elements in Set A that are not in Set B = A-B)
If using Set is fine then we can do like below. We can use ArrayList as well in place of Set but in AL case for each element to remove/retain check it needs to go through an entire other list scan.
Set<DataClass> a = new HashSet<>(object1.get("dbData"));
Set<DataClass> b = new HashSet<>(object2.get("dbData"));
a.removeAll(b);
If ordering is needed, use TreeSet.
Try to return a set from object1.get("dbData") and object2.get("dbData") that skips one more intermediate collection creation.
Context
Hi, I'm working on an assignment for school that asks us to implement a hash table in Java. There are no requirements that collisions be kept to a minimum, but low collision rate and speed seem to be the two most sought-after qualities in all the reading (some more) that I've done.
Problem
I'd like some guidance on how to map the output of a hash function to a smaller range, without having >20% of my keys collide (yikes).
In all of the algorithms that I've explored, keys are mapped to the entire range of an unsigned 32 bit integer (or in many cases, 64, even 128 bit). I'm not finding much about this on here, Wikipedia, or in any of the hash-related articles / discussions I've come across.
In terms of the specifics of my implementation, I'm working in Java (mandate of my school), which is problematic since there are no unsigned types to work with. To get around this, I've been using the 64-bit long integer type, then using a bit mask to map back down to 32 bits. Instead of simply truncating, I XOR the top 32 bits with the bottom 32, then perform a bitwise AND to mask out any upper bits that might result in a negative value when I cast it down to a 32 bit integer. After all that, a separate function compresses the resulting hash value down to fit into the bounds of the hash table's inner array.
It ends up looking like:
int hash( String key ) {
long h;
for( int i = 0; i < key.length(); i++ )
//do some stuff with each character in the key
h = h ^ ( h << 32 );
return h & 2147483647;
}
Where the inner-loop depends on the hash function (I've implemented a few: polynomial hashing, FNV1, SuperFastHash, and a custom one tailored to the input data).
They basically all perform horribly. I have yet to see <20% keys collide. Even before I compress the hash values down to array indices, none of my hash functions will get me less thank 10k collisions. My inputs are two text files, each ~220,000 lines. One is English words, the other is random strings of varying length.
My lecture notes recommend the following, for compressing the hashed keys:
(hashed key) % P
Where P is the largest prime < the size of the inner array.
Is this an accepted method of compressing hash values? I have a feeling it isn't, but since performance is so poor even before compression, I have a feeling it's not the primary culprit, either.
I don´t know if I understand well your concrete problem, but I will try to help in hash performance and collisions.
The hash based objects will determine in which bucket they will store the key-value pair based on hash value. Inside each bucket there is a structure (In HashMap case a LinkedList) in where the pair is stored.
If the hash value is usually the same, the bucket will be usually the same so the performance will degrade a lot, let´s see an example:
Consider this class
package hashTest;
import java.util.Hashtable;
public class HashTest {
public static void main (String[] args) {
Hashtable<MyKey, String> hm = new Hashtable<>();
long ini = System.currentTimeMillis();
for (int i=0; i<100000; i++) {
MyKey a = new HashTest().new MyKey(String.valueOf(i));
hm.put(a, String.valueOf(i));
}
System.out.println(hm.size());
long fin = System.currentTimeMillis();
System.out.println("tiempo: " + (fin-ini) + " mls");
}
private class MyKey {
private String str;
public MyKey(String i) {
str = i;
}
public String getStr() {
return str;
}
#Override
public int hashCode() {
return 0;
}
#Override
public boolean equals(Object o) {
if (o instanceof MyKey) {
MyKey aux = (MyKey) o;
if (this.str.equals(aux.getStr())) {
return true;
}
}
return false;
}
}
}
Note that hashCode in class MyKey returns always '0' as hash. It is ok with the hashcode definition (http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode()). If we run that program, this is the result
100000
tiempo: 62866 mls
Is a very poor performance, now we are going to change the MyKey hashcode code:
package hashTest;
import java.util.Hashtable;
public class HashTest {
public static void main (String[] args) {
Hashtable<MyKey, String> hm = new Hashtable<>();
long ini = System.currentTimeMillis();
for (int i=0; i<100000; i++) {
MyKey a = new HashTest().new MyKey(String.valueOf(i));
hm.put(a, String.valueOf(i));
}
System.out.println(hm.size());
long fin = System.currentTimeMillis();
System.out.println("tiempo: " + (fin-ini) + " mls");
}
private class MyKey {
private String str;
public MyKey(String i) {
str = i;
}
public String getStr() {
return str;
}
#Override
public int hashCode() {
return str.hashCode() * 31;
}
#Override
public boolean equals(Object o) {
if (o instanceof MyKey) {
MyKey aux = (MyKey) o;
if (this.str.equals(aux.getStr())) {
return true;
}
}
return false;
}
}
}
Note that only hashcode in MyKey has changed, now when we run the code te result is
100000
tiempo: 47 mls
There is an incredible better performance now with a minor change. Is a very common practice return the hashcode multiplied by a prime number (in this case 31), using the same hashcode members that you use inside equals method in order to determine if two objects are the same (in this case only str).
I hope that this little example can you point out a solution for your problem.
I would like to know what makes the difference, what should i aware of when im writing code.
Used the same parameters and methods put(), get() when testing
without printing
Used System.NanoTime() to test runtime
I tried it with 1-10 int keys with 10 values, so every single hash returns unique index, which is the most optimal scenario
My HashSet implementation which is based on this is almost as fast as the JDK's
Here's my simple implementation:
public MyHashMap(int s) {
this.TABLE_SIZE=s;
table = new HashEntry[s];
}
class HashEntry {
int key;
String value;
public HashEntry(int k, String v) {
this.key=k;
this.value=v;
}
public int getKey() {
return key;
}
}
int TABLE_SIZE;
HashEntry[] table;
public void put(int key, String value) {
int hash = key % TABLE_SIZE;
while(table[hash] != null && table[hash].getKey() != key)
hash = (hash +1) % TABLE_SIZE;
table[hash] = new HashEntry(key, value);
}
public String get(int key) {
int hash = key % TABLE_SIZE;
while(table[hash] != null && table[hash].key != key)
hash = (hash+1) % TABLE_SIZE;
if(table[hash] == null)
return null;
else
return table[hash].value;
}
Here's the benchmark:
public static void main(String[] args) {
long start = System.nanoTime();
MyHashMap map = new MyHashMap(11);
map.put(1,"A");
map.put(2,"B");
map.put(3,"C");
map.put(4,"D");
map.put(5,"E");
map.put(6,"F");
map.put(7,"G");
map.put(8,"H");
map.put(9,"I");
map.put(10,"J");
map.get(1);
map.get(2);
map.get(3);
map.get(4);
map.get(5);
map.get(6);
map.get(7);
map.get(8);
map.get(9);
map.get(10);
long end = System.nanoTime();
System.out.println(end-start+" ns");
}
If you read the documentation of the HashMap class, you see that it implements a hash table implementation based on the hashCode of the keys. This is dramatically more efficient than a brute-force search if the map contains a non-trivial number of entries, assuming reasonable key distribution amongst the "buckets" that it sorts the entries into.
That said, benchmarking the JVM is non-trivial and easy to get wrong, if you're seeing big differences with small numbers of entries, it could easily be a benchmarking error rather than the code.
When it is up to performance, never assume something.
Your assumption was "My HashSet implementation which is based on this is almost as fast as the JDK's". No, obviously it is not.
That is the tricky part when doing performance work: doubt everything unless you have measured with great accuracy. Worse, you even measured, and the measurement told you that your implementation is slower; and instead of checking your source, and the source of the thing you are measuring against; you decided that the measuring process must be wrong ...
We did a big code change changing a node id, which used to be represented by an int, to be now represented by a NodeId object. The challenging task now is to identify all the places that are using object == to change them to .equals(). The same for the != operator.
Is there any script or anything that exists or can be written that can identify the places more accurately than a manual eyeballing?
Your help is appreciated!
Thanks a lot.
Unfortunately, I think you are out of luck. NetBeans does not support "Find Usages" for binary operators, and a quick web search does not reveal any other tools which might. Ironically, if you wanted to replace NodeId.equals() with ==, "Find Usages" would be the right tool.
Have you considered making NodeId immutable and using a Factory Pattern to make every equivalent instance of NodeId unique? Then you would not need to replace == and !=. This is also faster than using equals(), but makes object creation take longer. This may seem like an overly complicated solution to a simple problem, but consider the trouble that a single missed == or != could be to track down later on. In contrast, you can probably write the factory code in less time than it will take you to manually search for == and != and it should be easy to unit test.
Here is a simple example where the equivalence of different instances of NodeId depends on an int. Modify the HashMap key to fit your use case. If you don't need multiple factories, you can use a static factory method with a static map rather than a class.
public class NodeId {
private final int id;
private NodeId(int id) {
this.id = id;
}
#Override
public boolean equals(Object obj) {
if (obj instanceof NodeId) return id == ((NodeId)obj).id;
else return false;
}
#Override
public int hashCode() {
int hash = 3;
hash = 29 * hash + this.id;
return hash;
}
public static class Factory {
Map<Integer, NodeId> assigned = new HashMap<>();
public NodeId getInstance(int id) {
NodeId nodeId = assigned.get(id);
if (nodeId == null) {
nodeId = new NodeId(id);
assigned.put(id, nodeId);
}
return nodeId;
}
}
}
A simple test class demonstrates usage.
public class Test {
public static void main(String[] args) {
NodeId.Factory factory = new NodeId.Factory();
NodeId a = factory.getInstance(42);
NodeId b = factory.getInstance(42);
NodeId c = factory.getInstance(3);
System.out.println(a == b); // Should print "true"
System.out.println(a == c); // Should print "false"
}
}
I have a group of Strings which represent product sizes in which most of them are duplicated in meaning but not name. (IE the size Large has at least 14 different spellings possible, each of which needs to be preserved.) I need to sort these based on the size they represent. Any possible Small value should come before any possible Medium value etc.
The only way I see this being possible is to implement a specific Comparator which contains different Sets grouping each size on the base size it represents. Then I can implement the -1,0,1 relationship by determining which Set that particular size falls into.
Is there a more robust way to accomplish this? Specifically I'm worried about 2 weeks from now when someone comes up with yet another way to spell Large.
edit: to be clear its not the actual comparator I have a question with, its the setup with the sets containing each group. Is this a normal way to handle this situation? How do I future proof it so each new size addition doesn't require a full recompile / deploy?
Custom comparator is the solution. I do not understand why do you worry that this is not robust enough.
A simple approach would be to load the size aliases from a resourcebundle. Some example code (put all the files in the same package):
An interface to encapsulate the size property
public interface Sized {
public String getSize();
}
A product class
public class Product implements Sized {
private final String size;
public Product(String size) {
this.size = size;
}
public String getSize() {
return size;
}
#Override
public String toString() {
return size;
}
}
A comparator that does the magic:
import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.ResourceBundle;
public class SizedComparator implements Comparator<Sized> {
// maps size aliases to canonical sizes
private static final Map<String, String> sizes = new HashMap<String, String>();
static {
// create the lookup map from a resourcebundle
ResourceBundle sizesBundle = ResourceBundle
.getBundle(SizedComparator.class.getName());
for (String canonicalSize : sizesBundle.keySet()) {
String[] aliases = sizesBundle.getString(canonicalSize).split(",");
for (String alias : aliases) {
sizes.put(alias, canonicalSize);
}
}
}
#Override
public int compare(Sized s1, Sized s2) {
int result;
String c1 = getCanonicalSize(s1);
String c2 = getCanonicalSize(s2);
if (c1 == null && c2 == null) {
result = 0;
} else if (c1 == null) {
result = -1;
} else if (c2 == null) {
result = 1;
} else {
result = c1.compareTo(c2);
}
return result;
}
private String getCanonicalSize(Sized s1) {
String result = null;
if (s1 != null && s1.getSize() != null) {
result = sizes.get(s1.getSize());
}
return result;
}
}
SizedComparator.properties:
1 = Small,tiny
2 = medium,Average
3 = Large,big,HUGE
A unit test (just for the happy flow):
import org.junit.Before;
import org.junit.Test;
public class FieldSortTest {
private static final String SMALL = "tiny";
private static final String LARGE = "Large";
private static final String MEDIUM = "medium";
private Comparator<Sized> instance;
#Before
public void setup() {
instance = new SizedComparator();
}
#Test
public void testHappy() {
List<Product> products = new ArrayList<Product>();
products.add(new Product(MEDIUM));
products.add(new Product(LARGE));
products.add(new Product(SMALL));
Collections.sort(products, instance);
Assert.assertSame(SMALL, products.get(0).getSize());
Assert.assertSame(MEDIUM, products.get(1).getSize());
Assert.assertSame(LARGE, products.get(2).getSize());
}
}
Note that ResourceBundles are cached automatically. You can reload the ResourceBundle programmatically with:
ResourceBundle.clearCache();
(since Java 1.6). Alternatively you could use some Spring magic to create an auto-reloading message resource.
If reading from a rickety properties file is not cool enough you could quite easily keep your size aliases in a database too.
To impose an arbitrary ordering on a collection of strings (or objects in general), the standard means to do this is to implement a Comparator as you suggest.
Apart from the 'manual' solution you suggest, you could consider comparing the relative edit distance of strings to canonical examples. This will be more flexible in the sense that it will work on alternatives you haven't thought of. But in terms of the work involved, it might be overkill for your application.