I'm looking for a hash function to hash Strings. For my purposes (identifying changed objects during an import) it should have the following properties:
fast
can be used incremental, i.e. I can use it like this:
Hasher h = new Hasher();
h.add("somestring");
h.add("another part");
h.add("eveno more");
Long hash = h.create();
without compromising the other properties or keeping the strings in memory during the complete process.
Secure against collisions. If I compare two hash values from different strings 1 million times per day for the rest of my life, the risk that I get a collision should be neglectable.
It does not have to be secure against malicious attempts to create collisions.
What algorithm can I use? An algorithm with an existent free implementation in Java is preferred.
Clarification
The hash doesn't have to be a long. A String for example would be just fine.
The data to be hashed will come from a file or a db, with many 10MB or up to a few GB of data, that will get distributed into different Hashes. So keeping the complete Strings in memory is not really an option.
Hashs are a sensible topic and it is hard to recommend any such hash based upon your question. You might want to ask this question on https://security.stackexchange.com/ to get expert opinions on the usability of hashs in certain usecases.
What I understood so far is that most hashs are implemented incrementally in the very core; the execution-timing on the other hand is not that easy to predict.
I present you two Hasher implementations which rely on "an existent free implementation in Java". Both implementations are constructed in a way that you can arbitrarily split your Strings before calling add() and get the same result as long as you do not change the order of the characters in them:
import java.math.BigInteger;
import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.Arrays;
/**
* Created for https://stackoverflow.com/q/26928529/1266906.
*/
public class Hashs {
public static class JavaHasher {
private int hashCode;
public JavaHasher() {
hashCode = 0;
}
public void add(String value) {
hashCode = 31 * hashCode + value.hashCode();
}
public int create() {
return hashCode;
}
}
public static class ShaHasher {
public static final Charset UTF_8 = Charset.forName("UTF-8");
private final MessageDigest messageDigest;
public ShaHasher() throws NoSuchAlgorithmException {
messageDigest = MessageDigest.getInstance("SHA-256");
}
public void add(String value) {
messageDigest.update(value.getBytes(UTF_8));
}
public byte[] create() {
return messageDigest.digest();
}
}
public static void main(String[] args) {
javaHash();
try {
shaHash();
} catch (NoSuchAlgorithmException e) {
e.printStackTrace(); // TODO: implement catch
}
}
private static void javaHash() {
JavaHasher h = new JavaHasher();
h.add("somestring");
h.add("another part");
h.add("eveno more");
int hash = h.create();
System.out.println(hash);
}
private static void shaHash() throws NoSuchAlgorithmException {
ShaHasher h = new ShaHasher();
h.add("somestring");
h.add("another part");
h.add("eveno more");
byte[] hash = h.create();
System.out.println(Arrays.toString(hash));
System.out.println(new BigInteger(1, hash));
}
}
Here obviously "SHA-256" could be replaced with other common hash-algorithms; Java ships quite a few of them.
Now you called out for a Long as return-value which would imply you are looking for a 64bit-Hash. If this really was on purpose have a look at the answers to What is a good 64bit hash function in Java for textual strings?. The accepted answer is a slight variant of the JavaHasher as String.hashCode() does basically the same calculation, but with lower overflow-boundary:
public static class Java64Hasher {
private long hashCode;
public Java64Hasher() {
hashCode = 1125899906842597L;
}
public void add(CharSequence value) {
final int len = value.length();
for(int i = 0; i < len; i++) {
hashCode = 31*hashCode + value.charAt(i);
}
}
public long create() {
return hashCode;
}
}
Unto your points:
fast
With SHA-256 being slower than the other two I still would call all three presented approaches fast.
can be used incremental without compromising the other properties or keeping the strings in memory during the complete process.
I can not guarantee that property for the ShaHasher as I understand it is block-based and I lack the source code.Still I would suggest that at most one block, the hash and some internal states are kept. The other two obviously only store the partial hash between calls to add()
Secure against collisions. If I compare two hash values from different strings 1 million times per day for the rest of my life, the risk that I get a collision should be neglectable.
For every hash there are collisions. Given a good distribution the bit-size of the hash is the main factor on how often a collision happens. The JavaHasher is used in e.g. HashMaps and seems to be "collision-free" enough to distribute similar keys far apart each other. As for any deeper analysis: do your own tests or ask your local security engineer - sorry.
I hope this gives a good starting point, details are probably mainly opinion-based.
Not intended as an answer, just to demonstrate that hash collisions are much more likely than human intuition tends to assume.
The following tiny program generates 2^31 distinct strings and checks if any of their hashes collide. It does this by keeping a tracking bit per possible hash value (so you need >512MB heap to run it), to mark each hash value as "used" as they are encountered. It takes several minutes to complete.
public class TestStringHashCollisions {
public static void main(String[] argv) {
long collisions = 0;
long testcount = 0;
StringBuilder b = new StringBuilder(64);
for (int i=0; i>=0; ++i) {
// construct distinct string
b.setLength(0);
b.append("www.");
b.append(Integer.toBinaryString(i));
b.append(".com");
// check for hash collision
String s = b.toString();
++testcount;
if (isColliding(s.hashCode()))
++collisions;
// progress printing
if ((i & 0xFFFFFF) == 0) {
System.out.println("Tested: " + testcount + ", Collisions: " + collisions);
}
}
System.out.println("Tested: " + testcount + ", Collisions: " + collisions);
System.out.println("Collision ratio: " + (collisions / (double) testcount));
}
// storage for 2^32 bits in 2^27 ints
static int[] bitSet = new int[1 << 27];
// test if hash code has appeared before, mark hash as "used"
static boolean isColliding(int hash) {
int index = hash >>> 5;
int bitMask = 1 << (hash & 31);
if ((bitSet[index] & bitMask) != 0)
return true;
bitSet[index] |= bitMask;
return false;
}
}
You can adjust the string generation part easily to test different patterns.
Related
Summary
We have recently changed our String-based ID schema in a complex retrieval engine and observed a severe performance drop. In essence, we changed the IDs from XXX-00000001 to X384840564 (see below for details on the ID schema) and suffer from almost doubled runtimes. Choosing a different string hash function solved the problem, but we still lack a good explanation. Thus, our questions are:
Why do we see such a strong performance drop when changing from
the old to the new ID schema?
Why does our solution of using the “parent hash” actually work?
To approach the problem, we hereafter provide (a) detailed information about the ID schemata and hash functions used, (b) a minimal working example in Java that highlights the performance defect, and (c) our performance results and observations.
(Despite the lengthy description, we have already massively reduced the code example to 4 performance critical lines – see phase 2 in the listing.)
(a) Old and new ID schema; hash functions
Our ID objects consist of a parent ID object (string of 16 characters in [A-Z0-9]) and a child ID string. The same parent ID string is on average used by 1–10 child IDs. The old child IDs had a three-letter prefix, a dash, and a zero-padded running index number of length 8, for example, XXX-00000001 (12 characters in total; X may be any letter [A-Z]). The new child IDs have one letter and 9 non-consecutive digits, for example, X384840564 (10 characters in total, X may be any letter [A-Z]). An obvious difference is that the old child ID strings are often recurring (i.e., the string ABC-00000002 occurs with multiple different parent IDs, as the running index typically starts with 1), while the new child IDs with their arbitrary digit combinations typically occur only a few times or even only with a single parent ID.
Since the ID objects are put into HashSets and HashMaps, the choice of a hash function seems crucial. Currently, the system uses the standard string hash for the parent IDs. For the child IDs, we used to XOR the string hashes of parent and child ID – called XOR hash henceforth. In theory, this should distribute different child IDs quite well. As a variant, we experimented with using only the string hash of the parent ID as the hash code of the child ID – called parent hash henceforth. That is, all child IDs sharing the same parent ID share the same hash. In theory, the parent hash could be suboptimal, as all children sharing the same parent ID end up in the same bucket, while the XOR hash should yield a better data distribution.
(b) Minimal working example
Please refer to the following listing (explanation below):
package mwe;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Random;
import java.util.Set;
public class Main {
private static final Random RANDOM = new Random(42);
private static final String DIGITS = "0123456789";
private static final String ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" + DIGITS;
private static final int NUM_IDS = 5_000_000;
private static final int MAX_CHILDREN = 5;
private static final int REPETITIONS = 5;
private static final boolean RUN_ID_OLD = true; // e.g., 8IBKMAO2T1ORICNZ__XXX-00000002
private static final boolean RUN_ID_NEW = false; // e.g., 6TEG9R5JP1KHJN55__X580104176
private static final boolean USE_PARENT_HASH = false;
private static final boolean SHUFFLE_SET = false;
private abstract static class BaseID {
protected int hash;
public abstract BaseID getParentID();
#Override
public int hashCode() {
return this.hash;
}
}
private static class ParentID extends BaseID {
private final String id;
public ParentID(final String id) {
this.id = id;
this.hash = id.hashCode();
}
#Override
public BaseID getParentID() {
return null;
}
#Override
public boolean equals(final Object obj) {
if (this == obj) {
return true;
}
if (obj instanceof ParentID) {
final ParentID o = (ParentID) obj;
return this.id.equals(o.id);
}
return false;
}
#Override
public String toString() {
return this.id;
}
}
private static class ChildID extends BaseID {
private final String id;
private final BaseID parent;
public ChildID(final String id, final BaseID parent) {
this.id = id;
this.parent = parent;
// Initialize the hash code of the child ID:
if (USE_PARENT_HASH) {
// Only use the parent hash (i.e., all children have the same hash).
this.hash = parent.hashCode();
} else {
// XOR parent and child hash.
this.hash = parent.hashCode() ^ id.hashCode();
}
}
#Override
public BaseID getParentID() {
return this.parent;
}
#Override
public boolean equals(final Object obj) {
if (this == obj) {
return true;
}
if (this.hash != obj.hashCode()) {
return false;
}
if (obj instanceof ChildID) {
final ChildID o = (ChildID) obj;
final BaseID oParent = o.getParentID();
if (this.parent == null && oParent != null) {
return false;
}
if (this.parent != null && oParent == null) {
return false;
}
if (this.parent == null || !this.parent.equals(oParent)) {
return false;
}
return this.id.equals(o.id);
}
return false;
}
#Override
public String toString() {
return this.parent.toString() + "__" + this.id;
}
}
public static void run(final int repetitions, final boolean useVariant2IDs) throws IOException {
for (int i = 0; i < repetitions; i++) {
System.gc(); // Force memory reset for the next repetition.
// -- PHASE 1: CREATE DATA --------------------------------------------------------------------------------
// Fill a set of several millions random IDs. Each ID is a child ID with a reference to its parent ID.
// Each parent ID has between 1 and MAX_CHILDREN children.
Set<BaseID> ids = new HashSet<>(NUM_IDS);
for (int parentIDIdx = 0; parentIDIdx < NUM_IDS; parentIDIdx++) {
// Generate parent ID: 16 random characters.
final StringBuilder parentID = new StringBuilder();
for (int k = 0; k < 16; k++) {
parentID.append(ALPHABET.charAt(RANDOM.nextInt(ALPHABET.length())));
}
// Generate between 1 and MAX_CHILDREN child IDs.
final int childIDCount = RANDOM.nextInt(MAX_CHILDREN) + 1;
for (int childIDIdx = 0; childIDIdx < childIDCount; childIDIdx++) {
final StringBuilder childID = new StringBuilder();
if (useVariant2IDs) {
// Variant 2: Child ID = letter X plus 9 random digits.
childID.append("X");
for (int k = 0; k < 9; k++) {
childID.append(DIGITS.charAt(RANDOM.nextInt(DIGITS.length())));
}
} else {
// Variant 1: Child ID = XXX- plus zero-padded index of length 8.
childID.append("XXX-").append(String.format("%08d", childIDIdx + 1));
}
final BaseID id = new ChildID(childID.toString(), new ParentID(parentID.toString()));
ids.add(id);
}
}
System.out.print(ids.iterator().next().toString());
System.out.flush();
if (SHUFFLE_SET) {
final List<BaseID> list = new ArrayList<>(ids);
Collections.shuffle(list);
ids = new LinkedHashSet<>(list);
}
System.gc(); // Clean up temp data before starting the timer.
// -- PHASE 2: INDEX DATA ---------------------------------------------------------------------------------
// Iterate over the ID set and fill a map indexed by parent IDs. The map values are irrelevant here, so
// use empty objects.
final long timer = System.currentTimeMillis();
final HashMap<BaseID, Object> map = new HashMap<>();
for (final BaseID id : ids) {
map.put(id.getParentID(), new Object());
}
System.out.println("\t" + (System.currentTimeMillis() - timer));
// Ensure that map and IDs are not GC:ed before the timer stops.
if (map.get(new ParentID("_do_not_gc")) == null) {
map.put(new ParentID("_do_not_gc"), new Object());
}
ids.add(new ParentID("_do_not_gc"));
}
}
public static void main(final String[] args) throws IOException {
if (RUN_ID_OLD) {
run(REPETITIONS, false);
}
if (RUN_ID_NEW) {
run(REPETITIONS, true);
}
}
}
In essence, the program first generates a HashSet of IDs and then indexes these IDs by their parent ID in a HashMap. In detail:
The first phase (PHASE 1) generates 5 million parent IDs, each with 1 to 10 child IDs using either the old (e.g., XXX-00000001) or the new ID schema (e.g., X384840564) and one of the two hash functions. The generated child IDs are collected in a HashSet. We explicitly create new parent ID objects for each child ID to match the functionality of the original system. For experimentation, the IDs can optionally be shuffled in a LinkedHashSet to distort the hash-based ordering (cf. boolean SHUFFLE_SET).
The second phase (PHASE 2) simulates the performance-critical path. It reads all IDs (child IDs with their parents) from the HashSet and puts them into a HashMap with the parent IDs as keys (i.e., aggregate IDs by parent).
Note: The actual retrieval system has a more complex logic, such as reading IDs from multiple sets and merging child IDs as the map entry’s values, but it turned out that none of these steps was responsible for the strong performance gap in question.
The remaining lines try to control for the GC, such that the data structures are not GC:ed too early. We’ve tried different alternatives for controlling the GC, but the results seemed pretty stable overall.
When running the program, the constants RUN_ID_OLD and RUN_ID_NEW activate the old and the new ID schema, respectively (best activate only one at a time). USE_PARENT_HASH switches between the XOR hash (false) and the parent hash (true). SHUFFLE_SET distorts the item order in the ID set. All other constants can remain as they are.
(c) Results
All results here are based on a typical Windows desktop with OpenJDK 11. We also tested Oracle JDK 8 and a Linux machine, but observed similar effects in all cases. For the following figure, we tested each configuration in independent runs, whereas each run repeats the timing 5 times. To avoid outliers, we report the median of the repetitions. Note, however, that the timings of the repetitions do not differ much. The performance is measured in milliseconds.
Observations:
Using XOR hash yields a substantial performance drop in the
HashSet setting when switching to the new ID schema. This hash
function seems suboptimal, but we lack a good explanation.
Using the parent hash function speeds up the process regardless of the ID
schema. We speculate that the internal HashSet order is beneficial,
since the resulting HashMap will build up the same order (because
ID.hash = ID.parent.hash). Interestingly, this effect can also be
observed if the HashSet is split into, say, 50 parts, each holding a
random partition of the full HashSet. This leaves us puzzled.
The entire process seems to be heavily dependent of the reading
order in the for loop of the second phase (i.e., the internal order of the
HashSet). If we distort the order in the shuffled LinkHashSet, we
end up in a worst-case scenario, regardless of the ID schema.
In a separate experiment, we also diagnosed the number of
collisions when filling the HashMap, but could not find obvious
differences when changing the ID schema.
Who can shed more light on explaining these results?
Update
The image below shows some profiling results (using VisualVM) for the non-shuffled runs. Indent indicates nested calls. All percentage values are relative to the phase 2 timing (100%).
An obious difference seems to be HashMap.putVal's self time. There was no obvious difference for treeifying buckets.
Context
Hi, I'm working on an assignment for school that asks us to implement a hash table in Java. There are no requirements that collisions be kept to a minimum, but low collision rate and speed seem to be the two most sought-after qualities in all the reading (some more) that I've done.
Problem
I'd like some guidance on how to map the output of a hash function to a smaller range, without having >20% of my keys collide (yikes).
In all of the algorithms that I've explored, keys are mapped to the entire range of an unsigned 32 bit integer (or in many cases, 64, even 128 bit). I'm not finding much about this on here, Wikipedia, or in any of the hash-related articles / discussions I've come across.
In terms of the specifics of my implementation, I'm working in Java (mandate of my school), which is problematic since there are no unsigned types to work with. To get around this, I've been using the 64-bit long integer type, then using a bit mask to map back down to 32 bits. Instead of simply truncating, I XOR the top 32 bits with the bottom 32, then perform a bitwise AND to mask out any upper bits that might result in a negative value when I cast it down to a 32 bit integer. After all that, a separate function compresses the resulting hash value down to fit into the bounds of the hash table's inner array.
It ends up looking like:
int hash( String key ) {
long h;
for( int i = 0; i < key.length(); i++ )
//do some stuff with each character in the key
h = h ^ ( h << 32 );
return h & 2147483647;
}
Where the inner-loop depends on the hash function (I've implemented a few: polynomial hashing, FNV1, SuperFastHash, and a custom one tailored to the input data).
They basically all perform horribly. I have yet to see <20% keys collide. Even before I compress the hash values down to array indices, none of my hash functions will get me less thank 10k collisions. My inputs are two text files, each ~220,000 lines. One is English words, the other is random strings of varying length.
My lecture notes recommend the following, for compressing the hashed keys:
(hashed key) % P
Where P is the largest prime < the size of the inner array.
Is this an accepted method of compressing hash values? I have a feeling it isn't, but since performance is so poor even before compression, I have a feeling it's not the primary culprit, either.
I don´t know if I understand well your concrete problem, but I will try to help in hash performance and collisions.
The hash based objects will determine in which bucket they will store the key-value pair based on hash value. Inside each bucket there is a structure (In HashMap case a LinkedList) in where the pair is stored.
If the hash value is usually the same, the bucket will be usually the same so the performance will degrade a lot, let´s see an example:
Consider this class
package hashTest;
import java.util.Hashtable;
public class HashTest {
public static void main (String[] args) {
Hashtable<MyKey, String> hm = new Hashtable<>();
long ini = System.currentTimeMillis();
for (int i=0; i<100000; i++) {
MyKey a = new HashTest().new MyKey(String.valueOf(i));
hm.put(a, String.valueOf(i));
}
System.out.println(hm.size());
long fin = System.currentTimeMillis();
System.out.println("tiempo: " + (fin-ini) + " mls");
}
private class MyKey {
private String str;
public MyKey(String i) {
str = i;
}
public String getStr() {
return str;
}
#Override
public int hashCode() {
return 0;
}
#Override
public boolean equals(Object o) {
if (o instanceof MyKey) {
MyKey aux = (MyKey) o;
if (this.str.equals(aux.getStr())) {
return true;
}
}
return false;
}
}
}
Note that hashCode in class MyKey returns always '0' as hash. It is ok with the hashcode definition (http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode()). If we run that program, this is the result
100000
tiempo: 62866 mls
Is a very poor performance, now we are going to change the MyKey hashcode code:
package hashTest;
import java.util.Hashtable;
public class HashTest {
public static void main (String[] args) {
Hashtable<MyKey, String> hm = new Hashtable<>();
long ini = System.currentTimeMillis();
for (int i=0; i<100000; i++) {
MyKey a = new HashTest().new MyKey(String.valueOf(i));
hm.put(a, String.valueOf(i));
}
System.out.println(hm.size());
long fin = System.currentTimeMillis();
System.out.println("tiempo: " + (fin-ini) + " mls");
}
private class MyKey {
private String str;
public MyKey(String i) {
str = i;
}
public String getStr() {
return str;
}
#Override
public int hashCode() {
return str.hashCode() * 31;
}
#Override
public boolean equals(Object o) {
if (o instanceof MyKey) {
MyKey aux = (MyKey) o;
if (this.str.equals(aux.getStr())) {
return true;
}
}
return false;
}
}
}
Note that only hashcode in MyKey has changed, now when we run the code te result is
100000
tiempo: 47 mls
There is an incredible better performance now with a minor change. Is a very common practice return the hashcode multiplied by a prime number (in this case 31), using the same hashcode members that you use inside equals method in order to determine if two objects are the same (in this case only str).
I hope that this little example can you point out a solution for your problem.
I would like to know what makes the difference, what should i aware of when im writing code.
Used the same parameters and methods put(), get() when testing
without printing
Used System.NanoTime() to test runtime
I tried it with 1-10 int keys with 10 values, so every single hash returns unique index, which is the most optimal scenario
My HashSet implementation which is based on this is almost as fast as the JDK's
Here's my simple implementation:
public MyHashMap(int s) {
this.TABLE_SIZE=s;
table = new HashEntry[s];
}
class HashEntry {
int key;
String value;
public HashEntry(int k, String v) {
this.key=k;
this.value=v;
}
public int getKey() {
return key;
}
}
int TABLE_SIZE;
HashEntry[] table;
public void put(int key, String value) {
int hash = key % TABLE_SIZE;
while(table[hash] != null && table[hash].getKey() != key)
hash = (hash +1) % TABLE_SIZE;
table[hash] = new HashEntry(key, value);
}
public String get(int key) {
int hash = key % TABLE_SIZE;
while(table[hash] != null && table[hash].key != key)
hash = (hash+1) % TABLE_SIZE;
if(table[hash] == null)
return null;
else
return table[hash].value;
}
Here's the benchmark:
public static void main(String[] args) {
long start = System.nanoTime();
MyHashMap map = new MyHashMap(11);
map.put(1,"A");
map.put(2,"B");
map.put(3,"C");
map.put(4,"D");
map.put(5,"E");
map.put(6,"F");
map.put(7,"G");
map.put(8,"H");
map.put(9,"I");
map.put(10,"J");
map.get(1);
map.get(2);
map.get(3);
map.get(4);
map.get(5);
map.get(6);
map.get(7);
map.get(8);
map.get(9);
map.get(10);
long end = System.nanoTime();
System.out.println(end-start+" ns");
}
If you read the documentation of the HashMap class, you see that it implements a hash table implementation based on the hashCode of the keys. This is dramatically more efficient than a brute-force search if the map contains a non-trivial number of entries, assuming reasonable key distribution amongst the "buckets" that it sorts the entries into.
That said, benchmarking the JVM is non-trivial and easy to get wrong, if you're seeing big differences with small numbers of entries, it could easily be a benchmarking error rather than the code.
When it is up to performance, never assume something.
Your assumption was "My HashSet implementation which is based on this is almost as fast as the JDK's". No, obviously it is not.
That is the tricky part when doing performance work: doubt everything unless you have measured with great accuracy. Worse, you even measured, and the measurement told you that your implementation is slower; and instead of checking your source, and the source of the thing you are measuring against; you decided that the measuring process must be wrong ...
i need to compare two objects in java, it should be tested if their attributes have the same values. Instead of simply compare all the attributes i was thinking about using hash functions. Therefore i have written the following code
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.Vector;
public class Test {
private static Vector<String> vecA, vecB;
public static void main(String args[]) {
vecA = new Vector<String>();
vecB = new Vector<String>();
vecA.add("hallo");
vecA.add("blödes Beispiel");
vecA.add("Einer geht noch");
vecB.add("hallo");
vecB.add("blödes Beispiel");
vecB.add("Einer geht noch");
System.out.println("HashCode() VecA: " + vecA.hashCode());
System.out.println("HashCode() VecB: " + vecB.hashCode());
System.out.println("md5 VecA: " + md5(vecA));
System.out.println("md5 VecB: " + md5(vecB));
vecA.add("ungleich");
System.out.println("HashCode() VecA: " + vecA.hashCode());
System.out.println("HashCode() VecB: " + vecB.hashCode());
System.out.println("md5 VecA: " + md5(vecA));
System.out.println("md5 VecB: " + md5(vecB));
}
private static String md5(Vector<String> v){
try {
MessageDigest algorithm = MessageDigest.getInstance("MD5");
algorithm.reset();
algorithm.update(vecA.toString().getBytes());
byte messageDigest[] = algorithm.digest();
StringBuffer hexString = new StringBuffer();
for (int i = 0; i < messageDigest.length; i++) {
String hex = Integer.toHexString(0xFF & messageDigest[i]);
if (hex.length() == 1)
hexString.append('0');
hexString.append(hex);
}
return hexString.toString();
} catch (NoSuchAlgorithmException nsae) {}
return null;
}
}
The md5 function is simply copied by some website. This leads to the following output
HashCode() VecA: -356464767
HashCode() VecB: -356464767
md5 VecA: 6805716958249f5b7f177fc95408713e
md5 VecB: 6805716958249f5b7f177fc95408713e
HashCode() VecA: 1477685990
HashCode() VecB: -356464767
md5 VecA: c76297ce297d5308359ca06f26fb97ca
md5 VecB: c76297ce297d5308359ca06f26fb97ca
I am confused that adding an element in vecA seems to change the md5 Code of vecB and so their hashes are still the same. Whats the reason and is their any advantage in using java.security.MessageDigest or simply hashCode() in that case? How about the performance of hash functions vs. comparisons of all attributes?
You're building the md5 hash for vecA in both cases:
algorithm.update(vecA.toString().getBytes());
should probably be
algorithm.update(v.toString().getBytes());
Whats the reason and is their any advantage in using java.security.MessageDigest or simply hashCode() in that case?
One advantage of not using hashCode would be that if that method is not overridden, two instances of the same class and with the same attribute values would still return different hashes, due to the default implementation of hashCode.
How about the performance of hash functions vs. comparisons of all attributes?
If you just compare the attributes once, there might not be any noticable performance difference (that, however, depends on how you compare the attributes) but in case of repeatedly comparing a number of attributes vs. calculating a hash once and repeatedly comparing the hashes, the latter might be faster.
Edit:
Here's an example to clarify the answer to the first quote.
Consider the following simple piece of code:
static class A {
int x;
public A( int i) {
x = i;
}
}
static class B {
int x;
public B( int i) {
x = i;
}
public int hashCode() {
final int prime = 31;
return prime * x;
}
public boolean equals( Object obj ) {
//by contract you should always override equals and hashCode together
//also note that some checks are omitted for simplicity's sake (obj might be null etc.)
return getClass().equals( obj.getClass()) && x == ((B)obj).x;
}
}
As you can see, A doesn't override hashCode while B does. Thus you'll get the following result:
System.out.println(new A( 500 ).hashCode() == new A(500).hashCode()); //false
System.out.println(new B( 500 ).hashCode() == new B(500).hashCode()); //true
Note that x is the same in both cases but for A#hashCode() the object identity is used, not the value of x as does B#hashCode()
There is a problem in your md5 method
algorithm.update(vecA.toString().getBytes());
Probably should be
algorithm.update(v.toString().getBytes());
In comments to "How to implement List, Set, and Map in null free design?", Steven Sudit and I got into a discussion about using a callback, with handlers for "found" and "not found" situations, vs. a tryGet() method, taking an out parameter and returning a boolean indicating whether the out parameter had been populated. Steven maintained that the callback approach was more complex and almost certain to be slower; I maintained that the complexity was no greater and the performance at worst the same.
But code speaks louder than words, so I thought I'd implement both and see what I got. The original question was fairly theoretical with regard to language ("And for argument sake, let's say this language don't even have null") -- I've used Java here because that's what I've got handy. Java doesn't have out parameters, but it doesn't have first-class functions either, so style-wise, it should suck equally for both approaches.
(Digression: As far as complexity goes: I like the callback design because it inherently forces the user of the API to handle both cases, whereas the tryGet() design requires callers to perform their own boilerplate conditional check, which they could forget or get wrong. But having now implemented both, I can see why the tryGet() design looks simpler, at least in the short term.)
First, the callback example:
class CallbackMap<K, V> {
private final Map<K, V> backingMap;
public CallbackMap(Map<K, V> backingMap) {
this.backingMap = backingMap;
}
void lookup(K key, Callback<K, V> handler) {
V val = backingMap.get(key);
if (val == null) {
handler.handleMissing(key);
} else {
handler.handleFound(key, val);
}
}
}
interface Callback<K, V> {
void handleFound(K key, V value);
void handleMissing(K key);
}
class CallbackExample {
private final Map<String, String> map;
private final List<String> found;
private final List<String> missing;
private Callback<String, String> handler;
public CallbackExample(Map<String, String> map) {
this.map = map;
found = new ArrayList<String>(map.size());
missing = new ArrayList<String>(map.size());
handler = new Callback<String, String>() {
public void handleFound(String key, String value) {
found.add(key + ": " + value);
}
public void handleMissing(String key) {
missing.add(key);
}
};
}
void test() {
CallbackMap<String, String> cbMap = new CallbackMap<String, String>(map);
for (int i = 0, count = map.size(); i < count; i++) {
String key = "key" + i;
cbMap.lookup(key, handler);
}
System.out.println(found.size() + " found");
System.out.println(missing.size() + " missing");
}
}
Now, the tryGet() example -- as best I understand the pattern (and I might well be wrong):
class TryGetMap<K, V> {
private final Map<K, V> backingMap;
public TryGetMap(Map<K, V> backingMap) {
this.backingMap = backingMap;
}
boolean tryGet(K key, OutParameter<V> valueParam) {
V val = backingMap.get(key);
if (val == null) {
return false;
}
valueParam.value = val;
return true;
}
}
class OutParameter<V> {
V value;
}
class TryGetExample {
private final Map<String, String> map;
private final List<String> found;
private final List<String> missing;
private final OutParameter<String> out = new OutParameter<String>();
public TryGetExample(Map<String, String> map) {
this.map = map;
found = new ArrayList<String>(map.size());
missing = new ArrayList<String>(map.size());
}
void test() {
TryGetMap<String, String> tgMap = new TryGetMap<String, String>(map);
for (int i = 0, count = map.size(); i < count; i++) {
String key = "key" + i;
if (tgMap.tryGet(key, out)) {
found.add(key + ": " + out.value);
} else {
missing.add(key);
}
}
System.out.println(found.size() + " found");
System.out.println(missing.size() + " missing");
}
}
And finally, the performance test code:
public static void main(String[] args) {
int size = 200000;
Map<String, String> map = new HashMap<String, String>();
for (int i = 0; i < size; i++) {
String val = (i % 5 == 0) ? null : "value" + i;
map.put("key" + i, val);
}
long totalCallback = 0;
long totalTryGet = 0;
int iterations = 20;
for (int i = 0; i < iterations; i++) {
{
TryGetExample tryGet = new TryGetExample(map);
long tryGetStart = System.currentTimeMillis();
tryGet.test();
totalTryGet += (System.currentTimeMillis() - tryGetStart);
}
System.gc();
{
CallbackExample callback = new CallbackExample(map);
long callbackStart = System.currentTimeMillis();
callback.test();
totalCallback += (System.currentTimeMillis() - callbackStart);
}
System.gc();
}
System.out.println("Avg. callback: " + (totalCallback / iterations));
System.out.println("Avg. tryGet(): " + (totalTryGet / iterations));
}
On my first attempt, I got 50% worse performance for callback than for tryGet(), which really surprised me. But, on a hunch, I added some garbage collection, and the performance penalty vanished.
This fits with my instinct, which is that we're basically talking about taking the same number of method calls, conditional checks, etc. and rearranging them. But then, I wrote the code, so I might well have written a suboptimal or subconsicously penalized tryGet() implementation. Thoughts?
Updated: Per comment from Michael Aaron Safyan, fixed TryGetExample to reuse OutParameter.
I would say that neither design makes sense in practice, regardless of the performance. I would argue that both mechanisms are overly complicated and, more importantly, don't take into account actual usage.
Actual Usage
If a user looks up a value in a map and it isn't there, most likely the user wants one of the following:
To insert some value with that key into the map
To get back some default value
To be informed that the value isn't there
Thus I would argue that a better, null-free API would be:
has(key) which indicates if the key is present (if one only wishes to check for the key's existence).
get(key) which reports the value if the key is present; otherwise, throws NoSuchElementException.
get(key,defaultval) which reports the value for the key, or defaultval if the key isn't present.
setdefault(key,defaultval) which inserts (key,defaultval) if key isn't present, and returns the value associated with key (which is defaultval if there is no previous mapping, otherwise prev mapping).
The only way to get back null is if you explicity ask for it as in get(key,null). This API is incredibly simple, and yet is able to handle the most common map-related tasks (in most use cases that I have encountered).
I should also add that in Java, has() would be called containsKey() while setdefault() would be called putIfAbsent(). Because get() signals an object's absence via a NoSuchElementException, it is then possible to associate a key with null and treat it as a legitimate association.... if get() returns null, it means the key has been associated with the value null, not that the key is absent (although you can define your API to disallow a value of null if you so choose, in which case you would throw an IllegalArgumentException from the functions that are used to add associations if the value given is null). Another advantage to this API, is that setdefault() only needs to perform the lookup procedure once instead of twice, which would be the case if you used if( ! dict.has(key) ){ dict.set(key,val); }. Another advantage is that you do not surprise developers who write something like dict.get(key).doSomething() who assume that get() will always return a non-null object (because they have never inserted a null value into the dictionary)... instead, they get a NoSuchElementException if there is no value for that key, which is more consistent with the rest of the error checking in Java and which is also a much easier to understand and debug than NullPointerException.
Answer To Question
To answer original question, yes, you are unfairly penalizing the tryGet version.... in your callback based mechanism you construct the callback object only once and use it in all subsequent calls; whereas in your tryGet example, you construct your out parameter object in every single iteration. Try taking the line:
OutParameter out = new OutParameter();
Take the line above out of the for-loop and see if that improves the performance of the tryGet example. In other words, place the line above the for-loop, and re-use the out parameter in each iteration.
David, thanks for taking the time to write this up. I'm a C# programmer, so my Java skills are a bit vague these days. Because of this, I decided to port your code over and test it myself. I found some interesting differences and similarities, which are pretty much worth the price of admission as far as I'm concerned. Among the major differences are:
I didn't have to implement TryGet because it's built into Dictionary.
In order to use the native TryGet, instead of inserting nulls to simulate misses, I simply omitted those values. This still means that v = map[k] would have set v to null, so I think it's a proper porting. In hindsight, I could have inserted the nulls and changed (_map.TryGetValue(key, out value)) to (_map.TryGetValue(key, out value) && value != null)), but I'm glad I didn't.
I want to be exceedingly fair. So, to keep the code as compact and maintainable as possible, I used lambda calculus notation, which let me define the callbacks painlessly. This hides much of the complexity of setting up anonymous delegates, and allows me to use closures seamlessly. Ironically, the implementation of Lookup uses TryGet internally.
Instead of declaring a new type of Dictionary, I used an extension method to graft Lookup onto the standard dictionary, much simplifying the code.
With apologies for the less-than-professional quality of the code, here it is:
using System;
using System.Collections.Generic;
using System.Linq;
namespace ConsoleApplication1
{
static class CallbackDictionary
{
public static void Lookup<K, V>(this Dictionary<K, V> map, K key, Action<K, V> found, Action<K> missed)
{
V v;
if (map.TryGetValue(key, out v))
found(key, v);
else
missed(key);
}
}
class TryGetExample
{
private Dictionary<string, string> _map;
private List<string> _found;
private List<string> _missing;
public TryGetExample(Dictionary<string, string> map)
{
_map = map;
_found = new List<string>(_map.Count);
_missing = new List<string>(_map.Count);
}
public void TestTryGet()
{
for (int i = 0; i < _map.Count; i++)
{
string key = "key" + i;
string value;
if (_map.TryGetValue(key, out value))
_found.Add(key + ": " + value);
else
_missing.Add(key);
}
Console.WriteLine(_found.Count() + " found");
Console.WriteLine(_missing.Count() + " missing");
}
public void TestCallback()
{
for (int i = 0; i < _map.Count; i++)
_map.Lookup("key" + i, (k, v) => _found.Add(k + ": " + v), k => _missing.Add(k));
Console.WriteLine(_found.Count() + " found");
Console.WriteLine(_missing.Count() + " missing");
}
}
class Program
{
static void Main(string[] args)
{
int size = 2000000;
var map = new Dictionary<string, string>(size);
for (int i = 0; i < size; i++)
if (i % 5 != 0)
map.Add("key" + i, "value" + i);
long totalCallback = 0;
long totalTryGet = 0;
int iterations = 20;
TryGetExample tryGet;
for (int i = 0; i < iterations; i++)
{
tryGet = new TryGetExample(map);
long tryGetStart = DateTime.UtcNow.Ticks;
tryGet.TestTryGet();
totalTryGet += (DateTime.UtcNow.Ticks - tryGetStart);
GC.Collect();
tryGet = new TryGetExample(map);
long callbackStart = DateTime.UtcNow.Ticks;
tryGet.TestCallback();
totalCallback += (DateTime.UtcNow.Ticks - callbackStart);
GC.Collect();
}
Console.WriteLine("Avg. callback: " + (totalCallback / iterations));
Console.WriteLine("Avg. tryGet(): " + (totalTryGet / iterations));
}
}
}
My performance expectations, as I said in the article that inspired this one, would be that neither one is much faster or slower than the other. After all, most of the work is in the searching and adding, not in the simple logic that structures it. In fact, it varied a bit among runs, but I was unable to detect any consistent advantage.
Part of the problem is that I used a low-precision timer and the test was short, so I increased the count by 10x to 2000000 and that helped. Now callbacks are about 3% slower, which I do not consider significant. On my fairly slow machine, callbacks took 17773437 while tryget took 17234375.
Now, as for code complexity, it's a bit unfair because TryGet is native, so let's just ignore the fact that I had to add a callback interface. At the calling spot, lambda notation did a great job of hiding the complexity. If anything, it's actually shorter than the if/then/else used in the TryGet version, although I suppose I could have used a ternary operator to make it equally compact.
On the whole, I found the C# to be more elegant, and only some of that is due to my bias as a C# programmer. Mainly, I didn't have to define and implement interfaces, which cut down on the plumbing overhead. I also used pretty standard .NET conventions, which seem to be a bit more streamlined than the sort of style favored in Java.