Problem:
I need to compare 2 hash table implementations (well basically HashMap with another one) and make a reasonable conclusion.
I am not interested in 100% accuracy but just being in the right direction in my estimation.
I am interested in the difference not only per operation but mainly on the hashtable as a "whole".
I don't have a strict requirement on speed so if the other implementation is reasonably slower I can accept it but I do expect/require that the memory usage be better (since one of the hashtables is backed by primitive table).
What I did so far:
Originally I created my own custom "benchmark" with loops and many calls to hint for gc to get a feeling of the difference but I am reading online that using a standard tool is more reliable/appropriate.
Example of my approach (MapInterface is just a wrapper so I can switch among implementations.):
int[] keys = new int[10000000];
String[] values = new String[10000000];
for(int i = 0; i < keys.length; ++i) {
keys[i] = i;
values[i] = "" + i;
}
if(operation.equals("put", keys, values)) {
runPutOperation(map);
}
public static long[] runOperation(MapInterface map, Integer[] keys, String[] values) {
long min = Long.MAX_VALUE;
long max = Long.MIN_VALUE;
long run = 0;
for(int i = 0; i < 10; ++i) {
long start = System.currentTimeMillis();
for(int i = 0; i < keys.length; ++i) {
map.put(keys[i], values[i]);
}
long total = System.currentTimeMillis() - start;
System.out.println(total/1000d + " seconds");
if(total < min) {
min = time;
}
if(total > max) {
max = time;
}
run += time;
map = null;
map = createNewHashMap();
hintsToGC();
}
return new long[] {min, max, run};
}
public void hintsToGC() {
for(int i = 0; i < 20; ++i) {
System.out.print(". ");
System.gc();
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
private HashMapInterface<String> createNewHashMap() {
if(jdk) {
return new JDKHashMapWrapper<String>();
}
else {
return new AlternativeHashMapWrapper<String>();
}
}
public class JDKHashMapWrapper implements HashMapInterface<String> {
HashMap<Integer, String> hashMap;
JDKHashMapWrapper() {
hashMap = new HashMap<Integer, String>();
}
public String put(Integer key, String value) {
return hashMap.put(key, value);
}
//etc
}
(I want to test put, get, contains and the memory utilization)
Can I be sure by using my approach that I can get reasonable measurements?
If not what would be the most appropriate tool to use and how?
Update:
- I also test with random numbers (also ~10M random numbers) using SecureRandom.
- When the hash table resizes I print the logical size of the hash table/size of the actual table to get the load factor
Update:
For my specific case, where I am interested also in integers what can of pitfalls are there with my approach?
UPDATE after #dimo414 comments:
Well at a minimum the hashtable as a "whole" isn't meaningful
I mean how the hashtable behaves under various loads both at runtime and in memory consumption.
Every data structure is a tradeoff of different methods
I agree. My trade-off is an acceptable access penalty for memory improvement
You need to identify what features you're interested in verifying
1) put(key, value);
2) get(key, value);
3) containsKey(key);
4) all the above when having many entries in the hash table
Some key consideration for using Hash tables is the size of the "buckets" allocation, the collision resolution strategy, and the shape of your data. Essentially, a Hash table takes the key supplied by the application and then hashes it to a value less than or equal to the number of allocated buckets. When two key values hash to the same bucket, the implementation has to resolve the collision and return the right value. For example, one could have a sorted linked list for each bucket and that list is searched.
If your data happens to have a lot of collisions, then your performance will suffer, because the Hash table implementation will spend too much time resolving the collision. On the other hand, if you have a very large number of buckets, you solve the collision problem at the expense of memory. Also, Java's built-in HashMap implementation will "rehash" if the number of entries gets larger than a certain amount - I imagine this is an expensive operation that is worth avoiding.
Since your key data is the positive integers from 1 to 10M, your test data looks good. I would also ensure that the different hash tables implementations were initialized to the same bucket size for a given test, otherwise it's not a fair comparison. Finally, I would vary the bucket size over a pretty significant range and rerun the tests to see how the implementations changed their behavior.
As I understand you are interested in both operations execution time and memory consumption of the maps in the test.
I will start with memory consumption as this seams not to be answered at all. What I propose is to use a small library called Classmexer. I personally used it when I need to get the 100% correct memory consumption of any object. It has the java agent approach (because it's using the Instrumentation API), which means that you need to add it as the parameter to the JVM executing your tests:
-javaagent: [PATH_TO]/classmexer.jar
The usage of the Classmexer is very simple. At any point of time you can get the memory consumption in bytes by executing:
MemoryUtil.deepMemoryUsageOf(mapIamInterestedIn, VisibilityFilter.ALL)
Note that with visibility filter you can specify if the memory calculation should be done for the object (our map) plus all other reachable object through references. That's what VisibilityFilter.ALL is for. However, this would mean that the size you get back includes all objects you used for keys and values. Thus if you have 100 Integer/String entries the reported size will include those as well.
For the timing aspect I would propose JMH tool, as this tool is made for micro bench-marking. There are plenty examples online, for example this article has map testing examples that can guide you pretty good.
Note that I should be careful when do you call the Classmexer's Memory Util as it will interfere with the time results if you call it during the time measuring. Furthermore, I am sure that there are many other tools similar to Classmexer, but I like it because it small and simple.
I was just doing something similar to this, and I ended up using the built in profiler in the Netbeans IDE. You can get really detailed info on both CPU and memory usage. I had originally written all my code in Eclipse, but Netbeans has an import feature for bringing in Eclipse projects and it set it all up no problem, if that is possibly your situation too.
For timing, you might also look at the StopWatch class in Apache Commons. It's a much more intuitive way of tracking time on targeted operations, e.g.:
StopWatch myMapTimer = new StopWatch();
HashMap<Integer, Integer> hashMap = new HashMap<>();
myMapTimer.start();
for (int i = 0; i < numElements; i++)
hashMap.put(i, i);
myMapTimer.stop();
System.out.println(myMapTimer.getTime()); // time will be in milliseconds
Related
My program needs to generate unique labels that consist of a tag, date and time. Something like this:
"myTag__2019_09_05__07_51"
If one tries to generate two labels with the same tag in the same minute, one will receive equal labels, what I cannot allow. I think of adding as an additional suffix result of System.nanoTime() to make sure that each label will be unique (I cannot access all labels previously generated to look for duplicates):
"myTag__2019_09_05__07_51__{System.nanoTime() result}"
Can I trust that each invocation of System.nanoTime() this will produce a different value? I tested it like this on my laptop:
assertNotEquals(System.nanoTime(), System.nanoTime())
and this works. I wonder if I have a guarantee that it will always work.
TLDR; If you only use a single thread on a popular VM on a modern operating system, it may work in practice. But many serious applications use multiple threads and multiple instances of the application, and there won't be any guarantee in that case.
The only guarantee given in the Javadoc for System.nanoTime() is that the resolution of the clock is at least as good as System.currentTimeMillis() - so if you are writing cross-platform code, there is clearly no expectation that the results of nanoTime are unique, as you can call nanoTime() many times per millisecond.
On my OS (Java 11, MacOS) I always get at least one nanosecond difference between successive calls on the same thread (and that is after Integer.MAX_VALUE looks at successive return values); it's possible that there is something in the implementation that guarantees it.
However it is simple to generate duplicate results if you use multiple Threads and have more than 1 physical CPU. Here's code that will show you:
public class UniqueNano {
private static volatile long a = -1, b = -2;
public static void main(String[] args) {
long max = 1_000_000;
new Thread(() -> {
for (int i = 0; i < max; i++) { a = System.nanoTime(); }
}).start();
new Thread(() -> {
for (int i = 0; i < max; i++) { b = System.nanoTime(); }
}).start();
for (int i = 0; i < max; i++) {
if (a == b) {
System.out.println("nanoTime not unique");
}
}
}
}
Also, when you scale your application to multiple machines, you will potentially have the same problem.
Relying on System.nanoTime() to get unique values is not a good idea.
I am evaluating an SDK and I need to cross compare ~15000 iris images stored in a gallery folder and generate the similarity scores as a 15000 x 15000 matrix.
So I pre-processed all the images and stored the processed blobs in an ArrayList. Then I'm using multiple threads with 2 'for' loops in the run method to call the 'compare' method (from SDK) and pass the index of an ArrayList as parameters to compare those respective blobs and save the integer return values in an excel sheet using Apache poi library. The performance is very inefficient (each comparison takes ~40ms) and the whole task takes a lot of time (~100 days estimated with 8 cores running at 100%) to do all the 225,000,000 comparisons. Please help me understand this bottle neck.
Multithreading code
int processors = Runtime.getRuntime().availableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(processors);
for(int i =0; i<processors; i++) {
//each thread compares 1875 images with 15000 images
Runnable task = new Thread(bloblist,i*1875,i*1875+1874);
executor.execute(task);
}
executor.shutdown();
Run Method
public void run(){
for(int i = startIndex; i<= lastIndex; i++) {
for(int j=0;j<15000;j++){
compare.compareIris(bloblist.get(i),bloblist.get(j));
score= compare.getScore();
//save result to Excel using Apache POI
...
...
}
}
}
Please suggest me a time-efficient architecture to accomplish this task. Shall I store the blobs in a NoSQL DB or is there any alternate way to do this?
I'd consider adding some simple profiling to your code as a first step. Profiling libraries are great, but can be a little intimidating. All you really need to get started is:
public void run(){
long sumCompare = 0;
long sumSave = 0
for(int i = startIndex; i<= lastIndex; i++) {
for(int j=0;j<15000;j++){
final long compareStart = System.currentTimeMillis();
compare.compareIris(bloblist.get(i),bloblist.get(j));
score= compare.getScore();
final long compareEnd = System.currentTimeMillis();
compareSum += (compareEnd - compareStart);
//save result to Excel using Apache POI
...
...
final long saveEnd = System.currentTimeMillis();
saveSum += (saveEnd - compareEnd);
}
}
System.out.println(String.format("Compare: %d; Save: %d", sumCompare, sumSave);
}
Maybe run this over a 100x100 grid instead to get a rough idea of where the bulk of your runtime is.
If it's the save step, I'd strongly recommend using a database as an intermediate step between computing the score and exporting it to a spreadsheet. A NoSQL database would work, although I'd also encourage you to look at something like SQLite, just for simplicity's sake. (Many NoSQL databases are designed to offer advantages across a cluster of database nodes while working with very large datasets; if you're storing write-only data on one node, SQL may be your best bet.)
If the bottleneck is the compute step, it will be more difficult to improve performance. If the blobs don't all fit comfortably in RAM along with whatever RAM the comparisons consume, you may be paying the price of swapping this data onto and off-of disk. You may see a small improvement by having each thread take work "off of a queue", rather than starting with a pre-assigned block:
final int processors = Runtime.getRuntime().availableProcessors();
final ExecutorService executor = Executors.newFixedThreadPool(processors);
final AtomicLong nextCompare = new AtomicLong(0);
for(int i =0; i<processors; i++) {
Runnable task = new Thread(bloblist, nextCompare);
executor.execute(task);
}
executor.shutdown();
public void run(){
while (true) {
final long taskNum = nextCompare.getAndIncrement();
if (taskNum >= 15000 * 15000) {
return;
}
final long i = Math.floor(taskNum/15000);
final long j = taskNum % 15000;
compare.compareIris(bloblist.get(i),bloblist.get(j));
score = compare.getScore();
// Save score, etc.)
}
}
This will result in all threads working on blobs stored relatively close together in memory. In this way, no thread is evicting data from the cache that another thread will require in the near future. You are, however, paying the price of locking the AtomicLong; if memory thrashing wasn't your issue, this will likely be a bit slower.
This question is specifically geared towards the Java language, but I would not mind feedback about this being a general concept if so. I would like to know which operation might be faster, or if there is no difference between assigning a variable a value and performing tests for values. For this issue we could have a large series of Boolean values that will have many requests for changes. I would like to know if testing for the need to change a value would be considered a waste when weighed against the speed of simply changing the value during every request.
public static void main(String[] args){
Boolean array[] = new Boolean[veryLargeValue];
for(int i = 0; i < array.length; i++) {
array[i] = randomTrueFalseAssignment;
}
for(int i = 400; i < array.length - 400; i++) {
testAndChange(array, i);
}
for(int i = 400; i < array.length - 400; i++) {
justChange(array, i);
}
}
This could be the testAndChange method
public static void testAndChange(Boolean[] pArray, int ind) {
if(pArray)
pArray[ind] = false;
}
This could be the justChange method
public static void justChange(Boolean[] pArray, int ind) {
pArray[ind] = false;
}
If we were to end up with the very rare case that every value within the range supplied to the methods were false, would there be a point where one method would eventually become slower than the other? Is there a best practice for issues similar to this?
Edit: I wanted to add this to help clarify this question a bit more. I realize that the data type can be factored into the answer as larger or more efficient datatypes can be utilized. I am more focused on the task itself. Is the task of a test "if(aConditionalTest)" is slower, faster, or indeterminable without additional informaiton (such as data type) than the task of an assignment "x=avalue".
As #TrippKinetics points out, there is a semantical difference between the two methods. Because you use Boolean instead of boolean, it is possible that one of the values is a null reference. In that case the first method (with the if-statement) will throw an exception while the second, simply assigns values to all the elements in the array.
Assuming you use boolean[] instead of Boolean[]. Optimization is an undecidable problem. There are very rare cases where adding an if-statement could result in better performance. For instance most processors use cache and the if-statement can result in the fact that the executed code is stored exactly on two cache-pages where without an if on more resulting in cache faults. Perhaps you think you will save an assignment instruction but at the cost of a fetch instruction and a conditional instruction (which breaks the CPU pipeline). Assigning has more or less the same cost as fetching a value.
In general however, one can assume that adding an if statement is useless and will nearly always result in slower code. So you can quite safely state that the if statement will slow down your code always.
More specifically on your question, there are faster ways to set a range to false. For instance using bitvectors like:
long[] data = new long[(veryLargeValue+0x3f)>>0x06];//a long has 64 bits
//assign random values
int low = 400>>0x06;
int high = (veryLargeValue-400)>>0x06;
data[low] &= 0xffffffffffffffff<<(0x3f-(400&0x3f));
for(int i = low+0x01; i < high; i++) {
data[i] = 0x00;
}
data[high] &= 0xffffffffffffffff>>(veryLargeValue-400)&0x3f));
The advantage is that a processor can perform operations on 32- or 64-bits at once. Since a boolean is one bit, by storing bits into a long or int, operations are done in parallel.
The 2 following versions of the same function (which basically tries to recover a password by brute force) do not give same performance:
Version 1:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = length - 1;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
Version 2:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = refi;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
I would expect version 2 to be faster, as it does (and that is the only difference):
i = refi;
...as compare to version 1:
i = length -1;
However, it's the opposite: version 1 is faster by over 3%!
Does someone knows why? Is that due to some optimization done by the compiler?
Thank you all for your answers that far.
Just to add that the goal is actually not to optimize this piece of code (which is already quite optimized), but more to understand, from a compiler / CPU / architecture perspective, what could explain such performance difference.
Your answers have been very helpful, thanks again!
Key
It is difficult to check this in a micro-benchmark because you cannot say for sure how the code has been optimised without reading the machine code generated, even then the CPU can do plenty of tricks to optimise it future eg. it turns the x86 code in RISC style instructions to actually execute.
A computation takes as little as one cycle and the CPU can perform up to three of them at once. An access to L1 cache takes 4 cycles and for L2, L3, main memory it takes 11, 40-75, 200 cycles.
Storing values to avoid a simple calculation is actually slower in many cases. BTW using division and modulus is quite expensive and caching this value can be worth it when micro-tuning your code.
The correct answer should be retrievable by a deassembler (i mean .class -> .java converter),
but my guess is that the compiler might have decided to get rid of iref altogether and decided to store length - 1 an auxiliary register.
I'm more of a c++ guy, but I would start by trying:
const int refi = length - 1;
inside the for loop. Also you should probably use
indexes[ refi ] = 1;
Comparing running times of codes does not give exact or quarantine results
First of all, it is not the way comparing performances like this. A running time analysis is needed here. Both 2 codes have same loop structure and their running time are the same. You may have different running times when you run codes. However, they mostly differ with cache hits, I/O times, thread & process schedules. There is no quarantine that code is always completed in a exact time.
However, there is still differences in your code, to understand the difference you should look into your CPU architecture. I can explain it according to x86 architecture basically.
What happens behind the scenes?
i = refi;
CPU takes refi and i to its registers from ram. there is 2 access to ram if the values in not in the cache. and value of i will be written to the ram. However, it always takes different times according to thread & process schedules. Furrhermore, if the values are in virtual memory it wil take longer time.
i = length -1;
CPU also access i and length from ram or cache. there is same number of accesses. In addition, there is a subtraction here which means extra CPU cycles. That is why you think this one take longer time to complete. It is expected, but the issues that i mentioned above explain why this take longer time.
Summation
As i explain this is not the way of comparing performance. I think, there is no real difference between these codes. There are lots of optimizations inside CPU and also in compiler. You can see optimized codes if you decompile .class files.
My advice is it is better to minimize BigO running time analysis. If you find better algorithms it is the best way of optimizing codes. In case you still have bottlenecks in your code, you may try micro-benchmarking.
See also
Analysis of algorithms
Big O notation
Microprocessor
Compiler optimization
CPU Scheduling
To start with, you can't really compare the performance by just running your program - micro benchmarking in Java is complicated.
Also, a subtraction on modern CPUs can take as little as a third of a clock cycle on average. On a 3GHz CPU, that is 0.1 nanoseconds. And nothing tells you that the subtraction actually happens as the compiler might have modified the code.
So:
You should try to check the generated assembly code.
If you really care about the performance, create an appropriate micro-benchmark.
I have a program which fetches records from database (using Hibernate) and fills them in a Vector. There was an issue regarding the performance of the operation and I did a test with the Vector replaced by a HashSet. With 300000 records, the speed gain is immense - 45 mins to 2 mins!
So my question is, what is causing this huge difference? Is it just the point that all methods in Vector are synchronized or the point that internally Vector uses an array whereas HashSet does not? Or something else?
The code is running in a single thread.
EDIT:
The code is only inserting the values in the Vector (and in the other case, HashSet).
If it's trying to use the Vector as a set, and checking for the existence of a record before adding it, then filling the vector becomes an O(n^2) operation, compared with O(n) for HashSet. It would also become an O(n^2) operation if you insert each element at the start of the vector instead of at the end.
If you're just using collection.add(item) then I wouldn't expect to see that sort of difference - synchronization isn't that slow.
If you can try to test it with different numbers of records, you could see how each version grows as n increases - that would make it easier to work out what's going on.
EDIT: If you're just using Vector.add then it sounds like something else could be going on - e.g. your database was behaving differently between your different test runs. Here's a little test application:
import java.util.*;
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
Vector<String> vector = new Vector<String>();
for (int i = 0; i < 300000; i++) {
vector.add("dummy value");
}
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + "ms");
}
}
Output:
Time taken: 38ms
Now obviously this isn't going to be very accurate - System.currentTimeMillis isn't the best way of getting accurate timing - but it's clearly not taking 45 minutes. In other words, you should look elsewhere for the problem, if you really are just calling Vector.add(item).
Now, changing the code above to use
vector.add(0, "dummy value"); // Insert item at the beginning
makes an enormous difference - it takes 42 seconds instead of 38ms. That's clearly a lot worse - but it's still a long way from being 45 minutes - and I doubt that my desktop is 60 times as fast as yours.
If you are inserting them at the middle or beginning instead of at the end, then the Vector needs to move them all along. Every insert. The hashmap, on the other hand, doesn't really care or have to do anything.
Vector is outdated and should not be used anymore. Profile with ArrayList or LinkedList (depends on how you use the list) and you will see the difference (sync vs unsync).
Why are you using Vector in a single threaded application at all?
Vector is synchronized by default; HashSet is not. That's my guess. Obtaining a monitor for access takes time.
I don't know if there are reads in your test, but Vector and HashSet are both O(1) if get() is used to access Vector entries.
Under normal circumstances, it is totally implausible that inserting 300,000 records into a Vector will take 43 minutes longer than inserting the same records into a HashSet.
However, I think there is a possible explanation of what might be going on.
First, the records coming out of the database must have a very high proportion of duplicates. Or at least, they must be duplicates according to the semantics of the equals/hashcode methods of your record class.
Next, I think you must be pushing very close to filling up the heap.
So the reason that the HashSet solution is so much faster is that it is most of the records are being replaced by the set.add operation. By contrast the Vector solution is keeping all of the records, and the JVM is spending most of its time trying to squeeze that last 0.05% of memory by running the GC over, and over and over.
One way to test this theory is to run the Vector version of the application with a much bigger heap.
Irrespective, the best way to investigate this kind of problem is to run the application using a profiler, and see where all the CPU time is going.
import java.util.*;
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
Vector<String> vector = new Vector<String>();
for (int i = 0; i < 300000; i++) {
if(vector.contains(i)) {
vector.add("dummy value");
}
}
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + "ms");
}
}
If you check for duplicate element before insert the element in the vector, it will take more time depend upon the size of vector. best way is to use the HashSet for high performance, because Hashset will not allow duplicate and no need to check for duplicate element before inserting.
According to Dr Heinz Kabutz, he said this in one of his newsletters.
The old Vector class implements serialization in a naive way. They simply do the default serialization, which writes the entire Object[] as-is into the stream. Thus if we insert a bunch of elements into the List, then clear it, the difference between Vector and ArrayList is enormous.
import java.util.*;
import java.io.*;
public class VectorWritingSize {
public static void main(String[] args) throws IOException {
test(new LinkedList<String>());
test(new ArrayList<String>());
test(new Vector<String>());
}
public static void test(List<String> list) throws IOException {
insertJunk(list);
for (int i = 0; i < 10; i++) {
list.add("hello world");
}
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream out = new ObjectOutputStream(baos);
out.writeObject(list);
out.close();
System.out.println(list.getClass().getSimpleName() +
" used " + baos.toByteArray().length + " bytes");
}
private static void insertJunk(List<String> list) {
for(int i = 0; i<1000 * 1000; i++) {
list.add("junk");
}
list.clear();
}
}
When we run this code, we get the following output:
LinkedList used 107 bytes
ArrayList used 117 bytes
Vector used 1310926 bytes
Vector can use a staggering amount of bytes when being serialized. The lesson here? Don't ever use Vector as Lists in objects that are Serializable. The potential for disaster is too great.