In an application where 1 thread is responsible for updating a map continuously and the main thread periodically reads the map, is it sufficient to use a ConcurrentHashmap? Or should I explicitly lock operations in synchronize blocks? Any explanation would be great.
Update
I have a getter and a setter for the map (encapsulated in a custom type) which can be used simultaneously by both threads, is a ConcurrentHashMap still a good solution? Or maybe I should synchronize the getter/setter (or perhaps declare the instance variable to be volatile)? Just want to make sure that this extra detail doesn't change the solution.
As long as you perform all operation in one method call to the concurrent hash map, you don't need to use additional locking. Unfortunately if you need to perform a number of methods atomically, you have to use locking, in which case using concurrent hash map doesn't help and you may as well use a plain HashMap.
#James' suggestion got me thinking as to whether tuning un-need concurrency makes a ConcurrentHashMap faster. It should reduce memory, but you would need to have thousands of these to make much difference. So I wrote this test and it doesn't appear obvious that you would always need to tune the concurrency level.
warmup: Average access time 36 ns.
warmup2: Average access time 28 ns.
1 concurrency: Average access time 25 ns.
2 concurrency: Average access time 25 ns.
4 concurrency: Average access time 25 ns.
8 concurrency: Average access time 25 ns.
16 concurrency: Average access time 24 ns.
32 concurrency: Average access time 25 ns.
64 concurrency: Average access time 26 ns.
128 concurrency: Average access time 26 ns.
256 concurrency: Average access time 26 ns.
512 concurrency: Average access time 27 ns.
1024 concurrency: Average access time 28 ns.
Code
public static void main(String[] args) {
test("warmup", new ConcurrentHashMap());
test("warmup2", new ConcurrentHashMap());
for(int i=1;i<=1024;i+=i)
test(i+" concurrency", new ConcurrentHashMap(16, 0.75f, i));
}
private static void test(String description, ConcurrentHashMap map) {
Integer[] ints = new Integer[2000];
for(int i=0;i<ints.length;i++)
ints[i] = i;
long start = System.nanoTime();
for(int i=0;i<20*1000*1000;i+=ints.length) {
for (Integer j : ints) {
map.put(j,1);
map.get(j);
}
}
long time = System.nanoTime() - start;
System.out.println(description+": Average access time "+(time/20/1000/1000/2)+" ns.");
}
As #bestss points out, a larger concurrency level can be slower as it has poorer caching characteristics.
EDIT: Further to #betsss concern about whether loops get optimised if there is no method calls. Here is three loops, all the same but iterate a different number of times. They print
10M: Time per loop 661 ps.
100K: Time per loop 26490 ps.
1M: Time per loop 19718 ps.
10M: Time per loop 4 ps.
100K: Time per loop 17 ps.
1M: Time per loop 0 ps.
.
{
int loops = 10*1000 * 1000;
long product = 1;
long start = System.nanoTime();
for(int i=0;i< loops;i++)
product *= i;
long time = System.nanoTime() - start;
System.out.println("10M: Time per loop "+1000*time/loops+" ps.");
}
{
int loops = 100 * 1000;
long product = 1;
long start = System.nanoTime();
for(int i=0;i< loops;i++)
product *= i;
long time = System.nanoTime() - start;
System.out.println("100K: Time per loop "+1000*time/loops+" ps.");
}
{
int loops = 1000 * 1000;
long product = 1;
long start = System.nanoTime();
for(int i=0;i< loops;i++)
product *= i;
long time = System.nanoTime() - start;
System.out.println("1M: Time per loop "+1000*time/loops+" ps.");
}
// code for three loops repeated
That is sufficient, as the purpose of ConcurrentHashMap is to allow lockless get / put operations, but make sure you are using it with the correct concurrency level. From the docs:
Ideally, you should choose a value to accommodate as many threads as will ever concurrently modify the table. Using a significantly higher value than you need can waste space and time, and a significantly lower value can lead to thread contention. But overestimates and underestimates within an order of magnitude do not usually have much noticeable impact. A value of one is appropriate when it is known that only one thread will modify and all others will only read. Also, resizing this or any other kind of hash table is a relatively slow operation, so, when possible, it is a good idea to provide estimates of expected table sizes in constructors.
See http://download.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentHashMap.html.
EDIT:
The wrappered getter/setter make no difference so long as it is still being read/written to by multiple threads. You could concurrently lock the whole map, but that defeats the purpose of using a ConcurrentHashMap.
A ConcurrentHashMap is a good solution for a situation involving lots of write operations and fewer read operations. The downside is that it is not guaranteed what writes a reader will see at any particular moment. So if you require the reader to see the most up-to-date version of the map, it is not a good solution.
From the Java 6 API documentation:
Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset. For aggregate operations such as putAll and clear, concurrent retrievals may reflect insertion or removal of only some entries.
If that is not acceptable for your project, your best solution is really a fully synchronous lock. Solutions for many write operations with few read operations, as far as I know, compromise up-to-date reads in order to achieve faster, non-blocked writing. If you do go with this solution, the Collections.synchronizedMap(...) method creates a fully-synchronized, single reader/writer wrapper for any map object. Easier than writing your own.
You'd be better off using ConcurrentHashMap, as it's implementation doesn't normally block reads. If you synchronize externally, you'll end up blocking most reads, as you don't have access to the internal knowledge of the impl. needed to not do so.
If there is only one writer it should be safe to just use ConcurrentHashMap. If you feel the need to synchronize there are other HashMaps that do the synchronization for you and will be faster than manually writing the synchronization.
Yes... and to optimize it better, you should set the concurrency level to 1.
From Javadoc:
The allowed concurrency among update operations is guided by the optional concurrencyLevel constructor argument (default 16), which is used as a hint for internal sizing. .... A value of one is appropriate when it is known that only one thread will modify and all others will only read.
The solution works because of memory consistency effects for ConcurrentMaps: As with other concurrent collections, actions in a thread prior to placing an object into a ConcurrentMap as a key or value happen-before actions subsequent to the access or removal of that object from the ConcurrentMap in another thread.
Related
I was playing around with loops in java, when I saw that the iteration speed keeps increasing.
Kind of seemed interesting.
Any ideas why?
Code:
import org.junit.jupiter.api.Test;
public class RandomStuffTest {
public static long iterationsPerSecond = 0;
#Test
void testIterationSpeed() {
Thread t = new Thread(()->{
try{
while (true){
System.out.println("Iterations per second: "+iterationsPerSecond);
iterationsPerSecond = 0;
Thread.sleep(1000);
}
} catch (Exception e) {
e.printStackTrace();
}
});
t.setDaemon(true);
t.start();
while (true){
for (long i = 0; i < Long.MAX_VALUE; i++) {
iterationsPerSecond++;
}
}
}
}
Output:
Iterations per second: 6111
Iterations per second: 2199824206
Iterations per second: 4539572003
Iterations per second: 6919540856
Iterations per second: 9442209284
Iterations per second: 11899448226
Iterations per second: 14313220638
Iterations per second: 16827637088
Iterations per second: 19322118707
Iterations per second: 21807781722
Iterations per second: 24256315314
Iterations per second: 26641505580
Another thing that I noticed:
The CPU usage was around 20% all the time and not really increasing...
Maybe because I was running the code as a test using Junit?
The problem is the Java Memory Model (JMM).
Every thread is allowed to have (does not have to do this) a local copy of each field. Whenever it writes or reads this field it is free to just set its local copy and sync it up with other threads' local copies much, much later.
Said differently, the JVM is free to re-order instructions, do things in parallel, and otherwise apply whatever weird stuff it wants to optimize your code, as long as certain guarantees are never broken.
One guarantee that is easy to understand: The JVM is free to reorder or parallelize 2 sequential instructions, but it must never be possible to write code that can observe this except through timing.
In other words, int x = 0; x = 5; System.out.println(x); must necessarily print 5 and never 0.
You can establish such relationships between 2 threads as well but this involves the use of volatile and/or synchronized and/or something that does this internally (most things in the java.util.concurrent package).
You didn't, so this result is meaningless. Most likely, the instruction iterationsPerSecond = 0 is having no effect; the code iterationsPerSecond++ reads 9442209284, increments by one, and writes it back - and that field got written to 0 someplace in the middle of all that, which thus accomplished nothing whatsoever.
If you want to test this properly, try a volatile variable, or better yet an AtomicLong.
Like already indicated, the code is broken due to a data race.
The JIT can do some funny stuff with your code because of the data race:
while (true){
for (long i = 0; i < Long.MAX_VALUE; i++) {
iterationsPerSecond++;
}
}
Since it doesn't know that another thread is also messing with the iterationsPerSecond, the compiler could fold the for loop because it can calculate the outcome of the loop:
while (true){
iterationsPerSecond=Long.MAX_VALUE
}
And it could even decide to pull out the write of the loop since the same value is written (loop invariant code motion):
iterationsPerSecond=Long.MAX_VALUE
while (true){
}
It could even decide the throw away the store, because it doesn't know there are any readers. So effectively it is a dead store and hence it can apply dead code elimination.
while (true){
}
An atomic or volatile would solve the problem because a happens before edge is established. Using a volatile or an atomiclong.get/set is equally expensive. It has the same compiler restrictions and fences on hardware level.
If you want to run microbenchmarks, I would suggest checking out JMH. It will protect you against a lot of trivial mistakes.
Simple question, which I've been wondering. Of the following two versions of code, which is better optimized? Assume that the time value resulting from the System.currentTimeMillis() call only needs to be pretty accurate, so caching should only be considered from a performance point of view.
This (with value caching):
long time = System.currentTimeMillis();
for (long timestamp : times) {
if (time - timestamp > 600000L) {
// Do something
}
}
Or this (no caching):
for (long timestamp : times) {
if (System.currentTimeMillis() - timestamp > 600000L) {
// Do something
}
}
I'm assuming System.currentTimeMillis() is already a very optimized and lightweight method call, but let's assume I'll be calling it many, many times in a short period.
How many values must the "times" collection/array contain to justify caching the return value of System.currentTimeMillis() in its own variable?
Is this better to do from a CPU or memory optimization point of view?
A long is basically free. A JVM with a JIT compiler can keep it in a register, and since it's a loop invariant can even optimize your loop condition to -timestamp < 600000L - time or timestamp > time - 600000L. i.e. the loop condition becomes a trivial compare between the iterator and a loop-invariant constant in a register.
So yes it's obviously more efficient to hoist a function call out of a loop and keep the result in a variable, especially when the optimizer can't do that for you, and especially when the result is a primitive type, not an Object.
Assuming your code is running on a JVM that JITs x86 machine code, System.currentTimeMillis() will probably include at least an rdtsc instruction and some scaling of that result1. So the cheapest it can possibly be (on Skylake for example) is a micro-coded 20-uop instruction with a throughput of one per 25 clock cycles (http://agner.org/optimize/).
If your // Do something is simple, like just a few memory accesses that usually hit in cache, or some simpler calculation, or anything else that out-of-order execution can do a good job with, that could be most of the cost of your loop. Unless each loop iterations typically takes multiple microseconds (i.e. time for thousands of instructions on a 4GHz superscalar CPU), hoisting System.currentTimeMillis() out of the loop can probably make a measurable difference. Small vs. huge will depend on how simple your loop body is.
If you can prove that hoisting it out of your loop won't cause correctness problems, then go for it.
Even with it inside your loop, your thread could still sleep for an unbounded length of time between calling it and doing the work for that iteration. But hoisting it out of the loop makes it more likely that you could actually observe this kind of effect in practice; running more iterations "too late".
Footnote 1: On modern x86, the time-stamp counter runs at a fixed rate, so it's useful as a low-overhead timesource, and less useful for cycle-accurate micro-benchmarking. (Use performance counters for that, or disable turbo / power saving so core clock = reference clock.)
IDK if a JVM would actually go to the trouble of implementing its own time function, though. It might just use an OS-provided time function. On Linux, gettimeofday and clock_gettime are implemented in user-space (with code + scale factor data exported by the kernel into user-space memory, in the VDSO region). So glibc's wrapper just calls that, instead of making an actual syscall.
So clock_gettime can be very cheap compared to an actual system call that switches to kernel mode and back. That can take at least 1800 clock cycles on Skylake, on a kernel with Spectre + Meltdown mitigation enabled.
So yes, it's hopefully safe to assume System.currentTimeMillis() is "very optimized and lightweight", but even rdtsc itself is expensive compared to some loop bodies.
In your case, method calls should always be hoisted out of loops.
System.currentTimeMillis() simply reads a value from OS memory, so it is very cheap (a few nanoseconds), as opposed to System.nanoTime(), which involves a call to hardware, and therefore can be orders of magnitude slower.
I've a requirement to capture the execution time of some code in iterations. I've decided to use a Map<Integer,Long> for capturing this data where Integer(key) is the iteration number and Long(value) is the time consumed by that iteration in milliseconds.
I've written the below java code to compute the time taken for each iteration. I want to ensure that the time taken by all iterations is zero before invoking actual code. Surprisingly, the below code behaves differently for every execution.
Sometimes, I get the desired output(zero millisecond for all iterations), but at times I do get positive and even negative values for some random iterations.
I've tried replacing System.currentTimeMillis(); with below code:
new java.util.Date().getTime();
System.nanoTime();
org.apache.commons.lang.time.StopWatch
but still no luck.
Any suggestions as why some iterations take additional time and how to eliminate it?
package com.stackoverflow.programmer;
import java.util.HashMap;
import java.util.Map;
public class TestTimeConsumption {
public static void main(String[] args) {
Integer totalIterations = 100000;
Integer nonZeroMilliSecondsCounter = 0;
Map<Integer, Long> timeTakenMap = new HashMap<>();
for (Integer iteration = 1; iteration <= totalIterations; iteration++) {
timeTakenMap.put(iteration, getTimeConsumed(iteration));
if (timeTakenMap.get(iteration) != 0) {
nonZeroMilliSecondsCounter++;
System.out.format("Iteration %6d has taken %d millisecond(s).\n", iteration,
timeTakenMap.get(iteration));
}
}
System.out.format("Total non zero entries : %d", nonZeroMilliSecondsCounter);
}
private static Long getTimeConsumed(Integer iteration) {
long startTime = System.currentTimeMillis();
// Execute code for which execution time needs to be captured
long endTime = System.currentTimeMillis();
return (endTime - startTime);
}
}
Here's the sample output from 5 different executions of the same code:
Execution #1 (NOT OK)
Iteration 42970 has taken 1 millisecond(s).
Total non zero entries : 1
Execution #2 (OK)
Total non zero entries : 0
Execution #3 (OK)
Total non zero entries : 0
Execution #4 (NOT OK)
Iteration 65769 has taken -1 millisecond(s).
Total non zero entries : 1
Execution #5 (NOT OK)
Iteration 424 has taken 1 millisecond(s).
Iteration 33053 has taken 1 millisecond(s).
Iteration 76755 has taken -1 millisecond(s).
Total non zero entries : 3
I am looking for a Java based solution that ensures that all
iterations consume zero milliseconds consistently. I prefer to
accomplish this using pure Java code without using a profiler.
Note: I was also able to accomplish this through C code.
Your HashMap performance may be dropping if it is resizing. The default capacity is 16 which you are exceeding. If you know the expected capacity up front, create the HashMap with the appropriate size taking into account the default load factor of 0.75
If you rerun iterations without defining a new map and the Integer key does not start again from zero, you will need to resize the map taking into account the total of all possible iterations.
int capacity = (int) ((100000/0.75)+1);
Map<Integer, Long> timeTakenMap = new HashMap<>(capacity);
As you are starting to learn here, writing microbenchmarks in Java is not as easy as one would first assume. Everybody gets bitten at some point, even the hardened performance experts who have been doing it for years.
A lot is going on within the JVM and the OS that skews the results, such as GC, hotspot on the fly optimisations, recompilations, clock corrections, thread contention/scheduling, memory contention and cache misses. To name just a few. And sadly these skews are not consistent, and they can very easily dominate a microbenchmark.
To answer your immediate question of why the timings can some times go negative, it is because currentTimeMillis is designed to capture wall clock time and not elapsed time. No wall clock is accurate on a computer and there are times when the clock will be adjusted.. very possibly backwards. More detail on Java's clocks can be read on the following Oracle Blog Inside the Oracle Hotspot VM clocks.
Further details and support of nanoTime verses currentTimeMillis can be read here.
Before continuing with your own benchmark, I strongly recommend that you read how do I write a currect micro benchmark in java. The quick synopses is to 1) warm up the JVM before taking results, 2) jump through hoops to avoid dead code elimination, 3) ensure that nothing else is running on the same machine but accept that there will be thread scheduling going on.. you may even want to pin threads to cores, depends on how far you want to take this, 4) use a framework specifically designed for microbenchmarking such as JMH or for quick light weight spikes JUnitMosaic gives good results.
I'm not sure if I understand your question.
You're trying to execute a certain set of statements S, and expect the execution time to be zero. You then test this premise by executing it a number of times and verifying the result.
That is a strange expectation to have: anything consumes some time, and possibly even more. Hence, although it would be possible to test successfully, that does not prove that no time has been used, since your program is save_time();execute(S);compare_time(). Even if execute(S) is nothing, your timing is discrete, and as such, it is possible that the 'tick' of your wallclock just happens to happen just between save_time and compare_time, leading to some time having been visibly past.
As such, I'd expect your C program to behave exactly the same. Have you run that multiple times? What happens when you increase the iterations to over millions? If it still does not occur, then apparently your C compiler has optimized the code in such a way that no time is measured, and apparently, Java doesn't.
Or am I understanding you wrong?
You hint it right... System.currentTimeMillis(); is the way to go in this case.
There is no warranty that increasing the value of the integer object i represent either a millisecond or a Cycle-Time in no system...
you should take the System.currentTimeMillis() and calculated the elapsed time
Example:
public static void main(String[] args) {
long lapsedTime = System.currentTimeMillis();
doFoo();
lapsedTime -= System.currentTimeMillis();
System.out.println("Time:" + -lapsedTime);
}
I am also not sure exactly, You're trying to execute a certain code, and try to get the execution for each iteration of execution.
I hope I understand correct, if that so than i would suggest please use
System.nanoTime() instead of System.currentTimeMillis(); because if your statement of block has very small enough you always get Zero in Millisecond.
Simple Ex could be:
public static void main(String[] args) {
long lapsedTime = System.nanoTime();
//do your stuff here.
lapsedTime -= System.nanoTime();
System.out.println("Time Taken" + -lapsedTime);
}
If System.nanoTime() and System.currentTimeMillis(); are nothing much difference. But its just how much accurate result you need and some time difference in millisecond you may get Zero in case if you your set of statement are not more in each iteration.
I just have implemented a threaded version of the merge sort. ThreadedMerge.java: http://pastebin.com/5ZEvU6BV
Since merge sort is a divide and conquer algorithm I create a thread for every half of the array. But the number of avialable threads in Java-VM is limited so I check that before creating threads:
if(num <= nrOfProcessors){
num += 2;
//create more threads
}else{
//continue without threading
}
However the threaded sorting takes about ~ 6000 ms while the non-threaded version is much faster with just ~ 2500 ms.
Non-Threaded: http://pastebin.com/7FdhZ4Fw
Why is the threaded version slower and how do I solve that problem?
Update: I use atomic integer now for thread counting and declared a static field for Runtime.getRuntime().availableProcessors(). The sorting takes about ~ 1400 ms now.
However creating just one thread in the mergeSort method and let the current thread do the rest has no sigificant performance increase. Why?
Besides when after I call join on a thread and after that decrement the number of used threads with
num.set(num.intValue() - 1);
the sorting takes about ~ 200 ms longer. Here is the update of my algorithm http://pastebin.com/NTZq5zQp Why does this line of code make it even worse?
first off your accesses to num is not threadsafe (check http://download.oracle.com/javase/6/docs/api/java/util/concurrent/atomic/AtomicInteger.html )
you create an equal amount of processes to cores but you block half of them with the join call
num += 1;
ThreadedMerge tm1 = new ThreadedMerge(array, startIndex, startIndex + halfLength);
tm1.start();
sortedRightPart = mergeSort(array, startIndex + halfLength, endIndex);
try{
tm1.join();
num-=1
sortedLeftPart = tm1.list;
}catch(InterruptedException e){
}
this doesn't block the calling thread but uses it to sort the right part and let the created thread do the other part when that one returns the space it takes up can be used by another thread
Hhm, you should not create a thread for every single step (they are expensive and there are lightweight alternatives.)
Ideally, you should only create 4 threads if there are 4 CPU´s.
So let´s say you have 4 CPU´s, then you create one thread at the first level (now you have 2) and at the second level you also create a new thread. This gives you 4.
The reason why you only create one and not two is that you can use the thread you are currently running like:
Thread t = new Thread(...);
t.start();
// Do half of the job here
t.join(); // Wait for the other half to complete.
If you have, let´s say, 5 CPU´s (not in the power of two) then just create 8 threads.
One simple way to do this in practice, is to create the un-threaded version you already made when you reach the appropriate level. In this way you avoid to clutter the merge method when if-sentences etc.
The call to Runtime.availableProcessors() appears to be taking up a fair amount of extra time. You only need to call it once, so just move it outside of the method and define it as a static, e.g.:
static int nrOfProcessors = Runtime.getRuntime().availableProcessors();
I’m dealing with multithreading in Java and, as someone pointed out to me, I noticed that threads warm up, it is, they get faster as they are repeatedly executed. I would like to understand why this happens and if it is related to Java itself or whether it is a common behavior of every multithreaded program.
The code (by Peter Lawrey) that exemplifies it is the following:
for (int i = 0; i < 20; i++) {
ExecutorService es = Executors.newFixedThreadPool(1);
final double[] d = new double[4 * 1024];
Arrays.fill(d, 1);
final double[] d2 = new double[4 * 1024];
es.submit(new Runnable() {
#Override
public void run() {
// nothing.
}
}).get();
long start = System.nanoTime();
es.submit(new Runnable() {
#Override
public void run() {
synchronized (d) {
System.arraycopy(d, 0, d2, 0, d.length);
}
}
});
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
// get a the values in d2.
for (double x : d2) ;
long time = System.nanoTime() - start;
System.out.printf("Time to pass %,d doubles to another thread and back was %,d ns.%n", d.length, time);
}
Results:
Time to pass 4,096 doubles to another thread and back was 1,098,045 ns.
Time to pass 4,096 doubles to another thread and back was 171,949 ns.
... deleted ...
Time to pass 4,096 doubles to another thread and back was 50,566 ns.
Time to pass 4,096 doubles to another thread and back was 49,937 ns.
I.e. it gets faster and stabilises around 50 ns. Why is that?
If I run this code (20 repetitions), then execute something else (lets say postprocessing of the previous results and preparation for another mulithreading round) and later execute the same Runnable on the same ThreadPool for another 20 repetitions, it will be warmed up already, in any case?
On my program, I execute the Runnable in just one thread (actually one per processing core I have, its a CPU-intensive program), then some other serial processing alternately for many times. It doesn’t seem to get faster as the program goes. Maybe I could find a way to warm it up…
It isn't the threads that are warming up so much as the JVM.
The JVM has what's called JIT (Just In Time) compiling. As the program is running, it analyzes what's happening in the program and optimizes it on the fly. It does this by taking the byte code that the JVM runs and converting it to native code that runs faster. It can do this in a way that is optimal for your current situation, as it does this by analyzing the actual runtime behavior. This can (not always) result in great optimization. Even more so than some programs that are compiled to native code without such knowledge.
You can read a bit more at http://en.wikipedia.org/wiki/Just-in-time_compilation
You could get a similar effect on any program as code is loaded into the CPU caches, but I believe this will be a smaller difference.
The only reasons I see that a thread execution can end up being faster are:
The memory manager can reuse already allocated object space (e.g., to let heap allocations fill up the available memory until the max memory is reached - the Xmx property)
The working set is available in the hardware cache
Repeating operations might create operations the compiler can easier reorder to optimize execution