CopyOnWriteArraySet is too slow

CopyOnWriteArraySet is too slow - java

When I ran the following program, it took around 7 to 8 minutes to execute. I am really not sure where I am mistaken as this program is taking so much time to execute.
public class Test {
public static void main(String[] args) {
final Integer[] a= new Integer[1000000];
for (int i=0; i < a.length; i++) {
a[i] = i;
}
final List<Integer> source = Arrays.asList(a);
final Set<Integer> set = new CopyOnWriteArraySet<Integer>(source);
}
}
Can some one help me understand, why this program is too slow.
My machine is Core I7 with 4GB RAM

I have tested and indeed with a List of 1 000 000 elements provided to the constructor, it takes a good time (7 minutes).
It is a referenced issue on Open JDK the 2013-01-09 :
JDK-8005953 - CopyOnWriteArraySet copy constructor is unusable for large collections
The problem would cause by the CopyOnWriteArrayList#addAllAbsent() method invoked by the CopyOnWriteArraySet constructor.
Extract of the issue :
CopyOnWriteArraySet's copy constructor is too slow for large
collections. It takes over 10 minutes on a developer laptop with just
1 million entries in the collection to be copied...
As resolution status, you can read : Won't Fix.
And you can read as last message :
addAllAbsent can be made faster for larger input, but it would impact
the performance for small sizes. And it's documented that
CopyOnWriteXXX classes are better suited for collections of small
sizes.
The CopyOnWriteArraySet javadoc specifies indeed this point :
It is best suited for applications in which set sizes generally stay
small, read-only operations vastly outnumber mutative operations, and
you need to prevent interference among threads during traversal.

Related

How to prevent heap space error when using large parallel Java 8 stream

How do I effectively parallel my computation of pi (just as an example)?
This works (and takes about 15secs on my machine):
Stream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).limit(999999999L).mapToDouble(d->4.0d/d).sum()
But all of the following parallel variants run into an OutOfMemoryError
DoubleStream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).parallel().limit(999999999L).map(d->4.0d/d).sum();
DoubleStream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).limit(999999999L).parallel().map(d->4.0d/d).sum();
DoubleStream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).limit(999999999L).map(d->4.0d/d).parallel().sum();
So, what do I need to do to get parallel processing of this (large) stream?
I already checked if autoboxing is causing the memory consumption, but it is not. This works also:
DoubleStream.iterate(1, d->-(d+Math.abs(2*d)/d)).boxed().limit(999999999L).mapToDouble(d->4/d).sum()

The problem is that you are using constructs which are hard to parallelize.
First, Stream.iterate(…) creates a sequence of numbers where each calculation depends on the previous value, hence, it offers no room for parallel computation. Even worse, it creates an infinite stream which will be handled by the implementation like a stream with unknown size. For splitting the stream, the values have to be collected into arrays before they can be handed over to other computation threads.
Second, providing a limit(…) doesn’t improve the situation, it makes the situation even worse. Applying a limit removes the size information which the implementation just had gathered for the array fragments. The reason is that the stream is ordered, thus a thread processing an array fragment doesn’t know whether it can process all elements as that depends on the information how many previous elements other threads are processing. This is documented:
“… it can be quite expensive on ordered parallel pipelines, especially for large values of maxSize, since limit(n) is constrained to return not just any n elements, but the first n elements in the encounter order.”
That’s a pity as we perfectly know that the combination of an infinite sequence returned by iterate with a limit(…) actually has an exactly known size. But the implementation doesn’t know. And the API doesn’t provide a way to create an efficient combination of the two. But we can do it ourselves:
static DoubleStream iterate(double seed, DoubleUnaryOperator f, long limit) {
return StreamSupport.doubleStream(new Spliterators.AbstractDoubleSpliterator(limit,
Spliterator.ORDERED|Spliterator.SIZED|Spliterator.IMMUTABLE|Spliterator.NONNULL) {
long remaining=limit;
double value=seed;
public boolean tryAdvance(DoubleConsumer action) {
if(remaining==0) return false;
double d=value;
if(--remaining>0) value=f.applyAsDouble(d);
action.accept(d);
return true;
}
}, false);
}
Once we have such an iterate-with-limit method we can use it like
iterate(1d, d -> -(d+2*(Math.abs(d)/d)), 999999999L).parallel().map(d->4.0d/d).sum()
this still doesn’t benefit much from parallel execution due to the sequential nature of the source, but it works. On my four core machine it managed to get roughly 20% gain.

This is because the default ForkJoinPool implementation used by the parallel() method does not limit the number of threads that get created. The solution is to provide a custom implementation of a ForkJoinPool that is limited to the number of threads that it executes in parallel. This can be achieved as mentioned below:
ForkJoinPool forkJoinPool = new ForkJoinPool(Runtime.getRuntime().availableProcessors());
forkJoinPool.submit(() -> DoubleStream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).parallel().limit(999999999L).map(d->4.0d/d).sum());

How is LongAccumulator implemented, so that it is more efficient?

I understand that the new Java (8) has introduced new sychronization tools such as LongAccumulator (under the atomic package).
In the documentation it says that the LongAccumulator is more efficient when the variable update from several threads is frequent.
I wonder how is it implemented to be more efficient?

That's a very good question, because it shows a very important characteristic of concurrent programming with shared memory. Before going into details, I have to make a step back. Take a look at the following class:
class Accumulator {
private final AtomicLong value = new AtomicLong(0);
public void accumulate(long value) {
this.value.addAndGet(value);
}
public long get() {
return this.value.get();
}
}
If you create one instance of this class and invoke the method accumulate(1) from one thread in a loop, then the execution will be really fast. However, if you invoke the method on the same instance from two threads, the execution will be about two magnitudes slower.
You have to take a look at the memory architecture to understand what happens. Most systems nowadays have a non-uniform memory access. In particular, each core has its own L1 cache, which is typically structured into cache lines with 64 octets. If a core executes an atomic increment operation on a memory location, it first has to get exclusive access to the corresponding cache line. That's expensive, if it has no exclusive access yet, due to the required coordination with all other cores.
There's a simple and counter-intuitive trick to solve this problem. Take a look at the following class:
class Accumulator {
private final AtomicLong[] values = {
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
};
public void accumulate(long value) {
int index = getMagicValue();
this.values[index % values.length].addAndGet(value);
}
public long get() {
long result = 0;
for (AtomicLong value : values) {
result += value.get();
}
return result;
}
}
At first glance, this class seems to be more expensive due to the additional operations. However, it might be several times faster than the first class, because it has a higher probability, that the executing core already has exclusive access to the required cache line.
To make this really fast, you have to consider a few more things:
The different atomic counters should be located on different cache lines. Otherwise you replace one problem with another, namely false sharing. In Java you can use a long[8 * 4] for that purpose, and only use the indexes 0, 8, 16 and 24.
The number of counters have to be chosen wisely. If there are too few different counters, there are still too many cache switches. if there are too many counters, you waste space in the L1 caches.
The method getMagicValue should return a value with an affinity to the core id.
To sum up, LongAccumulator is more efficient for some use cases, because it uses redundant memory for frequently used write operations, in order to reduce the number of times, that cache lines have to be exchange between cores. On the other hand, read operations are slightly more expensive, because they have to create a consistent result.

by this
http://codenav.org/code.html?project=/jdk/1.8.0-ea&path=/Source%20Packages/java.util.concurrent.atomic/LongAccumulator.java
it looks like a spin lock.

Java optimization to prevent heapspace out of memory

Ok, I have a problem in a particular situation that my program get the out of memory error from heap space.
Let's assume we have two ArrayList, the first one contains many T objects, the second one contains W object that are created from the T objects of first List.
And we cycle through it in this way (after the cycle the list :
public void funct(ArrayList<T> list)
{
ArrayList<W> list2 = new ArrayList<W>();
for (int i = 0 ; i < list.size() ; i++)
{
W temp = new W();
temp.set(list.get(i));
temp.saveToDB();
list2.add(temp);
}
// more code! from this point on the `list` is useless
}
My code is pretty similar to this one, but when list contains tons of objects I often get the heap space out of memory (during the for cycle), I'd like to solve this problem.
I do not know very well how the GC works in java, but surely in the previous example there are a lot of possible optimization.
Since the list is not used anymore after the for cycle I thought as first optimization to change from for loop to do loop and empty the list as we cycle through it:
public void funct(ArrayList<T> list)
{
ArrayList<W> list2 = new ArrayList<W>();
while (list.size() > 0)
{
W temp = new W();
temp.set(list.remove(0));
temp.saveToDB();
list2.add(temp);
}
// more code! from this point on the `list` is useless
}
Is this modification useful?
How can I do a better optimization to the above code? and how can I prevent heap space out of memory error? (increasing the XMX and XMS value is not a possibility).

You can try to set the -XX:MaxNewSize=40% of you Xmx AND -XX:NewSize=40% of you Xmx
This params will speedup the GC calls, because your creation rate is high.
For more help : check here

It really depends on many things. How big are the W and T objects?
One optimization you could surely do is ArrayList list2 = new ArrayList(list.size());
This way your listarray does not need to adjust its size many times.
That will not do much difference tough. The real problem is probably the size and number of your W and T objects. Have you thought of using different data structures to manage a smaller portion of objects at time?

If you did some memory profiling, you would have discovered that the largest source of heap exhaustion are the W instances, which you retain by adding them to list2. The ArrayList itself adds a very small overhead per object contained (just 4 bytes if properly pre-sized, worst case 8 bytes), so even if you retain list, this cannot matter much.
You will not be able to lessen the heap pressure without changing your approach towards the non-retention of each and every W instance you have created in your loop.

You continue to reference all the items from the original list :
temp.set(list.get(i)) // you probably store somewhere the passed reference
If the T object has a big size and you don't need all of its fields, try to use a projection of it.
temp.set( extractWhatINeed( list.get(i) ) )
This will involve creating a new class with fewer fields than T (the return type of the extract method).
Now, when you don't reference the original items, they are eligible for GC (when the list itself will not be referenced anymore).

Multicore Java Program with Native Code

I am using a native C++ library inside a Java program. The Java program is written to make use of many-core systems, but it does not scale: the best speed is with around 6 cores, i.e., adding more cores slows it down. My tests show that the call to the native code itself causes the problem, so I want to make sure that different threads access different instances of the native library, and therefore remove any hidden (memory) dependency between the parallel tasks.
In other words, instead of the static block
static {
System.loadLibrary("theNativeLib");
}
I want multiple instances of the library to be loaded, for each thread dynamically. The main question is if that is possible at all. And then how to do it!
Notes:
- I have implementations in Java 7 fork/join as well as Scala/akka. So any help in each platform is appreciated.
- The parallel tasks are completely independent. In fact, each task may create a couple of new tasks and then terminates; no further dependency!
Here is the test program in fork/join style, in which processNatively is basically a bunch of native calls:
class Repeater extends RecursiveTask<Long> {
final int n;
final processor mol;
public Repeater(final int m, final processor o) {
n=m;
mol = o;
}
#Override
protected Long compute() {
processNatively(mol);
final List<RecursiveTask<Long>> tasks = new ArrayList<>();
for (int i=n; i<9; i++) {
tasks.add(new Repeater(n+1,mol));
}
long count = 1;
for(final RecursiveTask<Long> task : invokeAll(tasks)) {
count += task.join();
}
return count;
}
}
private final static ForkJoinPool forkJoinPool = new ForkJoinPool();
public void repeat(processor mol)
{
final long middle = System.currentTimeMillis();
final long count = forkJoinPool.invoke(new Repeater(0, mol));
System.out.println("Count is "+count);
final long after = System.currentTimeMillis();
System.out.println("Time elapsed: "+(after-middle));
}
Putting it differently:
If I have N threads that use a native library, what happens if each of them calls System.loadLibrary("theNativeLib"); dynamically, instead of calling it once in a static block? Will they share the library anyway? If yes, how can I fool JVM into seeing it as N different libraries loaded independently? (The value of N is not known statically)

The javadoc for System.loadLibrary states that it's the same as calling Runtime.getRuntime().loadLibrary(name). The javadoc for this loadLibrary (http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#loadLibrary(java.lang.String) ) states that "If this method is called more than once with the same library name, the second and subsequent calls are ignored.", so it seems you can't load the same library more than once. In terms of fooling the JVM into thinking there are multiple instances, I can't help you there.

You need to ensure you don't have a bottle neck on any shared resources. e.g. say you have 6 hyper threaded cores, you may find that 12 threads is optimal or you might find that 6 thread is optimal (and each thread has a dedicated core)
If you have a heavy floating point routine, it is likely that hyperthreading will be slower rather than faster.
If you are using all the cache, trying to use more can slow your system down. If you are using the limit of CPU to main memory bandwidth, attempting to use more bandwidth can slow your machine.
But then, how can I refer to the different instances? I mean the loaded classes will have the same names and packages, right? What happens in general if you load two dynamic libraries containing classes with the same names and packages?
There is only one instance, you cannot load a DLL more than once. If you want to construct a different data set for each thread you need to do this externally to the library and pass this to the library so each thread can work on different data.

Optimizing the creation of objects inside loops

Which of the following would be more optimal on a Java 6 HotSpot VM?
final Map<Foo,Bar> map = new HashMap<Foo,Bar>(someNotSoLargeNumber);
for (int i = 0; i < someLargeNumber; i++)
{
doSomethingWithMap(map);
map.clear();
}
or
final int someNotSoLargeNumber = ...;
for (int i = 0; i < someLargeNumber; i++)
{
final Map<Foo,Bar> map = new HashMap<Foo,Bar>(someNotSoLargeNumber);
doSomethingWithMap(map);
}
I think they're both as clear to the intent, so I don't think style/added complexity is an issue here.
Intuitively it looks like the first one would be better as there's only one 'new'. However, given that no reference to the map is held onto, would HotSpot be able to determine that a map of the same size (Entry[someNotSoLargeNumber] internally) is being created for each loop and then use the same block of memory (i.e. not do a lot of memory allocation, just zeroing that might be quicker than calling clear() for each loop)?
An acceptable answer would be a link to a document describing the different types of optimisations the HotSpot VM can actually do, and how to write code to assist HotSpot (rather than naive attmepts at optimising the code by hand).

Don't spend your time on such micro optimizations unless your profiler says you should do it. In particular, Sun claims that modern garbage collectors do very well with short-living objects and new() becomes cheaper and cheaper
Garbage collection and performance on DeveloperWorks

That's a pretty tight loop over a "fairly large number", so generally I would say move the instantiation outside of the loop. But, overall, my guess is you aren't going to notice much of a difference as I am willing to bet that your doSomethingWithMap will take up the majority of time to allow the GC to catch up.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

CopyOnWriteArraySet is too slow - java

Related

How to prevent heap space error when using large parallel Java 8 stream

How is LongAccumulator implemented, so that it is more efficient?

Java optimization to prevent heapspace out of memory

Multicore Java Program with Native Code

Optimizing the creation of objects inside loops

Categories

Resources