Multicore Java Program with Native Code

Multicore Java Program with Native Code - java

I am using a native C++ library inside a Java program. The Java program is written to make use of many-core systems, but it does not scale: the best speed is with around 6 cores, i.e., adding more cores slows it down. My tests show that the call to the native code itself causes the problem, so I want to make sure that different threads access different instances of the native library, and therefore remove any hidden (memory) dependency between the parallel tasks.
In other words, instead of the static block
static {
System.loadLibrary("theNativeLib");
}
I want multiple instances of the library to be loaded, for each thread dynamically. The main question is if that is possible at all. And then how to do it!
Notes:
- I have implementations in Java 7 fork/join as well as Scala/akka. So any help in each platform is appreciated.
- The parallel tasks are completely independent. In fact, each task may create a couple of new tasks and then terminates; no further dependency!
Here is the test program in fork/join style, in which processNatively is basically a bunch of native calls:
class Repeater extends RecursiveTask<Long> {
final int n;
final processor mol;
public Repeater(final int m, final processor o) {
n=m;
mol = o;
}
#Override
protected Long compute() {
processNatively(mol);
final List<RecursiveTask<Long>> tasks = new ArrayList<>();
for (int i=n; i<9; i++) {
tasks.add(new Repeater(n+1,mol));
}
long count = 1;
for(final RecursiveTask<Long> task : invokeAll(tasks)) {
count += task.join();
}
return count;
}
}
private final static ForkJoinPool forkJoinPool = new ForkJoinPool();
public void repeat(processor mol)
{
final long middle = System.currentTimeMillis();
final long count = forkJoinPool.invoke(new Repeater(0, mol));
System.out.println("Count is "+count);
final long after = System.currentTimeMillis();
System.out.println("Time elapsed: "+(after-middle));
}
Putting it differently:
If I have N threads that use a native library, what happens if each of them calls System.loadLibrary("theNativeLib"); dynamically, instead of calling it once in a static block? Will they share the library anyway? If yes, how can I fool JVM into seeing it as N different libraries loaded independently? (The value of N is not known statically)

The javadoc for System.loadLibrary states that it's the same as calling Runtime.getRuntime().loadLibrary(name). The javadoc for this loadLibrary (http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#loadLibrary(java.lang.String) ) states that "If this method is called more than once with the same library name, the second and subsequent calls are ignored.", so it seems you can't load the same library more than once. In terms of fooling the JVM into thinking there are multiple instances, I can't help you there.

You need to ensure you don't have a bottle neck on any shared resources. e.g. say you have 6 hyper threaded cores, you may find that 12 threads is optimal or you might find that 6 thread is optimal (and each thread has a dedicated core)
If you have a heavy floating point routine, it is likely that hyperthreading will be slower rather than faster.
If you are using all the cache, trying to use more can slow your system down. If you are using the limit of CPU to main memory bandwidth, attempting to use more bandwidth can slow your machine.
But then, how can I refer to the different instances? I mean the loaded classes will have the same names and packages, right? What happens in general if you load two dynamic libraries containing classes with the same names and packages?
There is only one instance, you cannot load a DLL more than once. If you want to construct a different data set for each thread you need to do this externally to the library and pass this to the library so each thread can work on different data.

Related

Sequential stream is faster than parallel stream if number of iterations is increased

I measure performance with the example code at the end.
If I call the checkPerformanceResult method with the parameter numberOfTimes set to 100 the parallel stream outperforms the sequential stream significant(sequential=346, parallel=78).
If I set the parameter to 1000, the sequential stream outperforms the parallel stream significant(sequential=3239, parallel=9337).
I did a lot of runs and the result is the same.
Can someone explain me this behaviour and what is going on under the hood here?
public class ParallelStreamExample {
public static long checkPerformanceResult(Supplier<Integer> s, int numberOfTimes) {
long startTime = System.currentTimeMillis();
for(int i = 0; i < numberOfTimes; i++) {
s.get();
}
long endTime = System.currentTimeMillis();
return endTime - startTime;
}
public static int sumSequentialStreamThread() {
IntStream.rangeClosed(1, 10000000).sum();
return 0;
}
public static int sumParallelStreamThread() {
IntStream.rangeClosed(1, 10000000)
.parallel().sum();
return 0;
}
public static void main(String[] args) {
System.out.println(checkPerformanceResult(ParallelStreamExample::sumSequentialStreamThread, 1000));
System.out.println("break");
System.out.println(checkPerformanceResult(ParallelStreamExample::sumParallelStreamThread, 1000));
}
}

Using threads doesn't always make the code run faster
when working with a few threads there is always an overhead of managing each thread (assigning CPU time by to OS to each thread, managing the next line of code that needs to run in case of a context switch etc...)
In this specific case
each thread created in sumParallelStreamThread does very simple in memory operations (calling a function that returns a number).
so the difference between sumSequentialStreamThread and sumParallelStreamThread is that in sumParallelStreamThread each simple operation has the overhead of creating a thread and running it (assuming that there isn't any thread optimization happening in the background).
and sumSequentialStreamThread does the same thing without the overhead of managing all the threads, that's why it runs faster.
When to use a threads
The most common use case for working with threads is when you need to perform a bunch of I/O tasks.
what is considered an I/O task?
it depends on several factors, you can find a debate on it here.
but i think generally people will agree that making and HTTP request to somewhere or executing a database query can be considered an I/O operation.
why is it more suitable?
because I/O operations usually have some period of time of waiting for a response involved with them.
for example when querying a database the thread performing the query will wait for the database to return the response (even if its less than half a second) while this thread is waiting a different thread can perform other actions and that is where we can gain performance.
I find that usually running tasks that involve only RAM memory and CPU operations in different threads makes the code run slower than with one thread.
Benchmark discussion
regarding the benchmark remarks is see in the comments, i am not sure if they are correct or not, but in these type of situations i would double check my benchmark against any profiling tool (or just use it to begin with) like JProfiler or YoutKit they are usually very accurate.

CopyOnWriteArraySet is too slow

When I ran the following program, it took around 7 to 8 minutes to execute. I am really not sure where I am mistaken as this program is taking so much time to execute.
public class Test {
public static void main(String[] args) {
final Integer[] a= new Integer[1000000];
for (int i=0; i < a.length; i++) {
a[i] = i;
}
final List<Integer> source = Arrays.asList(a);
final Set<Integer> set = new CopyOnWriteArraySet<Integer>(source);
}
}
Can some one help me understand, why this program is too slow.
My machine is Core I7 with 4GB RAM

I have tested and indeed with a List of 1 000 000 elements provided to the constructor, it takes a good time (7 minutes).
It is a referenced issue on Open JDK the 2013-01-09 :
JDK-8005953 - CopyOnWriteArraySet copy constructor is unusable for large collections
The problem would cause by the CopyOnWriteArrayList#addAllAbsent() method invoked by the CopyOnWriteArraySet constructor.
Extract of the issue :
CopyOnWriteArraySet's copy constructor is too slow for large
collections. It takes over 10 minutes on a developer laptop with just
1 million entries in the collection to be copied...
As resolution status, you can read : Won't Fix.
And you can read as last message :
addAllAbsent can be made faster for larger input, but it would impact
the performance for small sizes. And it's documented that
CopyOnWriteXXX classes are better suited for collections of small
sizes.
The CopyOnWriteArraySet javadoc specifies indeed this point :
It is best suited for applications in which set sizes generally stay
small, read-only operations vastly outnumber mutative operations, and
you need to prevent interference among threads during traversal.

ForkJoinPool vs Plain Recursion

After reading about ForkJoinPool, I tried an experiment to test how fast actually ForkJoinPool is, when compared to plain recursion.
I calculated the number of files in a folder recursively, and to my surprize, plain recursion performed way better than ForkJoinPool
Here's my code.
Recursive Task
class DirectoryTask extends RecursiveTask<Long> {
private Directory directory;
#Override
protected Long compute() {
List<RecursiveTask<Long>> forks = new ArrayList<>();
List<Directory> directories = directory.getDirectories();
for (Directory directory : directories) {
DirectoryTask directoryTask = new DirectoryTask(directory);
forks.add(directoryTask);
directoryTask.fork();
}
Long count = directory.getDoumentCount();
for (RecursiveTask<Long> task : forks) {
count += task.join();
}
return count;
}
}
Plain Recursion
private static Long getFileCount(Directory directory) {
Long recursiveCount = 0L;
List<Directory> directories = directory.getDirectories();
if (null != directories) {
for (Directory d : directories) {
recursiveCount += getFileCount(d);
}
}
return recursiveCount + directory.getDoumentCount();
}
Directory Object
class Directory {
private List<Directory> directories;
private Long doumentCount = 0L;
static Directory fromFolder(File file) {
List<Directory> children = new ArrayList<>();
Long documentCount = 0L;
if (!file.isDirectory()) {
throw new IllegalArgumentException("Only directories are allowed");
}
String[] files = file.list();
if (null != files) {
for (String path : files) {
File f = new File(file.getPath() + File.separator + path);
if (f.isHidden()) {
continue;
}
if (f.isDirectory()) {
children.add(Directory.fromFolder(f));
} else {
documentCount++;
}
}
}
return new Directory(children, documentCount);
}
}
Results
Plain Recursion: 3ms
ForkJoinPool: 25ms
Where's the mistake here?
I am just trying to understand whether there is a particular threshold, below which plain recursion is faster than a ForkJoinPool.

Nothing in life comes for free. If you had to move one beer crate from your car to your apartment - what is quicker: carrying it there manually, or first going to the shed, to get the wheelbarrow to use that to move that one crate?
Creating thread objects is a "native" operation that goes down into the underlying Operating System to acquire resources there. That can be a rather costly operation.
Meaning: just throwing "more threads" at a problem doesn't automatically speed things up. To the contrary. When your task is mainly CPU-intensive, there might be small gain from doing things in parallel. When you are doing lots of IO, then having multiple threads allows you to do "less" waiting overall; thus improving your throughput.
In other words: Fork/Join requires considerable activity before it does the real job. Using it for computations that only require a few ms is simply overkill. Thus: you would be looking to "fork/join" operations that work on larger data sets.
For further reading, you might look at parallel streams. Those are using the fork/join framework under the covers; and surprise, it is a misconception to expect arbitrary parallelStream to be "faster" than ordinary streams, too.

There are multiple aspects to this:
Is there a difference between serial (e.g. plain recursion) and parallel (e.g. forkjoin) solutions to the same problem?
What is the scope for parallelizing file system access?
What are the traps for measuring performance?
Answer to #1. Yes there is a difference. Parallelism is not good for a problem that is too small. With a parallel solution, you need to account for the overheads of:
creating and managing threads
passing information from the parent thread to the child threads
returning results from the child threads to the parent thread
synchronized access to shared data structures,
waiting for the slowest / last finishing child thread to finish.
How these play out in practice depend on a variety of things ... including the size of the problem, and the opportunities for parallelism.
The answer to #2 is (probably) not as much as you would think. A typical file system is stored on a disk drive that has physical characteristics such as disk rotation and disk head seeking. These typically become the bottleneck, and less you have a high-end storage system, there is not much scope for parallelism.
The answer to #3 is that there are lots of traps. And those traps can result in very misleading (i.e. invalid) performance results .... if you don't allow for them. One of the biggest traps is that JVMs take time to "warm up"; i.e. load classes, do JIT compilation, resize the heap, and so on.
Another trap that applies to benchmarks that do file system I/O is that a typical OS will do things like caching disk blocks and file / directory metadata. So the second time you access a file or directory it is likely to be faster than the first time.
Having said this, if you have a well-designed, high performance file system (e.g. inodes held on SSDs) and a well designed application, and enough cores, it is possible to achieve extraordinary rates of file system scanning through parallelism. (For example, checking the modification timestamps on half a billion files in under 12 hours ....)

How is LongAccumulator implemented, so that it is more efficient?

I understand that the new Java (8) has introduced new sychronization tools such as LongAccumulator (under the atomic package).
In the documentation it says that the LongAccumulator is more efficient when the variable update from several threads is frequent.
I wonder how is it implemented to be more efficient?

That's a very good question, because it shows a very important characteristic of concurrent programming with shared memory. Before going into details, I have to make a step back. Take a look at the following class:
class Accumulator {
private final AtomicLong value = new AtomicLong(0);
public void accumulate(long value) {
this.value.addAndGet(value);
}
public long get() {
return this.value.get();
}
}
If you create one instance of this class and invoke the method accumulate(1) from one thread in a loop, then the execution will be really fast. However, if you invoke the method on the same instance from two threads, the execution will be about two magnitudes slower.
You have to take a look at the memory architecture to understand what happens. Most systems nowadays have a non-uniform memory access. In particular, each core has its own L1 cache, which is typically structured into cache lines with 64 octets. If a core executes an atomic increment operation on a memory location, it first has to get exclusive access to the corresponding cache line. That's expensive, if it has no exclusive access yet, due to the required coordination with all other cores.
There's a simple and counter-intuitive trick to solve this problem. Take a look at the following class:
class Accumulator {
private final AtomicLong[] values = {
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
};
public void accumulate(long value) {
int index = getMagicValue();
this.values[index % values.length].addAndGet(value);
}
public long get() {
long result = 0;
for (AtomicLong value : values) {
result += value.get();
}
return result;
}
}
At first glance, this class seems to be more expensive due to the additional operations. However, it might be several times faster than the first class, because it has a higher probability, that the executing core already has exclusive access to the required cache line.
To make this really fast, you have to consider a few more things:
The different atomic counters should be located on different cache lines. Otherwise you replace one problem with another, namely false sharing. In Java you can use a long[8 * 4] for that purpose, and only use the indexes 0, 8, 16 and 24.
The number of counters have to be chosen wisely. If there are too few different counters, there are still too many cache switches. if there are too many counters, you waste space in the L1 caches.
The method getMagicValue should return a value with an affinity to the core id.
To sum up, LongAccumulator is more efficient for some use cases, because it uses redundant memory for frequently used write operations, in order to reduce the number of times, that cache lines have to be exchange between cores. On the other hand, read operations are slightly more expensive, because they have to create a consistent result.

by this
http://codenav.org/code.html?project=/jdk/1.8.0-ea&path=/Source%20Packages/java.util.concurrent.atomic/LongAccumulator.java
it looks like a spin lock.

Java Performance measurement

I am doing some Java performance comparison between my classes, and wondering if there is some sort of Java Performance Framework to make writing performance measurement code easier?
I.e, what I am doing now is trying to measure what effect does it have having a method as "synchronized" as in PseudoRandomUsingSynch.nextInt() compared to using an AtomicInteger as my "synchronizer".
So I am trying to measure how long it takes to generate random integers using 3 threads accessing a synchronized method looping for say 10000 times.
I am sure there is a much better way doing this. Can you please enlighten me? :)
public static void main( String [] args ) throws InterruptedException, ExecutionException {
PseudoRandomUsingSynch rand1 = new PseudoRandomUsingSynch((int)System.currentTimeMillis());
int n = 3;
ExecutorService execService = Executors.newFixedThreadPool(n);
long timeBefore = System.currentTimeMillis();
for(int idx=0; idx<100000; ++idx) {
Future<Integer> future = execService.submit(rand1);
Future<Integer> future1 = execService.submit(rand1);
Future<Integer> future2 = execService.submit(rand1);
int random1 = future.get();
int random2 = future1.get();
int random3 = future2.get();
}
long timeAfter = System.currentTimeMillis();
long elapsed = timeAfter - timeBefore;
out.println("elapsed:" + elapsed);
}
the class
public class PseudoRandomUsingSynch implements Callable<Integer> {
private int seed;
public PseudoRandomUsingSynch(int s) { seed = s; }
public synchronized int nextInt(int n) {
byte [] s = DonsUtil.intToByteArray(seed);
SecureRandom secureRandom = new SecureRandom(s);
return ( secureRandom.nextInt() % n );
}
#Override
public Integer call() throws Exception {
return nextInt((int)System.currentTimeMillis());
}
}
Regards

Ignoring the question of whether a microbenchmark is useful in your case (Stephen C' s points are very valid), I would point out:
Firstly, don't listen to people who say 'it's not that hard'. Yes, microbenchmarking on a virtual machine with JIT compilation is difficult. It's actually really difficult to get meaningful and useful figures out of a microbenchmark, and anyone who claims it's not hard is either a supergenius or doing it wrong. :)
Secondly, yes, there are a few such frameworks around. One worth looking at (thought it's in very early pre-release stage) is Caliper, by Kevin Bourrillion and Jesse Wilson of Google. Looks really impressive from a few early looks at it.

More micro-benchmarking advice - micro benchmarks rarely tell you what you really need to know ... which is how fast a real application is going to run.
In your case, I imagine you are trying to figure out if your application will perform better using an Atomic object than using synchronized ... or vice versa. And the real answer is that it most likely depends on factors that a micro-benchmark cannot measure. Things like the probability of contention, how long locks are held, the number of threads and processors, and the amount of extra algorithmic work needed to make atomic update a viable solution.
EDIT - in response to this question.
so is there a way i can measure all these probability of contention, locks held duration, etc ?
In theory yes. Once you have implemented the entire application, it is possible to instrument it to measure these things. But that doesn't give you your answer either, because there isn't a predictive model you can plug these numbers into to give the answer. And besides, you've already implemented the application by then.
But my point was not that measuring these factors allows you to predict performance. (It doesn't!) Rather, it was that a micro-benchmark does not allow you to predict performance either.
In reality, the best approach is to implement the application according to your intuition, and then use profiling as the basis for figuring out where the real performance problems are.

OpenJDK guys have developed a benchmarking tool called JMH:
http://openjdk.java.net/projects/code-tools/jmh/
This provides quite an easy to setup framework, and there is a couple of samples showing how to use that.
http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/
Nothing can prevent you from writing the benchmark wrong, but they did a great job at eliminating the non-obvious mistakes (such as false sharing between threads, preventing dead code elimination etc).

These guys designed a good JVM measurement methodology so you won't fool yourself with bogus numbers, and then published it as a Python script so you can re-use their smarts -
Statistically Rigorous Java Performance Evaluation (pdf paper)

You probably want to move the loop into the task. As it is you just start all the threads and almost immediately you're back to single threaded.
Usual microbenchmarking advice: Allow for some warm up. As well as average, deviation is interesting. Use System.nanoTime instead of System.currentTimeMillis.
Specific to this problem is how much the threads fight. With a large number of contending threads, cas loops can perform wasted work. Creating a SecureRandom is probably expensive, and so might System.currentTimeMillis to a lesser extent. I believe SecureRandom should already be thread safe, if used correctly.

In short, you are thus searching for an "Java unit performance testing tool"?
Use JUnitPerf.
Update: for the case it's not clear yet: it also supports concurrent (multithreading) testing. Here's an extract of the chapter "LoadTest" of the aforementioned link which includes a code sample:
For example, to create a load test of
10 concurrent users with each user
running the
ExampleTestCase.testOneSecondResponse()
method for 20 iterations, and with a 1
second delay between the addition of
users, use:
int users = 10;
int iterations = 20;
Timer timer = new ConstantTimer(1000);
Test testCase = new ExampleTestCase("testOneSecondResponse");
Test repeatedTest = new RepeatedTest(testCase, iterations);
Test loadTest = new LoadTest(repeatedTest, users, timer);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.