I have a simple recursive method, a depth first search. On each call, it checks if it's in a leaf, otherwise it expands the current node and calls itself on the children.
I'm trying to make it parallel, but I notice the following strange (for me) problem.
I measure execution time with System.currentTimeMillis().
When I break the search into a number of subsearches and add the total execution time, I get a bigger number than the sequential search. I only measure execution time, no communication or sync, etc. I would expect to get the same time when I add the times of the subtasks. This happens even if I just run one task after the other, so without threads. If I just break the search into some subtasks and run the subtasks one after the other, I get a bigger time.
If I add the number of method calls for the subtasks, I get the same number as the sequential search. So, basically, in both cases I do the same number of method calls, but I get different times.
I'm guessing there's some overhead on initial method calls or something else caused by a JVM mechanism. Any ideas what could it be?
For example, one sequential search takes around 3300 ms. If I break it into 13 tasks, it takes a total time of 3500ms.
My method looks like this:
private static final int dfs(State state) {
method_calls++;
if(state.isLeaf()){
return 1;
}
State[] children = state.expand();
int result = 0;
for (int i = 0; i < children.length; i++) {
result += dfs(children[i]);
}
return result;
}
Whenever I call it, I do it like this:
for(int i = 0; i < num_tasks; i++){
long start = System.currentTimeMillis();
dfs(tasks[i]);
totalTime += (System.currentTimeMillis() - start);
}
Problem is totalTime increases with num_tasks and I would expect to stay the same because the method_calls variable stays the same.
You should average out the numbers over longer runs. Secondly the precision of currentTimeMillis may not be sufficient, you can try using System.nanoTime().
As in all the programming languages, whenever you call a procedure or a method, you have to push the environment, initialize the new one, execute the programs instructions, return the value on the stack and finally reset the previous environment. It cost a bit! Create a thread cost also more!
I suppose that if you enlarge the researching tree you will have benefit by the parallelization.
Adding system clock time for several threads seems a weird idea. Either you are interested in the time until processing is complete, in which case adding doesn't make sense, or in cpu usage, in which case you should only count when the thread is actually scheduled to execute.
What probably happens is that at least part of the time, more threads are ready to execute than the system has cpu cores, and the scheduler puts one of your threads to sleep, which causes it to take longer to complete. It makes sense that this effect is exacerbated the more threads you use. (Even if your program uses less threads than you have cores, other programs (such as your development environment, ...) might).
If you are interested in CPU usage, you might wish to query ThreadMXBean.getCurrentThreadCpuTime
I'd expect to see Threads used. Something like this:
import java.util.concurrent.Executor;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class Puzzle {
static volatile long totalTime = 0;
private static int method_calls = 0;
/**
* #param args
*/
public static void main(String[] args) {
final int num_tasks = 13;
final State[] tasks = new State[num_tasks];
ExecutorService threadPool = Executors.newFixedThreadPool(5);
for(int i = 0; i < num_tasks; i++){
threadPool.submit(new DfsRunner(tasks[i]));
}
try {
threadPool.shutdown();
threadPool.awaitTermination(1, TimeUnit.SECONDS);
} catch (InterruptedException e) {
System.out.println("Interrupted");
}
System.out.println(method_calls + " Methods in " + totalTime + "msecs");
}
static final int dfs(State state) {
method_calls++;
if(state.isLeaf()){
return 1;
}
State[] children = state.expand();
int result = 0;
for (int i = 0; i < children.length; i++) {
result += dfs(children[i]);
}
return result;
}
}
With the runnable bit like this:
public class DfsRunner implements Runnable {
private State state;
public DfsRunner(State state) {
super();
this.state = state;
}
#Override
public void run() {
long start = System.currentTimeMillis();
Puzzle.dfs(state);
Puzzle.totalTime += (System.currentTimeMillis() - start);
}
}
Related
My program needs to generate unique labels that consist of a tag, date and time. Something like this:
"myTag__2019_09_05__07_51"
If one tries to generate two labels with the same tag in the same minute, one will receive equal labels, what I cannot allow. I think of adding as an additional suffix result of System.nanoTime() to make sure that each label will be unique (I cannot access all labels previously generated to look for duplicates):
"myTag__2019_09_05__07_51__{System.nanoTime() result}"
Can I trust that each invocation of System.nanoTime() this will produce a different value? I tested it like this on my laptop:
assertNotEquals(System.nanoTime(), System.nanoTime())
and this works. I wonder if I have a guarantee that it will always work.
TLDR; If you only use a single thread on a popular VM on a modern operating system, it may work in practice. But many serious applications use multiple threads and multiple instances of the application, and there won't be any guarantee in that case.
The only guarantee given in the Javadoc for System.nanoTime() is that the resolution of the clock is at least as good as System.currentTimeMillis() - so if you are writing cross-platform code, there is clearly no expectation that the results of nanoTime are unique, as you can call nanoTime() many times per millisecond.
On my OS (Java 11, MacOS) I always get at least one nanosecond difference between successive calls on the same thread (and that is after Integer.MAX_VALUE looks at successive return values); it's possible that there is something in the implementation that guarantees it.
However it is simple to generate duplicate results if you use multiple Threads and have more than 1 physical CPU. Here's code that will show you:
public class UniqueNano {
private static volatile long a = -1, b = -2;
public static void main(String[] args) {
long max = 1_000_000;
new Thread(() -> {
for (int i = 0; i < max; i++) { a = System.nanoTime(); }
}).start();
new Thread(() -> {
for (int i = 0; i < max; i++) { b = System.nanoTime(); }
}).start();
for (int i = 0; i < max; i++) {
if (a == b) {
System.out.println("nanoTime not unique");
}
}
}
}
Also, when you scale your application to multiple machines, you will potentially have the same problem.
Relying on System.nanoTime() to get unique values is not a good idea.
I have two versions of a program with the same purpose: to calculate how many prime numbers there are between 0 and n.
The first version uses concurrency, a Callable class "does the math" and the results are retrieved though a Future array. There are as many created threads as processors in my computer (4).
The second version is implemented via RMI. All four servers are registered in the local host. The servers are working in paralell as well, obviously.
I would expect the second version to be slower than the first, because I guess the network would involve latency and the other version would just run the program concurrently.
However, the RMI version is around twice faster than the paralel version... Why is this happening?!
I didn't paste any code because it'd be huge, but ask for it in case you need it and I'll see what I can do...
EDIT: adding the code. I commented the sections where unnecessary code was to be posted.
Paralell version
public class taskPrimes implements Callable
{
private final long x;
private final long y;
private Long total = new Long(0);
public taskPrimes(long x, long y)
{
this.x = x;
this.y = y;
}
public static boolean isPrime(long n)
{
if (n<=1) return false ;
for (long i=2; i<=Math.sqrt(n); i++)
if (n%i == 0) return false;
return true;
}
public Long call()
{
for (long i=linf; i<=lsup;i++)
if (isPrime(i)) total++;
return total;
}
}
public class paralelPrimes
{
public static void main(String[] args) throws Exception
{
// here some variables...
int nTasks = Runtime.getRuntime().availableProcessors();
ArrayList<Future<Long>> partial = new ArrayList<Future<Long>>();
ThreadPoolExecutor ept = new ThreadPoolExecutor();
for(int i=0; i<nTasks; i++)
{
partial.add(ept.submit(new taskPrimes(x, y))); // x and y are the limits of the range
// sliding window here
}
for(Future<Long> iterator:partial)
try { total += iterator.get(); } catch (Exception e) {}
}
}
RMI version
Server
public class serverPrimes
extends UnicastRemoteObject
implements interfacePrimes
{
public serverPrimes() throws RemoteException {}
#Override
public int primes(int x, int y) throws RemoteException
{
int total = 0;
for(int i=x; i<=y; i++)
if(isPrime(i)) total++;
return total;
}
#Override
public boolean isPrime(int n) throws RemoteException
{
if (n<=1) return false;
for (int i=2; i<=Math.sqrt(n); i++)
if (n%i == 0) return false ;
return true;
}
public static void main(String[] args) throws Exception
{
interfacePrimes RemoteObject1 = new serverPrimes();
interfacePrimes RemoteObject2 = new serverPrimes();
interfacePrimes RemoteObject3 = new serverPrimes();
interfacePrimes RemoteObject4 = new serverPrimes();
Naming.bind("Server1", RemoteObject1);
Naming.bind("Server2", RemoteObject2);
Naming.bind("Server3", RemoteObject3);
Naming.bind("Server4", RemoteObject4);
}
}
Client
public class clientPrimes implements Runnable
{
private int x;
private int y;
private interfacePrimes RemoteObjectReference;
private static AtomicInteger total = new AtomicInteger();
public clientPrimes(int x, int y, interfacePrimes RemoteObjectReference)
{
this.x = x;
this.y = y;
this.RemoteObjectReference = RemoteObjectReference;
}
#Override
public void run()
{
try
{
total.addAndGet(RemoteObjectReference.primes(x, y));
}
catch (RemoteException e) {}
}
public static void main(String[] args) throws Exception
{
// some variables here...
int nServers = 4;
ExecutorService e = Executors.newFixedThreadPool(nServers);
double t = System.nanoTime();
for (int i=1; i<=nServers; i++)
{
e.submit(new clientPrimes(xVentana, yVentana, (interfacePrimes)Naming.lookup("//localhost/Server"+i)));
// sliding window here
}
e.shutdown();
while(!e.isTerminated());
t = System.nanoTime()-t;
}
}
One interesting thing to consider is that, by default, the jvm runs in client mode. This means that threads won't span over the cores in the most agressive way. Trying to run the program with -server option can influence the result although, as mentioned, the algorithm design is crucial the concurrent version may have bottlenecks. There is little chance that, given the problem, there is a bottleneck in your algorithm, but it sure needs to be considered.
The rmi version truly runs in parallel because each object runs on a different machine, since this tends to be a processing problem more than a communication problem then the latency plays a non important part.
[UPDATE]
Now that I saw your code lets get into some more details.
You are relying on the ThreadExecutorPool and Future to perform the thread control and synchronization for you, this means (by the documentation) that your running objects will be allocated on an existing thread and once your object finishes its computation the thread will be returned to that pool, on the other hand the Future will check periodically if the computation has finished so it can collect the values.
This scenario would be best fit for some computation that is performed periodically in a way that the ThreadPool could increase performance by having the threads pre-allocated (having the overhead of thread creation only on the first time the threads aren't there).
Your implementation is correct, but it is more centered on the programmer convinience (there is nothing wrong with this, I am always defending this point of view) than on system performance.
The RMI version performs differently due (mainly) of 2 things:
1 - you said you are running on the same machine, most OS will recognize localhost, 127.0.0.1 or even the real self ip address as being its self address and perform some optimizations on the communication, so little overhead from the network here.
2 - the RMI system will create a separate thread for each server object you created (as I mentioned before) and these servers will starting computing as soon as they get called.
Things you should try to experiment:
1 - Try to run your RMI version truly on a network, if you can configure it for 10Mbps would be better to see communication overhead (although, since it is a one shot communication it may have little impact to notice, you could chance you client application to call for the calculation multiple times, and then you see the lattency in the way)
2 - Try to change you parallel implementation to use Threads directly with no Future (you could use Thread.join to monitor execution end) and then use the -server option on the machine (although, sometimes, the JVM performs a check to see if the machine configuration can be truly said to be a server and will decline to move to that profile). The main problem is that if your threads doesn't get to use all the computer cores you won't see any performance improvent. Also try to perform the calculations many time to overcome the overhead of thread creation.
Hope that helps to elucidate the situation :)
Cheers
It depends on how your Algorithms are designed for parallel and concurrent solutions. There is no a criteria where parallel must be better than concurrent or viceversa. By example if your concurrent solution has many synchronized blocks it can drop your performance, in the other case maybe the communication in your parallel algorithm is minimum then there is no overhead on network.
If you can get a copy o the book of Peter Pacheco it can clear some ideas:http://www.cs.usfca.edu/~peter/ipp/
Given the details you provided, it will mostly depend on how large a range you're using, and how efficiently you distribute the work to the servers.
For instance, I'll bet that for a small range N you will probably have no speedup from distributing via RMI. In this case, the RMI (network) overhead will likely outweigh the benefit of distributing over multiple servers. When N becomes large, and with an efficient distribution algorithm, this overhead will become more and more negligible with regards to the actual computation time.
For example, assuming homogenous servers, a relatively efficient distribution could be to tell each server to compute the primes for all the numbers n such that n % P = i, where n <= N, P is the number of servers, i is an index in the range [0, P-1] assigned to each server, and % is the modulo operation.
I have some exercises, and one of them refers to concurrency. This theme is new for me, however I spent 6 hours and finally solve my problem. But my knowledge of corresponding API is poor, so I need advice: is my solution correct or may be there is more appropriate way.
So, I have to implement next interface:
public interface PerformanceTester {
/**
* Runs a performance test of the given task.
* #param task which task to do performance tests on
* #param executionCount how many times the task should be executed in total
* #param threadPoolSize how many threads to use
*/
public PerformanceTestResult runPerformanceTest(
Runnable task,
int executionCount,
int threadPoolSize) throws InterruptedException;
}
where PerformanceTestResult contains total time (how long the whole performance test took in total), minimum time (how long the shortest single execution took) and maximum time (how long the longest single execution took).
So, I learned many new things today - about thread pools, types Executors, ExecutorService, Future, CompletionService etc.
If I had Callable task, I could make next:
Return current time in the end of call() procedure.
Create some data structure (some Map may be) to store start time and Future object, that retuned by fixedThreadPool.submit(task) (do this executionCount times, in loop);
After execution I could just subtract start time from end time for every Future.
(Is this right way in case of Callable task?)
But! I have only Runnable task, so I continued looking. I even create FutureListener implements Callable<Long>, that have to return time, when Future.isDone(), but is seams little crazy for my (I have to double threads count).
So, eventually I noticed CompletionService type with interesting method take(), that Retrieves and removes the Future representing the next completed task, waiting if none are yet present., and very nice example of using ExecutorCompletionService. And there is my solution.
public class PerformanceTesterImpl implements PerformanceTester {
#Override
public PerformanceTestResult runPerformanceTest(Runnable task,
int executionCount, int threadPoolSize) throws InterruptedException {
long totalTime = 0;
long[] times = new long[executionCount];
ExecutorService pool = Executors.newFixedThreadPool(threadPoolSize);
//create list of executionCount tasks
ArrayList<Runnable> solvers = new ArrayList<Runnable>();
for (int i = 0; i < executionCount; i++) {
solvers.add(task);
}
CompletionService<Long> ecs = new ExecutorCompletionService<Long>(pool);
//submit tasks and save time of execution start
for (Runnable s : solvers)
ecs.submit(s, System.currentTimeMillis());
//take Futures one by one in order of completing
for (int i = 0; i < executionCount; ++i) {
long r = 0;
try {
//this is saved time of execution start
r = ecs.take().get();
} catch (ExecutionException e) {
e.printStackTrace();
return null;
}
//put into array difference between current time and start time
times[i] = System.currentTimeMillis() - r;
//calculate sum in array
totalTime += times[i];
}
pool.shutdown();
//sort array to define min and max
Arrays.sort(times);
PerformanceTestResult performanceTestResult = new PerformanceTestResult(
totalTime, times[0], times[executionCount - 1]);
return performanceTestResult;
}
}
So, what can you say? Thanks for replies.
I would use System.nanoTime() for higher resolution timings. You might want to ignroe the first 10,000 tests to ensure the JVM has warmed up.
I wouldn't bother creating a List of Runnable and add this to the Executor. I would instead just add them to the executor.
Using Runnable is not a problem as you get a Future<?> back.
Note: Timing how long the task spends in the queue can make a big difference to the timing. Instead of taking the time from when the task was created you can have the task time itself and return a Long for the time in nano-seconds. How the timing is done should reflect the use case you have in mind.
A simple way to convert a Runnable task into one which times itself.
finla Runnable run = ...
ecs.submit(new Callable<Long>() {
public Long call() {
long start = System.nanoTime();
run.run();
return System.nanoTime() - start;
}
});
There are many intricacies when writing performance tests in the JVM. You probably aren't worried about them as this is an exercise, but if you are this question might have more information:
How do I write a correct micro-benchmark in Java?
That said, there don't seem to be any glaring bugs in your code. You might want to ask this on the lower traffic code-review site if you want a full review of your code:
http://codereview.stackexchange.com
I've got a little bit of work that is easily parallelizable, and I want to use Java threads to split up the work across my four core machine. It's a genetic algorithm applied to the traveling salesman problem. It doesn't sound easily parallelizable, but the first loop is very easily so. The second part where I talk about the actual evolution may or may not be, but I want to know if I'm getting slow down because of the way I'm implementing threading, or if its the algorithm itself.
Also, if anyone has better ideas on how I should be implementing what I'm trying to do, that would be very much appreciated.
In main(), I have this:
final ArrayBlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(numThreads*numIter);
ThreadPoolExecutor tpool = new ThreadPoolExecutor(numThreads, numThreads, 10, TimeUnit.SECONDS, queue);
barrier = new CyclicBarrier(numThreads);
k.init(tpool);
I have a loop that is done inside of init() and looks like this:
for (int i = 0; i < numCities; i++) {
x[i] = rand.nextInt(width);
y[i] = rand.nextInt(height);
}
That I changed to this:
int errorCities = 0, stepCities = 0;
stepCities = numCities/numThreads;
errorCities = numCities - stepCities*numThreads;
// Split up work, assign to threads
for (int i = 1; i <= numThreads; i++) {
int startCities = (i-1)*stepCities;
int endCities = startCities + stepCities;
// This is a bit messy...
if(i <= numThreads) endCities += errorCities;
tpool.execute(new citySetupThread(startCities, endCities));
}
And here is citySetupThread() class:
public class citySetupThread implements Runnable {
int start, end;
public citySetupThread(int s, int e) {
start = s;
end = e;
}
public void run() {
for (int j = start; j < end; j++) {
x[j] = ThreadLocalRandom.current().nextInt(0, width);
y[j] = ThreadLocalRandom.current().nextInt(0, height);
}
try {
barrier.await();
} catch (InterruptedException ie) {
return;
} catch (BrokenBarrierException bbe) {
return;
}
}
}
The above code is run once in the program, so it was sort of a test case for my threading constructs (this is my first experience with Java threads). I implemented the same sort of thing in a real critical section, specifically the evolution part of the genetic algorithm, whose class is as follows:
public class evolveThread implements Runnable {
int start, end;
public evolveThread(int s, int e) {
start = s;
end = e;
}
public void run() {
// Get midpoint
int n = population.length/2, m;
for (m = start; m > end; m--) {
int i, j;
i = ThreadLocalRandom.current().nextInt(0, n);
do {
j = ThreadLocalRandom.current().nextInt(0, n);
} while(i == j);
population[m].crossover(population[i], population[j]);
population[m].mutate(numCities);
}
try {
barrier.await();
} catch (InterruptedException ie) {
return;
} catch (BrokenBarrierException bbe) {
return;
}
}
}
Which exists in a function evolve() that is called in init() like so:
for (int p = 0; p < numIter; p++) evolve(p, tpool);
Yes I know that's not terribly good design, but for other reasons I'm stuck with it. Inside of evolve is the relevant parts, shown here:
// Threaded inner loop
int startEvolve = popSize - 1,
endEvolve = (popSize - 1) - (popSize - 1)/numThreads;
// Split up work, assign to threads
for (int i = 0; i < numThreads; i++) {
endEvolve = (popSize - 1) - (popSize - 1)*(i + 1)/numThreads + 1;
tpool.execute(new evolveThread(startEvolve, endEvolve));
startEvolve = endEvolve;
}
// Wait for our comrades
try {
barrier.await();
} catch (InterruptedException ie) {
return;
} catch (BrokenBarrierException bbe) {
return;
}
population[1].crossover(population[0], population[1]);
population[1].mutate(numCities);
population[0].mutate(numCities);
// Pick out the strongest
Arrays.sort(population, population[0]);
current = population[0];
generation++;
What I really want to know is this:
What role does the "queue" have? Am I right to create a queue for as many jobs as I think will be executed for all threads in the pool? If the size isn't sufficiently large, I get RejectedExecutionException's. I just decided to do numThreads*numIterations because that's how many jobs there would be (for the actual evolution method that I mentioned earlier). It's weird though.. I shouldn't have to do this if the barrier.await()'s were working, which leads me to...
Am I using the barrier.await() correctly? Currently I have it in two places: inside the run() method for the Runnable object, and after the for loop that executes all the jobs. I would've thought only one would be required, but I get errors if I remove one or the other.
I'm suspicious of contention for the threads, as that is the only thing I can glean from the absurd slowdown (which does scale with the input parameters). I want to know if it is anything to do with how I'm implementing the thread pool and barriers. If not, then I'll have to look inside the crossover() and mutate() methods, I suppose.
First, I think you may have a bug with how you intended to use the CyclicBarrier. Currently you are initializing it with the number of executor threads as the number of parties. You have an additional party, however; the main thread. So I think you need to do:
barrier = new CyclicBarrier(numThreads + 1);
I think this should work, but personally I find it an odd use of the barrier.
When using a worker-queue thread-pool model I find it easier to use a Semaphore or Java's Future model.
For a semaphore:
class MyRunnable implements Runnable {
private final Semaphore sem;
public MyRunnable(Semaphore sem) {
this.sem = sem;
}
public void run() {
// do work
// signal complete
sem.release()
}
}
Then in your main thread:
Semaphore sem = new Semaphore(0);
for (int i = 0; i < numJobs; ++i) {
threadPool.execute(new MyRunnable(sem));
}
sem.acquire(numJobs);
Its really doing the same thing as the barrier, but I find it easier to think about the worker tasks "signaling" that they are done instead of "sync'ing up" with the main thread again.
For example, if you look at the example code in the CyclicBarrier JavaDoc the call to barrier.await() is inside the loop inside the worker. So it is really synching up the multiple long running worker threads and the main thread is not participating in the barrier. Calling barrier.await() at the end of the worker outside the loop is more signaling completion.
As you increase the number of tasks, you increase the overhead using each task adds. This means you want to minimise the number of tasks i.e. the same as the number of cpus you have. For some tasks using double the number of cpus can be better when the work load is not even.
BTW: You don't need a barrier in each task, you can wait for the future of each task to complete by calling get() on each one.
I am trying out the executor service in Java, and wrote the following code to run Fibonacci (yes, the massively recursive version, just to stress out the executor service).
Surprisingly, it will run faster if I set the nThreads to 1. It might be related to the fact that the size of each "task" submitted to the executor service is really small. But still it must be the same number also if I set nThreads to 1.
To see if the access to the shared Atomic variables can cause this issue, I commented out the three lines with the comment "see text", and looked at the system monitor to see how long the execution takes. But the results are the same.
Any idea why this is happening?
BTW, I wanted to compare it with the similar implementation with Fork/Join. It turns out to be way slower than the F/J implementation.
public class MainSimpler {
static int N=35;
static AtomicInteger result = new AtomicInteger(0), pendingTasks = new AtomicInteger(1);
static ExecutorService executor;
public static void main(String[] args) {
int nThreads=2;
System.out.println("Number of threads = "+nThreads);
executor = Executors.newFixedThreadPool(nThreads);
Executable.inQueue = new AtomicInteger(nThreads);
long before = System.currentTimeMillis();
System.out.println("Fibonacci "+N+" is ... ");
executor.submit(new FibSimpler(N));
waitToFinish();
System.out.println(result.get());
long after = System.currentTimeMillis();
System.out.println("Duration: " + (after - before) + " milliseconds\n");
}
private static void waitToFinish() {
while (0 < pendingTasks.get()){
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
executor.shutdown();
}
}
class FibSimpler implements Runnable {
int N;
FibSimpler (int n) { N=n; }
#Override
public void run() {
compute();
MainSimpler.pendingTasks.decrementAndGet(); // see text
}
void compute() {
int n = N;
if (n <= 1) {
MainSimpler.result.addAndGet(n); // see text
return;
}
MainSimpler.executor.submit(new FibSimpler(n-1));
MainSimpler.pendingTasks.incrementAndGet(); // see text
N = n-2;
compute(); // similar to the F/J counterpart
}
}
Runtime (approximately):
1 thread : 11 seconds
2 threads: 19 seconds
4 threads: 19 seconds
Update:
I notice that even if I use one thread inside the executor service, the whole program will use all four cores of my machine (each core around 80% usage on average). This could explain why using more threads inside the executor service slows down the whole process, but now, why does this program use 4 cores if only one thread is active inside the executor service??
It might be related to the fact that the size of each "task" submitted
to the executor service is really small.
This is certainly the case and as a result you are mainly measuring the overhead of context switching. When n == 1, there is no context switching and thus the performance is better.
But still it must be the same number also if I set nThreads to 1.
I'm guessing you meant 'to higher than 1' here.
You are running into the problem of heavy lock contention. When you have multiple threads, the lock on the result is contended all the time. Threads have to wait for each other before they can update the result and that slows them down. When there is only a single thread, the JVM probably detects that and performs lock elision, meaning it doesn't actually perform any locking at all.
You may get better performance if you don't divide the problem into N tasks, but rather divide it into N/nThreads tasks, which can be handled simultaneously by the threads (assuming you choose nThreads to be at most the number of physical cores/threads available). Each thread then does its own work, calculating its own total and only adding that to a grand total when the thread is done. Even then, for fib(35) I expect the costs of thread management to outweigh the benefits. Perhaps try fib(1000).