I would like to run several tasks in parallel until a certain amount of time has passed. Let us suppose those threads are CPU-heavy and/or may block indefinitely. After the timeout, the threads should be interrupted immediately, and the main thread should continue execution regardless of unfinished or still running tasks.
I've seen a lot of questions asking this, and the answers were always similar, often along the lines of "create thread pool for tasks, start it, join it on timeout"
The problem is between the "start" and "join" parts. As soon as the pool is allowed to run, it may grab CPU and the timeout will not even start until I get it back.
I have tried Executor.invokeAll, and found that it did not fully meet the requirements. Example:
long dt = System.nanoTime ();
ExecutorService pool = Executors.newFixedThreadPool (4);
List <Callable <String>> list = new ArrayList <> ();
for (int i = 0; i < 10; i++) {
list.add (new Callable <String> () {
#Override
public String call () throws Exception {
while (true) {
}
}
});
}
System.out.println ("Start at " + (System.nanoTime () - dt) / 1000000 + "ms");
try {
pool.invokeAll (list, 3000, TimeUnit.MILLISECONDS);
}
catch (InterruptedException e) {
}
System.out.println ("End at " + (System.nanoTime () - dt) / 1000000 + "ms");
Start at 1ms
End at 3028ms
This (27 ms delay) may not seem too bad, but an infinite loop is rather easy to break out of - the actual program experiences ten times more easily. My expectation is that a timeout request is met with very high accuracy even under heavy load (I'm thinking along the lines of a hardware interrupt, which should always work).
This is a major pain in my particular program, as it needs to heed certain timeouts rather accurately (for instance, around 100 ms, if possible better). However, starting the pool often takes as long as 400 ms until I get control back, pushing past the deadline.
I'm a bit confused why this problem is almost never mentioned. Most of the answers I have seen definitely suffer from this. I suppose it may be acceptable usually, but in my case it's not.
Is there a clean and tested way to go ahead with this issue?
Edited to add:
My program is involved with GC, even though not on a large scale. For testing purposes, I rewrote the above example and found that the results given are very inconsistent, but on average noticeably worse than before.
long dt = System.nanoTime ();
ExecutorService pool = Executors.newFixedThreadPool (40);
List <Callable <String>> list = new ArrayList <> ();
for (int i = 0; i < 10; i++) {
list.add (new Callable <String> () {
#Override
public String call () throws Exception {
String s = "";
while (true) {
s += Long.toString (System.nanoTime ());
if (s.length () > 1000000) {
s = "";
}
}
}
});
}
System.out.println ("Start at " + (System.nanoTime () - dt) / 1000000 + "ms");
try {
pool.invokeAll (list, 1000, TimeUnit.MILLISECONDS);
}
catch (InterruptedException e) {
}
System.out.println ("End at " + (System.nanoTime () - dt) / 1000000 + "ms");
Start at 1ms
End at 1189ms
invokeAll should work just fine. However, it is vital that you write your tasks to properly respond to interrupts. When catching InterruptedException, they should exit immediately. If your code is catching IOException, each such catch-block should be preceded with something like:
} catch (InterruptedIOException e) {
logger.log(Level.FINE, "Interrupted; exiting", e);
return;
}
If you are using Channels, you will want to handle ClosedByInterruptException the same way.
If you perform time-consuming operations that don't catch the above exceptions, you need to check Thread.interrupted() periodically. Obviously, checking more often is better, though there will be a point of diminishing returns. (Meaning, checking it after every single statement in your task probably isn't useful.)
if (Thread.interrupted()) {
logger.fine("Interrupted; exiting");
return;
}
In your example code, your Callable is not checking the interrupt status at all, so my guess is that it never exits. An interrupt does not actually stop a thread; it just signals the thread that it should terminate itself on its own terms.
Using the VM option -XX:+PrintGCDetails, I found that the GC runs more rarely, but with a far larger time delay than expected. That delay just so happens to coincide with the spikes I experienced.
A mundane and sad explanation for the observed behavior.
Related
I have a question, I am learning about CompletableFuture of Java 8, I did a dummy with One method running with runAsync of Completable future, it is a simple for 0 to 10 and in paralen a for o to 5 In the second method I run the same for to 0 to 20, but the method of runAsyn takes longer than the other method, It is normal?
Shouldn't the asynchronous method last the same or less than the other method?
Here is the code.
public class Sample{
public static void main(String x[]) throws InterruptedException {
runAsync();
System.out.println("========== SECOND TESTS ==========");
runSync();
}
static void runAsync() throws InterruptedException {
long startTimeOne = System.currentTimeMillis();
CompletableFuture<Void> cf = CompletableFuture.runAsync(() -> {
for (int i = 0; i < 10L; i++) {
System.out.println(" Async One");
}
});
for (int i = 0; i < 5; i++) {
System.out.println("two");
}
System.out.println("It is ready One? (1) " + cf.isDone());
System.out.println("It is ready One? (2)" + cf.isDone());
System.out.println("It is ready One? (3)" + cf.isDone());
System.out.println("It is ready One? (4)" + cf.isDone());
System.out.println("It is ready One? (5)" + cf.isDone());
System.out.println("It is ready One? (6)" + cf.isDone());
long estimatedTimeOne = System.currentTimeMillis() - startTimeOne;
System.out.println("Total time async: " + estimatedTimeOne);
}
static void runSync() {
long startTimeTwo = System.currentTimeMillis();
for (int i = 0; i < 20; i++) {
System.out.println("No async");
}
long estimatedTimeTwo = System.currentTimeMillis() - startTimeTwo;
System.out.println("Total time no async: " + estimatedTimeTwo);
}
}
The normal for waste 1 milisecond and the runAsync waste 54 miliseconds
Here is the result screenshot
First, you are violating basic rules mentioned in How do I write a correct micro-benchmark in Java?
Most notably, you’re running both approaches within the same runtime and allow them to affect each other.
Besides that, you are getting an output that is a sequence of messages, which is showing the fundamental problem of your operation: you can not print concurrently. The output system itself has to ensure that the printing will end up showing a sequential behavior.
When you are performing actions that can’t run in parallel through a concurrent framework, you can’t gain performance, you can only add thread communication overhead.
Besides that, the operations are not even the same:
the action you are passing to runAsync uses 10L as end boundary, in other words, is performing a long comparison where all other loops use int
"It is ready One? (6)" + cf.isDone() is performing two operations that do not appear in the sequential variant. First, polling the status of the CompletableFuture, which must be done with inter-thread semantics. Second, it bears string concatenation. Both are potentially expensive operations
The async variant is printing 21 messages whereas the sequential is printing 20. Even the total amount of characters to print is roughly 50% more in the async operation
These points may serve as examples of how easily you can do things wrong in a manual benchmark. But they do not affect the outcome significantly, due to the fundamental aspect mentioned before them. You can’t gain a performance advantage of doing the printing asynchronously at all.
Note that the output is quite consistent in your specific case. Since the common Fork/Join thread pool has not been used before your asynchronous operation, it needs to start a new thread when you submit your job, which takes so long that the subsequent local loop printing "two" completes before the asynchronous operation even starts. The next operation, polling cf.isDone() and performing string concatenation, on the other hand, is so slow, that the asynchronous operation completes entirely before these six print statements complete.
When you change the code to
CompletableFuture<Void> cf = CompletableFuture.runAsync(() -> {
for (int i = 0; i < 10; i++) {
System.out.println("Async One");
}
});
for(int i = 0; i < 10; i++) {
System.out.println("two");
}
cf.join();
you still can’t get a performance advantage, but the performance difference will be much smaller. When you add a statement like
ForkJoinPool.commonPool().execute(() -> System.out.println());
at the beginning of the main method, to ensure that the thread pool does not need to get initialized within the measured method, the perceived overhead may even reduce further.
Further, you may swap the order of the runAsync(); and runSync(); method invocations in the main method, to see how first-time execution effects influence the result when you run the two methods within the same JVM.
This all is not enough to make it a reliable benchmark but should help to understand the things that will go wrong when not understanding the pitfalls of doing a micro-benchmark.
I am using Java's concurrency library ExecutorService to run my tasks. The threshold for writing to the database is 200 QPS, however, this program can only reach 20 QPS with 15 threads. I tried 5, 10, 20, 30 threads, and they were even slower than 15 threads. Here is the code:
ExecutorService executor = Executors.newFixedThreadPool(15);
List<Callable<Object>> todos = new ArrayList<>();
for (final int id : ids) {
todos.add(Executors.callable(() -> {
try {
TestObject test = testServiceClient.callRemoteService();
SaveToDatabase();
} catch (Exception ex) {}
}));
}
try {
executor.invokeAll(todos);
} catch (InterruptedException ex) {}
executor.shutdown();
1) I checked the CPU usage of the linux server on which this program is running, and the usage was 90% and 60% (it has 4 CPUs). The memory usage was only 20%. So the CPU & memory were still fine. The database server's CPU usage was low (around 20%). What could prevent the speed from reaching 200 QPS? Maybe this service call: testServiceClient.callRemoteService()? I checked the server configuration for that call and it allows high number of calls per seconds.
2) If the count of id in ids is more than 50000, is it a good idea to use invokeAll? Should we split it to smaller batches, such as 5000 each batch?
There is nothing in this code which prevents this query rate, except creating and destroying a thread pool repeately is very expensive. I suggest using the Streams API which is not only simpler but reuses a built in thread pool
int[] ids = ....
IntStream.of(ids).parallel()
.forEach(id -> testServiceClient.callRemoteService(id));
Here is a benchmark using a trivial service. The main overhead is the latency in creating the connection.
public static void main(String[] args) throws IOException {
ServerSocket ss = new ServerSocket(0);
Thread service = new Thread(() -> {
try {
for (; ; ) {
try (Socket s = ss.accept()) {
s.getOutputStream().write(s.getInputStream().read());
}
}
} catch (Throwable t) {
t.printStackTrace();
}
});
service.setDaemon(true);
service.start();
for (int t = 0; t < 5; t++) {
long start = System.nanoTime();
int[] ids = new int[5000];
IntStream.of(ids).parallel().forEach(id -> {
try {
Socket s = new Socket("localhost", ss.getLocalPort());
s.getOutputStream().write(id);
s.getInputStream().read();
} catch (IOException e) {
e.printStackTrace();
}
});
long time = System.nanoTime() - start;
System.out.println("Throughput " + (int) (ids.length * 1e9 / time) + " connects/sec");
}
}
prints
Throughput 12491 connects/sec
Throughput 13138 connects/sec
Throughput 15148 connects/sec
Throughput 14602 connects/sec
Throughput 15807 connects/sec
Using an ExecutorService would be better as #grzegorz-piwowarek mentions.
ExecutorService es = Executors.newFixedThreadPool(8);
for (int t = 0; t < 5; t++) {
long start = System.nanoTime();
int[] ids = new int[5000];
List<Future> futures = new ArrayList<>(ids.length);
for (int id : ids) {
futures.add(es.submit(() -> {
try {
Socket s = new Socket("localhost", ss.getLocalPort());
s.getOutputStream().write(id);
s.getInputStream().read();
} catch (IOException e) {
e.printStackTrace();
}
}));
}
for (Future future : futures) {
future.get();
}
long time = System.nanoTime() - start;
System.out.println("Throughput " + (int) (ids.length * 1e9 / time) + " connects/sec");
}
es.shutdown();
In this case produces much the same results.
Why do you restrict yourself to such a low number of threads?
You're missing performance opportunities this way. It seems that your tasks are really not CPU-bound. The network operations (remote service + database query) may take up the majority of time for each task to finish. During these times, where a single task/thread needs to wait for some event (network,...), another thread can use the CPU. The more threads you make available to the system, the more threads may be waiting for their network I/O to complete while still having some threads use the CPU at the same time.
I suggest you drastically ramp up the number of threads for the executor. As you say that both remote servers are rather under-utilized, I assume the host your program runs at is the bottleneck at the moment. Try to increase (double?) the number of threads until either your CPU utilization approaches 100% or memory or the remote side become the bottleneck.
By the way, you shutdown the executor, but do you actually wait for the tasks to terminate? How do you measure the "QPS"?
One more thing comes to my mind: How are DB connections handled? I.e. how are SaveToDatabase()s synchronized? Do all threads share (and compete for) a single connection? Or, worse, will each thread create a new connection to the DB, do its thing, and then close the connection again? This may be a serious bottleneck because establishing a TCP connection and doing the authentication handshake may take up as much time as running a simple SQL statement.
If the count of id in ids is more than 50000, is it a good idea to use
invokeAll? Should we split it to smaller batches, such as 5000 each
batch?
As #Vaclav Stengl already wrote, the Executors have internal queues in which they enqueue and from which they process the tasks. So no need to worry about that one. You can also just call submit for each single task as soon as you have created it. This allows the first tasks to already start executing while you're still creating/preparing later tasks, which makes sense especially when each task creation takes comparatively long, but won't hurt in all other cases. Think about invokeAll as a convenience method for cases where you already have a collection of tasks. If you create the tasks successively yourself and you already have access to the ExecutorService to run them on, just submit() them a.s.a.p.
About batch spliting:
ExecutorService has inner queue for storing tasks. In your case ExecutorService executor = Executors.newFixedThreadPool(15); has 15 thread so max 15 tasks will run concurrently and others will be stored in queue. Size of queue can be parametrized. By default size will scale up to max int. InvokeAll call inside of method execute and this method will place tasks in to queue when all threads are working.
Imho there are 2 possible scenarios why CPU is not at 100%:
try to enlarge thread pool
thread is waiting for testServiceClient.callRemoteService() to
complete and meanwhile CPU is starwing
The problem of QPS maybe is the bandwidth limit or transaction execution(it will lock the table or row). So you just increase pool size is not worked. Additional, You can try to use the producer-consumer pattern.
everyone!
I have just created a brute force bot which uses WebDriver and multithreading to brute force a 4-digit code. 4-digit means a range of 0000 - 9999 possible String values. In my case, after clicking the "submit" button, not less than 7 seconds passes before the client gets a response from the server. So, I have decided to use Thread.sleep(7200) to let the page with a response be fully loaded. Then, I found out that I couldn't afford to wait for 9999*7,5 seconds for the task to be accomplished, so I had to use multithreading. I have a Quad-Core AMD machine with 1 virtual core per 1 hardware one, which gives me the opportunity to run 8 threads simultaneously. Ok, I have separated the whole job of 9999 combinations between 8 threads equally, each had got a scope of work of 1249 combinations + remainder thread starting at the very end. Ok, now I'm getting my job done in 1,5 hours (because the right code appears to be in the middle of the scope of work). That is much better, BUT it could be even more better! You know, the Thread.sleep(7500) is a pure waste of time. My machine could be switching to other threads which are wait() because of limited amount of hardware cores. How to do this? Any ideas?
Below are two classes to represent my architecture approach:
public class BruteforceBot extends Thread {
// All the necessary implementation, blah-blah
public void run() {
brutforce();
}
private void brutforce() {
initDriver();
int counter = start;
while (counter <= finish) {
try {
webDriver.get(gatewayURL);
webDriver.findElement(By.name("code")).sendKeys(codes.get(counter));
webDriver.findElement(By.name("code")).submit();
Thread.sleep(7200);
String textFound = "";
try {
do {
textFound = Jsoup.parse(webDriver.getPageSource()).text();
//we need to be sure that the page is fully loaded
} while (textFound.contains("XXXXXXXXXXXXX"));
} catch (org.openqa.selenium.JavascriptException je) {
System.err.println("JavascriptException: TypeError: "
+ "document.documentElement is null");
continue;
}
// Test if the page returns XXXXXXXXXXXXX below
if (textFound.contains("XXXXXXXXXXXXXXXx") && !textFound.contains("YYYYYYY")) {
System.out.println("Not " + codes.get(counter));
counter++;
// Test if the page contains "YYYYYYY" string below
} else if (textFound.contains("YYYYYYY")) {
System.out.println("Correct Code is " + codes.get(counter));
botLogger.writeTheLogToFile("We have found it: " + textFound
+ " ... at the code of " + codes.get(counter));
break;
// Test if any other case of response below
} else {
System.out.println("WTF?");
botLogger.writeTheLogToFile("Strange response for code "
+ codes.get(counter));
continue;
}
} catch (InterruptedException intrrEx) {
System.err.println("Interrupted exception: ");
intrrEx.printStackTrace();
}
}
destroyDriver();
} // end of bruteforce() method
And
public class ThreadMaster {
// All the necessary implementation, blah-blah
public ThreadMaster(int amountOfThreadsArgument,
ArrayList<String> customCodes) {
this();
this.codes = customCodes;
this.amountOfThreads = amountOfThreadsArgument;
this.lastCodeIndex = codes.size() - 1;
this.remainderThread = codes.size() % amountOfThreads;
this.scopeOfWorkForASingleThread
= codes.size()/amountOfThreads;
}
public static void runThreads() {
do {
bots = new BruteforceBot[amountOfThreads];
System.out.println("Bots array is populated");
} while (bots.length != amountOfThreads);
for (int j = 0; j <= amountOfThreads - 1;) {
int finish = start + scopeOfWorkForASingleThread;
try {
bots[j] = new BruteforceBot(start, finish, codes);
} catch (Exception e) {
System.err.println("Putting a bot into a theads array failed");
continue;
}
bots[j].start();
start = finish;
j++;
}
try {
for (int j = 0; j <= amountOfThreads - 1; j++) {
bots[j].join();
}
} catch (InterruptedException ie) {
System.err.println("InterruptedException has occured "
+ "while a Bot was joining a thread ...");
ie.printStackTrace();
}
// if there are any codes that are still remain to be tested -
// this last bot/thread will take care of them
if (remainderThread != 0) {
try {
int remainderStart = lastCodeIndex - remainderThread;
int remainderFinish = lastCodeIndex;
BruteforceBot remainderBot
= new BruteforceBot(remainderStart, remainderFinish, codes);
remainderBot.start();
remainderBot.join();
} catch (InterruptedException ie) {
System.err.println("The remainder Bot has failed to "
+ "create or start or join a thread ...");
}
}
}
I need your advise on how to improve the architecture of this app to make it successfully run with say, 20 threads instead of 8. My problem is - when I simply remove Thread.sleep(7200) and at the same time order to run 20 Thread instances instead of 8, the thread constantly fails to get a response from the server because it doesn't wait for 7 seconds for it to come. Therefore, the performance becomes not just less, it == 0; Which approach would you choose in this case?
P.S.: I order the amount of threads from the main() method:
public static void main(String[] args)
throws InterruptedException, org.openqa.selenium.SessionNotCreatedException {
System.setProperty("webdriver.gecko.driver", "lib/geckodriver.exe");
ThreadMaster tm = new ThreadMaster(8, new CodesGenerator().getListOfCodesFourDigits());
tm.runThreads();
Okay, so everyone can't wait until my question will get a response so I decided to answer it as soon as I can (now!).
If you would like to increase a performance of a Selenium WebDriver-based brute force bot like this one, you need to reject using the Selenium WebDriver. Because the WebDriver is a separate process in the OS, it does not even need a JVM to run. So, every single instance of the Bot was not only a thread managed by my JVM, but a Windows process also! This was the reason why I could hardly use my PC when this app was running with more than 8 threads (each thread was invoking a Windows process geckodriver.exe or chromedriver.exe). Okay, so what you really need to do to increase performance of such a brute force bot is to use HtmlUnit instead of Selenium! HtmlUnit is a pure Java framework, its jar could be found at Maven Central, its dependency could be added to your pom.xml. This way, brute forcing a 4-digit code takes 15 - 20 minutes, taking into account that after each attempt the website responds not faster than 7 seconds after each attempt. To compare, with Selenium WebDriver it took 90 minutes to accomplish the task.
And thanks again to #MartinJames, who has pointed that Thread.sleep() does let the hardware core to switch to other threads!
I need to check how many events are detected within 2 seconds. I have the timer working and I have everything else working...but I ran into a problem: the loop only checks one time, per second and I can't seem to figure out how to fix that. I need it to check constantly during these two seconds to see how many events there were in total!
Here is what I have:
int seconds = 0;
System.out.println("Seconds: " + seconds);
while(seconds < 2)
{
//Wait 1 second
try {
Thread.sleep(1000);
}
catch(Exception e) {}
seconds++;
System.out.println("Seconds: " + seconds);
//This needs to be looping the whole time.
//But right now, it's being blocked and only checked once
if(eventDetected() && seconds <= 2){
events++;
}
}
So you can see my problem. I can't split them up because then the second timer would run, and THEN eventDetected() would be checked. I need it to check constantly DURING the two second timer...so I basically need both things to happen at once. Is there any way I can do this?
Thanks for any help ahead of time!
I think your design pattern needs work -- I don't know what type event you're looking to detect, but no matter how short your sleep time is, there's a chance you could miss an event using the current pattern. Here's what I suggest:
Have eventDetected() increment your events counter. That way, you won't miss an event.
Then, you just need a way to turn on and off listening (and perhaps resetting the event counter). If you're sure that in you're current pattern you are really in a different thread that won't block your eventDetected() method, you could set a flag to check. For example:
When you want to start listening:
listenForEvents = true;
In eventDetected():
if (listenForEvents) { events++; }
When you want to stop listening (for example, after your Thread.sleep() call):
listenForEvents = false;
With multithreading, make sure to watch out for concurrency issues checking and setting the variables, of course.
I would tell you what kind of event I have to keep track of but then I'd have to kill you :D
Answered my own question. Hopefully this will help anyone else out who has a similar problem at some point! I looked up multithreading a bit...
I created a new class EventTimer which implements Runnable, with a public field for seconds:
public class EventTimer implements Runnable{
int seconds;
static int timerThreadCount = 0;
Thread t;
public EventTimer() {
timerThreadCount++;
this.seconds = 0;
t = new Thread(this, "Event Timer");
t.start(); // Start the thread
}
#Override
public void run() {
// TODO Auto-generated method stub
while(seconds < 2)
{
//Wait 1 second
try {
Thread.sleep(1000);
}
catch(Exception e) {
System.out.println("Waiting interupted.");
}
seconds++;
System.out.println("Seconds: " + seconds);
}
}
}
Then I used an instance of the EventTimer, and used a while loop & if statement to solve my problem.
EventTimer t = new EventTimer();
while(t.seconds < 2){
if(eventDetected()) events++;
}
It was actually quite simple! I realize that each iteration of my loop of operation (since the entire code piece above is inside an infinite loop) will create a new EventTimer thread and I will eventually run into memory problems however. How would I close/end a thread after the timer has reached 2 seconds?
I am trying out the executor service in Java, and wrote the following code to run Fibonacci (yes, the massively recursive version, just to stress out the executor service).
Surprisingly, it will run faster if I set the nThreads to 1. It might be related to the fact that the size of each "task" submitted to the executor service is really small. But still it must be the same number also if I set nThreads to 1.
To see if the access to the shared Atomic variables can cause this issue, I commented out the three lines with the comment "see text", and looked at the system monitor to see how long the execution takes. But the results are the same.
Any idea why this is happening?
BTW, I wanted to compare it with the similar implementation with Fork/Join. It turns out to be way slower than the F/J implementation.
public class MainSimpler {
static int N=35;
static AtomicInteger result = new AtomicInteger(0), pendingTasks = new AtomicInteger(1);
static ExecutorService executor;
public static void main(String[] args) {
int nThreads=2;
System.out.println("Number of threads = "+nThreads);
executor = Executors.newFixedThreadPool(nThreads);
Executable.inQueue = new AtomicInteger(nThreads);
long before = System.currentTimeMillis();
System.out.println("Fibonacci "+N+" is ... ");
executor.submit(new FibSimpler(N));
waitToFinish();
System.out.println(result.get());
long after = System.currentTimeMillis();
System.out.println("Duration: " + (after - before) + " milliseconds\n");
}
private static void waitToFinish() {
while (0 < pendingTasks.get()){
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
executor.shutdown();
}
}
class FibSimpler implements Runnable {
int N;
FibSimpler (int n) { N=n; }
#Override
public void run() {
compute();
MainSimpler.pendingTasks.decrementAndGet(); // see text
}
void compute() {
int n = N;
if (n <= 1) {
MainSimpler.result.addAndGet(n); // see text
return;
}
MainSimpler.executor.submit(new FibSimpler(n-1));
MainSimpler.pendingTasks.incrementAndGet(); // see text
N = n-2;
compute(); // similar to the F/J counterpart
}
}
Runtime (approximately):
1 thread : 11 seconds
2 threads: 19 seconds
4 threads: 19 seconds
Update:
I notice that even if I use one thread inside the executor service, the whole program will use all four cores of my machine (each core around 80% usage on average). This could explain why using more threads inside the executor service slows down the whole process, but now, why does this program use 4 cores if only one thread is active inside the executor service??
It might be related to the fact that the size of each "task" submitted
to the executor service is really small.
This is certainly the case and as a result you are mainly measuring the overhead of context switching. When n == 1, there is no context switching and thus the performance is better.
But still it must be the same number also if I set nThreads to 1.
I'm guessing you meant 'to higher than 1' here.
You are running into the problem of heavy lock contention. When you have multiple threads, the lock on the result is contended all the time. Threads have to wait for each other before they can update the result and that slows them down. When there is only a single thread, the JVM probably detects that and performs lock elision, meaning it doesn't actually perform any locking at all.
You may get better performance if you don't divide the problem into N tasks, but rather divide it into N/nThreads tasks, which can be handled simultaneously by the threads (assuming you choose nThreads to be at most the number of physical cores/threads available). Each thread then does its own work, calculating its own total and only adding that to a grand total when the thread is done. Even then, for fib(35) I expect the costs of thread management to outweigh the benefits. Perhaps try fib(1000).