Same code block takes different time duration to excute - java

I was trying to check the time of execution with similar blocks.
Sample code and output are below,
public class Tester {
public static void main(String[] args) {
System.out.println("Run 1");
List<Integer> list = new ArrayList<>();
int i = 0;
long st = System.currentTimeMillis();
while (++i < 10000) {
System.out.println("Time taken :" + (System.currentTimeMillis() - st));
System.out.println("Run 2");
int j = 0;
List<Integer> list2 = new ArrayList<>();
long ST = System.currentTimeMillis();
while (++j < 10000) {
System.out.println("Time taken :" + (System.currentTimeMillis() - ST));
System.out.println("Run 3");
int k = 0;
List<Integer> list3 = new ArrayList<>();
long ST2 = System.currentTimeMillis();
while (++k < 10000) {
System.out.println("Time taken :" + (System.currentTimeMillis() - ST2));
Run 1
Time taken :6
Run 2
Time taken :3
Run 3
Time taken :1
Why am I getting different time of execution?

This is probably to just-in-time compilation and hotspot optimizing on the array list, but you cannot be 100% sure.
Apart from that, your sample size is much too small to be significant.

a) Since the java code is compiled to bytecode some optimizations are done to your code, anyway this might not have something to do with your observations
b) Each subsequent similar operation has better execution time until the jvm is "warmed up" for that operation, due to JVM lazy loading or CPU caching for example.
c) If you want to try benchmarking check out Java Microbenchmark harness (JMH)


Java speedup processes / threads

I have a rather big ArrayList.
I have to go through every index, and do a expensive calculation
My first idea to speed it up was by putting it into a thread.
It works, but it is still extremely slow. I tinkered around the calculation, to make it less expensive, but its still to slow. The best solution i came up with is basically this one.
public void calculate(){
public void calculatePart(int offset) {
new Thread() {
public void run() {
int i = offset;
while(arrayList.size() > i) {
//Do the calulation
i +=2;
Yet this feels like a lazy, unprofessional solution. That is why I'm asking if there is a cleaner and even faster solution
Assuming that doing task on each element doesn't lead to data races, you could leverage the power of parallelism. To maximize the number of computations occurring at the same time, you would have to give tasks to each of the processors available in your system.
In Java, you can get the number of processors (cores) available using this:
int parallelism = Runtime.getRuntime().availableProcessors();
The idea is to create number of threads equal to the available processors.
So, if you have 4 processors available, you can create 4 threads and ask them to process items at a gap of 4.Suppose you have a list of size 10, which needs to be processed in parallel.
Thread 1 processes items at index 0,4,8
Thread 2 processes items at index 1,5,9
Thread 3 processes items at index 2,6
Thread 4 processes items at index 3,7
I tried to simulate your scenario with the following code:
import java.util.Arrays;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
public class SpeedUpTest {
public static void main(String[] args) throws InterruptedException, ExecutionException {
long seqTime, twoThreadTime, multiThreadTime;
List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
long time = System.currentTimeMillis();
seqTime = System.currentTimeMillis() - time;
int parallelism = 2;
ExecutorService executorService = Executors.newFixedThreadPool(parallelism);
time = System.currentTimeMillis();
List<Future> tasks = new ArrayList<>();
for (int offset = 0; offset < parallelism; offset++) {
int finalParallelism = parallelism;
int finalOffset = offset;
Future task = executorService.submit(() -> {
int i = finalOffset;
while (list.size() > i) {
try {
} catch (InterruptedException e) {
i += finalParallelism;
for (Future task : tasks) {
twoThreadTime = System.currentTimeMillis() - time;
parallelism = Runtime.getRuntime().availableProcessors();
executorService = Executors.newFixedThreadPool(parallelism);
tasks = new ArrayList<>();
time = System.currentTimeMillis();
for (int offset = 0; offset < parallelism; offset++) {
int finalParallelism = parallelism;
int finalOffset = offset;
Future task = executorService.submit(() -> {
int i = finalOffset;
while (list.size() > i) {
try {
} catch (InterruptedException e) {
i += finalParallelism;
for (Future task : tasks) {
multiThreadTime = System.currentTimeMillis() - time;
log("Total time for sequential execution : " + seqTime / 1000.0 + " seconds");
log("Total time for execution with 2 threads: " + twoThreadTime / 1000.0 + " seconds");
log("Total time for execution with " + parallelism + " threads: " + multiThreadTime / 1000.0 + " seconds");
private static void log(String msg) {
private static void processItem(int index) throws InterruptedException {
private static void sequentialProcessing(List<Integer> list) throws InterruptedException {
for (int i = 0; i < list.size(); i++) {
Total time for sequential execution : 50.001 seconds
Total time for execution with 2 threads: 25.102 seconds
Total time for execution with 4 threads: 15.002 seconds
High theoretically speaking:
if you have X elements and your calculation must perform N operations on each one then
your computer(processor) must perform X*N operations total, then...
Parallel threads can make it faster only if in the calculation operations there are some of them when thread is waiting (e.g. File or Network operations). That time can be used by other threads. But if all operations are pure CPU (e.g. mathematics) and thread is not waiting - required time to perform X*N operations stays the same.
Also each tread must give other threads ability to take control over CPU at some point. It happens automatically between methods calls or if you have Thread.yield() call in your code.
as example method like:
public void run()
long a=0;
for (long i=1; i < Long.MAX_VALUE; i++)
will not give other thread a chance to take control over CPU until it fully completed and exited.

looping StringBuilder memory leak

For a project in my data structures class, we have to traverse a graph using both breadth first and depth first traversals. Depth first traversals, in this case, are commonly between 1 and 100,000 nodes long.
The content of each node is a string. Traversing through the graph and finding the solution haven't been a large issue. However, once the traversal is completed, it must be displayed.
To do this, I went back through each node's parent, and added the node's string to a Stack. Still at this point the code runs fine (to my knowledge). The massive slowdown comes when attempting to display these strings.
Given a stack of several tens of thousands of strings, how do I display them? Straight calling System.out.println() takes way too long. Yet using a StringBuilder iteratively somehow just eats up the memory. I've tried to clear out the StringBuilder at a rather arbitrary interval (in a rather shoddy manner), and this seemed to slightly help. Also introducing a small sleep seemed to help as well. Either way, by the time the function is finished printing, either Java has run out of memory, or is incredibly slow and unresponsive.
Is there some way I can restructure this code to output all of the strings in a relatively timely manner (~10 seconds shouldn't be asking too much), without crashing the VM?
int depth = 0;
while(DFSResult.parent != null)
if(in.equals("y")) dfs.push(DFSResult.hash());
DFSResult = DFSResult.parent;
System.out.println("found at a depth of "+depth+"\n\n");
StringBuilder s = new StringBuilder();
int x = 0;
boolean isSure = false;
System.out.println("Are you really sure you want to print out a lineage this long? (y/n)");
if(input.readLine().toLowerCase().equals("y"))isSure = true;
}catch (Exception e) {
s = new StringBuilder();
while(!dfs.empty() && isSure)
if(x % 500 == 0)
{//Flush the stringbuilder
try{Thread.sleep(50);}catch(Exception e){e.printStackTrace();}
s.delete(0,s.length());//supposed 25% efficiency increase over saying s = new StringBuilder()
You don't need to use StringBuilder in your example.
Instead just do:
inside your loop. No sleeps needed either. Just one line inside the loop.
There's no problem with System.out.println() efficiency. Please consider the following code:
public class TestMain {
public static void main(String[] args) {
long timeBefore = System.currentTimeMillis();
for (int i = 0; i < 50000; i++) {
System.out.println("Value = " + i);
long timeAfter = System.currentTimeMillis();
System.out.println("Time elapsed (ms): " + (timeAfter - timeBefore));
This is the last lines of output on my machine:
. . .
Value = 49994
Value = 49995
Value = 49996
Value = 49997
Value = 49998
Value = 49999
Time elapsed (ms): 538
As you can see it's super fast. The problem is not in the println() method.
Example of stack usage below:
public static void main(String[] args) {
long timeBefore1 = System.currentTimeMillis();
Stack<String> s = new Stack<String>();
for (int i = 0; i < 50000; i++) {
s.push("Value = " + i);
long timeAfter1 = System.currentTimeMillis();
long timeBefore2 = System.currentTimeMillis();
while (!s.isEmpty()) {
long timeAfter2 = System.currentTimeMillis();
System.out.println("Time spent on building stack (ms): " + (timeAfter1 - timeBefore1));
System.out.println("Time spent on reading stack (ms): " + (timeAfter2 - timeBefore2));
. . .
Value = 2
Value = 1
Value = 0
Time spent on building stack (ms): 31
Time spent on reading stack (ms): 551

Multi threaded object creation slower then in a single thread

I have what probably is a basic question. When I create 100 million Hashtables it takes approximately 6 seconds (runtime = 6 seconds per core) on my machine if I do it on a single core. If I do this multi-threaded on 12 cores (my machine has 6 cores that allow hyperthreading) it takes around 10 seconds (runtime = 112 seconds per core).
This is the code I use:
public class Tests
public static void main(String args[])
double start = System.currentTimeMillis();
int nThreads = 12;
double[] runTime = new double[nThreads];
TestsThread[] threads = new TestsThread[nThreads];
int totalJob = 100000000;
int jobsize = totalJob/nThreads;
for(int i = 0; i < threads.length; i++)
threads[i] = new TestsThread(jobsize,runTime, i);
for(int i = 0; i < runTime.length; i++)
System.out.println("Runtime thread:" + i + " = " + (runTime[i]/1000000) + "ms");
double end = System.currentTimeMillis();
System.out.println("Total runtime = " + (end-start) + " ms");
private static void waitThreads(TestsThread[] threads)
for(int i = 0; i < threads.length; i++)
while(threads[i].finished == false)//keep waiting untill the thread is done
//System.out.println("waiting on thread:" + i);
try {
} catch (InterruptedException e) {
import java.util.HashMap;
import java.util.Map;
public class TestsThread extends Thread
int jobSize = 0;
double[] runTime;
boolean finished;
int threadNumber;
TestsThread(int job, double[] runTime, int threadNumber)
this.finished = false;
this.jobSize = job;
this.runTime = runTime;
this.threadNumber = threadNumber;
public void run()
double start = System.nanoTime();
for(int l = 0; l < jobSize ; l++)
double[] test = new double[65];
double end = System.nanoTime();
double difference = end-start;
runTime[threadNumber] += difference;
this.finished = true;
I do not understand why creating the object simultaneously in multiple threads takes longer per thread then doing it in serial in only 1 thread. If I remove the line where I create the Hashtable this problem disappears. If anyone could help me with this I would be greatly thankful.
Update: This problem has an associated bug report and has been fixed with Java 1.7u40. And it was never an issue for Java 1.8 as Java 8 has an entirely different hash table algorithm.
Since you are not using the created objects that operation will get optimized away. So you’re only measuring the overhead of creating threads. This is surely the more overhead the more threads you start.
I have to correct my answer regarding a detail, I didn’t know yet: there is something special with the classes Hashtable and HashMap. They both invoke sun.misc.Hashing.randomHashSeed(this) in the constructor. In other words, their instances escape during construction which has an impact on the memory visibility. This implies that their construction, unlike let’s say for an ArrayList, cannot optimized away, and multi-threaded construction slows down due to what happens inside that method (i.e. synchronization).
As said, that’s special to these classes and of course this implementation (my setup:1.7.0_13). For ordinary classes the construction time goes straight to zero for such code.
Here I add a more sophisticated benchmark code. Watch the difference between DO_HASH_MAP = true and DO_HASH_MAP = false (when false it will create an ArrayList instead which has no such special behavior).
import java.util.*;
import java.util.concurrent.*;
public class AllocBench {
static final int NUM_THREADS = 1;
static final int NUM_OBJECTS = 100000000 / NUM_THREADS;
static final boolean DO_HASH_MAP = true;
public static void main(String[] args) throws InterruptedException, ExecutionException {
ExecutorService threadPool = Executors.newFixedThreadPool(NUM_THREADS);
Callable<Long> task=new Callable<Long>() {
public Long call() {
return doAllocation(NUM_OBJECTS);
long startTime=System.nanoTime(), cpuTime=0;
for(Future<Long> f: threadPool.invokeAll(Collections.nCopies(NUM_THREADS, task))) {
long time=System.nanoTime()-startTime;
System.out.println("Number of threads: "+NUM_THREADS);
System.out.printf("entire allocation required %.03f s%n", time*1e-9);
System.out.printf("time x numThreads %.03f s%n", time*1e-9*NUM_THREADS);
System.out.printf("real accumulated cpu time %.03f s%n", cpuTime*1e-9);
static long doAllocation(int numObjects) {
long t0=System.nanoTime();
for(int i=0; i<numObjects; i++)
if(DO_HASH_MAP) new HashMap<Object, Object>(); else new ArrayList<Object>();
return System.nanoTime()-t0;
What about if you do it on 6 cores? Hyperthreading isn't the exact same as having double the cores, so you might want to try the amount of real cores too.
Also the OS won't necessarily schedule each of your threads to their own cores.
Since all you are doing is measuring the time and churning memory, your bottleneck is likely to be in your L3 cache or bus to main memory. In this cases, coordinating the work between threads could be producing so much overhead it is worse instead of better.
This is too long for a comment but your inner loop can be just
double start = System.nanoTime();
for(int l = 0; l < jobSize ; l++){
Map<String,Integer> test = new HashMap<String,Integer>();
// runtime is an AtomicLong for thread safety
runtime.addAndGet(System.nanoTime() - start); // time in nano-seconds.
Taking the time can be as slow creating a HashMap so you might not be measuring what you think you if you call the timer too often.
BTW Hashtable is synchronized and you might find using HashMap is faster, and possibly more scalable.

Java Parallel File Processing

I have following code:
import java.util.concurrent.* ;
public class Example{
public static void main(String args[]) {
try {
FileOutputStream fos = new FileOutputStream("1.dat");
DataOutputStream dos = new DataOutputStream(fos);
for (int i = 0; i < 200000; i++) {
dos.close(); // Two sample files created
FileOutputStream fos1 = new FileOutputStream("2.dat");
DataOutputStream dos1 = new DataOutputStream(fos1);
for (int i = 200000; i < 400000; i++) {
Exampless.createArray(200000); //Create a shared array
Exampless ex1 = new Exampless("1.dat");
Exampless ex2 = new Exampless("2.dat");
ExecutorService executor = Executors.newFixedThreadPool(2); //Exexuted parallaly to cont number of matches in two file
long startTime = System.nanoTime();
long endTime;
Future<Integer> future1 = executor.submit(ex1);
Future<Integer> future2 = executor.submit(ex2);
int count1 = future1.get();
int count2 = future2.get();
endTime = System.nanoTime();
long duration = endTime - startTime;
System.out.println("duration with threads:"+duration);
System.out.println("Matches: " + (count1 + count2));
startTime = System.nanoTime();;;
endTime = System.nanoTime();
duration = endTime - startTime;
System.out.println("duration without threads:"+duration);
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
class Exampless implements Callable {
public static int[] arr = new int[20000];
public String _name;
public Exampless(String name) {
this._name = name;
static void createArray(int z) {
for (int i = z; i < z + 20000; i++) { //shared array
arr[i - z] = i;
public Object call() {
try {
int cnt = 0;
FileInputStream fin = new FileInputStream(_name);
DataInputStream din = new DataInputStream(fin); // read file and calculate number of matches
for (int i = 0; i < 20000; i++) {
int c = din.readInt();
if (c == arr[i]) {
return cnt ;
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
return -1 ;
Where I am trying to count number of matches in an array with two files. Now, though I am running it on two threads, code is not doing well because:
(running it on single thread, file 1 + file 2 reading time) < (file 1 || file 2 reading time in multiple thread).
Can anyone help me how to solve this (I have 2 core CPU and file size is approx. 1.5 GB).
In the first case you are reading sequentially one file, byte-by-byte, block-by-block. This is as fast as disk I/O can be, providing the file is not very fragmented. When you are done with the first file, disk/OS finds the beginning of the second file and continues very efficient, linear reading of disk.
In the second case you are constantly switching between the first and the second file, forcing the disk to seek from one place to another. This extra seeking time (approximately 10 ms) is the root of your confusion.
Oh, and you know that disk access is single-threaded and your task is I/O bound so there is no way splitting this task to multiple threads could help, as long as your reading from the same physical disk? Your approach could only be justified if:
each thread, except reading from a file, was also performing some CPU intensive or blocking operations, slower by an order of magnitude compared to I/O.
files are on different physical drives (different partition is not enough) or on some RAID configurations
you are using SSD drive
You will not get any benefit from multithreading as Tomasz pointed out from reading the data from disk. You may get some improvement in speed if you multithread the checks, i.e. you load the data from the files into arrays sequentially and then the threads execute the checking in parallel. But considering the small size of your files (~80kb) and the fact that you are just comparing ints I doubt the performance improvement will be worth the effort.
Something that will definitely improve your execution speed is if you do not use readInt(). Since you know you are comparing 20000 ints, you should read all 20000 ints into an array at once for each file (or at least in blocks), rather than calling the readInt() function 20000 times.

Issues with using too many Threads a benchmark program

I've programmed a (very simple) benchmark in Java. It simply increments a double value up to a specified value and takes the time.
When I use this singlethreaded or with a low amount of threads (up to 100) on my 6-core desktop, the benchmark returns reasonable and repeatable results.
But when I use for example 1200 threads, the average multicore duration is significantly lower than the singlecore duration (about 10 times or more). I've made sure that the total amount of incrementations is the same, no matter how much threads I use.
Why does the performance drop so much with more threads? Is there a trick to solve this problem?
I'm posting my source, but I don't think, that there is a problem.
package sibbo.benchmark;
import java.text.DecimalFormat;
import java.util.LinkedList;
import java.util.List;
public class Benchmark implements TestFinishedListener {
private static final double TARGET = 1e10;
private static final int THREAD_MULTIPLICATOR = 2;
public static void main(String[] args) throws InterruptedException {
Benchmark b = new Benchmark(TARGET);
private int coreCount;
private List<Worker> workers = new LinkedList<>();
private List<Worker> finishedWorkers = new LinkedList<>();
private double target;
public Benchmark(double target) { = target;
private void getSystemInfos() {
coreCount = Runtime.getRuntime().availableProcessors();
private void printInfos() {
System.out.println("Usable cores: " + coreCount);
System.out.println("Multicore threads: " + coreCount * THREAD_MULTIPLICATOR);
System.out.println("Loops per core: " + new DecimalFormat("###,###,###,###,##0").format(TARGET));
public synchronized void start() throws InterruptedException {
System.out.print("Initializing singlecore benchmark... ");
Worker w = new Worker(this, 0);
System.out.print("Running singlecore benchmark... ");
// Multicore
System.out.print("Initializing multicore benchmark... ");
for (int i = 0; i < coreCount * THREAD_MULTIPLICATOR; i++) {
workers.add(new Worker(this, i));
System.out.print("Running multicore benchmark... ");
for (Worker worker : workers) {
worker.runBenchmark(target / THREAD_MULTIPLICATOR);
private void printResult() {
DecimalFormat df = new DecimalFormat("###,###,###,##0.000");
long min = -1, av = 0, max = -1;
int threadCount = 0;
boolean once = true;
for (Worker w : finishedWorkers) {
if (once) {
once = false;
min = w.getTime();
max = w.getTime();
if (w.getTime() > max) {
max = w.getTime();
if (w.getTime() < min) {
min = w.getTime();
av += w.getTime();
if (finishedWorkers.size() <= 6) {
System.out.println("Worker " + w.getId() + ": " + df.format(w.getTime() / 1e9) + "s");
System.out.println("Min: " + df.format(min / 1e9) + "s, Max: " + df.format(max / 1e9) + "s, Av per Thread: "
+ df.format((double) av / threadCount / 1e9) + "s");
public synchronized void testFinished(Worker w) {
if (workers.isEmpty()) {
package sibbo.benchmark;
public class Worker implements Runnable {
private double value = 0;
private long time;
private double target;
private TestFinishedListener l;
private final int id;
public Worker(TestFinishedListener l, int id) {
this.l = l; = id;
new Thread(this).start();
public int getId() {
return id;
public synchronized void runBenchmark(double target) { = target;
public long getTime() {
return time;
public void run() {
value = 0;
long startTime = System.nanoTime();
while (value < target) {
long endTime = System.nanoTime();
time = endTime - startTime;
private synchronized void synWait() {
try {
} catch (InterruptedException e) {
You need to understand that the OS (or Java thread scheduler, or both) is trying to balance between all of the threads in your application to give them all a chance to perform some work, and there is a non-zero cost to switch between threads. With 1200 threads, you have just reached (and probably far exceeded) the tipping point wherein the processor is spending more time context switching than doing actual work.
Here is a rough analogy:
You have one job to do in room A. You stand in room A for 8 hours a day, and do your job.
Then your boss comes by and tells you that you have to do a job in room B also. Now you need to periodically leave room A, walk down the hall to room B, and then walk back. That walking takes 1 minute per day. Now you spend 3 hours, 59.5 minutes working on each job, and one minute walking between rooms.
Now imagine that you have 1200 rooms to work in. You are going to spend more time walking between rooms than doing actual work. This is the situation that you have put your processor into. It is spending so much time switching between contexts that no real work gets done.
EDIT: Now, as per the comments below, maybe you spend a fixed amount of time in each room before moving on- your work will progress, but the number of context switches between rooms still affects the overall runtime of a single task.
Ok, I think I've found my problem, but until now, no solution.
When measuring the time every thread runs to do his part of the work, there are different possible minimums for different total amounts of threads. The maximum is the same everytime. In case that a thread is started first and then is paused very often and finishes last. For example this maximum value could be 10 seconds. Assuming that the total amount of operations that is done by every thread stays the same, no matter how much threads I use, the amount of operations that is done by a single thread has to be changed when using a different amount of threads. For example, using one thread, it has to do 1000 operations, but using ten threads, everyone of them has to do just 100 operations. Now, using ten threads, the minimum amount of time that one thread can use is much lower than using one thread. So calculating the average amount of time every thread needs to do his work is nonsense. The minimum using ten Threads would be 1 second. This happens if one thread does its work without interruption.
The solution would be to simply measure the amount of time between the start of the first thread and the completion of the last.

