Java Multithreaded vector addition - java

I am trying to get familiar with java multithreaded applications. I tried to think of a simple application that can be parallelized very well. I thought vector addition would be a good application to do so.
However, when running on my linux server (which has 4 cores) I dont get any speed up. The time to execute on 4,2,1 threads is about the same.
Here is the code I came up with:
public static void main(String[]args)throws InterruptedException{
final int threads = Integer.parseInt(args[0]);
final int length= Integer.parseInt(args[1]);
final int balk=(length/threads);
Thread[]th = new Thread[threads];
final double[]result =new double[length];
final double[]array1=getRandomArray(length);
final double[]array2=getRandomArray(length);
long startingTime =System.nanoTime();
for(int i=0;i<threads;i++){
final int current=i;
th[i]=new Thread(()->{
for(int k=current*balk;k<(current+1)*balk;k++){
result[k]=array1[k]+array2[k];
}
});
th[i].start();
}
for(int i=0;i<threads;i++){
th[i].join();
}
System.out.println("Time needed: "+(System.nanoTime()-startingTime));
}
length is always a multiple of threads and getRandomArray() creates a random array of doubles between 0 and 1.
Execution Time for 1-Thread: 84579446ns
Execution Time for 2-Thread: 74211325ns
Execution Time for 4-Thread: 89215100ns
length =10000000
Here is the Code for getRandomArray():
private static double[]getRandomArray(int length){
Random random =new Random();
double[]array= new double[length];
for(int i=0;i<length;i++){
array[i]=random.nextDouble();
}
return array;
}
I would appreciate any help.

The difference is observable for the following code. Try it.
public static void main(String[]args)throws InterruptedException{
for(int z = 0; z < 10; z++) {
final int threads = 1;
final int length= 100_000_000;
final int balk=(length/threads);
Thread[]th = new Thread[threads];
final boolean[]result =new boolean[length];
final boolean[]array1=getRandomArray(length);
final boolean[]array2=getRandomArray(length);
long startingTime =System.nanoTime();
for(int i=0;i<threads;i++){
final int current=i;
th[i]=new Thread(()->{
for(int k=current*balk;k<(current+1)*balk;k++){
result[k]=array1[k] | array2[k];
}
});
th[i].start();
}
for(int i=0;i<threads;i++){
th[i].join();
}
System.out.println("Time needed: "+(System.nanoTime()-startingTime)*1.0/1000/1000);
boolean x = false;
for(boolean d : result) {
x |= d;
}
System.out.println(x);
}
}
First things first you need to warmup your code. This way you will measure compiled code. The first two iterations have the same(approximately) time but the next will differ. Also I changed double to boolean because my machine doesn't have much memory. This allows me to allocate a huge array and it also makes work more CPU consuming.
There is a link in comments. I suggest you to read it.

Hi from my side if you are trying to see how your cores shares work you can make very simple task for all cores, but make them to work constantly on something not shared across different threads (basically to simulate for example merge sort, where threads are working on something complicated and use shared resources in a small amount of time). Using your code i did something like this. In such case you should see almost exactly 2x speed up and 4 times speed up.
public static void main(String[]args)throws InterruptedException{
for(int a=0; a<5; a++) {
final int threads = 2;
final int length = 10;
final int balk = (length / threads);
Thread[] th = new Thread[threads];
System.out.println(Runtime.getRuntime().availableProcessors());
final double[] result = new double[length];
final double[] array1 = getRandomArray(length);
final double[] array2 = getRandomArray(length);
long startingTime = System.nanoTime();
for (int i = 0; i < threads; i++) {
final int current = i;
th[i] = new Thread(() -> {
Random random = new Random();
int meaningless = 0;
for (int k = current * balk; k < (current + 1) * balk; k++) {
result[k] = array1[k] + array2[k];
for (int j = 0; j < 10000000; j++) {
meaningless+=random.nextInt(10);
}
}
});
th[i].start();
}
for (int i = 0; i < threads; i++) {
th[i].join();
}
System.out.println("Time needed: " + ((System.nanoTime() - startingTime) * 1.0) / 1000000000 + " s");
}
}
You see, in your code most time is consumed by building big table, and then threads are executing very fast, their work is so fast that your calculation of time is wrong because most of time is consumed by creating threads. When i invoked code which works on precalculated loop like this:
long startingTime =System.nanoTime();
for(int k=0; k<length; k++){
result[k]=array1[k]|array2[k];
}
System.out.println("Time needed: "+(System.nanoTime()-startingTime));
It worked two times faster than your code with 2 threads. I hope that you understand what i mean in this case and will see my point when i gave my threads much more meaningless work.

Related

Why is Arrays.binarySearch not improving the performance compared to walking the array?

I gave a shot at solving the Hackerland Radio Transmitters programming challange.
To summarize, challenge goes as follows:
Hackerland is a one-dimensional city with n houses, where each house i is located at some xi on the x-axis. The Mayor wants to install radio transmitters on the roofs of the city's houses. Each transmitter has a range, k, meaning it can transmit a signal to all houses ≤ k units of distance away.
Given a map of Hackerland and the value of k, can you find the minimum number of transmitters needed to cover every house?
My implementation is as follows:
package biz.tugay;
import java.util.*;
public class HackerlandRadioTransmitters {
public static int minNumOfTransmitters(int[] houseLocations, int transmitterRange) {
// Sort and remove duplicates..
houseLocations = uniqueHouseLocationsSorted(houseLocations);
int towerCount = 0;
for (int nextHouseNotCovered = 0; nextHouseNotCovered < houseLocations.length; ) {
final int towerLocation = HackerlandRadioTransmitters.findNextTowerIndex(houseLocations, nextHouseNotCovered, transmitterRange);
towerCount++;
nextHouseNotCovered = HackerlandRadioTransmitters.nextHouseNotCoveredIndex(houseLocations, towerLocation, transmitterRange);
if (nextHouseNotCovered == -1) {
break;
}
}
return towerCount;
}
public static int findNextTowerIndex(final int[] houseLocations, final int houseNotCoveredIndex, final int transmitterRange) {
final int houseLocationWeWantToCover = houseLocations[houseNotCoveredIndex];
final int farthestHouseLocationAllowed = houseLocationWeWantToCover + transmitterRange;
int towerIndex = houseNotCoveredIndex;
int loop = 0;
while (true) {
loop++;
if (towerIndex == houseLocations.length - 1) {
break;
}
if (farthestHouseLocationAllowed >= houseLocations[towerIndex + 1]) {
towerIndex++;
continue;
}
break;
}
System.out.println("findNextTowerIndex looped : " + loop);
return towerIndex;
}
public static int nextHouseNotCoveredIndex(final int[] houseLocations, final int towerIndex, final int transmitterRange) {
final int towerCoversUntil = houseLocations[towerIndex] + transmitterRange;
int notCoveredHouseIndex = towerIndex + 1;
int loop = 0;
while (notCoveredHouseIndex < houseLocations.length) {
loop++;
final int locationOfHouseBeingChecked = houseLocations[notCoveredHouseIndex];
if (locationOfHouseBeingChecked > towerCoversUntil) {
break; // Tower does not cover the house anymore, break the loop..
}
notCoveredHouseIndex++;
}
if (notCoveredHouseIndex == houseLocations.length) {
notCoveredHouseIndex = -1;
}
System.out.println("nextHouseNotCoveredIndex looped : " + loop);
return notCoveredHouseIndex;
}
public static int[] uniqueHouseLocationsSorted(final int[] houseLocations) {
Arrays.sort(houseLocations);
final HashSet<Integer> integers = new HashSet<>();
final int[] houseLocationsUnique = new int[houseLocations.length];
int innerCounter = 0;
for (int houseLocation : houseLocations) {
if (integers.contains(houseLocation)) {
continue;
}
houseLocationsUnique[innerCounter] = houseLocation;
integers.add(houseLocationsUnique[innerCounter]);
innerCounter++;
}
return Arrays.copyOf(houseLocationsUnique, innerCounter);
}
}
I am pretty sure this implementation is correct. But please see the detail in the functions: findNextTowerIndex and nextHouseNotCoveredIndex: they walk the array one by one!
One of my tests is as follows:
static void test_01() throws FileNotFoundException {
final long start = System.currentTimeMillis();
final File file = new File("input.txt");
final Scanner scanner = new Scanner(file);
int[] houseLocations = new int[73382];
for (int counter = 0; counter < 73382; counter++) {
houseLocations[counter] = scanner.nextInt();
}
final int[] uniqueHouseLocationsSorted = HackerlandRadioTransmitters.uniqueHouseLocationsSorted(houseLocations);
final int minNumOfTransmitters = HackerlandRadioTransmitters.minNumOfTransmitters(uniqueHouseLocationsSorted, 73381);
assert minNumOfTransmitters == 1;
final long end = System.currentTimeMillis();
System.out.println("Took: " + (end - start) + " milliseconds..");
}
where input.txt can be downloaded from here. (It is not the most important detail in this question, but still..) So we have an array of 73382 houses, and I deliberately set the transmitter range so the methods I have loop a lot:
Here is a sample output from this test in my machine:
findNextTowerIndex looped : 38213
nextHouseNotCoveredIndex looped : 13785
Took: 359 milliseconds..
I also have this test, which does not assert anything, but just keeps time:
static void test_02() throws FileNotFoundException {
final long start = System.currentTimeMillis();
for (int i = 0; i < 400; i ++) {
final File file = new File("input.txt");
final Scanner scanner = new Scanner(file);
int[] houseLocations = new int[73382];
for (int counter = 0; counter < 73382; counter++) {
houseLocations[counter] = scanner.nextInt();
}
final int[] uniqueHouseLocationsSorted = HackerlandRadioTransmitters.uniqueHouseLocationsSorted(houseLocations);
final int transmitterRange = ThreadLocalRandom.current().nextInt(1, 70000);
final int minNumOfTransmitters = HackerlandRadioTransmitters.minNumOfTransmitters(uniqueHouseLocationsSorted, transmitterRange);
}
final long end = System.currentTimeMillis();
System.out.println("Took: " + (end - start) + " milliseconds..");
}
where I randomly create 400 transmitter ranges, and run the program 400 times.. I will get run times as follows in my machine..
Took: 20149 milliseconds..
So now, I said, why don 't I use binary search instead of walking the array and changed my implementations as follows:
public static int findNextTowerIndex(final int[] houseLocations, final int houseNotCoveredIndex, final int transmitterRange) {
final int houseLocationWeWantToCover = houseLocations[houseNotCoveredIndex];
final int farthestHouseLocationAllowed = houseLocationWeWantToCover + transmitterRange;
int nextTowerIndex = Arrays.binarySearch(houseLocations, 0, houseLocations.length, farthestHouseLocationAllowed);
if (nextTowerIndex < 0) {
nextTowerIndex = -nextTowerIndex;
nextTowerIndex = nextTowerIndex -2;
}
return nextTowerIndex;
}
public static int nextHouseNotCoveredIndex(final int[] houseLocations, final int towerIndex, final int transmitterRange) {
final int towerCoversUntil = houseLocations[towerIndex] + transmitterRange;
int nextHouseNotCoveredIndex = Arrays.binarySearch(houseLocations, 0, houseLocations.length, towerCoversUntil);
if (-nextHouseNotCoveredIndex > houseLocations.length) {
return -1;
}
if (nextHouseNotCoveredIndex < 0) {
nextHouseNotCoveredIndex = - (nextHouseNotCoveredIndex + 1);
return nextHouseNotCoveredIndex;
}
return nextHouseNotCoveredIndex + 1;
}
and I am expecting a great performance boost, as now I will at most loop for log(N) times, instead of O(N).. So test_01 outputs:
Took: 297 milliseconds..
Remember, it was Took: 359 milliseconds.. before. And for test_02:
Took: 18047 milliseconds..
So I always get values around 20 seconds with array walking implementation and 18 - 19 seconds for binary search implementation.
I was expecting a much better performance gain using Arrays.binarySearch but obviously it is not the case, why is this? What am I missing? Do I need an array with more than 73382 to see the benefit, or is it irrelevant?
Edit #01
After #huck_cussler 's comment, I tried doubling and tripling the dataset I have (with random numbers) and tried running test02 (of course with tripling the array sizes in the test itself..). For the linear implementation the times go like this:
Took: 18789 milliseconds..
Took: 34396 milliseconds..
Took: 53504 milliseconds..
For the binary search implementation, I got values as follows:
Took: 18644 milliseconds..
Took: 33831 milliseconds..
Took: 52886 milliseconds..
Your timing includes the retrieval of data from your hard drive. This could be taking the majority of your runtime. Omit the data load from your timing to get a more accurate comparison of your two approaches. Imagine if it takes up 18 seconds and you're comparing 18.644 vs 18.789 (0.77% improvement) instead of 0.644 vs 0.789 (18.38% improvement).
If you have a linear operation O(n), such as loading a binary structure, and you combine it with a binary search O(log n), you end up with O(n). If you trust Big O notation, then you should expect O(n + log n) to not be significantly different from O(2 * n) as they both reduce to O(n).
Also, a binary search may perform better or worse than a linear search depending on the density of houses between towers. Consider, say 1024 homes with a tower evenly dispersed every 4 homes. A linear search will step 4 times per tower, while a binary search will take log2(1024)=10 steps per tower.
One more thing... your minNumOfTransmitters method is sorting the already-sorted array passed into it from test_01 and test_02. That resorting step takes longer than your searches themselves, which further obscures the timing differences between your two search algorithms.
======
I created a small timing class to give a better picture of what's happening. I've removed the line of code from minNumOfTransmitters to prevent it from rerunning the sort, and added a boolean param to select whether to use your binary version. It totals the sum of times for 400 iterations, separating out each step. The results on my system illustrate that the load time dwarfs the sort time, which in turn dwarfs the solve time.
Load: 22.565s
Sort: 4.518s
Linear: 0.012s
Binary: 0.003s
It's easy to see how optimizing that last step doesn't make much difference in overall runtime.
private static class Timing {
public long load=0;
public long sort=0;
public long solve1=0;
public long solve2=0;
private String secs(long millis) {
return String.format("%3d.%03ds", millis/1000, millis%1000);
}
public String toString() {
return " Load: " + secs(load) + "\n Sort: " + secs(sort) + "\nLinear: " + secs(solve1) + "\nBinary: " + secs(solve2);
}
public void add(Timing timing) {
load+=timing.load;
sort+=timing.sort;
solve1+=timing.solve1;
solve2+=timing.solve2;
}
}
static Timing test_01() throws FileNotFoundException {
Timing timing=new Timing();
long start = System.currentTimeMillis();
final File file = new File("c:\\path\\to\\xnpwdiG3.txt");
final Scanner scanner = new Scanner(file);
int[] houseLocations = new int[73382];
for (int counter = 0; counter < 73382; counter++) {
houseLocations[counter] = scanner.nextInt();
}
timing.load+=System.currentTimeMillis()-start;
start=System.currentTimeMillis();
final int[] uniqueHouseLocationsSorted = HackerlandRadioTransmitters.uniqueHouseLocationsSorted(houseLocations);
timing.sort=System.currentTimeMillis()-start;
start=System.currentTimeMillis();
final int minNumOfTransmitters = HackerlandRadioTransmitters.minNumOfTransmitters(uniqueHouseLocationsSorted, 73381, false);
timing.solve1=System.currentTimeMillis()-start;
start=System.currentTimeMillis();
final int minNumOfTransmittersBin = HackerlandRadioTransmitters.minNumOfTransmitters(uniqueHouseLocationsSorted, 73381, true);
timing.solve2=System.currentTimeMillis()-start;
final long end = System.currentTimeMillis();
return timing;
}
In your time measurement you include operations that are much slower than array search. Namely filesystem I/O and array sorting.
I/O in general (reading/writing from filesystem, network communication) is by orders of magnitude slower than operations that involve only CPU and RAM access.
Let's rewrite your test in a way that does not read the file on every loop iteration:
static void test_02() throws FileNotFoundException {
final File file = new File("input.txt");
final Scanner scanner = new Scanner(file);
int[] houseLocations = new int[73382];
for (int counter = 0; counter < 73382; counter++) {
houseLocations[counter] = scanner.nextInt();
}
scanner.close();
final int rounds = 400;
final int[] uniqueHouseLocationsSorted = uniqueHouseLocationsSorted(houseLocations);
final int transmitterRange = 73381;
final long start = System.currentTimeMillis();
for (int i = 0; i < rounds; i++) {
final int minNumOfTransmitters = minNumOfTransmitters(uniqueHouseLocationsSorted, transmitterRange);
}
final long end = System.currentTimeMillis();
System.out.println("Took: " + (end - start) + " milliseconds..");
}
Notice in this version of the test the file is read only once and time measuring starts after that.
With the above, I get Took: 1700 milliseconds.. (more or less a few millis) for both the iterative version and the binary search. So we still can't see that binary search is faster. That's because almost all of that time goes into sorting the array 400 times.
Now let's remove the line that sorts the input array from the minNumOfTransmitters method. We sort the array (once) anyway at the beginning of the test.
Now we can see that things are much faster. After removing the line houseLocations = uniqueHouseLocationsSorted(houseLocations) from minNumOfTransmitters I get: Took: 68 milliseconds.. for the iterative version. Clearly, since this duration is already very small, we will not see a significant difference with the binary search version.
So let's increase the number of loop rounds to: 100000.
Now I get Took: 2121 milliseconds.. for the iterative version and Took: 36 milliseconds.. for the binary search version.
Because we now isolated what we measure and focus on the array searches, rather than including operations that are much slower, we can notice the big difference in performance (for the better) of binary search.
If you want to see how many times binary search enters its while loop, you can implement it yourself and add a counter:
private static int binarySearch0(int[] a, int fromIndex, int toIndex, int key) {
int low = fromIndex;
int high = toIndex - 1;
int loop = 0;
while (low <= high) {
loop++;
int mid = (low + high) >>> 1;
int midVal = a[mid];
if (midVal < key) {
low = mid + 1;
} else if (midVal > key) {
high = mid - 1;
} else {
return mid; // key found
}
}
System.out.println("binary search looped " + loop + " times");
return -(low + 1); // key not found.
}
The method is copied from the Arrays class in the JDK - I just added the loop counter and the println.
When the length of the array to search is 73382, the loop enters only 16 times.
That is exactly what we expect: log(73382) =~ 16.
I agree with other answers that the main issue with your tests is that they measure wrong things: IO and sorting. But I don't think suggested tests are good. My suggestion is following:
static void test_02() throws FileNotFoundException {
final File file = new File("43620487.txt");
final Scanner scanner = new Scanner(file);
int[] houseLocations = new int[73382];
for (int counter = 0; counter < 73382; counter++) {
houseLocations[counter] = scanner.nextInt();
}
final int[] uniqueHouseLocationsSorted = uniqueHouseLocationsSorted(houseLocations);
final Random random = new Random(0); // fixed seed to have the same sequences in all tests
long sum = 0;
// warm up
for (int i = 0; i < 100; i++) {
final int transmitterRange = random.nextInt(70000) + 1;
final int minNumOfTransmitters = minNumOfTransmitters(uniqueHouseLocationsSorted, transmitterRange);
sum += minNumOfTransmitters;
}
// actual measure
final long start = System.currentTimeMillis();
for (int i = 0; i < 4000; i++) {
final int transmitterRange = random.nextInt(70000) + 1;
final int minNumOfTransmitters = minNumOfTransmitters(uniqueHouseLocationsSorted, transmitterRange);
sum += minNumOfTransmitters;
}
final long end = System.currentTimeMillis();
System.out.println("Took: " + (end - start) + " milliseconds. Sum = " + sum);
}
Note also that I remove all System.out.println calls from findNextTowerIndex and nextHouseNotCoveredIndex and uniqueHouseLocationsSorted call from minNumOfTransmitters as they affect performance testing as well.
So what I think is important here:
Move all I/O and sorting out of the measurement loop
Perform some warm up outside of measurement
Use the same random sequence for all measurements
Don't dispose result of the calculation so JIT can't optimize that call out altogether
With such test I see about 10 times difference on my machine: around 80ms vs around 8ms.
And if you really want to do performance tests in Java you should consider using JMH aka Java Microbenchmark Harness
Agree with other answers, the IO time is most problem, and sort is second, the search is last time consumer.
And agree phatfingers's example, the binary search sometime is worst than linear search in your problem because totally linear search goes one loop for every element(n times compare) but binary search run for tower times (O(logn)*#tower)), one suggestion is that binary search not start from 0, but from current location
int nextTowerIndex = Arrays.binarySearch(houseLocations, houseNotCoveredIndex+1, houseLocations.length, arthestHouseLocationAllowed)
then it should O(logn)*#tower/2)
Even more, maybe you can calculate every tower cover how many houses avg then first compare avg houses then using binary search start from houseNotCoveredIndex + avg + 1, but not sure the performance is much better.
ps: sort and unique you can using TreeSet as
public static int[] uniqueHouseLocationsSorted(final int[] houseLocations) {
final Set<Integer> integers = new TreeSet<>();
for (int houseLocation : houseLocations) {
integers.add(houseLocation);
}
int[] unique = new int[integers.size()];
int i = 0;
for(Integer loc : integers){
unique[i] = loc;
i++;
}
return unique;
}
uniqueHouseLocationsSorted is not efficient, andy solution seems better, but I think this could improve the time spent (note that I did not test the code):
public static int[] uniqueHouseLocationsSorted(final int[] houseLocations) {
int size = houseLocations.length;
if (size == 0) return null; // you have to check for null later or maybe throw an exception here
Arrays.sort(houseLocations);
final int[] houseLocationsUnique = new int[size];
int previous = houseLocationsUnique[0] = houseLocations[0];
int innerCounter = 1;
for (int i = 1; i < size; i++) {
int houseLocation = houseLocations[i];
if (houseLocation == previous) continue; // since elements are sorted this is faster
previous = houseLocationsUnique[innerCounter++] = houseLocation;
}
return Arrays.copyOf(houseLocationsUnique, innerCounter);
}
Consider also using an Array list as copying the array takes time.

multiple parallel threads in java dont terminate

i try to run a set of parallel thread in java. im creating these through a high order function as follows:
public static void parallelizedMap(Consumer<String> f, List<String> list, int count) {
List<List<String>> parts = new ArrayList<List<String>>();
final int N = list.size();
int L = N / (count - 1);
for (int i = 0; i < N; i += L) {
parts.add(new ArrayList<String>(list.subList(i, Math.min(N, i + L))));
}
for (List<String> e : parts) {
Runnable r = new Runnable() {
public void run() {
e.forEach(f);
}
};
new Thread(r).start();
}
}
this method is called every few minutes. it creates hundreds of thread after several minutes. every thread only runs for 20 seconds. but my debugging showed that they never terminate and therefor i get this Exception:
java.lang.OutOfMemoryError: GC overhead limit exceeded
thanks in advance
I do not think the problem are the threads. But if count is double or more than size of list:
final int N = list.size(); // N = 10, count = 22
int L = N / (count - 1); // L = 10 / (22-1) = 0.476 / L = 0.4d = (int) 0;
for (int i = 0; i < N; i += 0) {
parts.add(new ArrayList<String>(list.subList(i, Math.min(N, i + 0))));
}
The iteration is an endless-loop but the memory-usage grows every iteration.
Naturally its a OutOfMemory.
for me the solution was to give every thread an extra jdbc connection and dont let them share one. this resolved the deadlock und they terminate as expected.

ThreadLocalRandom with shared static Random instance performance comparing test

In our project for one task we used static Random instance for random numbers generation goal. After Java 7 release new ThreadLocalRandom class appeared for generating random numbers.
From spec:
When applicable, use of ThreadLocalRandom rather than shared Random objects in concurrent programs will typically encounter much less overhead and contention. Use of ThreadLocalRandom is particularly appropriate when multiple tasks (for example, each a ForkJoinTask) use random numbers in parallel in thread pools.
and also:
When all usages are of this form, it is never possible to accidently share a ThreadLocalRandom across multiple threads.
So I've made my little test:
public class ThreadLocalRandomTest {
private static final int THREAD_COUNT = 100;
private static final int GENERATED_NUMBER_COUNT = 1000;
private static final int INT_RIGHT_BORDER = 5000;
private static final int EXPERIMENTS_COUNT = 5000;
public static void main(String[] args) throws InterruptedException {
System.out.println("Number of threads: " + THREAD_COUNT);
System.out.println("Length of generated numbers chain for each thread: " + GENERATED_NUMBER_COUNT);
System.out.println("Right border integer: " + INT_RIGHT_BORDER);
System.out.println("Count of experiments: " + EXPERIMENTS_COUNT);
int repeats = 0;
int workingTime = 0;
long startTime = 0;
long endTime = 0;
for (int i = 0; i < EXPERIMENTS_COUNT; i++) {
startTime = System.currentTimeMillis();
repeats += calculateRepeatsForSharedRandom();
endTime = System.currentTimeMillis();
workingTime += endTime - startTime;
}
System.out.println("Average repeats for shared Random instance: " + repeats / EXPERIMENTS_COUNT
+ ". Average working time: " + workingTime / EXPERIMENTS_COUNT + " ms.");
repeats = 0;
workingTime = 0;
for (int i = 0; i < EXPERIMENTS_COUNT; i++) {
startTime = System.currentTimeMillis();
repeats += calculateRepeatsForTheadLocalRandom();
endTime = System.currentTimeMillis();
workingTime += endTime - startTime;
}
System.out.println("Average repeats for ThreadLocalRandom: " + repeats / EXPERIMENTS_COUNT
+ ". Average working time: " + workingTime / EXPERIMENTS_COUNT + " ms.");
}
private static int calculateRepeatsForSharedRandom() throws InterruptedException {
final Random rand = new Random();
final Map<Integer, Integer> counts = new HashMap<>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = rand.nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
thread.start();
thread.join();
}
int repeats = 0;
for (Integer value : counts.values()) {
if (value > 1) {
repeats += value;
}
}
return repeats;
}
private static int calculateRepeatsForTheadLocalRandom() throws InterruptedException {
final Map<Integer, Integer> counts = new HashMap<>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = ThreadLocalRandom.current().nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
thread.start();
thread.join();
}
int repeats = 0;
for (Integer value : counts.values()) {
if (value > 1) {
repeats += value;
}
}
return repeats;
}
}
I've also added test for non-shared Random and got next results:
Number of threads: 100
Length of generated numbers chain for each thread: 100
Right border integer: 5000
Count of experiments: 10000
Average repeats for non-shared Random instance: 8646. Average working time: 13 ms.
Average repeats for shared Random instance: 8646. Average working time: 13 ms.
Average repeats for ThreadLocalRandom: 8646. Average working time: 13 ms.
To me it's little strange, as I expected at least speed increasing when using ThreadLocalRandom comparing to shared Random instance, but see no difference at all.
Can someone explain why it works that way, maybe I haven't done testing properly. Thank you!
You're not running anything in parallel because you're waiting for each thread to finish immediately after starting it. You need a waiting loop outside the loop that starts the threads:
List<Thread> threads = new ArrayList<Thread>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = rand.nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
threads.add(thread);
thread.start();
}
for (Thread thread: threads) {
thread.join();
}
Your testing code is flawed for one. The bane of benchmarkers everywhere.
thread.start();
thread.join();
why not save LOCs and write
thread.run();
the outcome is the same.
EDIT: If you don't realize the outcome from the above, it means that you're running single threaded tests, there's no multithreading going on.
Maybe it would be easier to just have a look at what actually happens. Here is the source for ThreadLocal.get() which is also called for the ThreadLocalRandom.current().
public T get() {
Thread t = Thread.currentThread();
ThreadLocalMap map = getMap(t);
if (map != null) {
ThreadLocalMap.Entry e = map.getEntry(this);
if (e != null)
return (T)e.value;
}
return setInitialValue();
}
Where ThreadLocalMap is a specialized HashMap-like implementation with optimizations.
So what basically happens is that ThreadLocal holds a map Thread->Object - or in this case Thread->Random - which is then looked up and either returned or created. As this is nothing 'magical', the timing will be equal to a HashMap-lookup + the initial creation overhead of the actual Object to be returned. Since a HashMap lookup (in this optimized case) is linear, the cost for a lookup is k, where k is the calculation cost of the hash function.
So you can make some assumptions:
ThreadLocal will be faster than creating the object each time in each Runnable, unless the creation cost is much smaller than k. So looking up Random is a good thing, putting an int inside might not be so smart.
ThreadLocal will be better than using your own HashMap, as such a generic implementation can be assumed to be equal to k or worse.
ThreadLocal will be slower than using any lookup with a cost < k. Example: store everything in an array first, then do myRandoms[threadID].
But then this assumes that you know which threads will be processing your work in the first place, so this isn't a real candidate for ThreadLocal anyways.

[Java]Arrays and Collections test, unexpected outcome?

There are probably hundreds of questions about Java Collections vs. arrays, but this is something I really didn't expect.
I am developing a server for my game, and to communicate between the client and server you need to send packets (obviously), so I did some tests which Collection (or array) I could use best to handle them, HashMap, ArrayList and a PacketHandler array. And the outcome is very unexpected to me, because the ArrayList wins.
The packet handling structure is just like dictionary usage (index to PacketHandler), and because an array is the most primitive form of dictionary use I thought that would easily perform better than an ArrayList. Could someone explain me why this is?
My test
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Random;
public class Main {
/**
* Packet handler interface.
*/
private interface PacketHandler {
void handle();
}
/**
* A dummy packet handler.
*/
private class DummyPacketHandler implements PacketHandler {
#Override
public void handle() { }
}
public Main() {
Random r = new Random();
PacketHandler[] handlers = new PacketHandler[256];
HashMap<Integer, PacketHandler> m = new HashMap<Integer, PacketHandler>();
ArrayList<PacketHandler> list = new ArrayList<PacketHandler>();
// packet handler initialization
for (int i = 0; i < 255; i++) {
DummyPacketHandler p = new DummyPacketHandler();
handlers[i] = p;
m.put(new Integer(i), p);
list.add(p);
}
// array
long time = System.currentTimeMillis();
for (int i = 0; i < 10000000; i++)
handlers[r.nextInt(255)].handle();
System.out.println((System.currentTimeMillis() - time));
// hashmap
time = System.currentTimeMillis();
for (int i = 0; i < 10000000; i++)
m.get(new Integer(r.nextInt(255))).handle();
System.out.println((System.currentTimeMillis() - time));
// arraylist
time = System.currentTimeMillis();
for (int i = 0; i < 10000000; i++)
list.get(r.nextInt(255)).handle();
System.out.println((System.currentTimeMillis() - time));
}
public static void main(String[] args) {
new Main();
}
}
I think the problem is quite solved, thanks everybody
The shorter answer is that ArrayList is slightly more optimised the first time, but is still slower in the long run.
How and when the JVM optimise the code before its completely warmed up isn't always obvious and can change between version and based on your command line options.
What is really interesting is what you get when you repeat the test. The reason that makes a difference here is that the code is compiled in stages in the background as you want to have tests where the code is already as fast as it will get right from the start.
There are a few things you can do to make your benchmark more reproducaeable.
generate your random numbers in advance, they are not part of your test but can slow you down.
place each loop in a separate method. The first loop triggers the whole method to be compiled for better or worse.
repeat the test 5 to 10 times and ignore the first one.
Using System.nanoTime() instead of currentTimeMillis() It may not make any difference here but is a good habit to get into.
Use autoboxing where you can so it uses the integer cache or Integer.valueOf(n) which does the same thing. new Integer(n) will always create an object.
make sure your inner loop does something otherwise its quite likely the JIT will optimise it away to nothing. ;)
.
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Random;
public class Main {
/**
* Packet handler interface.
*/
private interface PacketHandler {
void handle();
}
/**
* A dummy packet handler.
*/
static class DummyPacketHandler implements PacketHandler {
#Override
public void handle() {
}
}
public static void main(String[] args) {
Random r = new Random();
PacketHandler[] handlers = new PacketHandler[256];
HashMap<Integer, PacketHandler> m = new HashMap<Integer, PacketHandler>();
ArrayList<PacketHandler> list = new ArrayList<PacketHandler>();
// packet handler initialization
for (int i = 0; i < 256; i++) {
DummyPacketHandler p = new DummyPacketHandler();
handlers[i] = p;
m.put(new Integer(i), p);
list.add(p);
}
int runs = 10000000;
int[] handlerToUse = new int[runs];
for (int i = 0; i < runs; i++)
handlerToUse[i] = r.nextInt(256);
for (int i = 0; i < 5; i++) {
testArray(handlers, runs, handlerToUse);
testHashMap(m, runs, handlerToUse);
testArrayList(list, runs, handlerToUse);
System.out.println();
}
}
private static void testArray(PacketHandler[] handlers, int runs, int[] handlerToUse) {
// array
long time = System.nanoTime();
for (int i = 0; i < runs; i++)
handlers[handlerToUse[i]].handle();
System.out.print((System.nanoTime() - time)/1e6+" ");
}
private static void testHashMap(HashMap<Integer, PacketHandler> m, int runs, int[] handlerToUse) {
// hashmap
long time = System.nanoTime();
for (int i = 0; i < runs; i++)
m.get(handlerToUse[i]).handle();
System.out.print((System.nanoTime() - time)/1e6+" ");
}
private static void testArrayList(ArrayList<PacketHandler> list, int runs, int[] handlerToUse) {
// arraylist
long time = System.nanoTime();
for (int i = 0; i < runs; i++)
list.get(handlerToUse[i]).handle();
System.out.print((System.nanoTime() - time)/1e6+" ");
}
}
prints for array HashMap ArrayList
24.62537 263.185092 24.19565
28.997305 206.956117 23.437585
19.422327 224.894738 21.191718
14.154433 194.014725 16.927638
13.897081 163.383876 16.678818
After the code warms up, the array is marginally faster.
There are at least a few problems with your benchmark:
you run your tests directly in main, meaning that when you main method gets compiled, the JIT compiler has not had time to optimise all the code because it has not run it yet
the map method creates a new integer each time, which is not fair: use m.get(r.nextInt(255)).handle(); to allow the Integer cache to be used
you need to run your test several times before you can draw conclusions
you are not using the result of what you do in your loops and the JIT is therefore allowed to simply ignore them
monitor GC as it might always run at the same time and bias the results of one of your loop and add a System.gc() call between each loop.
But before doing all that, read this post ;-)
After tweaking your code a bit, I get these results:
Array: 116
Map: 139
List: 117
So array and list are close to identical once compiled and map is slightly slower.
Code:
public class Main {
/**
* Packet handler interface.
*/
private interface PacketHandler {
int handle();
}
/**
* A dummy packet handler.
*/
private class DummyPacketHandler implements PacketHandler {
#Override
public int handle() {
return 123;
}
}
public Main() {
Random r = new Random();
PacketHandler[] handlers = new PacketHandler[256];
HashMap<Integer, PacketHandler> m = new HashMap<Integer, PacketHandler>();
ArrayList<PacketHandler> list = new ArrayList<PacketHandler>();
// packet handler initialization
for (int i = 0; i < 255; i++) {
DummyPacketHandler p = new DummyPacketHandler();
handlers[i] = p;
m.put(new Integer(i), p);
list.add(p);
}
long sum = 0;
runArray(handlers, r, 20000);
runMap(m, r, 20000);
runList(list, r, 20000);
// array
long time = System.nanoTime();
sum += runArray(handlers, r, 10000000);
System.out.println("Array: " + (System.nanoTime() - time) / 1000000);
// hashmap
time = System.nanoTime();
sum += runMap(m, r, 10000000);
System.out.println("Map: " + (System.nanoTime() - time) / 1000000);
// arraylist
time = System.nanoTime();
sum += runList(list, r, 10000000);
System.out.println("List: " + (System.nanoTime() - time) / 1000000);
System.out.println(sum);
}
public static void main(String[] args) {
new Main();
}
private long runArray(PacketHandler[] handlers, Random r, int loops) {
long sum = 0;
for (int i = 0; i < loops; i++)
sum += handlers[r.nextInt(255)].handle();
return sum;
}
private long runMap(HashMap<Integer, PacketHandler> m, Random r, int loops) {
long sum = 0;
for (int i = 0; i < loops; i++)
sum += m.get(new Integer(r.nextInt(255))).handle();
return sum;
}
private long runList(List<PacketHandler> list, Random r, int loops) {
long sum = 0;
for (int i = 0; i < loops; i++)
sum += list.get(r.nextInt(255)).handle();
return sum;
}
}

Is there any "threshold" justifying multithreaded computation?

So basically I needed to optimize this piece of code today. It tries to find the longest sequence produced by some function for the first million starting numbers:
public static void main(String[] args) {
int mostLen = 0;
int mostInt = 0;
long currTime = System.currentTimeMillis();
for(int j=2; j<=1000000; j++) {
long i = j;
int len = 0;
while((i=next(i)) != 1) {
len++;
}
if(len > mostLen) {
mostLen = len;
mostInt = j;
}
}
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if(i%2==0) {
return i/2;
} else {
return i*3+1;
}
}
My mistake was to try to introduce multithreading:
void doSearch() throws ExecutionException, InterruptedException {
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
long currTime = System.currentTimeMillis();
List<Future<ValueBean>> list = new ArrayList<Future<ValueBean>>();
for (int j = 2; j <= 1000000; j++) {
MyCallable<ValueBean> worker = new MyCallable<ValueBean>();
worker.setBean(new ValueBean(j, 0));
Future<ValueBean> f = executor.submit(worker);
list.add(f);
}
System.out.println(System.currentTimeMillis() - currTime);
int mostLen = 0;
int mostInt = 0;
for (Future<ValueBean> f : list) {
final int len = f.get().getLen();
if (len > mostLen) {
mostLen = len;
mostInt = f.get().getNum();
}
}
executor.shutdown();
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
public class MyCallable<T> implements Callable<ValueBean> {
public ValueBean bean;
public void setBean(ValueBean bean) {
this.bean = bean;
}
public ValueBean call() throws Exception {
long i = bean.getNum();
int len = 0;
while ((i = next(i)) != 1) {
len++;
}
return new ValueBean(bean.getNum(), len);
}
}
public class ValueBean {
int num;
int len;
public ValueBean(int num, int len) {
this.num = num;
this.len = len;
}
public int getNum() {
return num;
}
public int getLen() {
return len;
}
}
long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
Unfortunately, the multithreaded version worked 5 times slower than the single-threaded on 4 processors (cores).
Then I tried a bit more crude approach:
static int mostLen = 0;
static int mostInt = 0;
synchronized static void updateIfMore(int len, int intgr) {
if (len > mostLen) {
mostLen = len;
mostInt = intgr;
}
}
public static void main(String[] args) throws InterruptedException {
long currTime = System.currentTimeMillis();
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
for (int i = 2; i <= 1000000; i++) {
final int j = i;
executor.execute(new Runnable() {
public void run() {
long l = j;
int len = 0;
while ((l = next(l)) != 1) {
len++;
}
updateIfMore(len, j);
}
});
}
executor.shutdown();
executor.awaitTermination(30, TimeUnit.SECONDS);
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
and it worked much faster, but still it was slower than the single thread approach.
I hope it's not because I screwed up the way I'm doing multithreading, but rather this particular calculation/algorithm is not a good fit for parallel computation. If I change calculation to make it more processor intensive by replacing method next with:
long next(long i) {
Random r = new Random();
for(int j=0; j<10; j++) {
r.nextLong();
}
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
both multithreaded versions start to execute more than twice as fast than the singlethreaded version on a 4 core machine.
So clearly there must be some threshold that you can use to determine if it is worth to introduce multithreading and my question is:
What is the basic rule that would help decide if a given calculation is intensive enough to be optimized by running it in parallel (without spending effort to actually implement it?)
The key to efficiently implementing multithreading is to make sure the cost is not too high. There are no fixed rules as they heavily depend on your hardware.
Starting and stopping threads has a high cost. Of course you already used the executor service which reduces these costs considerably because it uses a bunch of worker threads to execute your Runnables. However each Runnable still comes with some overhead. Reducing the number of runnables and increasing the amount of work each one has to do will improve performance, but you still want to have enough runnables for the executor service to efficiently distribute them over the worker threads.
You have choosen to create one runnable for each starting value so you end up creating 1000000 runnables. You would probably be getting much better results of you let each Runnable do a batch of say 1000 start values. Which means you only need 1000 runnables greatly reducing the overhead.
I think there is another component to this which you are not considering. Parallelization works best when the units of work have no dependence on each other. Running a calculation in parallel is sub-optimal when later calculation results depend on earlier calculation results. The dependence could be strong in the sense of "I need the first value to compute the second value". In that case, the task is completely serial and later values cannot be computed without waiting for earlier computations. There could also be a weaker dependence in the sense of "If I had the first value I could compute the second value faster". In that case, the cost of parallelization is that some work may be duplicated.
This problem lends itself to being optimized without multithreading because some of the later values can be computed faster if you have the previous results already in hand. Take, for example j == 4. Once through the inner loop produces i == 2, but you just computed the result for j == 2 two iterations ago, if you saved the value of len you can compute it as len(4) = 1 + len(2).
Using an array to store previously computed values of len and a little bit twiddling in the next method, you can complete the task >50x faster.
"Will the performance gain be greater than the cost of context switching and thread creation?"
That is a very OS, language, and hardware, dependent cost; this question has some discussion about the cost in Java, but has some numbers and some pointers to how to calculate the cost.
You also want to have one thread per CPU, or less, for CPU intensive work. Thanks to David Harkness for the pointer to a thread on how to work out that number.
Estimate amount of work which a thread can do without interaction with other threads (directly or via common data). If that piece of work can be completed in 1 microsecond or less, overhead is too much and multithreading is of no use. If it is 1 millisecond or more, multithreading should work well. If it is in between, experimental testing required.

Categories

Resources