Java Math.min/max performance

Java Math.min/max performance - java

EDIT: maaartinus gave the answer I was looking for and tmyklebu's data on the problem helped a lot, so thanks both! :)
I've read a bit about how HotSpot has some "intrinsics" that injects in the code, specially for Java standard Math libs (from here)
So I decided to give it a try, to see how much difference HotSpot could make against doing the comparison directly (specially since I've heard min/max can compile to branchless asm).
public class OpsMath {
public static final int max(final int a, final int b) {
if (a > b) {
return a;
}
return b;
}
}
That's my implementation. From another SO question I've read that using the ternary operator uses an extra register, I haven't found significant differences between doing an if block and using a ternary operator (ie, return ( a > b ) ? a : b ).
Allocating a 8Mb int array (ie, 2 million values), and randomizing it, I do the following test:
try ( final Benchmark bench = new Benchmark( "millis to max" ) )
{
int max = Integer.MIN_VALUE;
for ( int i = 0; i < array.length; ++i )
{
max = OpsMath.max( max, array[i] );
// max = Math.max( max, array[i] );
}
}
I'm using a Benchmark object in a try-with-resources block. When it finishes, it calls close() on the object and prints the time the block took to complete. The tests are done separately by commenting in/out the max calls in the code above.
'max' is added to a list outside the benchmark block and printed later, so to avoid the JVM optimizing the whole block away.
The array is randomized each time the test runs.
Running the test 6 times, it gives these results:
Java standard Math:
millis to max 9.242167
millis to max 2.1566199999999998
millis to max 2.046396
millis to max 2.048616
millis to max 2.035761
millis to max 2.001044
So fairly stable after the first run, and running the tests again gives similar results.
OpsMath:
millis to max 8.65418
millis to max 1.161559
millis to max 0.955851
millis to max 0.946642
millis to max 0.994543
millis to max 0.9469069999999999
Again, very stable results after the first run.
The question is: Why? Thats quite a big difference there. And I have no idea why. Even if I implement my max() method exactly like Math.max() (ie, return (a >= b) ? a : b ) I still get better results! It makes no sense.
Specs:
CPU: Intel i5 2500, 3,3Ghz.
Java Version: JDK 8 (public march 18 release), x64.
Debian Jessie (testing release) x64.
I have yet to try with 32 bit JVM.
EDIT: Self contained test as requested. Added a line to force the JVM to preload Math and OpsMath classes. That eliminates the 18ms cost of the first iteration for OpsMath test.
// Constant nano to millis.
final double TO_MILLIS = 1.0d / 1000000.0d;
// 8Mb alloc.
final int[] array = new int[(8*1024*1024)/4];
// Result and time array.
final ArrayList<Integer> results = new ArrayList<>();
final ArrayList<Double> times = new ArrayList<>();
// Number of tests.
final int itcount = 6;
// Call both Math and OpsMath method so JVM initializes the classes.
System.out.println("initialize classes " +
OpsMath.max( Math.max( 20.0f, array.length ), array.length / 2.0f ));
final Random r = new Random();
for ( int it = 0; it < itcount; ++it )
{
int max = Integer.MIN_VALUE;
// Randomize the array.
for ( int i = 0; i < array.length; ++i )
{
array[i] = r.nextInt();
}
final long start = System.nanoTime();
for ( int i = 0; i < array.length; ++i )
{
max = Math.max( array[i], max );
// OpsMath.max() method implemented as described.
// max = OpsMath.max( array[i], max );
}
// Calc time.
final double end = (System.nanoTime() - start);
// Store results.
times.add( Double.valueOf( end ) );
results.add( Integer.valueOf( max ) );
}
// Print everything.
for ( int i = 0; i < itcount; ++i )
{
System.out.println( "IT" + i + " result: " + results.get( i ) );
System.out.println( "IT" + i + " millis: " + times.get( i ) * TO_MILLIS );
}
Java Math.max result:
IT0 result: 2147477409
IT0 millis: 9.636998
IT1 result: 2147483098
IT1 millis: 1.901314
IT2 result: 2147482877
IT2 millis: 2.095551
IT3 result: 2147483286
IT3 millis: 1.9232859999999998
IT4 result: 2147482828
IT4 millis: 1.9455179999999999
IT5 result: 2147482475
IT5 millis: 1.882047
OpsMath.max result:
IT0 result: 2147482689
IT0 millis: 9.003616
IT1 result: 2147483480
IT1 millis: 0.882421
IT2 result: 2147483186
IT2 millis: 1.079143
IT3 result: 2147478560
IT3 millis: 0.8861169999999999
IT4 result: 2147477851
IT4 millis: 0.916383
IT5 result: 2147481983
IT5 millis: 0.873984
Still the same overall results. I've tried with randomizing the array only once, and repeating the tests over the same array, I get faster results overall, but the same 2x difference between Java Math.max and OpsMath.max.

It's hard to tell why Math.max is slower than a Ops.max, but it's easy to tell why this benchmark strongly favors branching to conditional moves: On the n-th iteration, the probability of
Math.max( array[i], max );
being not equal to max is the probability that array[n-1] is bigger than all previous elements. Obviously, this probability gets lower and lower with growing n and given
final int[] array = new int[(8*1024*1024)/4];
it's rather negligible most of the time. The conditional move instruction is insensitive to the branching probability, it always take the same amount of time to execute. The conditional move instruction is faster than branch prediction if the branch is very hard to predict. On the other hand, branch prediction is faster if the branch can be predicted well with high probability. Currently, I'm unsure about the speed of conditional move compared to best and worst case of branching.1
In your case all but first few branches are fairly predictable. From about n == 10 onward, there's no point in using conditional moves as the branch is rather guaranteed to be predicted correctly and can execute in parallel with other instructions (I guess you need exactly one cycle per iteration).
This seems to happen for algorithms computing minimum/maximum or doing some inefficient sorting (good branch predictability means low entropy per step).
1 Both conditional move and predicted branch take one cycle. The problem with the former is that it needs its two operands and this takes additional instruction. In the end the critical path may get longer and/or the ALUs saturated while the branching unit is idle. Often, but not always, branches can be predicted well in practical applications; that's why branch prediction was invented in the first place.
As for the gory details of timing conditional move vs. branch prediction best and worst case, see the discussion below in comments. My my own benchmark shows that conditional move is significantly faster than branch prediction when branch prediction encounters its worst case, but I can't ignore contradictory results. We need some explanation for what exactly makes the difference. Some more benchmarks and/or analysis could help.

When I run your (suitably-modified) code using Math.max on an old (1.6.0_27) JVM, the hot loop looks like this:
0x00007f4b65425c50: mov %r11d,%edi ;*getstatic array
; - foo146::bench#81 (line 40)
0x00007f4b65425c53: mov 0x10(%rax,%rdx,4),%r8d
0x00007f4b65425c58: mov 0x14(%rax,%rdx,4),%r10d
0x00007f4b65425c5d: mov 0x18(%rax,%rdx,4),%ecx
0x00007f4b65425c61: mov 0x2c(%rax,%rdx,4),%r11d
0x00007f4b65425c66: mov 0x28(%rax,%rdx,4),%r9d
0x00007f4b65425c6b: mov 0x24(%rax,%rdx,4),%ebx
0x00007f4b65425c6f: rex mov 0x20(%rax,%rdx,4),%esi
0x00007f4b65425c74: mov 0x1c(%rax,%rdx,4),%r14d ;*iaload
; - foo146::bench#86 (line 40)
0x00007f4b65425c79: cmp %edi,%r8d
0x00007f4b65425c7c: cmovl %edi,%r8d
0x00007f4b65425c80: cmp %r8d,%r10d
0x00007f4b65425c83: cmovl %r8d,%r10d
0x00007f4b65425c87: cmp %r10d,%ecx
0x00007f4b65425c8a: cmovl %r10d,%ecx
0x00007f4b65425c8e: cmp %ecx,%r14d
0x00007f4b65425c91: cmovl %ecx,%r14d
0x00007f4b65425c95: cmp %r14d,%esi
0x00007f4b65425c98: cmovl %r14d,%esi
0x00007f4b65425c9c: cmp %esi,%ebx
0x00007f4b65425c9e: cmovl %esi,%ebx
0x00007f4b65425ca1: cmp %ebx,%r9d
0x00007f4b65425ca4: cmovl %ebx,%r9d
0x00007f4b65425ca8: cmp %r9d,%r11d
0x00007f4b65425cab: cmovl %r9d,%r11d ;*invokestatic max
; - foo146::bench#88 (line 40)
0x00007f4b65425caf: add $0x8,%edx ;*iinc
; - foo146::bench#92 (line 39)
0x00007f4b65425cb2: cmp $0x1ffff9,%edx
0x00007f4b65425cb8: jl 0x00007f4b65425c50
Apart from the weirdly-placed REX prefix (not sure what that's about), here you have a loop that's been unrolled 8 times that does mostly what you'd expect---loads, comparisons, and conditional moves. Interestingly, if you swap the order of the arguments to max, here it outputs the other kind of 8-deep cmovl chain. I guess it doesn't know how to generate a 3-deep tree of cmovls or 8 separate cmovl chains to be merged after the loop is done.
With the explicit OpsMath.max, it turns into a ratsnest of conditional and unconditional branches that's unrolled 8 times. I'm not going to post the loop; it's not pretty. Basically each mov/cmp/cmovl above gets broken into a load, a compare and a conditional jump to where a mov and a jmp happen. Interestingly, if you swap the order of the arguments to max, here it outputs an 8-deep cmovle chain instead. EDIT: As #maaartinus points out, said ratsnest of branches is actually faster on some machines because the branch predictor works its magic on them and these are well-predicted branches.
I would hesitate to draw conclusions from this benchmark. You have benchmark construction issues; you have to run it a lot more times than you are and you have to factor your code differently if you want to time Hotspot's fastest code. Beyond the wrapper code, you aren't measuring how fast your max is, or how well Hotspot understands what you're trying to do, or anything else of value here. Both implementations of max will result in code that's entirely too fast for any sort of direct measurement to be meaningful within the context of a larger program.

Using JDK 8:
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)
On Ubuntu 13.10
I ran the following:
import java.util.Random;
import java.util.function.BiFunction;
public class MaxPerformance {
private final BiFunction<Integer, Integer, Integer> max;
private final int[] array;
public MaxPerformance(BiFunction<Integer, Integer, Integer> max, int[] array) {
this.max = max;
this.array = array;
}
public double time() {
long start = System.nanoTime();
int m = Integer.MIN_VALUE;
for (int i = 0; i < array.length; ++i) m = max.apply(m, array[i]);
m = Integer.MIN_VALUE;
for (int i = 0; i < array.length; ++i) m = max.apply(array[i], m);
// total time over number of calls to max
return ((double) (System.nanoTime() - start)) / (double) array.length / 2.0;
}
public double averageTime(int repeats) {
double cumulativeTime = 0;
for (int i = 0; i < repeats; i++)
cumulativeTime += time();
return (double) cumulativeTime / (double) repeats;
}
public static void main(String[] args) {
int size = 1000000;
Random random = new Random(123123123L);
int[] array = new int[size];
for (int i = 0; i < size; i++) array[i] = random.nextInt();
double tMath = new MaxPerformance(Math::max, array).averageTime(100);
double tAlt1 = new MaxPerformance(MaxPerformance::max1, array).averageTime(100);
double tAlt2 = new MaxPerformance(MaxPerformance::max2, array).averageTime(100);
System.out.println("Java Math: " + tMath);
System.out.println("Alt 1: " + tAlt1);
System.out.println("Alt 2: " + tAlt2);
}
public static int max1(final int a, final int b) {
if (a >= b) return a;
return b;
}
public static int max2(final int a, final int b) {
return (a >= b) ? a : b; // same as JDK implementation
}
}
And I got the following results (average nanoseconds taken for each call to max):
Java Math: 15.443555810000003
Alt 1: 14.968298919999997
Alt 2: 16.442204045
So on a long run it looks like the second implementation is the fastest, although by a relatively small margin.
In order to have a slightly more scientific test, it makes sense to compute the max of pairs of elements where each call is independent from the previous one. This can be done by using two randomized arrays instead of one as in this benchmark:
import java.util.Random;
import java.util.function.BiFunction;
public class MaxPerformance2 {
private final BiFunction<Integer, Integer, Integer> max;
private final int[] array1, array2;
public MaxPerformance2(BiFunction<Integer, Integer, Integer> max, int[] array1, int[] array2) {
this.max = max;
this.array1 = array1;
this.array2 = array2;
if (array1.length != array2.length) throw new IllegalArgumentException();
}
public double time() {
long start = System.nanoTime();
int m = Integer.MIN_VALUE;
for (int i = 0; i < array1.length; ++i) m = max.apply(array1[i], array2[i]);
m += m; // to avoid optimizations!
return ((double) (System.nanoTime() - start)) / (double) array1.length;
}
public double averageTime(int repeats) {
// warm up rounds:
double tmp = 0;
for (int i = 0; i < 10; i++) tmp += time();
tmp *= 2.0;
double cumulativeTime = 0;
for (int i = 0; i < repeats; i++)
cumulativeTime += time();
return cumulativeTime / (double) repeats;
}
public static void main(String[] args) {
int size = 1000000;
Random random = new Random(123123123L);
int[] array1 = new int[size];
int[] array2 = new int[size];
for (int i = 0; i < size; i++) {
array1[i] = random.nextInt();
array2[i] = random.nextInt();
}
double tMath = new MaxPerformance2(Math::max, array1, array2).averageTime(100);
double tAlt1 = new MaxPerformance2(MaxPerformance2::max1, array1, array2).averageTime(100);
double tAlt2 = new MaxPerformance2(MaxPerformance2::max2, array1, array2).averageTime(100);
System.out.println("Java Math: " + tMath);
System.out.println("Alt 1: " + tAlt1);
System.out.println("Alt 2: " + tAlt2);
}
public static int max1(final int a, final int b) {
if (a >= b) return a;
return b;
}
public static int max2(final int a, final int b) {
return (a >= b) ? a : b; // same as JDK implementation
}
}
Which gave me:
Java Math: 15.346468170000005
Alt 1: 16.378737519999998
Alt 2: 20.506475350000006
The way your test is set up makes a huge difference on the results. The JDK version seems to be the fastest in this scenario. This time by a relatively large margin compared to the previous case.
Somebody mentioned Caliper. Well if you read the wiki, one the first things they say about micro-benchmarking is not to do it: this is because it's hard to get accurate results in general. I think this is a clear example of that.

Here's a branchless min operation, max can be implemented by replacing diff=a-b with diff=b-a.
public static final long min(final long a, final long b) {
final long diff = a - b;
// All zeroes if a>=b, all ones if a<b because the sign bit is propagated
final long mask = diff >> 63;
return (a & mask) | (b & (~mask));
}
It should be as fast as streaming the memory because the CPU operations should be hidden by the sequential memory read latency.

Related

Multithreaded Segmented Sieve of Eratosthenes in Java

I am trying to create a fast prime generator in Java. It is (more or less) accepted that the fastest way for this is the segmented sieve of Eratosthenes: https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes. Lots of optimizations can be further implemented to make it faster. As of now, my implementation generates 50847534 primes below 10^9 in about 1.6 seconds, but I am looking to make it faster and at least break the 1 second barrier. To increase the chance of getting good replies, I will include a walkthrough of the algorithm as well as the code.
Still, as a TL;DR, I am looking to include multi-threading into the code
For the purposes of this question, I want to separate between the 'segmented' and the 'traditional' sieves of Eratosthenes. The traditional sieve requires O(n) space and therefore is very limited in range of the input (the limit of it). The segmented sieve however only requires O(n^0.5) space and can operate on much larger limits. (A main speed-up is using a cache-friendly segmentation, taking into account the L1 & L2 cache sizes of the specific computer). Finally, the main difference that concerns my question is that the traditional sieve is sequential, meaning it can only continue once the previous steps are completed. The segmented sieve however, is not. Each segment is independent, and is 'processed' individually against the sieving primes (the primes not larger than n^0.5). This means that theoretically, once I have the sieving primes, I can divide the work between multiple computers, each processing a different segment. The work of eachother is independent of the others. Assuming (wrongly) that each segment requires the same amount of time t to complete, and there are k segments, One computer would require total time of T = k * t, whereas k computers, each working on a different segment would require a total amount of time T = t to complete the entire process. (Practically, this is wrong, but for the sake of simplicity of the example).
This brought me to reading about multithreading - dividing the work to a few threads each processing a smaller amount of work for better usage of CPU. To my understanding, the traditional sieve cannot be multithreaded exactly because it is sequential. Each thread would depend on the previous, rendering the entire idea unfeasible. But a segmented sieve may indeed (I think) be multithreaded.
Instead of jumping straight into my question, I think it is important to introduce my code first, so I am hereby including my current fastest implementation of the segmented sieve. I have worked quite hard on it. It took quite some time, slowly tweaking and adding optimizations to it. The code is not simple. It is rather complex, I would say. I therefore assume the reader is familiar with the concepts I am introducing, such as wheel factorization, prime numbers, segmentation and more. I have included notes to make it easier to follow.
import java.math.BigInteger;
import java.util.ArrayList;
import java.util.Arrays;
public class primeGen {
public static long x = (long)Math.pow(10, 9); //limit
public static int sqrtx;
public static boolean [] sievingPrimes; //the sieving primes, <= sqrtx
public static int [] wheels = new int [] {2,3,5,7,11,13,17,19}; // base wheel primes
public static int [] gaps; //the gaps, according to the wheel. will enable skipping multiples of the wheel primes
public static int nextp; // the first prime > wheel primes
public static int l; // the amount of gaps in the wheel
public static void main(String[] args)
{
long startTime = System.currentTimeMillis();
preCalc(); // creating the sieving primes and calculating the list of gaps
int segSize = Math.max(sqrtx, 32768*8); //size of each segment
long u = nextp; // 'u' is the running index of the program. will continue from one segment to the next
int wh = 0; // the will be the gap index, indicating by how much we increment 'u' each time, skipping the multiples of the wheel primes
long pi = pisqrtx(); // the primes count. initialize with the number of primes <= sqrtx
for (long low = 0 ; low < x ; low += segSize) //the heart of the code. enumerating the primes through segmentation. enumeration will begin at p > sqrtx
{
long high = Math.min(x, low + segSize);
boolean [] segment = new boolean [(int) (high - low + 1)];
int g = -1;
for (int i = nextp ; i <= sqrtx ; i += gaps[g])
{
if (sievingPrimes[(i + 1) / 2])
{
long firstMultiple = (long) (low / i * i);
if (firstMultiple < low)
firstMultiple += i;
if (firstMultiple % 2 == 0) //start with the first odd multiple of the current prime in the segment
firstMultiple += i;
for (long j = firstMultiple ; j < high ; j += i * 2)
segment[(int) (j - low)] = true;
}
g++;
//if (g == l) //due to segment size, the full list of gaps is never used **within just one segment** , and therefore this check is redundant.
//should be used with bigger segment sizes or smaller lists of gaps
//g = 0;
}
while (u <= high)
{
if (!segment[(int) (u - low)])
pi++;
u += gaps[wh];
wh++;
if (wh == l)
wh = 0;
}
}
System.out.println(pi);
long endTime = System.currentTimeMillis();
System.out.println("Solution took "+(endTime - startTime) + " ms");
}
public static boolean [] simpleSieve (int l)
{
long sqrtl = (long)Math.sqrt(l);
boolean [] primes = new boolean [l/2+2];
Arrays.fill(primes, true);
int g = -1;
for (int i = nextp ; i <= sqrtl ; i += gaps[g])
{
if (primes[(i + 1) / 2])
for (int j = i * i ; j <= l ; j += i * 2)
primes[(j + 1) / 2]=false;
g++;
if (g == l)
g=0;
}
return primes;
}
public static long pisqrtx ()
{
int pi = wheels.length;
if (x < wheels[wheels.length-1])
{
if (x < 2)
return 0;
int k = 0;
while (wheels[k] <= x)
k++;
return k;
}
int g = -1;
for (int i = nextp ; i <= sqrtx ; i += gaps[g])
{
if(sievingPrimes[( i + 1 ) / 2])
pi++;
g++;
if (g == l)
g=0;
}
return pi;
}
public static void preCalc ()
{
sqrtx = (int) Math.sqrt(x);
int prod = 1;
for (long p : wheels)
prod *= p; // primorial
nextp = BigInteger.valueOf(wheels[wheels.length-1]).nextProbablePrime().intValue(); //the first prime that comes after the wheel
int lim = prod + nextp; // circumference of the wheel
boolean [] marks = new boolean [lim + 1];
Arrays.fill(marks, true);
for (int j = 2 * 2 ;j <= lim ; j += 2)
marks[j] = false;
for (int i = 1 ; i < wheels.length ; i++)
{
int p = wheels[i];
for (int j = p * p ; j <= lim ; j += 2 * p)
marks[j]=false; // removing all integers that are NOT comprime with the base wheel primes
}
ArrayList <Integer> gs = new ArrayList <Integer>(); //list of the gaps between the integers that are coprime with the base wheel primes
int d = nextp;
for (int p = d + 2 ; p < marks.length ; p += 2)
{
if (marks[p]) //d is prime. if p is also prime, then a gap is identified, and is noted.
{
gs.add(p - d);
d = p;
}
}
gaps = new int [gs.size()];
for (int i = 0 ; i < gs.size() ; i++)
gaps[i] = gs.get(i); // Arrays are faster than lists, so moving the list of gaps to an array
l = gaps.length;
sievingPrimes = simpleSieve(sqrtx); //initializing the sieving primes
}
}
Currently, it produces 50847534 primes below 10^9 in about 1.6 seconds. This is very impressive, at least by my standards, but I am looking to make it faster, possibly break the 1 second barrier. Even then, I believe it can be made much faster still.
The whole program is based on wheel factorization: https://en.wikipedia.org/wiki/Wheel_factorization. I have noticed I am getting the fastest results using a wheel of all primes up to 19.
public static int [] wheels = new int [] {2,3,5,7,11,13,17,19}; // base wheel primes
This means that the multiples of those primes are skipped, resulting in a much smaller searching range. The gaps between numbers which we need to take are then calculated in the preCalc method. If we make those jumps between the the numbers in the searching range we skip the multiples of the base primes.
public static void preCalc ()
{
sqrtx = (int) Math.sqrt(x);
int prod = 1;
for (long p : wheels)
prod *= p; // primorial
nextp = BigInteger.valueOf(wheels[wheels.length-1]).nextProbablePrime().intValue(); //the first prime that comes after the wheel
int lim = prod + nextp; // circumference of the wheel
boolean [] marks = new boolean [lim + 1];
Arrays.fill(marks, true);
for (int j = 2 * 2 ;j <= lim ; j += 2)
marks[j] = false;
for (int i = 1 ; i < wheels.length ; i++)
{
int p = wheels[i];
for (int j = p * p ; j <= lim ; j += 2 * p)
marks[j]=false; // removing all integers that are NOT comprime with the base wheel primes
}
ArrayList <Integer> gs = new ArrayList <Integer>(); //list of the gaps between the integers that are coprime with the base wheel primes
int d = nextp;
for (int p = d + 2 ; p < marks.length ; p += 2)
{
if (marks[p]) //d is prime. if p is also prime, then a gap is identified, and is noted.
{
gs.add(p - d);
d = p;
}
}
gaps = new int [gs.size()];
for (int i = 0 ; i < gs.size() ; i++)
gaps[i] = gs.get(i); // Arrays are faster than lists, so moving the list of gaps to an array
l = gaps.length;
sievingPrimes = simpleSieve(sqrtx); //initializing the sieving primes
}
At the end of the preCalc method, the simpleSieve method is called, efficiently sieving all the sieving primes mentioned before, the primes <= sqrtx. This is a simple Eratosthenes sieve, rather than segmented, but it is still based on wheel factorization, perviously computed.
public static boolean [] simpleSieve (int l)
{
long sqrtl = (long)Math.sqrt(l);
boolean [] primes = new boolean [l/2+2];
Arrays.fill(primes, true);
int g = -1;
for (int i = nextp ; i <= sqrtl ; i += gaps[g])
{
if (primes[(i + 1) / 2])
for (int j = i * i ; j <= l ; j += i * 2)
primes[(j + 1) / 2]=false;
g++;
if (g == l)
g=0;
}
return primes;
}
Finally, we reach the heart of the algorithm. We start by enumerating all primes <= sqrtx, with the following call:
long pi = pisqrtx();`
which used the following method:
public static long pisqrtx ()
{
int pi = wheels.length;
if (x < wheels[wheels.length-1])
{
if (x < 2)
return 0;
int k = 0;
while (wheels[k] <= x)
k++;
return k;
}
int g = -1;
for (int i = nextp ; i <= sqrtx ; i += gaps[g])
{
if(sievingPrimes[( i + 1 ) / 2])
pi++;
g++;
if (g == l)
g=0;
}
return pi;
}
Then, after initializing the pi variable which keeps track of the enumeration of primes, we perform the mentioned segmentation, starting the enumeration from the first prime > sqrtx:
int segSize = Math.max(sqrtx, 32768*8); //size of each segment
long u = nextp; // 'u' is the running index of the program. will continue from one segment to the next
int wh = 0; // the will be the gap index, indicating by how much we increment 'u' each time, skipping the multiples of the wheel primes
long pi = pisqrtx(); // the primes count. initialize with the number of primes <= sqrtx
for (long low = 0 ; low < x ; low += segSize) //the heart of the code. enumerating the primes through segmentation. enumeration will begin at p > sqrtx
{
long high = Math.min(x, low + segSize);
boolean [] segment = new boolean [(int) (high - low + 1)];
int g = -1;
for (int i = nextp ; i <= sqrtx ; i += gaps[g])
{
if (sievingPrimes[(i + 1) / 2])
{
long firstMultiple = (long) (low / i * i);
if (firstMultiple < low)
firstMultiple += i;
if (firstMultiple % 2 == 0) //start with the first odd multiple of the current prime in the segment
firstMultiple += i;
for (long j = firstMultiple ; j < high ; j += i * 2)
segment[(int) (j - low)] = true;
}
g++;
//if (g == l) //due to segment size, the full list of gaps is never used **within just one segment** , and therefore this check is redundant.
//should be used with bigger segment sizes or smaller lists of gaps
//g = 0;
}
while (u <= high)
{
if (!segment[(int) (u - low)])
pi++;
u += gaps[wh];
wh++;
if (wh == l)
wh = 0;
}
}
I have also included it as a note, but will explain as well. Because the segment size is relatively small, we will not go through the entire list of gaps within just one segment, and checking it - is redundant. (Assuming we use a 19-wheel). But in a broader scope overview of the program, we will make use of the entire array of gaps, so the variable u has to follow it and not accidentally surpass it:
while (u <= high)
{
if (!segment[(int) (u - low)])
pi++;
u += gaps[wh];
wh++;
if (wh == l)
wh = 0;
}
Using higher limits will eventually render a bigger segment, which might result in a neccessity of checking we don't surpass the gaps list even within the segment. This, or tweaking the wheel primes base might have this effect on the program. Switching to bit-sieving can largely improve the segment limit though.
As an important side-note, I am aware that efficient segmentation is
one that takes the L1 & L2 cache-sizes into account. I get the
fastest results using a segment size of 32,768 * 8 = 262,144 = 2^18. I am not sure what the cache-size of my computer is, but I do
not think it can be that big, as I see most cache sizes <= 32,768.
Still, this produces the fastest run time on my computer, so this is
why it's the chosen segment size.
As I mentioned, I am still looking to improve this by a lot. I
believe, according to my introduction, that multithreading can result
in a speed-up factor of 4, using 4 threads (corresponding to 4
cores). The idea is that each thread will still use the idea of the
segmented sieve, but work on different portions. Divide the n
into 4 equal portions - threads, each in turn performing the
segmentation on the n/4 elements it is responsible for, using the
above program. My question is how do I do that? Reading about
multithreading and examples, unfortunately, did not bring to me any
insight on how to implement it in the case above efficiently. It
seems to me, as opposed to the logic behind it, that the threads were
running sequentially, rather than simultaneously. This is why I
excluded it from the code to make it more readable. I will really
appreciate a code sample on how to do it in this specific code, but a
good explanation and reference will maybe do the trick too.
Additionally, I would like to hear about more ways of speeding-up
this program even more, any ideas you have, I would love to hear!
Really want to make it very fast and efficient. Thank you!

An example like this should help you get started.
An outline of a solution:
Define a data structure ("Task") that encompasses a specific segment; you can put all the immutable shared data into it for extra neatness, too. If you're careful enough, you can pass a common mutable array to all tasks, along with the segment limits, and only update the part of the array within these limits. This is more error-prone, but can simplify the step of joining the results (AFAICT; YMMV).
Define a data structure ("Result") that stores the result of a Task computation. Even if you just update a shared resulting structure, you may need to signal which part of that structure has been updated so far.
Create a Runnable that accepts a Task, runs a computation, and puts the results into a given result queue.
Create a blocking input queue for Tasks, and a queue for Results.
Create a ThreadPoolExecutor with the number of threads close to the number of machine cores.
Submit all your Tasks to the thread pool executor. They will be scheduled to run on the threads from the pool, and will put their results into the output queue, not necessarily in order.
Wait for all the tasks in the thread pool to finish.
Drain the output queue and join the partial results into the final result.
Extra speedup may (or may not) be achieved by joining the results in a separate task that reads the output queue, or even by updating a mutable shared output structure under synchronized, depending on how much work the joining step involves.
Hope this helps.

Are you familiar with the work of Tomas Oliveira e Silva? He has a very fast implementation of the Sieve of Eratosthenes.

How interested in speed are you? Would you consider using c++?
$ time ../c_code/segmented_bit_sieve 1000000000
50847534 primes found.
real 0m0.875s
user 0m0.813s
sys 0m0.016s
$ time ../c_code/segmented_bit_isprime 1000000000
50847534 primes found.
real 0m0.816s
user 0m0.797s
sys 0m0.000s
(on my newish laptop with an i5)
The first is from #Kim Walisch using a bit array of odd prime candidates.
https://github.com/kimwalisch/primesieve/wiki/Segmented-sieve-of-Eratosthenes
The second is my tweak to Kim's with IsPrime[] also implemented as bit array, which is slightly less clear to read, although a little faster for big N due to the reduced memory footprint.
I will read your post carefully as I am interested in primes and performance no matter what language is used. I hope this isn't too far off topic or premature. But I noticed I was already beyond your performance goal.

Is bitwise operation faster than modulo/reminder operator in Java?

I read in couple of blogs that in Java modulo/reminder operator is slower than bitwise-AND. So, I wrote the following program to test.
public class ModuloTest {
public static void main(String[] args) {
final int size = 1024;
int index = 0;
long start = System.nanoTime();
for(int i = 0; i < Integer.MAX_VALUE; i++) {
getNextIndex(size, i);
}
long end = System.nanoTime();
System.out.println("Time taken by Modulo (%) operator --> " + (end - start) + "ns.");
start = System.nanoTime();
final int shiftFactor = size - 1;
for(int i = 0; i < Integer.MAX_VALUE; i++) {
getNextIndexBitwise(shiftFactor, i);
}
end = System.nanoTime();
System.out.println("Time taken by bitwise AND --> " + (end - start) + "ns.");
}
private static int getNextIndex(int size, int nextInt) {
return nextInt % size;
}
private static int getNextIndexBitwise(int size, int nextInt) {
return nextInt & size;
}
}
But in my runtime environment (MacBook Pro 2.9GHz i7, 8GB RAM, JDK 1.7.0_51) I am seeing otherwise. The bitwise-AND is significantly slower, in fact twice as slow than the remainder operator.
I would appreciate it if someone can help me understand if this is intended behavior or I am doing something wrong?
Thanks,
Niranjan

Your code reports bitwise-and being much faster on each Mac I've tried it on, both with Java 6 and Java 7. I suspect the first portion of the test on your machine happened to coincide with other activity on the system. You should try running the test multiple times to verify you aren't seeing distortions based on that. (I would have left this as a 'comment' rather than an 'answer', but apparently you need 50 reputation to do that -- quite silly, if you ask me.)

For starters, logical conjunction trick only works with Nature Number dividends and power of 2 divisors. So, if you need negative dividends, floats, or non-powers of 2, sick with the default % operator.
My tests (with JIT warmup and 1M random iterations), on an i7 with a ton of cores and bus load of ram show about 20% better performance from the bitwise operation. This can very per run, depending how the process scheduler runs the code.
using Scala 2.11.8 on JDK 1.8.91
4Ghz i7-4790K, 8 core AMD, 32GB PC3 19200 ram, SSD

This example in particular will always give you a wrong result. Moreover, I believe that any program which is calculating the modulo by a power of 2 will be faster than bitwise AND.
REASON: When you use N % X where X is kth power of 2, only last k bits are considered for modulo, whereas in case of the bitwise AND operator the runtime actually has to visit each bit of the number under question.
Also, I would like to point out the Hot Spot JVM's optimizes repetitive calculations of similar nature(one of the examples can be branch prediction etc). In your case, the method which is using the modulo just returns the last 10 bits of the number because 1024 is the 10th power of 2.
Try using some prime number value for size and check the same result.
Disclaimer: Micro benchmarking is not considered good.

Is this method missing something?
public static void oddVSmod(){
float tests = 100000000;
oddbit(tests);
modbit(tests);
}
public static void oddbit(float tests){
for(int i=0; i<tests; i++)
if((i&1)==1) {System.out.print(" "+i);}
System.out.println();
}
public static void modbit(float tests){
for(int i=0; i<tests; i++)
if((i%2)==1) {System.out.print(" "+i);}
System.out.println();
}
With that, i used netbeans built-in profiler (advanced-mode) to run this. I set var tests up to 10X10^8, and every time, it showed that modulo is faster than bitwise.

Thank you all for valuable inputs.
#pamphlet: Thank you very much for the concerns, but negative comments are fine with me. I confess that I did not do proper testing as suggested by AndyG. AndyG could have used a softer tone, but its okay, sometimes negatives help seeing the positive. :)
That said, I changed my code (as shown below) in a way that I can run that test multiple times.
public class ModuloTest {
public static final int SIZE = 1024;
public int usingModuloOperator(final int operand1, final int operand2) {
return operand1 % operand2;
}
public int usingBitwiseAnd(final int operand1, final int operand2) {
return operand1 & operand2;
}
public void doCalculationUsingModulo(final int size) {
for(int i = 0; i < Integer.MAX_VALUE; i++) {
usingModuloOperator(1, size);
}
}
public void doCalculationUsingBitwise(final int size) {
for(int i = 0; i < Integer.MAX_VALUE; i++) {
usingBitwiseAnd(i, size);
}
}
public static void main(String[] args) {
final ModuloTest moduloTest = new ModuloTest();
final int invocationCount = 100;
// testModuloOperator(moduloTest, invocationCount);
testBitwiseOperator(moduloTest, invocationCount);
}
private static void testModuloOperator(final ModuloTest moduloTest, final int invocationCount) {
for(int i = 0; i < invocationCount; i++) {
final long startTime = System.nanoTime();
moduloTest.doCalculationUsingModulo(SIZE);
final long timeTaken = System.nanoTime() - startTime;
System.out.println("Using modulo operator // Time taken for invocation counter " + i + " is " + timeTaken + "ns");
}
}
private static void testBitwiseOperator(final ModuloTest moduloTest, final int invocationCount) {
for(int i = 0; i < invocationCount; i++) {
final long startTime = System.nanoTime();
moduloTest.doCalculationUsingBitwise(SIZE);
final long timeTaken = System.nanoTime() - startTime;
System.out.println("Using bitwise operator // Time taken for invocation counter " + i + " is " + timeTaken + "ns");
}
}
}
I called testModuloOperator() and testBitwiseOperator() in mutual exclusive way. The result was consistent with the idea that bitwise is faster than modulo operator. I ran each of the calculation 100 times and recorded the execution times. Then removed first five and last five recordings and used rest to calculate the avg. time. And, below are my test results.
Using modulo operator, the avg. time for 90 runs: 8388.89ns.
Using bitwise-AND operator, the avg. time for 90 runs: 722.22ns.
Please suggest if my approach is correct or not.
Thanks again.
Niranjan

Java 8 nested loops with streams & performance

In order to practise the Java 8 streams I tried converting the following nested loop to the Java 8 stream API. It calculates the largest digit sum of a^b (a,b < 100) and takes ~0.135s on my Core i5 760.
public static int digitSum(BigInteger x)
{
int sum = 0;
for(char c: x.toString().toCharArray()) {sum+=Integer.valueOf(c+"");}
return sum;
}
#Test public void solve()
{
int max = 0;
for(int i=1;i<100;i++)
for(int j=1;j<100;j++)
max = Math.max(max,digitSum(BigInteger.valueOf(i).pow(j)));
System.out.println(max);
}
My solution, which I expected to be faster because of the paralellism actually took 0.25s (0.19s without the parallel()):
int max = IntStream.range(1,100).parallel()
.map(i -> IntStream.range(1, 100)
.map(j->digitSum(BigInteger.valueOf(i).pow(j)))
.max().getAsInt()).max().getAsInt();
My questions
did I do the conversion right or is there a better way to convert nested loops to stream calculations?
why is the stream variant so much slower than the old one?
why did the parallel() statement actually increased the time from 0.19s to 0.25s?
I know that microbenchmarks are fragile and parallelism is only worth it for big problems but for a CPU, even 0.1s is an eternity, right?
Update
I measure with the Junit 4 framework in Eclipse Kepler (it shows the time taken for executing a test).
My results for a,b<1000 instead of 100:
traditional loop 186s
sequential stream 193s
parallel stream 55s
Update 2
Replacing sum+=Integer.valueOf(c+""); with sum+= c - '0'; (thanks Peter!) shaved off 10 whole seconds of the parallel method, bringing it to 45s. Didn't expect such a big performance impact!
Also, reducing the parallelism to the number of CPU cores (4 in my case) didn't do much as it reduced the time only to 44.8s (yes, it adds a and b=0 but I think this won't impact the performance much):
int max = IntStream.range(0, 3).parallel().
.map(m -> IntStream.range(0,250)
.map(i -> IntStream.range(1, 1000)
.map(j->.digitSum(BigInteger.valueOf(250*m+i).pow(j)))
.max().getAsInt()).max().getAsInt()).max().getAsInt();

I have created a quick and dirty micro benchmark based on your code. The results are:
loop: 3192
lambda: 3140
lambda parallel: 868
So the loop and lambda are equivalent and the parallel stream significantly improves the performance. I suspect your results are unreliable due to your benchmarking methodology.
public static void main(String[] args) {
int sum = 0;
//warmup
for (int i = 0; i < 100; i++) {
solve();
solveLambda();
solveLambdaParallel();
}
{
long start = System.nanoTime();
for (int i = 0; i < 100; i++) {
sum += solve();
}
long end = System.nanoTime();
System.out.println("loop: " + (end - start) / 1_000_000);
}
{
long start = System.nanoTime();
for (int i = 0; i < 100; i++) {
sum += solveLambda();
}
long end = System.nanoTime();
System.out.println("lambda: " + (end - start) / 1_000_000);
}
{
long start = System.nanoTime();
for (int i = 0; i < 100; i++) {
sum += solveLambdaParallel();
}
long end = System.nanoTime();
System.out.println("lambda parallel : " + (end - start) / 1_000_000);
}
System.out.println(sum);
}
public static int digitSum(BigInteger x) {
int sum = 0;
for (char c : x.toString().toCharArray()) {
sum += Integer.valueOf(c + "");
}
return sum;
}
public static int solve() {
int max = 0;
for (int i = 1; i < 100; i++) {
for (int j = 1; j < 100; j++) {
max = Math.max(max, digitSum(BigInteger.valueOf(i).pow(j)));
}
}
return max;
}
public static int solveLambda() {
return IntStream.range(1, 100)
.map(i -> IntStream.range(1, 100).map(j -> digitSum(BigInteger.valueOf(i).pow(j))).max().getAsInt())
.max().getAsInt();
}
public static int solveLambdaParallel() {
return IntStream.range(1, 100)
.parallel()
.map(i -> IntStream.range(1, 100).map(j -> digitSum(BigInteger.valueOf(i).pow(j))).max().getAsInt())
.max().getAsInt();
}
I have also run it with jmh which is more reliable than manual tests. The results are consistent with above (micro seconds per call):
Benchmark Mode Mean Units
c.a.p.SO21968918.solve avgt 32367.592 us/op
c.a.p.SO21968918.solveLambda avgt 31423.123 us/op
c.a.p.SO21968918.solveLambdaParallel avgt 8125.600 us/op

The problem you have is you are looking at sub-optimal code. When you have code which might be heavily optimised you are very dependant on whether the JVM is smart enough to optimise your code. Loops have been around much longer and are better understood.
One big difference in your loop code, is you working set is very small. You are only considering one maximum digit sum at a time. This means the code is cache friendly and you have very short lived objects. In the stream() case you are building up collections for which there more in the working set at any one time, using more cache, with more overhead. I would expect your GC times to be longer and/or more frequent as well.
why is the stream variant so much slower than the old one?
Loops are fairly well optimised having been around since before Java was developed. They can be mapped very efficiently to hardware. Streams are fairly new and not as heavily optimised.
why did the parallel() statement actually increased the time from 0.19s to 0.25s?
Most likely you have a bottle neck on a shared resource. You create quite a bit of garbage but this is usually fairly concurrent. Using more threads, only guarantees you will have more overhead, it doesn't ensure you can take advantage of the extra CPU power you have.

In java, is it more efficient to use byte or short instead of int and float instead of double?

I've noticed I've always used int and doubles no matter how small or big the number needs to be. So in java, is it more efficient to use byte or short instead of int and float instead of double?
So assume I have a program with plenty of ints and doubles. Would it be worth going through and changing my ints to bytes or shorts if I knew the number would fit?
I know java doesn't have unsigned types but is there anything extra I could do if I knew the number would be positive only?
By efficient I mostly mean processing. I'd assume the garbage collector would be a lot faster if all the variables would be half size and that calculations would probably be somewhat faster too.
( I guess since I am working on android I need to somewhat worry about ram too)
(I'd assume the garbage collector only deals with Objects and not primitive but still deletes all the primitives in abandoned objects right? )
I tried it with a small android app I have but didn't really notice a difference at all. (Though I didn't "scientifically" measure anything.)
Am I wrong in assuming it should be faster and more efficient? I'd hate to go through and change everything in a massive program to find out I wasted my time.
Would it be worth doing from the beginning when I start a new project? (I mean I think every little bit would help but then again if so, why doesn't it seem like anyone does it.)

Am I wrong in assuming it should be faster and more efficient? I'd hate to go through and change everything in a massive program to find out I wasted my time.
Short answer
Yes, you are wrong. In most cases, it makes little difference in terms of space used.
It is not worth trying to optimize this ... unless you have clear evidence that optimization is needed. And if you do need to optimize memory usage of object fields in particular, you will probably need to take other (more effective) measures.
Longer answer
The Java Virtual Machine models stacks and object fields using offsets that are (in effect) multiples of a 32 bit primitive cell size. So when you declare a local variable or object field as (say) a byte, the variable / field will be stored in a 32 bit cell, just like an int.
There are two exceptions to this:
long and double values require 2 primitive 32-bit cells
arrays of primitive types are represent in packed form, so that (for example) an array of bytes hold 4 bytes per 32bit word.
So it might be worth optimizing use of long and double ... and large arrays of primitives. But in general no.
In theory, a JIT might be able to optimize this, but in practice I've never heard of a JIT that does. One impediment is that the JIT typically cannot run until after there instances of the class being compiled have been created. If the JIT optimized the memory layout, you could have two (or more) "flavors" of object of the same class ... and that would present huge difficulties.
Revisitation
Looking at the benchmark results in #meriton's answer, it appears that using short and byte instead of int incurs a performance penalty for multiplication. Indeed, if you consider the operations in isolation, the penalty is significant. (You shouldn't consider them in isolation ... but that's another topic.)
I think the explanation is that JIT is probably doing the multiplications using 32bit multiply instructions in each case. But in the byte and short case, it executes extra instructions to convert the intermediate 32 bit value to a byte or short in each loop iteration. (In theory, that conversion could be done once at the end of the loop ... but I doubt that the optimizer would be able to figure that out.)
Anyway, this does point to another problem with switching to short and byte as an optimization. It could make performance worse ... in an algorithm that is arithmetic and compute intensive.
Secondary questions
I know java doesn't have unsigned types but is there anything extra I could do if I knew the number would be positive only?
No. Not in terms of performance anyway. (There are some methods in Integer, Long, etc for dealing with int, long, etc as unsigned. But these don't give any performance advantage. That is not their purpose.)
(I'd assume the garbage collector only deals with Objects and not primitive but still deletes all the primitives in abandoned objects right? )
Correct. A field of an object is part of the object. It goes away when the object is garbage collected. Likewise the cells of an array go away when the array is collected. When the field or cell type is a primitive type, then the value is stored in the field / cell ... which is part of the object / array ... and that has been deleted.

That depends on the implementation of the JVM, as well as the underlying hardware. Most modern hardware will not fetch single bytes from memory (or even from the first level cache), i.e. using the smaller primitive types generally does not reduce memory bandwidth consumption. Likewise, modern CPU have a word size of 64 bits. They can perform operations on less bits, but that works by discarding the extra bits, which isn't faster either.
The only benefit is that smaller primitive types can result in a more compact memory layout, most notably when using arrays. This saves memory, which can improve locality of reference (thus reducing the number of cache misses) and reduce garbage collection overhead.
Generally speaking however, using the smaller primitive types is not faster.
To demonstrate that, behold the following benchmark:
public class Benchmark {
public static void benchmark(String label, Code code) {
print(25, label);
try {
for (int iterations = 1; ; iterations *= 2) { // detect reasonable iteration count and warm up the code under test
System.gc(); // clean up previous runs, so we don't benchmark their cleanup
long previouslyUsedMemory = usedMemory();
long start = System.nanoTime();
code.execute(iterations);
long duration = System.nanoTime() - start;
long memoryUsed = usedMemory() - previouslyUsedMemory;
if (iterations > 1E8 || duration > 1E9) {
print(25, new BigDecimal(duration * 1000 / iterations).movePointLeft(3) + " ns / iteration");
print(30, new BigDecimal(memoryUsed * 1000 / iterations).movePointLeft(3) + " bytes / iteration\n");
return;
}
}
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
private static void print(int desiredLength, String message) {
System.out.print(" ".repeat(Math.max(1, desiredLength - message.length())) + message);
}
private static long usedMemory() {
return Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
}
#FunctionalInterface
interface Code {
/**
* Executes the code under test.
*
* #param iterations
* number of iterations to perform
* #return any value that requires the entire code to be executed (to
* prevent dead code elimination by the just in time compiler)
* #throws Throwable
* if the test could not complete successfully
*/
Object execute(int iterations);
}
public static void main(String[] args) {
benchmark("long[] traversal", (iterations) -> {
long[] array = new long[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = i;
}
return array;
});
benchmark("int[] traversal", (iterations) -> {
int[] array = new int[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = i;
}
return array;
});
benchmark("short[] traversal", (iterations) -> {
short[] array = new short[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = (short) i;
}
return array;
});
benchmark("byte[] traversal", (iterations) -> {
byte[] array = new byte[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = (byte) i;
}
return array;
});
benchmark("long fields", (iterations) -> {
class C {
long a = 1;
long b = 2;
}
C[] array = new C[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = new C();
}
return array;
});
benchmark("int fields", (iterations) -> {
class C {
int a = 1;
int b = 2;
}
C[] array = new C[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = new C();
}
return array;
});
benchmark("short fields", (iterations) -> {
class C {
short a = 1;
short b = 2;
}
C[] array = new C[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = new C();
}
return array;
});
benchmark("byte fields", (iterations) -> {
class C {
byte a = 1;
byte b = 2;
}
C[] array = new C[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = new C();
}
return array;
});
benchmark("long multiplication", (iterations) -> {
long result = 1;
for (int i = 0; i < iterations; i++) {
result *= 3;
}
return result;
});
benchmark("int multiplication", (iterations) -> {
int result = 1;
for (int i = 0; i < iterations; i++) {
result *= 3;
}
return result;
});
benchmark("short multiplication", (iterations) -> {
short result = 1;
for (int i = 0; i < iterations; i++) {
result *= 3;
}
return result;
});
benchmark("byte multiplication", (iterations) -> {
byte result = 1;
for (int i = 0; i < iterations; i++) {
result *= 3;
}
return result;
});
}
}
Run with OpenJDK 14 on my Intel Core i7 CPU # 3.5 GHz, this prints:
long[] traversal 3.206 ns / iteration 8.007 bytes / iteration
int[] traversal 1.557 ns / iteration 4.007 bytes / iteration
short[] traversal 0.881 ns / iteration 2.007 bytes / iteration
byte[] traversal 0.584 ns / iteration 1.007 bytes / iteration
long fields 25.485 ns / iteration 36.359 bytes / iteration
int fields 23.126 ns / iteration 28.304 bytes / iteration
short fields 21.717 ns / iteration 20.296 bytes / iteration
byte fields 21.767 ns / iteration 20.273 bytes / iteration
long multiplication 0.538 ns / iteration 0.000 bytes / iteration
int multiplication 0.526 ns / iteration 0.000 bytes / iteration
short multiplication 0.786 ns / iteration 0.000 bytes / iteration
byte multiplication 0.784 ns / iteration 0.000 bytes / iteration
As you can see, the only significant speed savings occur when traversing large arrays; using smaller object fields yields negligible benefit, and computations are actually slightly slower on the small datatypes.
Overall, the performance differences are quite minor. Optimizing algorithms is far more important than the choice of primitive type.

Using byte instead of int can increase performance if you are using them in a huge amount. Here is an experiment:
import java.lang.management.*;
public class SpeedTest {
/** Get CPU time in nanoseconds. */
public static long getCpuTime() {
ThreadMXBean bean = ManagementFactory.getThreadMXBean();
return bean.isCurrentThreadCpuTimeSupported() ? bean
.getCurrentThreadCpuTime() : 0L;
}
public static void main(String[] args) {
long durationTotal = 0;
int numberOfTests=0;
for (int j = 1; j < 51; j++) {
long beforeTask = getCpuTime();
// MEASURES THIS AREA------------------------------------------
long x = 20000000;// 20 millions
for (long i = 0; i < x; i++) {
TestClass s = new TestClass();
}
// MEASURES THIS AREA------------------------------------------
long duration = getCpuTime() - beforeTask;
System.out.println("TEST " + j + ": duration = " + duration + "ns = "
+ (int) duration / 1000000);
durationTotal += duration;
numberOfTests++;
}
double average = durationTotal/numberOfTests;
System.out.println("-----------------------------------");
System.out.println("Average Duration = " + average + " ns = "
+ (int)average / 1000000 +" ms (Approximately)");
}
}
This class tests the speed of creating a new TestClass. Each tests does it 20 million times and there are 50 tests.
Here is the TestClass:
public class TestClass {
int a1= 5;
int a2= 5;
int a3= 5;
int a4= 5;
int a5= 5;
int a6= 5;
int a7= 5;
int a8= 5;
int a9= 5;
int a10= 5;
int a11= 5;
int a12=5;
int a13= 5;
int a14= 5;
}
I've run the SpeedTest class and in the end got this:
Average Duration = 8.9625E8 ns = 896 ms (Approximately)
Now I'm changing the ints into bytes in the TestClass and running it again. Here is the result:
Average Duration = 6.94375E8 ns = 694 ms (Approximately)
I believe this experiment shows that if you are instancing a huge amount of variables, using byte instead of int can increase efficiency

byte is generally considered to be 8 bits.
short is generally considered to be 16 bits.
In a "pure" environment, which isn't java as all implementation of bytes and longs, and shorts, and other fun things is generally hidden from you, byte makes better use of space.
However, your computer is probably not 8 bit, and it is probably not 16 bit. this means that
to obtain 16 or 8 bits in particular, it would need to resort to "trickery" which wastes time in order to pretend that it has the ability to access those types when needed.
At this point, it depends on how hardware is implemented. However from I've been tought,
the best speed is achieved from storing things in chunks which are comfortable for your CPU to use. A 64 bit processor likes dealing with 64 bit elements, and anything less than that often requires "engineering magic" to pretend that it likes dealing with them.

One of the reason for short/byte/char being less performant is for lack of direct support for these data types. By direct support, it means, JVM specifications do not mention any instruction set for these data types. Instructions like store, load, add etc. have versions for int data type. But they do not have versions for short/byte/char. E.g. consider below java code:
void spin() {
int i;
for (i = 0; i < 100; i++) {
; // Loop body is empty
}
}
Same gets converted into machine code as below.
0 iconst_0 // Push int constant 0
1 istore_1 // Store into local variable 1 (i=0)
2 goto 8 // First time through don't increment
5 iinc 1 1 // Increment local variable 1 by 1 (i++)
8 iload_1 // Push local variable 1 (i)
9 bipush 100 // Push int constant 100
11 if_icmplt 5 // Compare and loop if less than (i < 100)
14 return // Return void when done
Now, consider changing int to short as below.
void sspin() {
short i;
for (i = 0; i < 100; i++) {
; // Loop body is empty
}
}
The corresponding machine code will change as follows:
0 iconst_0
1 istore_1
2 goto 10
5 iload_1 // The short is treated as though an int
6 iconst_1
7 iadd
8 i2s // Truncate int to short
9 istore_1
10 iload_1
11 bipush 100
13 if_icmplt 5
16 return
As you can observe, to manipulate short data type, it is still using int data type instruction version and explicitly converting int to short when required. Now, due to this, performance gets reduced.
Now, reason cited for not giving direct support as follows:
The Java Virtual Machine provides the most direct support for data of
type int. This is partly in anticipation of efficient implementations
of the Java Virtual Machine's operand stacks and local variable
arrays. It is also motivated by the frequency of int data in typical
programs. Other integral types have less direct support. There are no
byte, char, or short versions of the store, load, or add instructions,
for instance.
Quoted from JVM specification present here (Page 58).

I would say that accepted answer is somewhat wrong saying "it makes little difference in terms of space used". Here is the example showing that difference in some cases is very different:
Baseline usage 4.90MB, java: 11.0.12
Mem usage - bytes : +202.60 MB
Mem usage - shorts: +283.02 MB
Mem usage - ints : +363.02 MB
Mem usage - bytes : +203.02 MB
Mem usage - shorts: +283.02 MB
Mem usage - ints : +363.02 MB
Mem usage - bytes : +203.02 MB
Mem usage - shorts: +283.02 MB
Mem usage - ints : +363.02 MB
The code to verify:
static class Bytes {
public byte f1;
public byte f2;
public byte f3;
public byte f4;
}
static class Shorts {
public short f1;
public short f2;
public short f3;
public short f4;
}
static class Ints {
public int f1;
public int f2;
public int f3;
public int f4;
}
#Test
public void memUsageTest() throws Exception {
int countOfItems = 10 * 1024 * 1024;
float MB = 1024*1024;
Runtime rt = Runtime.getRuntime();
System.gc();
Thread.sleep(1000);
long baseLineUsage = rt.totalMemory() - rt.freeMemory();
trace("Baseline usage %.2fMB, java: %s", (baseLineUsage / MB), System.getProperty("java.version"));
for( int j = 0; j < 3; j++ ) {
Bytes[] bytes = new Bytes[countOfItems];
for( int i = 0; i < bytes.length; i++ ) {
bytes[i] = new Bytes();
}
System.gc();
Thread.sleep(1000);
trace("Mem usage - bytes : +%.2f MB", (rt.totalMemory() - rt.freeMemory() - baseLineUsage) / MB);
bytes = null;
Shorts[] shorts = new Shorts[countOfItems];
for( int i = 0; i < shorts.length; i++ ) {
shorts[i] = new Shorts();
}
System.gc();
Thread.sleep(1000);
trace("Mem usage - shorts: +%.2f MB", (rt.totalMemory() - rt.freeMemory() - baseLineUsage) / MB);
shorts = null;
Ints[] ints = new Ints[countOfItems];
for( int i = 0; i < ints.length; i++ ) {
ints[i] = new Ints();
}
System.gc();
Thread.sleep(1000);
trace("Mem usage - ints : +%.2f MB", (rt.totalMemory() - rt.freeMemory() - baseLineUsage) / MB);
ints = null;
}
}
private static void trace(String message, Object... args) {
String line = String.format(US, message, args);
System.out.println(line);
}

The difference is hardly noticeable! It's more a question of design, appropriateness, uniformity, habit, etc... Sometimes it's just a matter of taste. When all you care about is that your program gets up and running and substituting a float for an int would not harm correctness, I see no advantage in going for one or another unless you can demonstrate that using either type alters performance. Tuning performance based on types that are different in 2 or 3 bytes is really the last thing you should care about; Donald Knuth once said: "Premature optimization is the root of all evil" (not sure it was him, edit if you have the answer).

Java version of matlab's linspace

Is there a java version of matlab's colon operator or linspace? For instance, I'd like to make a for loop for evenly spaced numbers, but I don't want to bother with creating an array of those numbers manually.
For example to get all integers from 1 to 30, in matlab I would type:
1:30
or
linspace(1,30)

For the two variable call, #x4u is correct. The three variable call will be quite a bit harder to emulate.
For instance, i think that linspace(1,30,60) should produce values 1, 1.5, 2, 2.5, 3, 3.5..., or maybe that's the values for linspace(1,30,59)--either way, same problem.
With this format you'll have to do the calculations yourself--Personally I'd create a new object to do the whole thing for me and forget the for loop.
counter=new Linspace(1,30,60);
while(counter.hasNext()) {
process(counter.getNextFloat())
}
or simply
while(float f : new Linspace(1,30,60)) {
process(f);
}
if you have your Linspace object implement Iterable.
Then the inside of the counter object should be pretty obvious to implement and it will easily communicate to you what it is doing without obfuscating your code with a bunch of numeric calculations to figure out ratios.
An implementation might be something like this:
(NOTE: Untested and I'm pretty sure this would be vulnerable to edge cases and floating point errors! It also probably won't handle end < start for backwards counting, it's just a suggestion to get you going.)
public class Linspace {
private float current;
private final float end;
private final float step;
public Linspace(float start, float end, float totalCount) {
this.current=start;
this.end=end;
this.step=(end - start) / totalCount;
}
public boolean hasNext() {
return current < (end + step/2); //MAY stop floating point error
}
public float getNextFloat() {
current+=step;
return current;
}
}

Do you want to do this?
for( int number = 1; number <= 30; ++number )
If you need them spaced by a fixed amount, i.e. 3 you can write it this way:
for( int number = 1; number <= 30; number += 3 )
The left part of the for loop initializes the variable, the middle part is the condition that gets evaluated before each iteration and the right part gets executed after each iteration.

I actually just had to do this for a java project I am working on. I wanted to make sure it was implemented in the same way as in MATLAB, so I first wrote a MATLAB equivalent:
function result = mylinspace(min, max, points)
answer = zeros(1,points);
for i = 1:points
answer(i) = min + (i-1) * (max - min) / (points - 1);
end
result = answer;
I tested this against the built-in linspace function and it returned the correct result, so I then converted this to a static java function:
public static double[] linspace(double min, double max, int points) {
double[] d = new double[points];
for (int i = 0; i < points; i++){
d[i] = min + i * (max - min) / (points - 1);
}
return d;
}
In my opinion this is much simpler than creating a new class for this one function.

I think Bill K got the right idea, but I think there is no need to have a Linspace class.
// If you write linspace(start,end,totalCount) in Matlab ===>
for(float i = start; i < end; i += (end-start)/totalCount)
something(i);

I was myself looking for a solution for this problem and investigated how MatLab implements its Linspace. I more or less converted it to Java and ended up with the method below. As far as I have tested it works quite nicely and you get the endpoints. There is probably floating point errors as with most cases.
I am not sure if there are Copyright issues with this though.
public static List<Double> linspace(double start, double stop, int n)
{
List<Double> result = new ArrayList<Double>();
double step = (stop-start)/(n-1);
for(int i = 0; i <= n-2; i++)
{
result.add(start + (i * step));
}
result.add(stop);
return result;
}

If you'd like to use Java Streams, one option would be:
public static Stream<Double> linspace(double start, double end, int numPoints) {
return IntStream.range(0, numPoints)
.boxed()
.map(i -> start + i * (end - start) / (numPoints - 1));
}

Here's the main algo that works for matlab, you may convert this to Java with all the OOP details at your disposal:
>>> st=3;ed=9;num=4;
>>> linspace(st,ed,num)
ans =
3 5 7 9
>>> % # additional points to create (other than 3)
>>> p2c=num-1;
>>> % 3 is excluded when calculating distance d.
>>> a=st;
>>> d=ed-st;
>>> % the increment shall calculate without taking the starting value into consideration.
>>> icc=d/p2c;
>>> for idx=[1:p2c];
a(idx+1)=a(idx)+icc;
end;
>>> a
a =
3 5 7 9
>>> diary off

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.