Moving from EDU to java.util.concurrent cuts the performance twice

Moving from EDU to java.util.concurrent cuts the performance twice - java

Cross post from http://forums.oracle.com/forums/thread.jspa?threadID=2195025&tstart=0
There is a telecom application server (JAIN SLEE based) and the application running in it.
The application is receiving a message from the network, processes it and sends back to the network a response.
The requirement for request/response latency is 250 ms for 95% of calls and 3000 ms for 99.999% of calls.
We use EDU.oswego.cs.dl.util.concurrent.ConcurrentHashMap, 1 instance. For one call (one call is several messages) processing the following methods are invoked:
"put", "get", "get", "get", then in 180 seconds "remove".
There are 4 threads which invoke these methods.
(A small note: working with ConcurrentHashMap is not the only activity. Also for one network message there are a lot of other activities: protocol message parsing, querying a DB, writing an SDR into a file, creating short living and long living objects.)
When we move from EDU.oswego.cs.dl.util.concurrent.ConcurrentHashMap to java.util.concurrent.ConcurrentHashMap, we see a performance degradation from 1400 to 800 calls per second.
The first bottleneck for the last 800 calls per second is not sufficient latency for the requirement above.
This performance degradation is reproduced on hosts with the following CPU:
2 CPU x Quad-Core AMD Opteron 2356
2312 MHz, 8 HW threads in total,
2 CPU x Intel Xeon E5410 2.33 GHz, 8
HW threads in total.
It is not reproduced on X5570 CPU (Intel Xeon Nehalem X5570 2.93 GHz, 16 HW threads in total).
Did anybody face similar issues? How to solve them?

I assume you are taking about nano-seconds rather than milli-seconds. (That is one million times smaller!)
OR the use of ConcurrentHashMap is a trivial portion of your delay.
EDIT: Have edited the example to be multi-threaded using 100 tasks.
/*
Average operation time for a map of 10,000,000 was 48 ns
Average operation time for a map of 5,000,000 was 51 ns
Average operation time for a map of 2,500,000 was 48 ns
Average operation time for a map of 1,250,000 was 46 ns
Average operation time for a map of 625,000 was 45 ns
Average operation time for a map of 312,500 was 44 ns
Average operation time for a map of 156,200 was 38 ns
Average operation time for a map of 78,100 was 34 ns
Average operation time for a map of 39,000 was 35 ns
Average operation time for a map of 19,500 was 37 ns
*/
public static void main(String... args) {
ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
try {
for (int size = 100000; size >= 100; size /= 2)
test(es, size);
} finally {
es.shutdown();
}
}
private static void test(ExecutorService es, final int size) {
int tasks = 100;
final ConcurrentHashMap<Integer, String> map = new ConcurrentHashMap<Integer, String>(tasks*size);
List<Future> futures = new ArrayList<Future>();
long start = System.nanoTime();
for (int j = 0; j < tasks; j++) {
final int offset = j * size;
futures.add(es.submit(new Runnable() {
public void run() {
for (int i = 0; i < size; i++)
map.put(offset + i, "" + i);
int total = 0;
for (int j = 0; j < 10; j++)
for (int i = 0; i < size; i++)
total += map.get(offset + i).length();
for (int i = 0; i < size; i++)
map.remove(offset + i);
}
}));
}
try {
for (Future future : futures)
future.get();
} catch (Exception e) {
throw new AssertionError(e);
}
long time = System.nanoTime() - start;
System.out.printf("Average operation time for a map of %,d was %,d ns%n", size * tasks, time / tasks / 12 / size);
}

At first, did you check that the hash map is indeed the culprit? Assuming, that you did: There is a lock-free hash map designed to scale to hundreds of processors without introducing alot of contention. It's authored by Cliff Click a well known engineer on the original Hot Spot compiler team. Now, working on scaling the JDK to machines with hundreds of CPUs. So, I assume that he knows what he is doing in that hash map implementation. More infos about this hash map can be found in these slides.

Have you tried changing th concurrencyLevel in the ConcurrentHashMap? Try some lower values like 8, try some bigger values. And remember that the performance and concurrency of ConcurrentHashMap is dependend on you quality of HashCode function.
And yes, it - the java.util.ConcurrentHashMap has the same origin (Doug Lee from edu.oswego) as edu.oswego.cs.dl... , but it was totally rewritten by him so it can better scale.
I think it may be good for you to checkout the javolution fast map. It may be better suited for real-time applications.

Related

Why 2 similar loop codes costs different time in java

I was confused by the codes as follows:
public static void test(){
long currentTime1 = System.currentTimeMillis();
final int iBound = 10000000;
final int jBound = 100;
for(int i = 1;i<=iBound;i++){
int a = 1;
int tot = 10;
for(int j = 1;j<=jBound;j++){
tot *= a;
}
}
long updateTime1 = System.currentTimeMillis();
System.out.println("i:"+iBound+" j:"+jBound+"\nIt costs "+(updateTime1-currentTime1)+" ms");
}
That's the first version, it costs 443ms on my computer.
first version result
public static void test(){
long currentTime1 = System.currentTimeMillis();
final int iBound = 100;
final int jBound = 10000000;
for(int i = 1;i<=iBound;i++){
int a = 1;
int tot = 10;
for(int j = 1;j<=jBound;j++){
tot *= a;
}
}
long updateTime1 = System.currentTimeMillis();
System.out.println("i:"+iBound+" j:"+jBound+"\nIt costs "+(updateTime1-currentTime1)+" ms");
}
The second version costs 832ms.
second version result
The only difference is that I simply swap the i and j.
This result is incredible, I test the same code in C and the difference in C is not that huge.
Why is this 2 similar codes so different in java?
My jdk version is openjdk-14.0.2

TL;DR - This is just a bad benchmark.
I did the following:
Create a Main class with a main method.
Copy in the two versions of the test as test1() and test2().
In the main method do this:
while(true) {
test1();
test2();
}
Here is the output I got (Java 8).
i:10000000 j:100
It costs 35 ms
i:100 j:10000000
It costs 33 ms
i:10000000 j:100
It costs 33 ms
i:100 j:10000000
It costs 25 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
....
So as you can see, when I run two versions of the same method alternately in the same JVM, the times for each method are roughly the same.
But more importantly, after a small number of iterations the time drops to ... zero! What has happened is that the JIT compiler has compiled the two methods and (probably) deduced that their loops can be optimized away.
It is not entirely clear why people are getting different times when the two versions are run separately. One possible explanation is that the first time run, the JVM executable is being read from disk, and the second time is already cached in RAM. Or something like that.
Another possible explanation is that JIT compilation kicks in earlier1 with one version of test() so the proportion of time spent in the slower interpreting (pre-JIT) phase is different between the two versions. (It may be possible to teas this out using JIT logging options.)
But it is immaterial really ... because the performance of a Java application while the JVM is warming up (loading code, JIT compiling, growing the heap to its working size, loading caches, etc) is generally speaking not important. And for the cases where it is important, look for a JVM that can do AOT compilation; e.g. GraalVM.
1 - This could be because of the way that the interpreter gathers stats. The general idea is that the bytecode interpreter accumulates statistics on things like branches until it has "enough". Then the JVM triggers the JIT compiler to compile the bytecodes to native code. When that is done, the code runs typically 10 or more times faster. The different looping patterns might it reach "enough" earlier in one version compared to the other. NB: I am speculating here. I offer zero evidence ...
The bottom line is that you have to be careful when writing Java benchmarks because the timings can be distorted by various JVM warmup effects.
For more information read: How do I write a correct micro-benchmark in Java?

I test it myself, I get same difference (around 16ms and 4ms).
After testing, I found that :
Declare 1M of variable take less time than multiple by 1 1M time.
How ?
I made a sum of 100
final int nb = 100000000;
for(int i = 1;i<=nb;i++){
i *= 1;
i *= 1;
[... written 20 times]
i *= 1;
i *= 1;
}
And of 100 this:
final int nb = 100000000;
for(int i = 1;i<=nb;i++){
int a = 0;
int aa = 0;
[... written 20 times]
int aaaaaaaaaaaaaaaaaaaaaa = 0;
int aaaaaaaaaaaaaaaaaaaaaaa = 0;
}
And I respectively get 8 and 3ms, which seems to correspond to what you get.
You can have different result if you have different processor.

you found the answer in algorithm books first chapter :
cost of producing and assigning is 1. so in first algorithm you have 2 declaration and assignation 10000000 and in second one you make it 100. so you reduce time ...
in first :
5 in main loop and 3 in second loop -> second loop is : 3*100 = 300
then 300 + 5 -> 305 * 10000000 = 3050000000
in second :
3*10000000 = 30000000 - > (30000000 + 5 )*100 = 3000000500
so the second one in algorithm is faster in theory but I think its back to multi cpu's ...which they can do 10000000 parallel job in first but only 100 parallel job in second .... so the first one became faster.

Very slow iteration on Chronicle Map

I'm seeing very slow times iterating over a Chronicle Map - in the below example 93ms per iteration over 1M entries on my 2013 MacbookPro. I'm wondering if there's a better way to iterate or something I'm doing wrong or if this is expected? I know Chronicle Map isn't optimized for iterating but this ticket from a few years ago made me expect much faster iteration times. Toy example below:
public static void main(String[] args) throws Exception {
int numEntries = 1_000_000;
int numIterations = 1_000;
int avgEntrySize = BitUtil.SIZE_OF_LONG + BitUtil.SIZE_OF_INT;
ChronicleMap<IntValue, ByteBuffer> map = ChronicleMap.of(IntValue.class, ByteBuffer.class)
.name("test").entries(numEntries).averageValueSize(avgEntrySize)
.putReturnsNull(true).create();
IntValue value = Values.newHeapInstance(IntValue.class);
ByteBuffer buffer = ByteBuffer.allocate(avgEntrySize);
for (int i = 0; i < numEntries; i++) {
value.setValue(i);
buffer.clear();
buffer.putLong(i);
buffer.putInt(i);
buffer.flip();
map.put(value, buffer);
}
System.out.println("Finished insertion");
for (int i = 0; i < numIterations; i++) {
map.forEachEntry(entry -> {
Data<ByteBuffer> data = entry.value();
ByteBuffer val = data.get();
});
}
System.out.println("Finished priming");
long start = System.currentTimeMillis();
for (int i = 0; i < numIterations; i++) {
map.forEachEntry(entry -> {
Data<ByteBuffer> data = entry.value();
ByteBuffer val = data.get();
});
}
System.out.println(
"Elapsed: " + (System.currentTimeMillis() - start) + " for " + numIterations
+ " iterations");
}
Output:
Finished insertion
Finished priming
Elapsed: 93327 for 1000 iterations

Your results: 93 milliseconds per 1 million keys exactly match the result of benchmark here: http://jetbrains.github.io/xodus/#benchmarks, so it's in the expected ballpark. 93 ms / 1m keys is 93 ns per key, it "very slow" compared to what? Your map contains 16 MB of payload and it's total off-heap size is ~ 30 MB (FYI you can check that by map.offHeapMemoryUsed()), that is much more than the volume of L3 memory in consumer laptops, so iteration speed is bound by the latency of the main memory. Chronicle Map's iteration is mainly not sequential, so memory prefetch doesn't work. I've created an issue about this.
Also several notes about your code:
In your case the value size of the map is constant, so you should use constantValueSizeBySample(ByteBuffer.allocate(12)) instead of averageValueSize(). Even if the map value size wasn't constant, it's preferred to use averageValue() instead of averageValueSize(), because you cannot be sure how many bytes serializers use for the values.
Your value seems to be a good use case for value interfaces with two fields. Moreover you already use a value interface as the key type - IntValue.
Do benchmarks using JMH

Application with multitheard looks to work with randomly speed

I wrote an Java code just for testing how my CPU will run when have to may operation to do so I wrote loop that will add 1 to var in 100000000000 iterations:
public class NoThread {
public static void main(String[] args) {
long s = System.currentTimeMillis();
int sum = 0;
for (int i=0;i<=1000000;i++){
for (int j=0;j<=10000;j++){
for (int k = 0;k<=10;k++){
sum++;
}
}
}
long k = System.currentTimeMillis();
System.out.println("Time" + (k-s)+ " " + sum );
}
}
Code finish working after 30 - 40 sec.
Next I decide to split this operation into 10 threads to make my cpu more cry and say my prog to write time when each thread end:
public class WithThread {
public static void main(String[] args) {
Runnable[] run = new Runnable[10];
Thread[]thread = new Thread[10];
for (int i = 0; i<=9;i++){
run[i] = new Counter(i);
thread[i] = new Thread(run[i]);
thread[i].start();
}
}
}
and
public class Counter implements Runnable {
private int inc;
private int incc;
private int sum = 0;
private int id;
public Counter(int a){
id = a;
inc = a * 100000;
incc = (a+1)*100000;
}
#Override
public void run(){
long s = System.currentTimeMillis();
for (int i = inc;i<=incc;i++){
for (int j=0;j<=10000;j++){
for (int k = 0;k<=10;k++){
sum++;
}
}
}
long k = System.currentTimeMillis();
System.out.println("Time" + (k-s)+ " " + sum + " in thread " + id);
}
}
In the result whole code end in 18 - 20 second - so two times faster but when I look at time in each Thread end it works i found something interesting. Each thread had same job to do but 4 threads end work in very short time ( 0,8 second ) and rest of threads ( 6 ) end in 18 to 20 second. I start it again and now i had 6 thread with fast time and 4 with slow. Run it again 7 fast and 3 slow. Amount of fast and slow thread looks randomly. So my question is why there is so big difference between fast and slow threads. Why amount of fast and slow threads is so random, and is this Language specific (Java) or maybe operating system, CPU or something else ?

Before moving into the working process of Threads and Processors, I'll explain it in more understandable way.
Scenario
Location A ------------------------------ Location B
| |_____________________________|
| |
| 200 Metres
|
| Have to be carried to
400 Bags of Sand -------------------------- Location B
(In Location A)
So, the worker will have to carry each Sand Bag from Location A to Location B until all the Sandbags are moved to location B.
Lets just pretend that the worker will be instantly Teleported back (for argument sake) to Location A (but not the other way around) once he arrives at Location B.
Case 1
Number of Workforce = 1 (No.of Mens)
Time taken = 2 mins (Time for Moving 1 SandBag from Location A to Location B)
Total time taken to carry 400 Sandbags from Location A to Location B will be
Totaltime Taken = 2 x 400 = 800 mins
Case 2
Number of Workforce = 4 (No.of Mens)
Time taken = 2 mins (Time for Moving 1 SandBag from Location A to Location B)
So now we're going to split the job equally among the available workforce.
Assigned Sandbag for Each worker = 400 / 4 = 100
Lets say everyone is starting their job at the same time.
Total time taken for carrying 100 Sandbags from Location A to Location B for an individual workforce
TimeTaken for Individual Workforce = 2 x 100 = 200 mins
Since everyone had started their job at the same time, all the 400 Sandbags will be carried from Location A to Location B in 200 mins
Case 3
Number of Workforce = 4 (No.of Mens)
Here, lets say that every men has to carry 4 sandbags from Location A to Location B in a single transfer.
Total Sandbags in Single transfer for every worker = 4 bags
Time taken = 12 mins (Time for Moving 4 SandBags from Location A to Location B in a single transfer)
Since everyone is forced to carry 4 sandbags with them instead of 1, this is greatly reduce their speed.
Even consider this,
1) I ordered you to carry 1 sandbag from A to B, you'll take 2 mins.
2) I ordered you to carry 2 sandbags from A to B at one transfer, you'll take 5 mins instead of theoritical 4 mins, because this is due to our body conditions and the weight we're carrying.
3) I ordered you to carry 4 sandbags from A to B at one transfer, you'll take 12 mins instead of (Theoritical 8 mins in Point 1, Theoritical 10 mins in Point 2), which is also because of human nature.
So now we're going to split the job equally among the available workforce.
Assigned Sandbag for Each worker = 400 / 4 = 100
Total transfers for Each worker = 100 / 4 = 25 Transfers
Calculating the time taken for single worker to complete his full job
Total time for Single worker = 12 mins x 25 tranfers = 300
So, they've taken an additional 100 min instead of theoritical 200 mins (Case 2)
Case 4
Total Sandbags in Single transfer for every worker = 100 bags
Since this is impossible to do by anyone, so he'll just quit.
xx--------------------------------------------------------------------------------------xx
This is the same kind of working principle in Threads and Processors
Here
Workforce = No. of Processors
Total Sandbags = No.of Threads
Sandbags in a Single transfer = No.of threads a (1) processor is going to handle simultaneously
Assume
Available Processors = 4
Runtime.getRuntime().availableProcessors() // -> Syntax to get the no of available processors
Note: Link every Case with the Realtime Case explained above
Case 1
for (int i=0;i<=1000000;i++){
for (int j=0;j<=10000;j++){
for (int k = 0;k<=10;k++){
sum++;
}
}
}
Whole operation is series process, so it'll take execution time what it's suppose to.
Case 2
for( int n = 1; n <= 4; n++ ){
Thread t = new Thread(new Runnable(){
void run(){
for (int i=0;i<=250000;i++){ // 1000000 / 4 = 250000
for (int j=0;j<=10000;j++){
for (int k = 0;k<=10;k++){
sum++;
}
}
}
}
});
t.start();
}
Here each processor will going to handle 1 thread. So it'll take 1/4th of the actual time.
Case 3
for( int n = 1; n <= 16; n++ ){
Thread t = new Thread(new Runnable(){
void run(){
for (int i=0;i<=62500;i++){ // 1000000 / 16 = 62500
for (int j=0;j<=10000;j++){
for (int k = 0;k<=10;k++){
sum++;
}
}
}
}
});
t.start();
}
Totally 16 threads will be created and each processor will have to handle 4 threads simultaneously. So practically, it'll increase the processor load to its max, thus it'll reduce the efficiency of the processor resulting in increase in the execution time of each processor.
Totally it'll take 1/4th of(1/4th of actual time) + performace degrade time(will definitely be higher than than the 1/4th of actual time)
Case 4
for( int n = 1; n <= 100000; n++ ){ // 100000 - Just for argument sake
Thread t = new Thread(new Runnable(){
void run(){
for (int i=0;i<=1000000;i++){
for (int j=0;j<=10000;j++){
for (int k = 0;k<=10;k++){
sum++;
}
}
}
}
});
t.start();
}
At this stage, creating and starting a thread is more expensive (if the processor already have more threads in it) than the time taken for creating and starting previous threads.As the number of simultaneous threads increases, it'll hugely increase the processor load until the processor reaches its capacity, thus lead to System Crash.
The reason why your threads created in the first were having less execution time is because there wont be any performance degrade in processor during the intital stage. But as the for loop continues, no of threads have to be handled by each processor increases beyond the fair ratio (1:1), so you'll start to experience lag when the threads counts were increased in processor.

Is Arrays.stream(array_name).sum() slower than iterative approach?

I was coding a leetcode problem : https://oj.leetcode.com/problems/gas-station/ using Java 8.
My solution got TLE when I used Arrays.stream(integer_array).sum() to compute sum while the same solution got accepted using iteration to calculate the sum of elements in array. The best possible time complexity for this problem is O(n) and I am surprised to get TLE when using streaming API's from Java 8. I have implemented the solution in O(n) only.
import java.util.Arrays;
public class GasStation {
public int canCompleteCircuit(int[] gas, int[] cost) {
int start = 0, i = 0, runningCost = 0, totalGas = 0, totalCost = 0;
totalGas = Arrays.stream(gas).sum();
totalCost = Arrays.stream(cost).sum();
// for (int item : gas) totalGas += item;
// for (int item : cost) totalCost += item;
if (totalGas < totalCost)
return -1;
while (start > i || (start == 0 && i < gas.length)) {
runningCost += gas[i];
if (runningCost >= cost[i]) {
runningCost -= cost[i++];
} else {
runningCost -= gas[i];
if (--start < 0)
start = gas.length - 1;
runningCost += (gas[start] - cost[start]);
}
}
return start;
}
public static void main(String[] args) {
GasStation sol = new GasStation();
int[] gas = new int[] { 10, 5, 7, 14, 9 };
int[] cost = new int[] { 8, 5, 14, 3, 1 };
System.out.println(sol.canCompleteCircuit(gas, cost));
gas = new int[] { 10 };
cost = new int[] { 8 };
System.out.println(sol.canCompleteCircuit(gas, cost));
}
}
The solution gets accepted when,
I comment the following two lines: (calculating sum using streaming)
totalGas = Arrays.stream(gas).sum();
totalCost = Arrays.stream(cost).sum();
and uncomment the following two lines (calculating sum using iteration):
//for (int item : gas) totalGas += item;
//for (int item : cost) totalCost += item;
Now the solution gets accepted. Why streaming API in Java 8 is slower for large input than iteration for primitives?

The first step in dealing with problems like this is to bring the code into a controlled environment. That means running it in the JVM you control (and can invoke) and running tests inside a good benchmark harness like JMH. Analyze, don't speculate.
Here's a benchmark I whipped up using JMH to do some analysis on this:
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#State(Scope.Benchmark)
public class ArraySum {
static final long SEED = -897234L;
#Param({"1000000"})
int sz;
int[] array;
#Setup
public void setup() {
Random random = new Random(SEED);
array = new int[sz];
Arrays.setAll(array, i -> random.nextInt());
}
#Benchmark
public int sumForLoop() {
int sum = 0;
for (int a : array)
sum += a;
return sum;
}
#Benchmark
public int sumStream() {
return Arrays.stream(array).sum();
}
}
Basically this creates an array of a million ints and sums them twice: once using a for-loop and once using streams. Running the benchmark produces a bunch of output (elided for brevity and for dramatic effect) but the summary results are below:
Benchmark (sz) Mode Samples Score Score error Units
ArraySum.sumForLoop 1000000 avgt 3 514.473 398.512 us/op
ArraySum.sumStream 1000000 avgt 3 7355.971 3170.697 us/op
Whoa! That Java 8 streams stuff is teh SUXX0R! It's 14 times slower than a for-loop, don't use it!!!1!
Well, no. First let's go over these results, and then look more closely to see if we can figure out what's going on.
The summary shows the two benchmark methods, with the "sz" parameter of a million. It's possible to vary this parameter but it doesn't turn out to make a difference in this case. I also only ran the benchmark methods 3 times, as you can see from the "samples" column. (There were also only 3 warmup iterations, not visible here.) The score is in microseconds per operation, and clearly the stream code is much, much slower than the for-loop code. But note also the score error: that's the amount of variability in the different runs. JMH helpfully prints out the standard deviation of the results (not shown here) but you can easily see that the score error is a significant fraction of reported score. This reduces our confidence in the score.
Running more iterations should help. More warmup iterations will let the JIT do more work and settle down before running the benchmarks, and running more benchmark iterations will smooth out any errors from transient activity elsewhere on my system. So let's try 10 warmup iterations and 10 benchmark iterations:
Benchmark (sz) Mode Samples Score Score error Units
ArraySum.sumForLoop 1000000 avgt 10 504.803 34.010 us/op
ArraySum.sumStream 1000000 avgt 10 7128.942 178.688 us/op
Performance is overall a little faster, and the measurement error is also quite a bit smaller, so running more iterations has had the desired effect. But the streams code is still considerably slower than the for-loop code. What's going on?
A large clue can be obtained by looking at the individual timings of the streams method:
# Warmup Iteration 1: 570.490 us/op
# Warmup Iteration 2: 491.765 us/op
# Warmup Iteration 3: 756.951 us/op
# Warmup Iteration 4: 7033.500 us/op
# Warmup Iteration 5: 7350.080 us/op
# Warmup Iteration 6: 7425.829 us/op
# Warmup Iteration 7: 7029.441 us/op
# Warmup Iteration 8: 7208.584 us/op
# Warmup Iteration 9: 7104.160 us/op
# Warmup Iteration 10: 7372.298 us/op
What happened? The first few iterations were reasonably fast, but then the 4th and subsequent iterations (and all the benchmark iterations that follow) were suddenly much slower.
I've seen this before. It was in this question and this answer elsewhere on SO. I recommend reading that answer; it explains how the JVM's inlining decisions in this case result in poorer performance.
A bit of background here: a for-loop compiles to a very simple increment-and-test loop, and can easily be handled by usual optimization techniques like loop peeling and unrolling. The streams code, while not very complex in this case, is actually quite complex compared to the for-loop code; there's a fair bit of setup, and each loop requires at least one method call. Thus, the JIT optimizations, particularly its inlining decisions, are critical to making the streams code go fast. And it's possible for it to go wrong.
Another background point is that integer summation is about the simplest possible operation you can think of to do in a loop or stream. This will tend to make the fixed overhead of stream setup look relatively more expensive. It is also so simple that it can trigger pathologies in the inlining policy.
The suggestion from the other answer was to add the JVM option -XX:MaxInlineLevel=12 to increase the amount of code that can be inlined. Rerunning the benchmark with that option gives:
Benchmark (sz) Mode Samples Score Score error Units
ArraySum.sumForLoop 1000000 avgt 10 502.379 27.859 us/op
ArraySum.sumStream 1000000 avgt 10 498.572 24.195 us/op
Ah, much nicer. Disabling tiered compilation using -XX:-TieredCompilation also had the effect of avoiding the pathological behavior. I also found that making the loop computation even a bit more expensive, e.g. summing squares of integers -- that is, adding a single multiply -- also avoids the pathological behavior.
Now, your question is about running in the context of the leetcode environment, which seems to run the code in a JVM that you don't have any control over, so you can't change the inlining or compilation options. And you probably don't want to make your computation more complex to avoid the pathology either. So for this case, you might as well just stick to the good old for-loop. But don't be afraid to use streams, even for dealing with primitive arrays. It can perform quite well, aside from some narrow edge cases.

The normal iteration approach is going to be pretty much as fast as anything can be, but streams have a variety of overheads: even though it's coming directly from a stream, there's probably going to be a primitive Spliterator involved and lots of other objects being generated.
In general, you should expect the "normal approach" to usually be faster than streams unless you're both using parallelization and your data is very large.

My benchmark (see code below) shows that streaming approach is about 10-15% slower than iterative. Interestingly enough, parallel stream results vary greatly on my 4 core (i7) macbook pro, but, while I have seen a them a few times being about 30% faster than iterative, the most common result is almost three times slower than sequential.
Here is the benchmark code:
import java.util.*;
import java.util.function.*;
public class StreamingBenchmark {
private static void benchmark(String name, LongSupplier f) {
long start = System.currentTimeMillis(), sum = 0;
for(int count = 0; count < 1000; count ++) sum += f.getAsLong();
System.out.println(String.format(
"%10s in %d millis. Sum = %d",
name, System.currentTimeMillis() - start, sum
));
}
public static void main(String argv[]) {
int data[] = new int[1000000];
Random randy = new Random();
for(int i = 0; i < data.length; i++) data[i] = randy.nextInt();
benchmark("iterative", () -> { int s = 0; for(int n: data) s+=n; return s; });
benchmark("stream", () -> Arrays.stream(data).sum());
benchmark("parallel", () -> Arrays.stream(data).parallel().sum());
}
}
Here is the output from a few runs:
iterative in 350 millis. Sum = 564821058000
stream in 394 millis. Sum = 564821058000
parallel in 883 millis. Sum = 564821058000
iterative in 340 millis. Sum = -295411382000
stream in 376 millis. Sum = -295411382000
parallel in 1031 millis. Sum = -295411382000
iterative in 365 millis. Sum = 1205763898000
stream in 379 millis. Sum = 1205763898000
parallel in 1053 millis. Sum = 1205763898000
etc.
This got me curious, and I also tried running equivalent logic in scala:
object Scarr {
def main(argv: Array[String]) = {
val randy = new java.util.Random
val data = (1 to 1000000).map { _ => randy.nextInt }.toArray
val start = System.currentTimeMillis
var sum = 0l;
for ( _ <- 1 to 1000 ) sum += data.sum
println(sum + " in " + (System.currentTimeMillis - start) + " millis.")
}
}
This took 14 seconds! About 40 times(!) longer than streaming in java. Ouch!

The sum() method is syntactically equivalent to return reduce(0, Integer::sum); In a large list, there will be more overhead in making all the method calls than the basic by-hand for-loop iteration. The byte code for the for(int i : numbers) iteration is only very slightly more complicated than that generated by the by-hand for-loop. The stream operation is possibly faster in parallel-friendly environments (though maybe not for primitive methods), but unless we know that the environment is parallel-friendly (and it may not be since leetcode itself is probably designed to favor low-level over abstract since it's measuring efficiency rather than legibility).
The sum operation done in any of the three ways (Arrays.stream(int[]).sum, for (int i : ints){total+=i;}, and for(int i = 0; i < ints.length; i++){total+=i;} should be relatively similar in efficiency. I used the following test class (which sums a hundred million integers between 0 and 4096 a hundred times each and records the average times). All of them returned in very similar timeframes. It even attempts to limit parallel processing by occupying all but one of the available cores in while(true) loops, but I still found no particular difference:
public class SumTester {
private static final int ARRAY_SIZE = 100_000_000;
private static final int ITERATION_LIMIT = 100;
private static final int INT_VALUE_LIMIT = 4096;
public static void main(String[] args) {
Random random = new Random();
int[] numbers = new int[ARRAY_SIZE];
IntStream.range(0, ARRAY_SIZE).forEach(i->numbers[i] = random.nextInt(INT_VALUE_LIMIT));
Map<String, ToLongFunction<int[]>> inputs = new HashMap<String, ToLongFunction<int[]>>();
NanoTimer initializer = NanoTimer.start();
System.out.println("initialized NanoTimer in " + initializer.microEnd() + " microseconds");
inputs.put("sumByStream", SumTester::sumByStream);
inputs.put("sumByIteration", SumTester::sumByIteration);
inputs.put("sumByForLoop", SumTester::sumByForLoop);
System.out.println("Parallelables: ");
averageTimeFor(ITERATION_LIMIT, inputs, Arrays.copyOf(numbers, numbers.length));
int cores = Runtime.getRuntime().availableProcessors();
List<CancelableThreadEater> threadEaters = new ArrayList<CancelableThreadEater>();
if (cores > 1){
threadEaters = occupyThreads(cores - 1);
}
// Only one core should be left to our class
System.out.println("\nSingleCore (" + threadEaters.size() + " of " + cores + " cores occupied)");
averageTimeFor(ITERATION_LIMIT, inputs, Arrays.copyOf(numbers, numbers.length));
for (CancelableThreadEater cte : threadEaters){
cte.end();
}
System.out.println("Complete!");
}
public static long sumByStream(int[] numbers){
return Arrays.stream(numbers).sum();
}
public static long sumByIteration(int[] numbers){
int total = 0;
for (int i : numbers){
total += i;
}
return total;
}
public static long sumByForLoop(int[] numbers){
int total = 0;
for (int i = 0; i < numbers.length; i++){
total += numbers[i];
}
return total;
}
public static void averageTimeFor(int iterations, Map<String, ToLongFunction<int[]>> testMap, int[] numbers){
Map<String, Long> durationMap = new HashMap<String, Long>();
Map<String, Long> sumMap = new HashMap<String, Long>();
for (String methodName : testMap.keySet()){
durationMap.put(methodName, 0L);
sumMap.put(methodName, 0L);
}
for (int i = 0; i < iterations; i++){
for (String methodName : testMap.keySet()){
int[] newNumbers = Arrays.copyOf(numbers, ARRAY_SIZE);
ToLongFunction<int[]> function = testMap.get(methodName);
NanoTimer nt = NanoTimer.start();
long sum = function.applyAsLong(newNumbers);
long duration = nt.microEnd();
sumMap.put(methodName, sum);
durationMap.put(methodName, durationMap.get(methodName) + duration);
}
}
for (String methodName : testMap.keySet()){
long duration = durationMap.get(methodName) / iterations;
long sum = sumMap.get(methodName);
System.out.println(methodName + ": result '" + sum + "', elapsed time: " + duration + " milliseconds on average over " + iterations + " iterations");
}
}
private static List<CancelableThreadEater> occupyThreads(int numThreads){
List<CancelableThreadEater> result = new ArrayList<CancelableThreadEater>();
for (int i = 0; i < numThreads; i++){
CancelableThreadEater cte = new CancelableThreadEater();
result.add(cte);
new Thread(cte).start();
}
return result;
}
private static class CancelableThreadEater implements Runnable {
private Boolean stop = false;
public void run(){
boolean canContinue = true;
while (canContinue){
synchronized(stop){
if (stop){
canContinue = false;
}
}
}
}
public void end(){
synchronized(stop){
stop = true;
}
}
}
}
which returned
initialized NanoTimer in 22 microseconds
Parallelables:
sumByIteration: result '-1413860413', elapsed time: 35844 milliseconds on average over 100 iterations
sumByStream: result '-1413860413', elapsed time: 35414 milliseconds on average over 100 iterations
sumByForLoop: result '-1413860413', elapsed time: 35218 milliseconds on average over 100 iterations
SingleCore (3 of 4 cores occupied)
sumByIteration: result '-1413860413', elapsed time: 37010 milliseconds on average over 100 iterations
sumByStream: result '-1413860413', elapsed time: 38375 milliseconds on average over 100 iterations
sumByForLoop: result '-1413860413', elapsed time: 37990 milliseconds on average over 100 iterations
Complete!
That said, there's no real reason to do the sum() operation in this case. You are iterating through each array, for a total of three iterations and the last one may be a longer-than-normal iteration. It's possible to calculate correctly with one full simultaneous iteration of the arrays and one short-circuiting iteration. It may be possible to do it even more efficiently, but I couldn't figure out any better way to do it than I did. My solution ended up being one of the fastest java solutions on the chart - it ran in 223ms, which was in amongst the middle pack of python solutions.
I'll add my solution to the problem if you care to see it, but I hope the actual question is answered here.

Stream function is relatively slow. So during leetcode contest or any algorithms contest, always prefer classic loops over stream functions as large inputs are prone to TLE. This can in turn cause penality, which would affect your final ranking.
A detailed explanation is mentioned here https://stackoverflow.com/a/27994074/6185191

I also came across this issue while doing this pretty basic LeetCode problem. The first code that I submitted used Java Stream API's Arrays.stream().sum() to compute the Array sum, which gave a time of 6ms.
While the classic for loop just took a time of 1ms to iterate through the same array. Now that's insane! The Stream API method, takes atleast 6x the time, than your simple for loop. So yeah! always go with the simpler and classic method.

Java thread creation overhead

Conventional wisdom tells us that high-volume enterprise java applications should use thread pooling in preference to spawning new worker threads. The use of java.util.concurrent makes this straightforward.
There do exist situations, however, where thread pooling is not a good fit. The specific example which I am currently wrestling with is the use of InheritableThreadLocal, which allows ThreadLocal variables to be "passed down" to any spawned threads. This mechanism breaks when using thread pools, since the worker threads are generally not spawned from the request thread, but are pre-existing.
Now there are ways around this (the thread locals can be explicitly passed in), but this isn't always appropriate or practical. The simplest solution is to spawn new worker threads on demand, and let InheritableThreadLocal do its job.
This brings us back to the question - if I have a high volume site, where user request threads are spawning off half a dozen worker threads each (i.e. not using a thread pool), is this going to give the JVM a problem? We're potentially talking about a couple of hundred new threads being created every second, each one lasting less than a second. Do modern JVMs optimize this well? I remember the days when object pooling was desirable in Java, because object creation was expensive. This has since become unnecessary. I'm wondering if the same applies to thread pooling.
I'd benchmark it, if I knew what to measure, but my fear is that the problems may be more subtle than can be measured with a profiler.
Note: the wisdom of using thread locals is not the issue here, so please don't suggest that I not use them.

Here is an example microbenchmark:
public class ThreadSpawningPerformanceTest {
static long test(final int threadCount, final int workAmountPerThread) throws InterruptedException {
Thread[] tt = new Thread[threadCount];
final int[] aa = new int[tt.length];
System.out.print("Creating "+tt.length+" Thread objects... ");
long t0 = System.nanoTime(), t00 = t0;
for (int i = 0; i < tt.length; i++) {
final int j = i;
tt[i] = new Thread() {
public void run() {
int k = j;
for (int l = 0; l < workAmountPerThread; l++) {
k += k*k+l;
}
aa[j] = k;
}
};
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
System.out.print("Starting "+tt.length+" threads with "+workAmountPerThread+" steps of work per thread... ");
t0 = System.nanoTime();
for (int i = 0; i < tt.length; i++) {
tt[i].start();
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
System.out.print("Joining "+tt.length+" threads... ");
t0 = System.nanoTime();
for (int i = 0; i < tt.length; i++) {
tt[i].join();
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
long totalTime = System.nanoTime()-t00;
int checkSum = 0; //display checksum in order to give the JVM no chance to optimize out the contents of the run() method and possibly even thread creation
for (int a : aa) {
checkSum += a;
}
System.out.println("Checksum: "+checkSum);
System.out.println("Total time: "+totalTime*1E-6+" ms");
System.out.println();
return totalTime;
}
public static void main(String[] kr) throws InterruptedException {
int workAmount = 100000000;
int[] threadCount = new int[]{1, 2, 10, 100, 1000, 10000, 100000};
int trialCount = 2;
long[][] time = new long[threadCount.length][trialCount];
for (int j = 0; j < trialCount; j++) {
for (int i = 0; i < threadCount.length; i++) {
time[i][j] = test(threadCount[i], workAmount/threadCount[i]);
}
}
System.out.print("Number of threads ");
for (long t : threadCount) {
System.out.print("\t"+t);
}
System.out.println();
for (int j = 0; j < trialCount; j++) {
System.out.print((j+1)+". trial time (ms)");
for (int i = 0; i < threadCount.length; i++) {
System.out.print("\t"+Math.round(time[i][j]*1E-6));
}
System.out.println();
}
}
}
The results on 64-bit Windows 7 with 32-bit Sun's Java 1.6.0_21 Client VM on Intel Core2 Duo E6400 #2.13 GHz are as follows:
Number of threads 1 2 10 100 1000 10000 100000
1. trial time (ms) 346 181 179 191 286 1229 11308
2. trial time (ms) 346 181 187 189 281 1224 10651
Conclusions: Two threads do the work almost twice as fast as one, as expected since my computer has two cores. My computer can spawn nearly 10000 threads per second, i. e. thread creation overhead is 0.1 milliseconds. Hence, on such a machine, a couple of hundred new threads per second pose a negligible overhead (as can also be seen by comparing the numbers in the columns for 2 and 100 threads).

First of all, this will of course depend very much on which JVM you use. The OS will also play an important role. Assuming the Sun JVM (Hm, do we still call it that?):
One major factor is the stack memory allocated to each thread, which you can tune using the -Xssn JVM parameter - you'll want to use the lowest value you can get away with.
And this is just a guess, but I think "a couple of hundred new threads every second" is definitely beyond what the JVM is designed to handle comfortably. I suspect that a simple benchmark will quickly reveal quite unsubtle problems.

for your benchmark you can use JMeter + a profiler, which should give you direct overview on the behaviour in such a heavy-loaded environment. Just let it run for a an hour and monitor memory, cpu, etc. If nothing breaks and the cpu(s) doesn't overheat, it's ok :)
perhaps you can get a thread-pool, or customize (extend) the one you are using by adding some code in order to have the appropriate InheritableThreadLocals set each time a Thread is acquired from the thread-pool.
Each Thread has these package-private properties:
/* ThreadLocal values pertaining to this thread. This map is maintained
* by the ThreadLocal class. */
ThreadLocal.ThreadLocalMap threadLocals = null;
/*
* InheritableThreadLocal values pertaining to this thread. This map is
* maintained by the InheritableThreadLocal class.
*/
ThreadLocal.ThreadLocalMap inheritableThreadLocals = null;
You can use these (well, with reflection) in combination with the Thread.currentThread() to have the desired behaviour. However this is a bit ad-hock, and furthermore, I can't tell whether it (with the reflection) won't introduce even bigger overhead than just creating the threads.

I am wondering whether it is necessary to spawn new threads on each user request if their typical life-cycle is as short as a second. Could you use some kind of Notify/Wait queue where you spawn a given number of (daemon)threads, and they all wait until there's a task to solve. If the task queue gets long, you spawn additional threads, but not on a 1-1 ratio. It will most likely be perform better then spawning hundreds of new threads whose life-cycles are so short.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.