Why is parallel stream slower?

Why is parallel stream slower? - java

I was playing around with infinite streams and made this program for benchmarking. Basically the bigger the number you provide, the faster it will finish. However, I was amazed to find that using a parellel stream resulted in exponentially worse performance compared to a sequential stream. Intuitively, one would expect an infinite stream of random numbers to be generated and evaluated much faster in a multi-threaded environment, but this appears not to be the case. Why is this?
final int target = Integer.parseInt(args[0]);
if (target <= 0) {
System.err.println("Target must be between 1 and 2147483647");
return;
}
final long startTime, endTime;
startTime = System.currentTimeMillis();
System.out.println(
IntStream.generate(() -> new Double(Math.random()*2147483647).intValue())
//.parallel()
.filter(i -> i <= target)
.findFirst()
.getAsInt()
);
endTime = System.currentTimeMillis();
System.out.println("Execution time: "+(endTime-startTime)+" ms");

I totally agree with the other comments and answers but indeed your test behaves strange in case that the target is very low. On my modest laptop the parallel version is on average about 60x slower when very low targets are given. This extreme difference cannot be explained by the overhead of the parallelization in the stream APIs so I was also amazed :-). IMO the culprit lies here:
Math.random()
Internally this call relies on a global instance of java.util.Random. In the documentation of Random it is written:
Instances of java.util.Random are threadsafe. However, the concurrent
use of the same java.util.Random instance across threads may encounter
contention and consequent poor performance. Consider instead using
ThreadLocalRandom in multithreaded designs.
So I think that the really poor performance of the parallel execution compared to the sequential one is explained by the thread contention in random rather than any other overheads. If you use ThreadLocalRandom instead (as recommended in the documentation) then the performance difference will not be so dramatic. Another option would be to implement a more advanced number supplier.

The cost of passing work to multiple thread is expensive es the first time you do it. This cost is fairly fixed so even if your task is trivial, the overhead is relatively high.
One of the problems you have is that highly inefficient code is a very poor way to determine how well a solution performs. Also, how it runs the first time and how it runs after a few seconds can often be 100x different (can be much more) I suggest using an example which is already optimal and only then attempt to use multiple threads.
e.g.
long start = System.nanoTime();
int value = (int) (Math.random() * (target+1L));
long time = System.nanoTime() - value;
// don't time IO as it is sooo much slower
System.out.println(value);
Note: this will not be efficient until the code has warmuped and been compiled. i.e. ignore the first 2-5 seconds this code is run.

Following suggestions from various answers, I think I've fixed it. I'm not sure what the exact bottleneck was but on an i5-4590T the parallel version with the following code performs faster than the sequential variant. For brevity, I've included only the relevant parts of the (refactored) code:
static IntStream getComputation() {
return IntStream
.generate(() -> ThreadLocalRandom.current().nextInt(2147483647));
}
static void computeSequential(int target) {
for (int loop = 0; loop < target; loop++) {
final int result = getComputation()
.filter(i -> i <= target)
.findAny()
.getAsInt();
System.out.println(result);
}
}
static void computeParallel(int target) {
IntStream.range(0, target)
.parallel()
.forEach(loop -> {
final int result = getComputation()
.parallel()
.filter(i -> i <= target)
.findAny()
.getAsInt();
System.out.println(result);
});
}
EDIT: I should also note that I put it all in a loop to get longer running times.

Related

Java : Issue with capturing execution time per iteration in a Map

I've a requirement to capture the execution time of some code in iterations. I've decided to use a Map<Integer,Long> for capturing this data where Integer(key) is the iteration number and Long(value) is the time consumed by that iteration in milliseconds.
I've written the below java code to compute the time taken for each iteration. I want to ensure that the time taken by all iterations is zero before invoking actual code. Surprisingly, the below code behaves differently for every execution.
Sometimes, I get the desired output(zero millisecond for all iterations), but at times I do get positive and even negative values for some random iterations.
I've tried replacing System.currentTimeMillis(); with below code:
new java.util.Date().getTime();
System.nanoTime();
org.apache.commons.lang.time.StopWatch
but still no luck.
Any suggestions as why some iterations take additional time and how to eliminate it?
package com.stackoverflow.programmer;
import java.util.HashMap;
import java.util.Map;
public class TestTimeConsumption {
public static void main(String[] args) {
Integer totalIterations = 100000;
Integer nonZeroMilliSecondsCounter = 0;
Map<Integer, Long> timeTakenMap = new HashMap<>();
for (Integer iteration = 1; iteration <= totalIterations; iteration++) {
timeTakenMap.put(iteration, getTimeConsumed(iteration));
if (timeTakenMap.get(iteration) != 0) {
nonZeroMilliSecondsCounter++;
System.out.format("Iteration %6d has taken %d millisecond(s).\n", iteration,
timeTakenMap.get(iteration));
}
}
System.out.format("Total non zero entries : %d", nonZeroMilliSecondsCounter);
}
private static Long getTimeConsumed(Integer iteration) {
long startTime = System.currentTimeMillis();
// Execute code for which execution time needs to be captured
long endTime = System.currentTimeMillis();
return (endTime - startTime);
}
}
Here's the sample output from 5 different executions of the same code:
Execution #1 (NOT OK)
Iteration 42970 has taken 1 millisecond(s).
Total non zero entries : 1
Execution #2 (OK)
Total non zero entries : 0
Execution #3 (OK)
Total non zero entries : 0
Execution #4 (NOT OK)
Iteration 65769 has taken -1 millisecond(s).
Total non zero entries : 1
Execution #5 (NOT OK)
Iteration 424 has taken 1 millisecond(s).
Iteration 33053 has taken 1 millisecond(s).
Iteration 76755 has taken -1 millisecond(s).
Total non zero entries : 3
I am looking for a Java based solution that ensures that all
iterations consume zero milliseconds consistently. I prefer to
accomplish this using pure Java code without using a profiler.
Note: I was also able to accomplish this through C code.

Your HashMap performance may be dropping if it is resizing. The default capacity is 16 which you are exceeding. If you know the expected capacity up front, create the HashMap with the appropriate size taking into account the default load factor of 0.75
If you rerun iterations without defining a new map and the Integer key does not start again from zero, you will need to resize the map taking into account the total of all possible iterations.
int capacity = (int) ((100000/0.75)+1);
Map<Integer, Long> timeTakenMap = new HashMap<>(capacity);

As you are starting to learn here, writing microbenchmarks in Java is not as easy as one would first assume. Everybody gets bitten at some point, even the hardened performance experts who have been doing it for years.
A lot is going on within the JVM and the OS that skews the results, such as GC, hotspot on the fly optimisations, recompilations, clock corrections, thread contention/scheduling, memory contention and cache misses. To name just a few. And sadly these skews are not consistent, and they can very easily dominate a microbenchmark.
To answer your immediate question of why the timings can some times go negative, it is because currentTimeMillis is designed to capture wall clock time and not elapsed time. No wall clock is accurate on a computer and there are times when the clock will be adjusted.. very possibly backwards. More detail on Java's clocks can be read on the following Oracle Blog Inside the Oracle Hotspot VM clocks.
Further details and support of nanoTime verses currentTimeMillis can be read here.
Before continuing with your own benchmark, I strongly recommend that you read how do I write a currect micro benchmark in java. The quick synopses is to 1) warm up the JVM before taking results, 2) jump through hoops to avoid dead code elimination, 3) ensure that nothing else is running on the same machine but accept that there will be thread scheduling going on.. you may even want to pin threads to cores, depends on how far you want to take this, 4) use a framework specifically designed for microbenchmarking such as JMH or for quick light weight spikes JUnitMosaic gives good results.

I'm not sure if I understand your question.
You're trying to execute a certain set of statements S, and expect the execution time to be zero. You then test this premise by executing it a number of times and verifying the result.
That is a strange expectation to have: anything consumes some time, and possibly even more. Hence, although it would be possible to test successfully, that does not prove that no time has been used, since your program is save_time();execute(S);compare_time(). Even if execute(S) is nothing, your timing is discrete, and as such, it is possible that the 'tick' of your wallclock just happens to happen just between save_time and compare_time, leading to some time having been visibly past.
As such, I'd expect your C program to behave exactly the same. Have you run that multiple times? What happens when you increase the iterations to over millions? If it still does not occur, then apparently your C compiler has optimized the code in such a way that no time is measured, and apparently, Java doesn't.
Or am I understanding you wrong?

You hint it right... System.currentTimeMillis(); is the way to go in this case.
There is no warranty that increasing the value of the integer object i represent either a millisecond or a Cycle-Time in no system...
you should take the System.currentTimeMillis() and calculated the elapsed time
Example:
public static void main(String[] args) {
long lapsedTime = System.currentTimeMillis();
doFoo();
lapsedTime -= System.currentTimeMillis();
System.out.println("Time:" + -lapsedTime);
}

I am also not sure exactly, You're trying to execute a certain code, and try to get the execution for each iteration of execution.
I hope I understand correct, if that so than i would suggest please use
System.nanoTime() instead of System.currentTimeMillis(); because if your statement of block has very small enough you always get Zero in Millisecond.
Simple Ex could be:
public static void main(String[] args) {
long lapsedTime = System.nanoTime();
//do your stuff here.
lapsedTime -= System.nanoTime();
System.out.println("Time Taken" + -lapsedTime);
}
If System.nanoTime() and System.currentTimeMillis(); are nothing much difference. But its just how much accurate result you need and some time difference in millisecond you may get Zero in case if you your set of statement are not more in each iteration.

The costs of streams and closures in Java 8

I'm rewriting an application that involves dealing with objects in order of 10 millions using Java 8 and I noticed that streams can slow down the application up to 25%. Interestingly, this happens when my collections are empty as well, so it's the constant initialization time of stream. To reproduce the problem, consider the following code:
long start = System.nanoTime();
for (int i = 0; i < 10_000_000; i++) {
Set<String> set = Collections.emptySet();
set.stream().forEach(s -> System.out.println(s));
}
long end = System.nanoTime();
System.out.println((end - start)/1000_000);
start = System.nanoTime();
for (int i = 0; i < 10_000_000; i++) {
Set<String> set = Collections.emptySet();
for (String s : set) {
System.out.println(s);
}
}
end = System.nanoTime();
System.out.println((end - start)/1000_000);
The result is as follows: 224 vs. 5 ms.
If I use forEach on set directly, i.e., set.forEach(), the result will be: 12 vs 5ms.
Finally, if I create the closure outside once as
Consumer<? super String> consumer = s -> System.out.println(s);
and use set.forEach(c) the result will be 7 vs 5 ms.
Of course, the nubmers are small and my benchmarking is very primitive, but does this example shows that there is an overhead in initializing streams and closures?
(Actually, since set is empty, the initialization cost of closures should not be important in this case, but nevertheless, should I consider creating closures before hand instead of on-the-fly)

The cost you see here is not associated with the "closures" at all but with the cost of Stream initialization.
Let's take your three sample codes:
for (int i = 0; i < 10_000_000; i++) {
Set<String> set = Collections.emptySet();
set.stream().forEach(s -> System.out.println(s));
}
This one creates a new Stream instance at each loop; at least for the first 10k iterations, see below. After those 10k iterations, well, the JIT is probably smart enough to see that it's a no-op anyway.
for (int i = 0; i < 10_000_000; i++) {
Set<String> set = Collections.emptySet();
for (String s : set) {
System.out.println(s);
}
}
Here the JIT kicks in again: empty set? Well, that's a no-op, end of story.
set.forEach(System.out::println);
An Iterator is created for the set, which is always empty? Same story, the JIT kicks in.
The problem with your code to start with is that you fail to account for the JIT; for realistic measurements, run at least 10k loops before measuring, since 10k executions is what the JIT requires to kick in (at least, HotSpot acts this way).
Now, lambdas: they are call sites, and they are linked only once; but the cost of the initial linkage is still there, of course, and in your loops, you include this cost. Try and run only one loop before doing your measurements so that this cost is out of the way.
All in all, this is not a valid microbenchmark. Use caliper, or jmh, to really measure the performance.
An excellent video to see how lambdas work here. It is a little old now, and the JVM is much better than it was at this time with lambdas.
If you want to know more, look for literature about invokedynamic.

Erlang takes ages to count?

Hi guys I'm trying to make a load generator and my goal is to compare how much of my system's resources are consumed when spawning Erlang processes as compared to spawning threads (Java). I am doing this by having the program count to 1000000000 10 times. Java takes roughly 35 seconds to finish the whole process with 10 threads created, Erlang takes ages with 10 processes, I grew impatient with it because it spent over 4 minutes counting. If I just make Erlang and Java count to 1000000000 without spawning threads/processes, Erlang takes 1 minute and 32 seconds and Java takes a good 3 or so seconds. I know Erlang is not made for crunching numbers but that much of a difference is alarming, why is there such a big difference ? Both use my CPU to 100% but no spike in RAM. I am not sure what other methods can be used to make this comparison, I am open to any suggestions as well.
here is the code for both versions
-module(loop).
-compile(export_all).
start(NumberOfProcesses) ->
loop(0, NumberOfProcesses).
%%Processes to spawn
loop(A, NumberOfProcesses) ->
if A < NumberOfProcesses ->
spawn(loop, outerCount, [0]),
loop(A+1, NumberOfProcesses);
true -> ok
end.
%%outer loop
outerCount(A) ->
if A < 10 ->
innerCount(0),
outerCount(A + 1);
true -> ok
end.
%%inner loop
innerCount(A) ->
if A < 1000000000 ->
innerCount(A+1);
true -> ok
end.
and java
import java.util.Scanner;
class Loop implements Runnable
{
public static void main(String[] args)
{
System.out.println("Input number of processes");
Scanner scan = new Scanner(System.in);
String theNumber = scan.nextLine();
for (int t = 0; t < Integer.parseInt(theNumber); t++)
{
new Thread(new Loop()).start();
}
}
public void run()
{
int i;
for (i = 0; i < 10; i++)
{
for (int j = 0; j < 1000000000; j++);
}
}
}

Are you running a 32- or 64-bit version of Erlang? If it's 32 bit, then the inner loop limit 1000000000 won't fit in a single-word fixnum (max 28 bits incl. sign), and the loop will start to do bignum arithmetic on the heap which is way way more expensive than just incrementing a word and looping (it will also cause garbage collection to happen now and then, to get rid of old unused numbers from the heap). Changing the outer loop from 10 to 1000 and removing 2 zeros correspondingly from the inner loop should make it use fixnum arithmetic only even on a 32-bit BEAM.
Then, it's also a question of whether the Java version is actually doing any work at all, or if the loop gets optimized away to a no-op at some point. (The Erlang compiler doesn't do that sort of trick - at least not yet.)

RichardC answer gives some clue to understand the difference of execution time. I will add also that if your java code is compiled, it may benefits a lot from the predictive branching of the microprocessor, and thus make a better use of the cache memories.
But the more important in my opinion is that you are not choosing the right ratio of Process/processing to evaluate the cost of process spawning.
The test use 10 processes that does some significant work. I would have chosen a test where many processes are spawned (some thousands? I don't know how much threads the JVM can manage) each process doing very few things, for example this code which spawn at each step twice the number of process and wait for the deepest processes to send back the message done. With a depth of 17, which means 262143 processes in total and 131072 returned messages, it takes less than 0.5 s on my very slow PC, that is less than 2µs per process (of course the dual core dual thread should be used)
-module (cascade).
-compile([export_all]).
test() ->
timer:tc(?MODULE,start,[]).
start() ->
spawn(?MODULE,child,[self(),17]),
loop(1024*128).
loop(0) -> done;
loop(N) ->
receive
done -> loop(N-1)
end.
child(P,0) -> P ! done;
child(P,N) ->
spawn(?MODULE,child,[P,N-1]),
spawn(?MODULE,child,[P,N-1]).

There are a few problems here.
I don't know how you can evaluate what the Java compiler is doing, but I'd wager it's optimizing the loop out of existence. I think you'd have to have the loop do something meaningful to make any sort of comparison.
More importantly, the Erlang code is not doing what you think it's doing, as best as I can tell. It appears that each process is counting up to 1000000000, and then doing it again for a total of 10 times.
Perhaps worse, your functions are not tail recursive, so your functions keep accumulating in memory waiting for the last one to execute. (Edit: I may be wrong about that. Unaccustomed to the if statement.)
Here's Erlang that does what you want it to do. It's still very slow.
-module(realloop).
-compile(export_all).
start(N) ->
loop(0, N).
loop(N, N) ->
io:format("Spawned ~B processes~n", [N]);
loop(A, N) ->
spawn(realloop, count, [0, 1000000000]),
loop(A+1, N).
count(Upper, Upper) ->
io:format("Reached ~B~n", [Upper]);
count(Lower, Upper) ->
count(Lower+1, Upper).

CPU Intensive Calculation Examples?

I need a few easily implementable single cpu and memory intensive calculations that I can write in java for a test thread scheduler.
They should be slightly time consuming, but more importantly resource consuming.
Any ideas?

A few easy examples of CPU-intensive tasks:
searching for prime numbers (involves lots of BigInteger divisions)
calculating large factorials e.g. 2000! ((involves lots of BigInteger multiplications)
many Math.tan() calculations (this is interesting because Math.tan is native, so you're using two call stacks: one for Java calls, the other for C calls.)

Multiply two matrices. The matrices should be huge and stored on the disk.
String search. Or, index a huge document (detect and count the occurrence of each word or strings of alphabets) For example, you can index all of the identifiers in the source code of a large software project.
Calculate pi.
Rotate a 2D matrix, or an image.
Compress some huge files.
...

The CPU soak test for the PDP-11 was tan(atan(tan(atan(...))) etc. Works the FPU pretty hard and also the stack and registers.

Ok this is not Java, but this is based on Dhrystone benchmark algorithm found here. These implementations of the algorithm might give you an idea on how is it done. The link here contains sources to C/C++ and Assembler to obtain the benchmarks.

Calculate nth term of the fibonacci series, where n is greater than 70. (time consuming)
Calculate factorials of large numbers. (time consuming)
Find all possible
paths between two nodes, in a graph. (memory consuming)

Official RSA Challenge
Unofficial RSA Challenge - Grab some ciphertext that you want to read in plaintext. Let the computer at it. If u use a randomized algorithm, there is a small but non-zero chance that u will succeed.

I was messing around with Thread priority in Java and used the code below. It seems to keep the CPU busy enough that the thread priority makes a difference.
#Test
public void testCreateMultipleThreadsWithDifferentPriorities() throws Exception {
class MyRunnable implements Runnable {
#Override
public void run() {
for (int i=0; i<1_000_000; i++) {
double d = tan(atan(tan(atan(tan(atan(tan(atan(tan(atan(123456789.123456789))))))))));
cbrt(d);
}
LOGGER.debug("I am {}, and I have finished", Thread.currentThread().getName());
}
}
final int NUMBER_OF_THREADS = 32;
List<Thread> threadList = new ArrayList<Thread>(NUMBER_OF_THREADS);
for (int i=1; i<=NUMBER_OF_THREADS; i++) {
Thread t = new Thread(new MyRunnable());
if (i == NUMBER_OF_THREADS) {
// Last thread gets MAX_PRIORITY
t.setPriority(Thread.MAX_PRIORITY);
t.setName("T-" + i + "-MAX_PRIORITY");
} else {
// All other threads get MIN_PRIORITY
t.setPriority(Thread.MIN_PRIORITY);
t.setName("T-" + i);
}
threadList.add(t);
}
threadList.forEach(t->t.start());
for (Thread t : threadList) {
t.join();
}
}

How Stream is more efficient?

I am trying to digest Stream package and seems like it's very difficult for me to understand.
I was reading Stream package documentation and at a point I tried to implement it to learn by doing. This is the text I have read:
Intermediate operations return a new stream. They are always lazy;
executing an intermediate operation such as filter() does not actually
perform any filtering, but instead creates a new stream that, when
traversed, contains the elements of the initial stream that match the
given predicate. Traversal of the pipeline source does not begin until
the terminal operation of the pipeline is executed.
I understand this much that they provide a new Stream, so my first question is, Is creating a stream without traversing a heavy operation?
Now, since intermediate operations are lazy and terminal operations are eager and also streams are meant to be efficient than old programming standards of if-else and more readable.
Processing streams lazily allows for significant efficiencies; in a
pipeline such as the filter-map-sum example above, filtering, mapping,
and summing can be fused into a single pass on the data, with minimal
intermediate state. Laziness also allows avoiding examining all the
data when it is not necessary; for operations such as "find the first
string longer than 1000 characters", it is only necessary to examine
just enough strings to find one that has the desired characteristics
without examining all of the strings available from the source. (This
behavior becomes even more important when the input stream is infinite
and not merely large.)
To demonstrate this, I started implemented a small program to understand the concept. Here is the program:
List<String> stringList = new ArrayList<>();
for (int i = 0; i < 10000; i++) {
stringList.add("String" + i);
}
long start = System.currentTimeMillis();
Stream stream = stringList.stream().filter(s -> s.contains("99"));
long midEnd = System.currentTimeMillis();
System.out.println("Time is millis before applying terminal operation: " + (midEnd - start));
System.out.println(stream.findFirst().get());
long end = System.currentTimeMillis();
System.out.println("Whole time in millis: " + (end - start));
System.out.println("Time in millis for Terminal operation: " + (end - midEnd));
start = System.currentTimeMillis();
for (String ss1 : stringList) {
if (ss1.contains("99")) {
System.out.println(ss1);
break;
}
}
end = System.currentTimeMillis();
System.out.println("Time in millis with old standard: " + (end - start));
I have executed this program many times and each time it has proved me that, creating a new stream from intermediate operations is the heavy task to do. Terminal operations do take very little time as compared to intermediate operations.
And overall, old if-else pattern is way more efficient than streams. So, again more questions here:
Did I misunderstand something?
If I understand correct, why and when to use streams?
If I am doing or understanding anything wrong, can you please help clarify my concepts Package java.util.stream?
Actual Numbers:
Try 1:
Time is millis before applying terminal operation: 73
String99
Whole time in millis: 76
Time in millis for Terminal operation: 3
String99
Time in millis with old standard: 0
Try 2:
Time is millis before applying terminal operation: 56
String99
Whole time in millis: 59
Time in millis for Terminal operation: 3
String99
Time in millis with old standard: 0
Try 3:
Time is millis before applying terminal operation: 69
String99
Whole time in millis: 72
Time in millis for Terminal operation: 3
String99
Time in millis with old standard: 0
These are my machine details if this help:
Memory: 11.6 GiB
Processor: Intel® Core™ i7-3632QM CPU # 2.20GHz × 8
OS-Type: 64-bit

One of the rationales for the Stream api is that it eliminates the inherent assumption of the for loop, that all iteration happens in the same way. When you use an iterator-based for loop, you are hard-coding the iteration logic to always iterate sequentially. Consider the question, "what if I wanted to change the implementation of the 'for' loop with something more efficient?"
The Stream api addresses that--it abstracts the notion of iteration and allows other ways of processing multiple data points to be considered -- iterate serially vs. in parallel, add optimizations if it is known that the data is unordered, etc.
Consider your example--although you can't change the implementation of the for loop, you can change the implementation of the Stream to suit different situations. For example, if you have more cpu-intensive operations to do on each task, you might choose a parallel Stream. Here's an example with 10 ms delays simulate more complex processing, done in parallel, with very different results:
List<String> stringList = new ArrayList<>();
for (int i = 0; i < 10000; i++) {
stringList.add("String" + i);
}
long start = System.currentTimeMillis();
Stream stream = stringList.parallelStream().filter(s -> {
try {
Thread.sleep(10);
} catch (InterruptedException e) {
e.printStackTrace();
}
return s.contains("99" );});
long midEnd = System.currentTimeMillis();
System.out.println("Time is millis before applying terminal operation: " + (midEnd - start));
System.out.println(stream.findAny().get());
long end = System.currentTimeMillis();
System.out.println("Whole time in millis: " + (end - start));
System.out.println("Time in millis for Terminal operation: " + (end - midEnd));
start = System.currentTimeMillis();
for (String ss1 : stringList) {
try {
Thread.sleep(20);
} catch (InterruptedException e) {
e.printStackTrace();
}
if (ss1.contains("99")) {
System.out.println(ss1);
break;
}
}
end = System.currentTimeMillis();
System.out.println("Time in millis with old standard: " + (end - start));
I kept the same benchmark logic everyone is complaining about, to make it easier for you to compare.
As you can see, there are situations where for loops will always be more efficient than using a Stream, but Streams offer significant advantages in certain situations as well. It would be unwise to extrapolate from one isolated test that one approach is always better than the other--that is an axiom for life as well.

Unless your tests involve JMH, then your code is pretty much a proof of nothing and even worse, it will give an ALTERED impression of reality.
assylias made the comment that should make it clear on what goes wrong.
Also your measurements of the "intermediate operation" and then the "short circuit" are also wrong. The intermediate operation, because it is lazy, does nothing really, it will only take place when a terminal one will kick in.
If you ever worked with guava, this is how transform/filter is done in their code also, at least logically.

As others have already have noted your benchmark is flawed. The main problem is that the results are skewed by ignoring compilation time. Try the following:
Stream stream = stringList.stream().filter(s -> s.contains("99"));
long start = System.currentTimeMillis();
stream = stringList.stream().filter(s -> s.contains("99"));
long midEnd = System.currentTimeMillis();
Now the code that backs filter is already compiled and the second call is fast. Even this would work:
Stream stream = stringList.stream().map(s -> s);
long start = System.currentTimeMillis();
stream = stringList.stream().filter(s -> s.contains("99"));
long midEnd = System.currentTimeMillis();
map shares most of the code with filter, so calling filter is fast here, too, because the code is already compiled. And in case you ask: Calling filter or map on a different stream would work too, of course.
Your "old style" code doesn't require additional compilation.

I really don't trust your "benchmark", because too many things can go wrong, you better use a framework. But anyways, when people or docs say it is more efficient they don't mean the example you provided.
Streams as lifted collection (they don't hold data) are more efficient than eager ones like Scala Lists for instance where a filter allocates a new List and the map transforms the results to a new List.
When we compare with this implementation Streams win.
But yeah streams allocate objects which is vey cheap on modern JVMs and looked after in modern GC's.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.