Java - System.out effect on performance

Java - System.out effect on performance - java

I've seen this question and it's somewhat similar. I would like to know if it really is a big factor that would affect the performance of my application. Here's my scenario.
I have this Java webapp that can upload thousands of data from a Spreadsheet which is being read per row from top to bottom. I'm using System.out.println() to show on the server's side on what line the application is currently reading.
- I'm aware of creating a log file. In fact, I'm creating a log file and at the same time, displaying the logs on the server's prompt.
Is there any other way of printing the current data on the prompt?

I was recently testing with (reading and) writing large (1-1.5gb) text-files, and I found out that:
PrintWriter out = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(java.io.FileDescriptor.out), "UTF-8"), 512));
out.println(yourString);
//...
out.flush();
is in fact almost 250% faster than
System.out.println(yourString);
My test-program first read about 1gb of data, processed it a bit and outputted it in slightly different format.
Test results (on Macbook Pro, with SSD reading&writing using same disk):
data-output-to-system-out > output.txt => 1min32sec
data-written-to-file-in-java => 37sec
data-written-to-buffered-writer-stdout > output.txt => 36sec
I did try with multiple buffer sized between 256-10k but that didn't seem to matter.
So keep in mind if you're creating unix command-line tools with Java where output is meant to be directed or piped to somewhere else, don't use System.out directly!

It can have an impact on your application performance. The magnitude will vary depending on the kind of hardware you are running on and the load on the host.
Some points on which this can translate to performance wise:
-> Like Rocket boy stated, println is synchronized, which means you will be incurring in locking overhead on the object header and may cause thread bottlenecks depending on your design.
-> Printing on the console requires kernel time, kernel time means the cpu will not be running on user mode which basically means your cpu will be busy executing on kernel code instead of your application code.
-> If you are already logging this, that means extra kernel time for I/O, and if your platform does not support asynchronous I/O this means your cpu might become stalled on busy waits.
You can actually try and benchmark this and verify this yourself.
There are ways to getaway with this like for example having a really fast I/O, a huge machine for dedicated use maybe and biased locking on your JVM options if your application design will not be multithreaded on that console printing.
Like everything on performance, it all depends on your hardware and priorities.

System.out.println()
is synchronized.
public void println(String x) {
synchronized (this) {
print(x);
newLine();
}
If multiple threads write to it, its performance will suffer.

Yes, it will have a HUGE impact on performance. If you want a quantifiable number, well then there's plenty of software and/or ways of measuring your own code's performance.

System.out.println is very slow compared to most slow operations. This is because it places more work on the machine than other IO operations (and it is single threaded)
I suggest you write the output to a file and tail the output of this file. This way, the output will still be slow, but it won't slow down your web service so much.

Here's a very simple program to check performance of System.out.println and compare it with multiplication operation (You can use other operations or function specific to your requirements).
public class Main{
public static void main(String []args) throws InterruptedException{
long tTime = System.nanoTime();
long a = 123L;
long b = 234L;
long c = a*b;
long uTime = System.nanoTime();
System.out.println("a * b = "+ c +". Time taken for multiplication = "+ (uTime - tTime) + " nano Seconds");
long vTime = System.nanoTime();
System.out.println("Time taken to execute Print statement : "+ (vTime - uTime) + " nano Seconds");
}
}
Output depends on your machine and it's current state.
Here's what I got on : https://www.onlinegdb.com/online_java_compiler
a * b = 28782. Time taken for multiplication = 330 nano Seconds
Time taken to execute Print statement : 338650 nano Seconds
EDIT :
I have logger set up on my local machine so wanted to give you idea of performance difference between system.out.println and logger.info - i.e., performance comparison between console print vs logging
public static void main(String []args) throws InterruptedException{
long tTime = System.nanoTime();
long a = 123L;
long b = 234L;
long c = a*b;
long uTime = System.nanoTime();
System.out.println("a * b = "+ c +". Time taken for multiplication = "+ (uTime - tTime) + " nano Seconds");
long vTime = System.nanoTime();
System.out.println("Time taken to execute Print statement : "+ (vTime - uTime) + " nano Seconds");
long wTime = System.nanoTime();
logger.info("a * b = "+ c +". Time taken for multiplication = "+ (uTime - tTime) + " nano Seconds");
long xTime = System.nanoTime();
System.out.println("Time taken to execute log statement : "+ (xTime - wTime) + " nano Seconds");
}
Here's what I got on my local machine :
a * b = 28782. Time taken for multiplication = 1667 nano Seconds
Time taken to execute Print statement : 34080917 nano Seconds
2022-11-15 11:36:32.734 [] INFO CreditAcSvcImpl uuid: - a * b = 28782. Time taken for multiplication = 1667 nano Seconds
Time taken to execute log statement : 9645083 nano Seconds
Notice that system.out.println is taking almost 24 ms higher then the logger.info.

Related

How to get the time taken for part of a program in Unix

I'm working on comparing a Binary Search Tree to an AVL one and want to see the usr/sys time for a search operation performed on both. Thing is: I have an application (SearchBST.java/SearchAVL.java) that reads in a file and populates the trees, and then searches them. I want to know if I can check the usr/sys time for just the searching instead of the entire thing (inserting and searching). It seems to me that the insertion is causing the AVL's time (using "time java SearchAVL") to be roughly the same as the BST's.
Should I be doing it differently (such that populating the tree doesn't affect the overall time)? I'll post some code as soon as I can, but I wanted to see if anyone has any thoughts in the mean time.

Why don't you measure the time inside your application?
// Read file to a temporary collection or array
// to prevent meassuring disk performance instead of tree performance
long t = System.nanoTime();
// populate tree
long tPopulate = System.nanoTime() - t;
t = System.nanoTime();
// search tree
long tSearch = System.nanoTime() - t;
System.out.println("tPopulate = " + tPopulate + " ns");
System.out.println("tSearch = " + tSearch + " ns");
This will only print the wall clock time, but since you don't have any Thread.sleep(...) commands or things like that in your program, the wall clock time shouldn't differ much from the user time.

Java : Issue with capturing execution time per iteration in a Map

I've a requirement to capture the execution time of some code in iterations. I've decided to use a Map<Integer,Long> for capturing this data where Integer(key) is the iteration number and Long(value) is the time consumed by that iteration in milliseconds.
I've written the below java code to compute the time taken for each iteration. I want to ensure that the time taken by all iterations is zero before invoking actual code. Surprisingly, the below code behaves differently for every execution.
Sometimes, I get the desired output(zero millisecond for all iterations), but at times I do get positive and even negative values for some random iterations.
I've tried replacing System.currentTimeMillis(); with below code:
new java.util.Date().getTime();
System.nanoTime();
org.apache.commons.lang.time.StopWatch
but still no luck.
Any suggestions as why some iterations take additional time and how to eliminate it?
package com.stackoverflow.programmer;
import java.util.HashMap;
import java.util.Map;
public class TestTimeConsumption {
public static void main(String[] args) {
Integer totalIterations = 100000;
Integer nonZeroMilliSecondsCounter = 0;
Map<Integer, Long> timeTakenMap = new HashMap<>();
for (Integer iteration = 1; iteration <= totalIterations; iteration++) {
timeTakenMap.put(iteration, getTimeConsumed(iteration));
if (timeTakenMap.get(iteration) != 0) {
nonZeroMilliSecondsCounter++;
System.out.format("Iteration %6d has taken %d millisecond(s).\n", iteration,
timeTakenMap.get(iteration));
}
}
System.out.format("Total non zero entries : %d", nonZeroMilliSecondsCounter);
}
private static Long getTimeConsumed(Integer iteration) {
long startTime = System.currentTimeMillis();
// Execute code for which execution time needs to be captured
long endTime = System.currentTimeMillis();
return (endTime - startTime);
}
}
Here's the sample output from 5 different executions of the same code:
Execution #1 (NOT OK)
Iteration 42970 has taken 1 millisecond(s).
Total non zero entries : 1
Execution #2 (OK)
Total non zero entries : 0
Execution #3 (OK)
Total non zero entries : 0
Execution #4 (NOT OK)
Iteration 65769 has taken -1 millisecond(s).
Total non zero entries : 1
Execution #5 (NOT OK)
Iteration 424 has taken 1 millisecond(s).
Iteration 33053 has taken 1 millisecond(s).
Iteration 76755 has taken -1 millisecond(s).
Total non zero entries : 3
I am looking for a Java based solution that ensures that all
iterations consume zero milliseconds consistently. I prefer to
accomplish this using pure Java code without using a profiler.
Note: I was also able to accomplish this through C code.

Your HashMap performance may be dropping if it is resizing. The default capacity is 16 which you are exceeding. If you know the expected capacity up front, create the HashMap with the appropriate size taking into account the default load factor of 0.75
If you rerun iterations without defining a new map and the Integer key does not start again from zero, you will need to resize the map taking into account the total of all possible iterations.
int capacity = (int) ((100000/0.75)+1);
Map<Integer, Long> timeTakenMap = new HashMap<>(capacity);

As you are starting to learn here, writing microbenchmarks in Java is not as easy as one would first assume. Everybody gets bitten at some point, even the hardened performance experts who have been doing it for years.
A lot is going on within the JVM and the OS that skews the results, such as GC, hotspot on the fly optimisations, recompilations, clock corrections, thread contention/scheduling, memory contention and cache misses. To name just a few. And sadly these skews are not consistent, and they can very easily dominate a microbenchmark.
To answer your immediate question of why the timings can some times go negative, it is because currentTimeMillis is designed to capture wall clock time and not elapsed time. No wall clock is accurate on a computer and there are times when the clock will be adjusted.. very possibly backwards. More detail on Java's clocks can be read on the following Oracle Blog Inside the Oracle Hotspot VM clocks.
Further details and support of nanoTime verses currentTimeMillis can be read here.
Before continuing with your own benchmark, I strongly recommend that you read how do I write a currect micro benchmark in java. The quick synopses is to 1) warm up the JVM before taking results, 2) jump through hoops to avoid dead code elimination, 3) ensure that nothing else is running on the same machine but accept that there will be thread scheduling going on.. you may even want to pin threads to cores, depends on how far you want to take this, 4) use a framework specifically designed for microbenchmarking such as JMH or for quick light weight spikes JUnitMosaic gives good results.

I'm not sure if I understand your question.
You're trying to execute a certain set of statements S, and expect the execution time to be zero. You then test this premise by executing it a number of times and verifying the result.
That is a strange expectation to have: anything consumes some time, and possibly even more. Hence, although it would be possible to test successfully, that does not prove that no time has been used, since your program is save_time();execute(S);compare_time(). Even if execute(S) is nothing, your timing is discrete, and as such, it is possible that the 'tick' of your wallclock just happens to happen just between save_time and compare_time, leading to some time having been visibly past.
As such, I'd expect your C program to behave exactly the same. Have you run that multiple times? What happens when you increase the iterations to over millions? If it still does not occur, then apparently your C compiler has optimized the code in such a way that no time is measured, and apparently, Java doesn't.
Or am I understanding you wrong?

You hint it right... System.currentTimeMillis(); is the way to go in this case.
There is no warranty that increasing the value of the integer object i represent either a millisecond or a Cycle-Time in no system...
you should take the System.currentTimeMillis() and calculated the elapsed time
Example:
public static void main(String[] args) {
long lapsedTime = System.currentTimeMillis();
doFoo();
lapsedTime -= System.currentTimeMillis();
System.out.println("Time:" + -lapsedTime);
}

I am also not sure exactly, You're trying to execute a certain code, and try to get the execution for each iteration of execution.
I hope I understand correct, if that so than i would suggest please use
System.nanoTime() instead of System.currentTimeMillis(); because if your statement of block has very small enough you always get Zero in Millisecond.
Simple Ex could be:
public static void main(String[] args) {
long lapsedTime = System.nanoTime();
//do your stuff here.
lapsedTime -= System.nanoTime();
System.out.println("Time Taken" + -lapsedTime);
}
If System.nanoTime() and System.currentTimeMillis(); are nothing much difference. But its just how much accurate result you need and some time difference in millisecond you may get Zero in case if you your set of statement are not more in each iteration.

What is the actual time of execution of the method in Java and what does it depend?

I have one problem that I can't explain. Here is the code in main function:
String numberStr = "3151312423412354315";
System.out.println(numberStr + "\n");
System.out.println("Lehman method: ");
long beginTime = System.currentTimeMillis();
System.out.println(Lehman.getFullFactorization(numberStr));
long finishTime = System.currentTimeMillis();
System.out.println((finishTime-beginTime)/1000. + " sec.");
System.out.println();
System.out.println("Lehman method: ");
beginTime = System.currentTimeMillis();
System.out.println(Lehman.getFullFactorization(numberStr));
finishTime = System.currentTimeMillis();
System.out.println((finishTime-beginTime)/1000. + " sec.");
If it is necessary: method Lehman.getFullFactorization(...) returns the ArrayList of prime divisors in String format.
Here is the output:
3151312423412354315
Lehman method:
[5, 67, 24473, 384378815693]
0.149 sec.
Lehman method:
[5, 67, 24473, 384378815693]
0.016 sec.
I was surprised, when I saw it. Why a second execution of the same method much faster than first? Firstly, I thought that at the first running of the method it calculates time with time of running JVM and its resources, but it's impossible, because obviously JVM starts before execution of the "main" method.

In some cases, Java's JIT compiler (see http://java.sun.com/developer/onlineTraining/Programming/JDCBook/perf2.html#jit) kicks in on the first execution of a method and performs optimizations of that methods code. That is supposed to make all subsequent executions faster. I think this might be what happens in your case.

Try doing it more than 10,000 times and it will be much faster. This is because the code first has to be loaded (expensive) then runs in interpreted mode (ok speed) and is finally compiled to native code (much faster)
Can you try this?
int runs = 100*1000;
for(int i = -20000 /* warmup */; i < runs; i++) {
if(i == 0)
beginTime = System.nanoTime();
Lehman.getFullFactorization(numberStr);
}
finishTime = System.nanoTime();
System.out.println("Average time was " + (finishTime-beginTime)/1e9/runs. + " sec.");

I suppose the JVM has cached the results (may be particularly) of the first calculation and you observe the faster second calculation. JIT in action.

There are two things that make the second run faster.
The first time, the class containing the method must be loaded. The second time, it is already in memory.
Most importantly, the JIT optimizes code that is often executed: during the first call, the JVM starts by interpreting the byte code and then compiles it into machine code and continues the execution. The second time, the code is already compiled.
That's why micro-benchmarks in Java are often hard to validate.

My guess is that it's saved in the L1/L2 cache on the CPU for optimization.
Or Java doesn't have to interpret it again and recalls it from the memory as already compiled code.

How can i benchmark method execution time in java?

I have a program which i have myself written in java, but I want to test method execution times and get timings for specific methods. I was wondering if this is possible, by maybe somehow an eclipse plug-in? or maybe inserting some code?
I see, it is quite a small program, nothing more than 1500 lines, which would be better a dedicated tool or System.currentTimeMillis()?

Other than using a profiler, a simple way of getting what you want is the following:
public class SomeClass{
public void somePublicMethod()
{
long startTime = System.currentTimeMillis();
someMethodWhichYouWantToProfile();
long endTime = System.currentTimeMillis();
System.out.println("Total execution time: " + (endTime-startTime) + "ms");
}
}

If the bottleneck is big enough to be observed with a profiler, use a profiler.
If you need more accuracy, the best way of measuring an small piece of code is the use of a Java microbenchmark framework like OpenJDK's JMH or Google's Caliper. I believe they are as simple to use as JUnit and not only you will get more accurate results, but you will gain the expertise from the community to do it right.
Follows a JMH microbenchmark to measure the execution time of Math.log():
private double x = Math.PI;
#Benchmark
public void myBenchmark() {
return Math.log(x)
}
Using the currentMillis() and nanoTime() for measuring has many limitations:
They have latency (they also take time to execute) which bias your measurements.
They have VERY limited precision, this means you can mesure things from 26ns to 26ns in linux and 300 or so in Windows has described here
The warmup phase is not taken into consideration, making your measurements fluctuate A LOT.
The currentMillis() and nanoTime() methods in Java can be useful but must be used with EXTREME CAUTION or you can get wrong measurements errors like this where the order of the measured snippets influence the measurements or like this where the author wrongly conclude that several millions of operations where performed in less than a ms, when in fact the JMV realised no operations where made and hoisted the code, running no code at all.
Here is a wonderful video explaining how to microbenchmark the right way: https://shipilev.net/#benchmarking

For quick and dirty time measurement tests, don't use wall-clock time (System.currentTimeMillis()). Prefer System.nanoTime() instead:
public static void main(String... ignored) throws InterruptedException {
final long then = System.nanoTime();
TimeUnit.SECONDS.sleep(1);
final long millis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - then);
System.out.println("Slept for (ms): " + millis); // = something around 1000.
}

You should use a profiler like
jprofiler
yourkit
They will easily integrate with any IDE and show whatever detail you need.
Of course these tools are complex and meant to be used to profile complex programs, if you need just some simple benchmarks I suggest you to use System.currentTimeMillis() or System.nanoTime() and calculate delta of millisecs between calls by yourself.

Using a profiler is better because you can find out average execution times and bottlenecks in your app.
I use VisualVM. slick and simple.

Google Guava has a stopwatch, makes things simple and easy.
Stopwatch stopwatch = Stopwatch.createStarted();
myFunctionCall();
LOGGER.debug("Time taken by myFunctionCall: " + stopwatch.stop());

Jprofiler and yourkit are good, but cost money.
There is a free plugin for eclispe called TPTP (Test & Performance Tools Platform) That can give you code execution times. Here is a tutorial that a quick google search brought up. http://www.eclipse.org/articles/Article-TPTP-Profiling-Tool/tptpProfilingArticle.html

Another Custom made solution could be based on the the following post : http://www.mkyong.com/spring/spring-aop-examples-advice/
You then have also the possibility to use the utilities around application monitoring & snmp. If you need to "time" your methods on a regular basis in a production environment, you proabably should consider using one of the those SNMP tools

Usually I store the time in a .txt file for analise the outcome
StopWatch sWatch = new StopWatch();
sWatch.start();
//do stuff that you want to measure
downloadContent();
sWatch.stop();
//make the time pretty
long timeInMilliseconds = sWatch.getTime();
long hours = TimeUnit.MILLISECONDS.toHours(timeInMilliseconds);
long minutes = TimeUnit.MILLISECONDS.toMinutes(timeInMilliseconds - TimeUnit.HOURS.toMillis(hours));
long seconds = TimeUnit.MILLISECONDS.toSeconds(timeInMilliseconds - TimeUnit.HOURS.toMillis(hours) - TimeUnit.MINUTES.toMillis(minutes));
long milliseconds = timeInMilliseconds - TimeUnit.HOURS.toMillis(hours) - TimeUnit.MINUTES.toMillis(minutes) - TimeUnit.SECONDS.toMillis(seconds);
String t = String.format("%02d:%02d:%02d:%d", hours, minutes, seconds, milliseconds);
//each line to store in a txt file, new line
String content = "Ref: " + ref + " - " + t + "\r\n";
//you may want wrap this section with a try catch
File file = new File("C:\\time_log.txt");
FileWriter fw = new FileWriter(file.getAbsoluteFile(), true); //append content set to true, so it does not overwrite existing data
BufferedWriter bw = new BufferedWriter(fw);
bw.write(content);
bw.close();

You can add this code and it will tell you how long the method took to execute.
long millisecondsStart = System.currentTimeMillis();
executeMethod();
long timeSpentInMilliseconds = System.currentTimeMillis() - millisecondsStart;

How Stream is more efficient?

I am trying to digest Stream package and seems like it's very difficult for me to understand.
I was reading Stream package documentation and at a point I tried to implement it to learn by doing. This is the text I have read:
Intermediate operations return a new stream. They are always lazy;
executing an intermediate operation such as filter() does not actually
perform any filtering, but instead creates a new stream that, when
traversed, contains the elements of the initial stream that match the
given predicate. Traversal of the pipeline source does not begin until
the terminal operation of the pipeline is executed.
I understand this much that they provide a new Stream, so my first question is, Is creating a stream without traversing a heavy operation?
Now, since intermediate operations are lazy and terminal operations are eager and also streams are meant to be efficient than old programming standards of if-else and more readable.
Processing streams lazily allows for significant efficiencies; in a
pipeline such as the filter-map-sum example above, filtering, mapping,
and summing can be fused into a single pass on the data, with minimal
intermediate state. Laziness also allows avoiding examining all the
data when it is not necessary; for operations such as "find the first
string longer than 1000 characters", it is only necessary to examine
just enough strings to find one that has the desired characteristics
without examining all of the strings available from the source. (This
behavior becomes even more important when the input stream is infinite
and not merely large.)
To demonstrate this, I started implemented a small program to understand the concept. Here is the program:
List<String> stringList = new ArrayList<>();
for (int i = 0; i < 10000; i++) {
stringList.add("String" + i);
}
long start = System.currentTimeMillis();
Stream stream = stringList.stream().filter(s -> s.contains("99"));
long midEnd = System.currentTimeMillis();
System.out.println("Time is millis before applying terminal operation: " + (midEnd - start));
System.out.println(stream.findFirst().get());
long end = System.currentTimeMillis();
System.out.println("Whole time in millis: " + (end - start));
System.out.println("Time in millis for Terminal operation: " + (end - midEnd));
start = System.currentTimeMillis();
for (String ss1 : stringList) {
if (ss1.contains("99")) {
System.out.println(ss1);
break;
}
}
end = System.currentTimeMillis();
System.out.println("Time in millis with old standard: " + (end - start));
I have executed this program many times and each time it has proved me that, creating a new stream from intermediate operations is the heavy task to do. Terminal operations do take very little time as compared to intermediate operations.
And overall, old if-else pattern is way more efficient than streams. So, again more questions here:
Did I misunderstand something?
If I understand correct, why and when to use streams?
If I am doing or understanding anything wrong, can you please help clarify my concepts Package java.util.stream?
Actual Numbers:
Try 1:
Time is millis before applying terminal operation: 73
String99
Whole time in millis: 76
Time in millis for Terminal operation: 3
String99
Time in millis with old standard: 0
Try 2:
Time is millis before applying terminal operation: 56
String99
Whole time in millis: 59
Time in millis for Terminal operation: 3
String99
Time in millis with old standard: 0
Try 3:
Time is millis before applying terminal operation: 69
String99
Whole time in millis: 72
Time in millis for Terminal operation: 3
String99
Time in millis with old standard: 0
These are my machine details if this help:
Memory: 11.6 GiB
Processor: Intel® Core™ i7-3632QM CPU # 2.20GHz × 8
OS-Type: 64-bit

One of the rationales for the Stream api is that it eliminates the inherent assumption of the for loop, that all iteration happens in the same way. When you use an iterator-based for loop, you are hard-coding the iteration logic to always iterate sequentially. Consider the question, "what if I wanted to change the implementation of the 'for' loop with something more efficient?"
The Stream api addresses that--it abstracts the notion of iteration and allows other ways of processing multiple data points to be considered -- iterate serially vs. in parallel, add optimizations if it is known that the data is unordered, etc.
Consider your example--although you can't change the implementation of the for loop, you can change the implementation of the Stream to suit different situations. For example, if you have more cpu-intensive operations to do on each task, you might choose a parallel Stream. Here's an example with 10 ms delays simulate more complex processing, done in parallel, with very different results:
List<String> stringList = new ArrayList<>();
for (int i = 0; i < 10000; i++) {
stringList.add("String" + i);
}
long start = System.currentTimeMillis();
Stream stream = stringList.parallelStream().filter(s -> {
try {
Thread.sleep(10);
} catch (InterruptedException e) {
e.printStackTrace();
}
return s.contains("99" );});
long midEnd = System.currentTimeMillis();
System.out.println("Time is millis before applying terminal operation: " + (midEnd - start));
System.out.println(stream.findAny().get());
long end = System.currentTimeMillis();
System.out.println("Whole time in millis: " + (end - start));
System.out.println("Time in millis for Terminal operation: " + (end - midEnd));
start = System.currentTimeMillis();
for (String ss1 : stringList) {
try {
Thread.sleep(20);
} catch (InterruptedException e) {
e.printStackTrace();
}
if (ss1.contains("99")) {
System.out.println(ss1);
break;
}
}
end = System.currentTimeMillis();
System.out.println("Time in millis with old standard: " + (end - start));
I kept the same benchmark logic everyone is complaining about, to make it easier for you to compare.
As you can see, there are situations where for loops will always be more efficient than using a Stream, but Streams offer significant advantages in certain situations as well. It would be unwise to extrapolate from one isolated test that one approach is always better than the other--that is an axiom for life as well.

Unless your tests involve JMH, then your code is pretty much a proof of nothing and even worse, it will give an ALTERED impression of reality.
assylias made the comment that should make it clear on what goes wrong.
Also your measurements of the "intermediate operation" and then the "short circuit" are also wrong. The intermediate operation, because it is lazy, does nothing really, it will only take place when a terminal one will kick in.
If you ever worked with guava, this is how transform/filter is done in their code also, at least logically.

As others have already have noted your benchmark is flawed. The main problem is that the results are skewed by ignoring compilation time. Try the following:
Stream stream = stringList.stream().filter(s -> s.contains("99"));
long start = System.currentTimeMillis();
stream = stringList.stream().filter(s -> s.contains("99"));
long midEnd = System.currentTimeMillis();
Now the code that backs filter is already compiled and the second call is fast. Even this would work:
Stream stream = stringList.stream().map(s -> s);
long start = System.currentTimeMillis();
stream = stringList.stream().filter(s -> s.contains("99"));
long midEnd = System.currentTimeMillis();
map shares most of the code with filter, so calling filter is fast here, too, because the code is already compiled. And in case you ask: Calling filter or map on a different stream would work too, of course.
Your "old style" code doesn't require additional compilation.

I really don't trust your "benchmark", because too many things can go wrong, you better use a framework. But anyways, when people or docs say it is more efficient they don't mean the example you provided.
Streams as lifted collection (they don't hold data) are more efficient than eager ones like Scala Lists for instance where a filter allocates a new List and the map transforms the results to a new List.
When we compare with this implementation Streams win.
But yeah streams allocate objects which is vey cheap on modern JVMs and looked after in modern GC's.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.