Why is System.out.println so slow?

Why is System.out.println so slow? - java

Is this something common to all programming languages? Doing multiple print followed by a println seems faster but moving everything to a string and just printing that seems fastest. Why?
EDIT: For example, Java can find all the prime numbers up to 1 million in less than a second - but printing then all out each on their own println can take minutes! Up to a 10 billion can hours to print!
EX:
package sieveoferatosthenes;
public class Main {
public static void main(String[] args) {
int upTo = 10000000;
boolean primes[] = new boolean[upTo];
for( int b = 0; b < upTo; b++ ){
primes[b] = true;
}
primes[0] = false;
primes[1] = false;
int testing = 1;
while( testing <= Math.sqrt(upTo)){
testing ++;
int testingWith = testing;
if( primes[testing] ){
while( testingWith < upTo ){
testingWith = testingWith + testing;
if ( testingWith >= upTo){
}
else{
primes[testingWith] = false;
}
}
}
}
for( int b = 2; b < upTo; b++){
if( primes[b] ){
System.out.println( b );
}
}
}
}

println is not slow, it's the underlying PrintStream that is connected with the console, provided by the hosting operating system.
You can check it yourself: compare dumping a large text file to the console with piping the same textfile into another file:
cat largeTextFile.txt
cat largeTextFile.txt > temp.txt
Reading and writing are similiar and proportional to the size of the file (O(n)), the only difference is, that the destination is different (console compared to file). And that's basically the same with System.out.
The underlying OS operation (displaying chars on a console window) is slow because
The bytes have to be sent to the console application (should be quite fast)
Each char has to be rendered using (usually) a true type font (that's pretty slow, switching off anti aliasing could improve performance, btw)
The displayed area may have to be scrolled in order to append a new line to the visible area (best case: bit block transfer operation, worst case: re-rendering of the complete text area)

System.out is a static PrintStream class. PrintStream has, among other things, those methods you're probably quite familiar with, like print() and println() and such.
It's not unique to Java that input and output operations take a long time. "long." printing or writing to a PrintStream takes a fraction of a second, but over 10 billion instances of this print can add up to quite a lot!
This is why your "moving everything to a String" is the fastest. Your huge String is built, but you only print it once. Sure, it's a huge print, but you spend time on actually printing, not on the overhead associated with the print() or println().
As Dvd Prd has mentioned, Strings are immutable. That means whenever you assign a new String to an old one but reusing references, you actually destroy the reference to the old String and create a reference to the new one. So you can make this whole operation go even faster by using the StringBuilder class, which is mutable. This will decrease the overhead associated with building that string you'll eventually print.

I believe this is because of buffering. A quote from the article:
Another aspect of buffering concerns
text output to a terminal window. By
default, System.out (a PrintStream) is
line buffered, meaning that the output
buffer is flushed when a newline
character is encountered. This is
important for interactivity, where
you'd like to have an input prompt
displayed before actually entering any
input.
A quote explaining buffers from wikipedia:
In computer science, a buffer is a
region of memory used to temporarily
hold data while it is being moved from
one place to another. Typically, the
data is stored in a buffer as it is
retrieved from an input device (such
as a Mouse) or just before it is sent
to an output device (such as Speakers)
public void println()
Terminate the current line by writing
the line separator string. The line
separator string is defined by the
system property line.separator, and is
not necessarily a single newline
character ('\n').
So the buffer get's flushed when you do println which means new memory has to be allocated etc which makes printing slower. The other methods you specified require lesser flushing of buffers thus are faster.

Take a look at my System.out.println replacement.
By default, System.out.print() is only line-buffered and does a lot work related to Unicode handling. Because of its small buffer size, System.out.println() is not well suited to handle many repetitive outputs in a batch mode. Each line is flushed right away. If your output is mainly ASCII-based then by removing the Unicode-related activities, the overall execution time will be better.

If you're printing to the console window, not to a file, that will be the killer.
Every character has to be painted, and on every line the whole window has to be scrolled.
If the window is partly overlaid with other windows, it also has to do clipping.
That's going to take far more cycles than what your program is doing.
Usually that's not a bad price to pay, since console output is supposed to be for your reading pleasure :)

The problem you have is that displaying to the screen is very espensive, especially if you have a graphical windows/X-windows environment (rather than a pure text terminal) Just to render one digit in a font is far more expensive than the calculations you are doing. When you send data to the screen faster than it can display it, it buffered the data and quickly blocks. Even writing to a file is significant compare to the calculations, but its 10x - 100x faster than displaying on the screen.
BTW: math.sqrt() is very expensive, and using a loop is much slower than using modulus i.e. % to determine if a number is a multiple. BitSet can be 8x more space efficient than boolean[], and faster for operations on multiple bits e.g. counting or searching bits.
If I dump the output to a file, it is quick, but writing to the console is slow, and if I write to the console the data which was written to a file it takes about the same amount of time.
Took 289 ms to examine 10,000,000 numbers.
Took 149 ms to toString primes up to 10,000,000.
Took 306 ms to write to a file primes up to 10,000,000.
Took 61,082 ms to write to a System.out primes up to 10,000,000.
time cat primes.txt
real 1m24.916s
user 0m3.619s
sys 0m12.058s
The code
int upTo = 10*1000*1000;
long start = System.nanoTime();
BitSet nonprimes = new BitSet(upTo);
for (int t = 2; t * t < upTo; t++) {
if (nonprimes.get(t)) continue;
for (int i = 2 * t; i <= upTo; i += t)
nonprimes.set(i);
}
PrintWriter report = new PrintWriter("report.txt");
long time = System.nanoTime() - start;
report.printf("Took %,d ms to examine %,d numbers.%n", time / 1000 / 1000, upTo);
long start2 = System.nanoTime();
for (int i = 2; i < upTo; i++) {
if (!nonprimes.get(i))
Integer.toString(i);
}
long time2 = System.nanoTime() - start2;
report.printf("Took %,d ms to toString primes up to %,d.%n", time2 / 1000 / 1000, upTo);
long start3 = System.nanoTime();
PrintWriter pw = new PrintWriter(new BufferedOutputStream(new FileOutputStream("primes.txt"), 64*1024));
for (int i = 2; i < upTo; i++) {
if (!nonprimes.get(i))
pw.println(i);
}
pw.close();
long time3 = System.nanoTime() - start3;
report.printf("Took %,d ms to write to a file primes up to %,d.%n", time3 / 1000 / 1000, upTo);
long start4 = System.nanoTime();
for (int i = 2; i < upTo; i++) {
if (!nonprimes.get(i))
System.out.println(i);
}
long time4 = System.nanoTime() - start4;
report.printf("Took %,d ms to write to a System.out primes up to %,d.%n", time4 / 1000 / 1000, upTo);
report.close();

Most of the answers here are right, but they don't cover the most important point: system calls. This is the operation that induces the more overhead.
When your software needs to access some hardware resource (your screen for example), it needs to ask the OS (or hypervisor) if it can access the hardware. This costs a lot:
Here are interesting blogs about syscalls, the last one being dedicated to syscall and Java
http://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead
http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html
https://blog.packagecloud.io/eng/2017/03/14/using-strace-to-understand-java-performance-improvement/

Related

Java wordcount: a mediocre implementation

I implemented a wordcount program with Java. Basically, the program takes a large file (in my tests, I used a 10 gb data file that contained numbers only), and counts the number of times each 'word' appears - in this case, a number (23723 for example might appear 243 times in the file).
Below is my implementation. I seek to improve it, with mainly performance in mind, but a few other things as well, and I am looking for some guidance. Here are a few of the issues I wish to correct:
Currently, the program is threaded and works properly. However, what I do is pass a chunk of memory (500MB/NUM_THREADS) to each thread, and each thread proceeds to wordcount. The problem here is that I have the main thread wait for ALL the threads to complete before passing more data to each thread. It isn't too much of a problem, but there is a period of time where a few threads will wait and do nothing for a while. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance). Currently, I use the fact that I know the input is an integer, and just store the temporary variables as an int, so no memory problems there. I want to be able to use some sort of delimiter, whether that delimiter be a space, or several characters.
I am using a global ConcurrentHashMap to story key value pairs. For example, if a thread finds a number "24624", it searches for that number in the map. If it exists, it will increase the value of that key by one. The value of the keys at the end represent the number of occurrences of that key. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
I am open to other possibilities as well, this is just what comes to mind.
Note: Splitting the file is not an option I want to explore, as I might be deploying this on a server in which I should not be creating my own files, but if it would really be a performance boost, I might listen.
Other Note: I am new to java threading, as well as new to StackOverflow. Be gentle.
public class BigCount2 {
public static void main(String[] args) throws IOException, InterruptedException {
int num, counter;
long i, j;
String delimiterString = " ";
ArrayList<Character> delim = new ArrayList<Character>();
for (char c : delimiterString.toCharArray()) {
delim.add(c);
}
int counter2 = 0;
num = Integer.parseInt(args[0]);
int bytesToRead = 1024 * 1024 * 1024 / 2; //500 MB, size of loop
int remainder = bytesToRead % num;
int k = 0;
bytesToRead = bytesToRead - remainder;
int byr = bytesToRead / num;
String filepath = "C:/Users/Daniel/Desktop/int-dataset-10g.dat";
RandomAccessFile file = new RandomAccessFile(filepath, "r");
Thread[] t = new Thread [num];//array of threads
ConcurrentMap<Integer, Integer> wordCountMap = new ConcurrentHashMap<Integer, Integer>(25000);
byte [] byteArray = new byte [byr]; //allocates 500mb to a 2D byte array
char[] newbyte;
for (i = 0; i < file.length(); i += bytesToRead) {
counter = 0;
for (j = 0; j < bytesToRead; j += byr) {
file.seek(i + j);
file.read(byteArray, 0, byr);
newbyte = new String(byteArray).toCharArray();
t[counter] = new Thread(
new BigCountThread2(counter,
newbyte,
delim,
wordCountMap));//giving each thread t[i] different file fileReader[i]
t[counter].start();
counter++;
newbyte = null;
}
for (k = 0; k < num; k++){
t[k].join(); //main thread continues after ALL threads have finished.
}
counter2++;
System.gc();
}
file.close();
System.exit(0);
}
}
class BigCountThread2 implements Runnable {
private final ConcurrentMap<Integer, Integer> wordCountMap;
char [] newbyte;
private ArrayList<Character> delim;
private int threadId; //use for later
BigCountThread2(int tid,
char[] newbyte,
ArrayList<Character> delim,
ConcurrentMap<Integer, Integer> wordCountMap) {
this.delim = delim;
threadId = tid;
this.wordCountMap = wordCountMap;
this.newbyte = newbyte;
}
public void run() {
int intCheck = 0;
int counter = 0; int i = 0; Integer check; int j =0; int temp = 0; int intbuilder = 0;
for (i = 0; i < newbyte.length; i++) {
intCheck = Character.getNumericValue(newbyte[i]);
if (newbyte[i] == ' ' || intCheck == -1) { //once a delimiter is found, the current tempArray needs to be added to the MAP
check = wordCountMap.putIfAbsent(intbuilder, 1);
if (check != null) { //if returns null, then it is the first instance
wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1);
}
intbuilder = 0;
}
else {
intbuilder = (intbuilder * 10) + intCheck;
counter++;
}
}
}
}

Some thoughts on a little of most ..
.. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
If all the threads take about the same time to process the same amount of data, then there really isn't that much of a "problem" here.
However, one nice thing about a Thread Pool is it allows one to rather trivially adjust some basic parameters such as number of concurrent workers. Furthermore, using an executor service and Futures can provide an additional level of abstraction; in this case it could be especially handy if each thread returned a map as the result.
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance) ..
This sounds like an implementation issue. While I would first try a StreamTokenizer (because it's already written), if doing it manually, I would check out the source - a good bit of that can be omitted when simplifying the notion of a "token". (It uses a temporary array to build the token.)
I am using a global ConcurrentHashMap to story key value pairs. .. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
It would reduce locking and may increase performance to use a separate map per thread and merge strategy. Furthermore, the current implementation is broken as wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1) is not atomic and thus the operation might under count. I would use a separate map simply because reducing mutable shared state makes a threaded program much easier to reason about.
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
Consider using a FileReader (and BufferedReader) per thread on the same file. This will avoid having to first copy the file into the array and slice it out for individual threads which, while the same amount of total reading, avoids having to soak up so much memory. The reading done is actually not random access, but merely sequential (with a "skip") starting from different offsets - each thread still works on a mutually exclusive range.
Also, the original code with the slicing is broken if an integer value was "cut" in half as each of the threads would read half the word. One work-about is have each thread skip the first word if it was a continuation from the previous block (i.e. scan one byte sooner) and then read-past the end of it's range as required to complete the last word.

Java: why is computing faster than assigning value (int)?

The 2 following versions of the same function (which basically tries to recover a password by brute force) do not give same performance:
Version 1:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = length - 1;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
Version 2:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = refi;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
I would expect version 2 to be faster, as it does (and that is the only difference):
i = refi;
...as compare to version 1:
i = length -1;
However, it's the opposite: version 1 is faster by over 3%!
Does someone knows why? Is that due to some optimization done by the compiler?
Thank you all for your answers that far.
Just to add that the goal is actually not to optimize this piece of code (which is already quite optimized), but more to understand, from a compiler / CPU / architecture perspective, what could explain such performance difference.
Your answers have been very helpful, thanks again!
Key

It is difficult to check this in a micro-benchmark because you cannot say for sure how the code has been optimised without reading the machine code generated, even then the CPU can do plenty of tricks to optimise it future eg. it turns the x86 code in RISC style instructions to actually execute.
A computation takes as little as one cycle and the CPU can perform up to three of them at once. An access to L1 cache takes 4 cycles and for L2, L3, main memory it takes 11, 40-75, 200 cycles.
Storing values to avoid a simple calculation is actually slower in many cases. BTW using division and modulus is quite expensive and caching this value can be worth it when micro-tuning your code.

The correct answer should be retrievable by a deassembler (i mean .class -> .java converter),
but my guess is that the compiler might have decided to get rid of iref altogether and decided to store length - 1 an auxiliary register.
I'm more of a c++ guy, but I would start by trying:
const int refi = length - 1;
inside the for loop. Also you should probably use
indexes[ refi ] = 1;

Comparing running times of codes does not give exact or quarantine results
First of all, it is not the way comparing performances like this. A running time analysis is needed here. Both 2 codes have same loop structure and their running time are the same. You may have different running times when you run codes. However, they mostly differ with cache hits, I/O times, thread & process schedules. There is no quarantine that code is always completed in a exact time.
However, there is still differences in your code, to understand the difference you should look into your CPU architecture. I can explain it according to x86 architecture basically.
What happens behind the scenes?
i = refi;
CPU takes refi and i to its registers from ram. there is 2 access to ram if the values in not in the cache. and value of i will be written to the ram. However, it always takes different times according to thread & process schedules. Furrhermore, if the values are in virtual memory it wil take longer time.
i = length -1;
CPU also access i and length from ram or cache. there is same number of accesses. In addition, there is a subtraction here which means extra CPU cycles. That is why you think this one take longer time to complete. It is expected, but the issues that i mentioned above explain why this take longer time.
Summation
As i explain this is not the way of comparing performance. I think, there is no real difference between these codes. There are lots of optimizations inside CPU and also in compiler. You can see optimized codes if you decompile .class files.
My advice is it is better to minimize BigO running time analysis. If you find better algorithms it is the best way of optimizing codes. In case you still have bottlenecks in your code, you may try micro-benchmarking.
See also
Analysis of algorithms
Big O notation
Microprocessor
Compiler optimization
CPU Scheduling

To start with, you can't really compare the performance by just running your program - micro benchmarking in Java is complicated.
Also, a subtraction on modern CPUs can take as little as a third of a clock cycle on average. On a 3GHz CPU, that is 0.1 nanoseconds. And nothing tells you that the subtraction actually happens as the compiler might have modified the code.
So:
You should try to check the generated assembly code.
If you really care about the performance, create an appropriate micro-benchmark.

Multiplication time in BigInteger

My mini benchmark:
import java.math.*;
import java.util.*;
import java.io.*;
public class c
{
static Random rnd = new Random();
public static String addDigits(String a, int n)
{
if(a==null) return null;
if(n<=0) return a;
for(int i=0; i<n; i++)
a+=rnd.nextInt(10);
return a;
}
public static void main(String[] args) throws IOException
{
int n = 10000; \\number of iterations
int k = 10; \\number of digits added at each iteration
BigInteger a;
BigInteger b;
String as = "";
String bs = "";
as += rnd.nextInt(9)+1;
bs += rnd.nextInt(9)+1;
a = new BigInteger(as);
b = new BigInteger(bs);
FileWriter fw = new FileWriter("c.txt");
long t1 = System.nanoTime();
a.multiply(b);
long t2 = System.nanoTime();
//fw.write("1,"+(t2-t1)+"\n");
if(k>0) {
as = addDigits(as, k-1);
bs = addDigits(as, k-1);
}
for(int i=0; i<n; i++)
{
a = new BigInteger(as);
b = new BigInteger(bs);
t1 = System.nanoTime();
a.multiply(b);
t2 = System.nanoTime();
fw.write(((i+1)*k)+","+(t2-t1)+"\n");
if(i < n-1)
{
as = addDigits(as, k);
bs = addDigits(as, k);
}
System.out.println((i+1)*k);
}
fw.close();
}
}
It measures multiplication time of n-digit BigInteger
Result:
You can easily see the trend but why there is so big noise above 50000 digits?
It is because of garbage collector or is there something else that affects my results?
When performing the test, there were no other applications running.
Result from test with only odd digits. The test was shorter (n=1000, k=100)
Odd digits (n=10000, k=10)
As you can see there is a huge noise between 65000 and 70000. I wonder why...
Odd digits (n=10000, k=10), System.gc() every 1000 iterations
Results in noise between 50000-70000

I also suspect this is a JVM warmup effect. Not warmup involving classloading or the JIT compiler, but warmup of the heap.
Put a (java) loop around the whole benchmark, and run it a number of times. (If this gives you the same graphs as before ... you will have evidence that this is not a warmup effect. Currently you don't have any empirical evidence one way or the other.)
Another possibility is that the noise is caused by your benchmark's interactions with the OS and/or other stuff running on the machine.
You are writing your timing data to an unbuffered stream. That means LOTS of syscalls, and (potentially) lots of fine-grained disc writes.
You are making LOTS of calls to nanoTime(), and that might introduce noise.
If something else is running on your machine (e.g. you are web browsing) that will slow down your benchmark for a bit and introduce noise.
There could be competition over physical memory ... if you've got too much running on your machine for the amount of RAM.
Finally, a certain amount of noise is inevitable, because each of those multiply calls generates garbage, and the garbage collector is going to need to work to deal with it.
Finally finally, if you manually run the garbage collector (or increase the heap size) to "smooth out" the data points, what you are actually doing is concealing one of the costs of multiply calls. The resulting graphs looks nice, but it is misleading:
The noisiness reflects what will happen in real life.
The true cost of the multiply actually includes the amortized cost of running the GC to deal with the garbage generated by the call.
To get a measurements that reflect the way that BigInteger behaves in real life, you need to run the test a large number of times, calculate average times and fit a curve to the average data-points.
Remember, the real aim of the game is to get scientifically valid results ... not a smooth curve.

If you do a microbenchmark, you must "warm up" the JVM first to let the JIT optimize the code, and then you can measure the performance. Otherwise you are measuring the work done by the JIT and that can change the result on each run.
The "noise" happens probably because the cache of the CPU is exceeded and the performance starts degrading.

Java Array Bulk Flush on Disk

I have two arrays (int and long) which contains millions of entries. Until now, I am doing it using DataOutputStream and using a long buffer thus disk I/O costs gets low (nio is also more or less same as I have huge buffer, so I/O access cost low) specifically, using
DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream("abc.txt"),1024*1024*100));
for(int i = 0 ; i < 220000000 ; i++){
long l = longarray[i];
dos.writeLong(l);
}
But it takes several seconds (more than 5 minutes) to do that. Actually, what I want to bulk flush (some sort of main memory to disk memory map). For that, I found a nice approach in here and here. However, can't understand how to use that in my javac. Can anybody help me about that or any other way to do that nicely ?

On my machine, 3.8 GHz i7 with an SSD
DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream("abc.txt"), 32 * 1024));
long start = System.nanoTime();
final int count = 220000000;
for (int i = 0; i < count; i++) {
long l = i;
dos.writeLong(l);
}
dos.close();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to write %,d longs%n",
time / 1e9, count);
prints
Took 11.706 seconds to write 220,000,000 longs
Using memory mapped files
final int count = 220000000;
final FileChannel channel = new RandomAccessFile("abc.txt", "rw").getChannel();
MappedByteBuffer mbb = channel.map(FileChannel.MapMode.READ_WRITE, 0, count * 8);
mbb.order(ByteOrder.nativeOrder());
long start = System.nanoTime();
for (int i = 0; i < count; i++) {
long l = i;
mbb.putLong(l);
}
channel.close();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to write %,d longs%n",
time / 1e9, count);
// Only works on Sun/HotSpot/OpenJDK to deallocate buffer.
((DirectBuffer) mbb).cleaner().clean();
final FileChannel channel2 = new RandomAccessFile("abc.txt", "r").getChannel();
MappedByteBuffer mbb2 = channel2.map(FileChannel.MapMode.READ_ONLY, 0, channel2.size());
mbb2.order(ByteOrder.nativeOrder());
assert mbb2.remaining() == count * 8;
long start2 = System.nanoTime();
for (int i = 0; i < count; i++) {
long l = mbb2.getLong();
if (i != l)
throw new AssertionError("Expected "+i+" but got "+l);
}
channel.close();
long time2 = System.nanoTime() - start2;
System.out.printf("Took %.3f seconds to read %,d longs%n",
time2 / 1e9, count);
// Only works on Sun/HotSpot/OpenJDK to deallocate buffer.
((DirectBuffer) mbb2).cleaner().clean();
prints on my 3.8 GHz i7.
Took 0.568 seconds to write 220,000,000 longs
on a slower machine prints
Took 1.180 seconds to write 220,000,000 longs
Took 0.990 seconds to read 220,000,000 longs
Is here any other way not to create that ? Because I have that array already on my main memory and I can't allocate more than 500 MB to do that?
This doesn't uses less than 1 KB of heap. If you look at how much memory is used before and after this call you will normally see no increase at all.
Another thing, is this gives efficient loading also means MappedByteBuffer?
In my experience, using a memory mapped file is by far the fastest because you reduce the number of system calls and copies into memory.
Because, in some article I found read(buffer) this gives better loading performance. (I check that one, really faster 220 million int array -float array read 5 seconds)
I would like to read that article because I have never seen that.
Another issue: readLong gives error while reading from your code output file
Part of the performance in provement is storing the values in native byte order. writeLong/readLong always uses big endian format which is much slower on Intel/AMD systems which are little endian format natively.
You can make the byte order big-endian which will slow it down or you can use native ordering (DataInput/OutputStream only supports big endian)

I am running it server with 16GB memory with 2.13 GhZ [CPU]
I doubt the problem has anything to do with your Java code.
Your file system appears to be extraordinarily slow (at least ten times slower than what one would expect from a local disk).
I would do two things:
Double check that you are actually writing to a local disk, and not to a network share. Bear in mind that in some environments home directories are NFS mounts.
Ask your sysadmins to take a look at the machine to find out why the disk is so slow. If I were in their shoes, I'd start by checking the logs and running some benchmarks (e.g. using Bonnie++).

Should I use Java's String.format() if performance is important?

We have to build Strings all the time for log output and so on. Over the JDK versions we have learned when to use StringBuffer (many appends, thread safe) and StringBuilder (many appends, non-thread-safe).
What's the advice on using String.format()? Is it efficient, or are we forced to stick with concatenation for one-liners where performance is important?
e.g. ugly old style,
String s = "What do you get if you multiply " + varSix + " by " + varNine + "?";
vs. tidy new style (String.format, which is possibly slower),
String s = String.format("What do you get if you multiply %d by %d?", varSix, varNine);
Note: my specific use case is the hundreds of 'one-liner' log strings throughout my code. They don't involve a loop, so StringBuilder is too heavyweight. I'm interested in String.format() specifically.

I took hhafez's code and added a memory test:
private static void test() {
Runtime runtime = Runtime.getRuntime();
long memory;
...
memory = runtime.freeMemory();
// for loop code
memory = memory-runtime.freeMemory();
I run this separately for each approach, the '+' operator, String.format and StringBuilder (calling toString()), so the memory used will not be affected by other approaches.
I added more concatenations, making the string as "Blah" + i + "Blah"+ i +"Blah" + i + "Blah".
The result are as follows (average of 5 runs each):
Approach
Time(ms)
Memory allocated (long)
+ operator
747
320,504
String.format
16484
373,312
StringBuilder
769
57,344
We can see that String + and StringBuilder are practically identical time-wise, but StringBuilder is much more efficient in memory use.
This is very important when we have many log calls (or any other statements involving strings) in a time interval short enough so the Garbage Collector won't get to clean the many string instances resulting of the + operator.
And a note, BTW, don't forget to check the logging level before constructing the message.
Conclusions:
I'll keep on using StringBuilder.
I have too much time or too little life.

I wrote a small class to test which has the better performance of the two and + comes ahead of format. by a factor of 5 to 6.
Try it your self
import java.io.*;
import java.util.Date;
public class StringTest{
public static void main( String[] args ){
int i = 0;
long prev_time = System.currentTimeMillis();
long time;
for( i = 0; i< 100000; i++){
String s = "Blah" + i + "Blah";
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for( i = 0; i<100000; i++){
String s = String.format("Blah %d Blah", i);
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
}
}
Running the above for different N shows that both behave linearly, but String.format is 5-30 times slower.
The reason is that in the current implementation String.format first parses the input with regular expressions and then fills in the parameters. Concatenation with plus, on the other hand, gets optimized by javac (not by the JIT) and uses StringBuilder.append directly.

All the benchmarks presented here have some flaws, thus results are not reliable.
I was surprised that nobody used JMH for benchmarking, so I did.
Results:
Benchmark Mode Cnt Score Error Units
MyBenchmark.testOld thrpt 20 9645.834 ± 238.165 ops/s // using +
MyBenchmark.testNew thrpt 20 429.898 ± 10.551 ops/s // using String.format
Units are operations per second, the more the better. Benchmark source code. OpenJDK IcedTea 2.5.4 Java Virtual Machine was used.
So, old style (using +) is much faster.

Your old ugly style is automatically compiled by JAVAC 1.6 as :
StringBuilder sb = new StringBuilder("What do you get if you multiply ");
sb.append(varSix);
sb.append(" by ");
sb.append(varNine);
sb.append("?");
String s = sb.toString();
So there is absolutely no difference between this and using a StringBuilder.
String.format is a lot more heavyweight since it creates a new Formatter, parses your input format string, creates a StringBuilder, append everything to it and calls toString().

Java's String.format works like so:
it parses the format string, exploding into a list of format chunks
it iterates the format chunks, rendering into a StringBuilder, which is basically an array that resizes itself as necessary, by copying into a new array. this is necessary because we don't yet know how large to allocate the final String
StringBuilder.toString() copies his internal buffer into a new String
if the final destination for this data is a stream (e.g. rendering a webpage or writing to a file), you can assemble the format chunks directly into your stream:
new PrintStream(outputStream, autoFlush, encoding).format("hello {0}", "world");
I speculate that the optimizer will optimize away the format string processing. If so, you're left with equivalent amortized performance to manually unrolling your String.format into a StringBuilder.

To expand/correct on the first answer above, it's not translation that String.format would help with, actually.
What String.format will help with is when you're printing a date/time (or a numeric format, etc), where there are localization(l10n) differences (ie, some countries will print 04Feb2009 and others will print Feb042009).
With translation, you're just talking about moving any externalizable strings (like error messages and what-not) into a property bundle so that you can use the right bundle for the right language, using ResourceBundle and MessageFormat.
Looking at all the above, I'd say that performance-wise, String.format vs. plain concatenation comes down to what you prefer. If you prefer looking at calls to .format over concatenation, then by all means, go with that.
After all, code is read a lot more than it's written.

In your example, performance probalby isn't too different but there are other issues to consider: namely memory fragmentation. Even concatenate operation is creating a new string, even if its temporary (it takes time to GC it and it's more work). String.format() is just more readable and it involves less fragmentation.
Also, if you're using a particular format a lot, don't forget you can use the Formatter() class directly (all String.format() does is instantiate a one use Formatter instance).
Also, something else you should be aware of: be careful of using substring(). For example:
String getSmallString() {
String largeString = // load from file; say 2M in size
return largeString.substring(100, 300);
}
That large string is still in memory because that's just how Java substrings work. A better version is:
return new String(largeString.substring(100, 300));
or
return String.format("%s", largeString.substring(100, 300));
The second form is probably more useful if you're doing other stuff at the same time.

Generally you should use String.Format because it's relatively fast and it supports globalization (assuming you're actually trying to write something that is read by the user). It also makes it easier to globalize if you're trying to translate one string versus 3 or more per statement (especially for languages that have drastically different grammatical structures).
Now if you never plan on translating anything, then either rely on Java's built in conversion of + operators into StringBuilder. Or use Java's StringBuilder explicitly.

Another perspective from Logging point of view Only.
I see a lot of discussion related to logging on this thread so thought of adding my experience in answer. May be someone will find it useful.
I guess the motivation of logging using formatter comes from avoiding the string concatenation. Basically, you do not want to have an overhead of string concat if you are not going to log it.
You do not really need to concat/format unless you want to log. Lets say if I define a method like this
public void logDebug(String... args, Throwable t) {
if(debugOn) {
// call concat methods for all args
//log the final debug message
}
}
In this approach the cancat/formatter is not really called at all if its a debug message and debugOn = false
Though it will still be better to use StringBuilder instead of formatter here. The main motivation is to avoid any of that.
At the same time I do not like adding "if" block for each logging statement since
It affects readability
Reduces coverage on my unit tests - thats confusing when you want to make sure every line is tested.
Therefore I prefer to create a logging utility class with methods like above and use it everywhere without worrying about performance hit and any other issues related to it.

I just modified hhafez's test to include StringBuilder. StringBuilder is 33 times faster than String.format using jdk 1.6.0_10 client on XP. Using the -server switch lowers the factor to 20.
public class StringTest {
public static void main( String[] args ) {
test();
test();
}
private static void test() {
int i = 0;
long prev_time = System.currentTimeMillis();
long time;
for ( i = 0; i < 1000000; i++ ) {
String s = "Blah" + i + "Blah";
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for ( i = 0; i < 1000000; i++ ) {
String s = String.format("Blah %d Blah", i);
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for ( i = 0; i < 1000000; i++ ) {
new StringBuilder("Blah").append(i).append("Blah");
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
}
}
While this might sound drastic, I consider it to be relevant only in rare cases, because the absolute numbers are pretty low: 4 s for 1 million simple String.format calls is sort of ok - as long as I use them for logging or the like.
Update: As pointed out by sjbotha in the comments, the StringBuilder test is invalid, since it is missing a final .toString().
The correct speed-up factor from String.format(.) to StringBuilder is 23 on my machine (16 with the -server switch).

Here is modified version of hhafez entry. It includes a string builder option.
public class BLA
{
public static final String BLAH = "Blah ";
public static final String BLAH2 = " Blah";
public static final String BLAH3 = "Blah %d Blah";
public static void main(String[] args) {
int i = 0;
long prev_time = System.currentTimeMillis();
long time;
int numLoops = 1000000;
for( i = 0; i< numLoops; i++){
String s = BLAH + i + BLAH2;
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for( i = 0; i<numLoops; i++){
String s = String.format(BLAH3, i);
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for( i = 0; i<numLoops; i++){
StringBuilder sb = new StringBuilder();
sb.append(BLAH);
sb.append(i);
sb.append(BLAH2);
String s = sb.toString();
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
}
}
Time after for loop 391
Time after for loop 4163
Time after for loop 227

The answer to this depends very much on how your specific Java compiler optimizes the bytecode it generates. Strings are immutable and, theoretically, each "+" operation can create a new one. But, your compiler almost certainly optimizes away interim steps in building long strings. It's entirely possible that both lines of code above generate the exact same bytecode.
The only real way to know is to test the code iteratively in your current environment. Write a QD app that concatenates strings both ways iteratively and see how they time out against each other.

Consider using "hello".concat( "world!" ) for small number of strings in concatenation. It could be even better for performance than other approaches.
If you have more than 3 strings, than consider using StringBuilder, or just String, depending on compiler that you use.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.