I have two arrays (int and long) which contains millions of entries. Until now, I am doing it using DataOutputStream and using a long buffer thus disk I/O costs gets low (nio is also more or less same as I have huge buffer, so I/O access cost low) specifically, using
DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream("abc.txt"),1024*1024*100));
for(int i = 0 ; i < 220000000 ; i++){
long l = longarray[i];
dos.writeLong(l);
}
But it takes several seconds (more than 5 minutes) to do that. Actually, what I want to bulk flush (some sort of main memory to disk memory map). For that, I found a nice approach in here and here. However, can't understand how to use that in my javac. Can anybody help me about that or any other way to do that nicely ?
On my machine, 3.8 GHz i7 with an SSD
DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream("abc.txt"), 32 * 1024));
long start = System.nanoTime();
final int count = 220000000;
for (int i = 0; i < count; i++) {
long l = i;
dos.writeLong(l);
}
dos.close();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to write %,d longs%n",
time / 1e9, count);
prints
Took 11.706 seconds to write 220,000,000 longs
Using memory mapped files
final int count = 220000000;
final FileChannel channel = new RandomAccessFile("abc.txt", "rw").getChannel();
MappedByteBuffer mbb = channel.map(FileChannel.MapMode.READ_WRITE, 0, count * 8);
mbb.order(ByteOrder.nativeOrder());
long start = System.nanoTime();
for (int i = 0; i < count; i++) {
long l = i;
mbb.putLong(l);
}
channel.close();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to write %,d longs%n",
time / 1e9, count);
// Only works on Sun/HotSpot/OpenJDK to deallocate buffer.
((DirectBuffer) mbb).cleaner().clean();
final FileChannel channel2 = new RandomAccessFile("abc.txt", "r").getChannel();
MappedByteBuffer mbb2 = channel2.map(FileChannel.MapMode.READ_ONLY, 0, channel2.size());
mbb2.order(ByteOrder.nativeOrder());
assert mbb2.remaining() == count * 8;
long start2 = System.nanoTime();
for (int i = 0; i < count; i++) {
long l = mbb2.getLong();
if (i != l)
throw new AssertionError("Expected "+i+" but got "+l);
}
channel.close();
long time2 = System.nanoTime() - start2;
System.out.printf("Took %.3f seconds to read %,d longs%n",
time2 / 1e9, count);
// Only works on Sun/HotSpot/OpenJDK to deallocate buffer.
((DirectBuffer) mbb2).cleaner().clean();
prints on my 3.8 GHz i7.
Took 0.568 seconds to write 220,000,000 longs
on a slower machine prints
Took 1.180 seconds to write 220,000,000 longs
Took 0.990 seconds to read 220,000,000 longs
Is here any other way not to create that ? Because I have that array already on my main memory and I can't allocate more than 500 MB to do that?
This doesn't uses less than 1 KB of heap. If you look at how much memory is used before and after this call you will normally see no increase at all.
Another thing, is this gives efficient loading also means MappedByteBuffer?
In my experience, using a memory mapped file is by far the fastest because you reduce the number of system calls and copies into memory.
Because, in some article I found read(buffer) this gives better loading performance. (I check that one, really faster 220 million int array -float array read 5 seconds)
I would like to read that article because I have never seen that.
Another issue: readLong gives error while reading from your code output file
Part of the performance in provement is storing the values in native byte order. writeLong/readLong always uses big endian format which is much slower on Intel/AMD systems which are little endian format natively.
You can make the byte order big-endian which will slow it down or you can use native ordering (DataInput/OutputStream only supports big endian)
I am running it server with 16GB memory with 2.13 GhZ [CPU]
I doubt the problem has anything to do with your Java code.
Your file system appears to be extraordinarily slow (at least ten times slower than what one would expect from a local disk).
I would do two things:
Double check that you are actually writing to a local disk, and not to a network share. Bear in mind that in some environments home directories are NFS mounts.
Ask your sysadmins to take a look at the machine to find out why the disk is so slow. If I were in their shoes, I'd start by checking the logs and running some benchmarks (e.g. using Bonnie++).
Related
I have a client-server application where the server sends some binary data to the client and the client has to deserialize objects from that byte stream according to a custom binary format. The data is sent via an HTTPS connection and the client uses HttpsURLConnection.getInputStream() to read it.
I implemented a DataDeserializer that takes an InputStream and deserializes it completely. It works in a way that it performs multiple inputStream.read(buffer) calls with small buffers (usually less than 100 bytes). On my way of achieving better overall performance I also tried different implementations here. One change did improve this class' performance significantly (I'm using a ByteBuffer now to read primitive types rather than doing it manually with byte shifting), but in combination with the network stream no differences show up. See the section below for more details.
Quick summary of my issue
Deserializing from the network stream takes way too long even though I proved that the network and the deserializer themselves are fast. Are there any common performance tricks that I could try? I am already wrapping the network stream with a BufferedInputStream. Also, I tried double buffering with some success (see code below). Any solution to achieve better performance is welcome.
The performance test scenario
In my test scenario server and client are located on the same machine and the server sends ~174 MB of data. The code snippets can be found at the end of this post. All numbers you see here are averages of 5 test runs.
First I wanted to know, how fast that InputStream of the HttpsURLConnection can be read. Wrapped into a BufferedInputStream, it took 26.250s to write the entire data into a ByteArrayOutputStream.1
Then I tested the performance of my deserializer passing it all that 174 MB as a ByteArrayInputStream. Before I improved the deserializer's implementation, it took 38.151s. After the improvement it took only 23.466s.2
So this is going to be it, I thought... but no.
What I actually want to do, somehow, is passing connection.getInputStream() to the deserializer. And here comes the strange thing: Before the deserializer improvement deserializing took 61.413s and after improving it was 60.100s!3
How can that happen? Almost no improvement here despite the deserializer improved significantly. Also, unrelated to that improvement, I was surprised that this takes longer than the separate performances summed up (60.100 > 26.250 + 23.466). Why? Don't get me wrong, I didn't expect this to be the best solution, but I didn't expect it to be that bad either.
So, three things to notice:
The overall speed is bound by the network which takes at least 26.250s. Maybe there are some http-settings that I could tweak or I could further optimize the server, but for now this is likely not what I should focus on.
My deserializer implementation is very likely still not perfect, but on its own it is faster than the network, so I don't think there is need to further improve it.
Based on 1. and 2. I'm assuming that it should be somehow possible to do the entire job in a combined way (reading from the network + deserializing) which should take not much more than 26.250s. Any suggestions on how to achieve this are welcome.
I was looking for some kind of double buffer allowing two threads to read from it and write to it in parallel.
Is there something like that in standard Java? Preferably some class inheriting from InputStream that allows to write to it in parallel? If there is something similar, but not inheriting from InputStream, I may be able to change my DataDeserializer to consume from that one as well.
As I haven't found any such DoubleBufferInputStream, I implemented it myself.
The code is quite long and likely not perfect and I don't want to bother you to do a code review for me. It has two 16kB buffers. Using it I was able to improve the overall performance to 39.885s.4
That is much better than 60.100s but still much worse than 26.250s. Choosing different buffer sizes didn't change much. So, I hope someone can lead me to some good double buffer implementation.
The test code
1 (26.250s)
InputStream inputStream = new BufferedInputStream(connection.getInputStream());
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[16 * 1024];
int count = 0;
long start = System.nanoTime();
while ((count = inputStream.read(buffer)) >= 0) {
outputStream .write(buffer, 0, count);
}
long end = System.nanoTime();
2 (23.466s)
InputStream inputStream = new ByteArrayInputStream(entire174MBbuffer);
DataDeserializer deserializer = new DataDeserializer(inputStream);
long start = System.nanoTime();
deserializer.Deserialize();
long end = System.nanoTime();
3 (60.100s)
InputStream inputStream = new BufferedInputStream(connection.getInputStream());
DataDeserializer deserializer = new DataDeserializer(inputStream);
long start = System.nanoTime();
deserializer.Deserialize();
long end = System.nanoTime();
4 (39.885s)
MyDoubleBufferInputStream doubleBufferInputStream = new MyDoubleBufferInputStream();
new Thread(new Runnable() {
#Override
public void run() {
try (InputStream inputStream = new BufferedInputStream(connection.getInputStream())) {
byte[] buffer = new byte[16 * 1024];
int count = 0;
while ((count = inputStream.read(buffer)) >= 0) {
doubleBufferInputStream.write(buffer, 0, count);
}
} catch (IOException e) {
} finally {
doubleBufferInputStream.closeWriting(); // read() may return -1 now
}
}
}).start();
DataDeserializer deserializer = new DataDeserializer(doubleBufferInputStream);
long start = System.nanoTime();
deserializer.deserialize();
long end = System.nanoTime();
Update
As requested, here is the core of my deserializer. I think the most important method is prepareForRead() which performs the actual reading of the stream.
class DataDeserializer {
private InputStream _stream;
private ByteBuffer _buffer;
public DataDeserializer(InputStream stream) {
_stream = stream;
_buffer = ByteBuffer.allocate(256 * 1024);
_buffer.order(ByteOrder.LITTLE_ENDIAN);
_buffer.flip();
}
private int readInt() throws IOException {
prepareForRead(4);
return _buffer.getInt();
}
private long readLong() throws IOException {
prepareForRead(8);
return _buffer.getLong();
}
private CustomObject readCustomObject() throws IOException {
prepareForRead(/*size of CustomObject*/);
int customMember1 = _buffer.getInt();
long customMember2 = _buffer.getLong();
// ...
return new CustomObject(customMember1, customMember2, ...);
}
// several other built-in and custom object read methods
private void prepareForRead(int count) throws IOException {
while (_buffer.remaining() < count) {
if (_buffer.capacity() - _buffer.limit() < count) {
_buffer.compact();
_buffer.flip();
}
int read = _stream.read(_buffer.array(), _buffer.limit(), _buffer.capacity() - _buffer.limit());
if (read < 0)
throw new EOFException("Unexpected end of stream.");
_buffer.limit(_buffer.limit() + read);
}
}
public HugeCustomObject Deserialize() throws IOException {
while (...) {
// call several of the above methods
}
return new HugeCustomObject(/* deserialized members */);
}
}
Update 2
I modified my code snippet #1 a little bit to see more precisely where time is being spent:
InputStream inputStream = new BufferedInputStream(connection.getInputStream());
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[16 * 1024];
long read = 0;
long write = 0;
while (true) {
long t1 = System.nanoTime();
int count = istream.read(buffer);
long t2 = System.nanoTime();
read += t2 - t1;
if (count < 0)
break;
t1 = System.nanoTime();
ostream.write(buffer, 0, count);
t2 = System.nanoTime();
write += t2 - t1;
}
System.out.println(read + " " + write);
This tells me that reading from the network stream takes 25.756s while writing to the ByteArrayOutputStream only takes 0.817s. This makes sense as these two numbers almost perfectly sum up to the previously measured 26.250s (plus some additional measuring overhead).
In the very same way I modified code snippet #4:
MyDoubleBufferInputStream doubleBufferInputStream = new MyDoubleBufferInputStream();
new Thread(new Runnable() {
#Override
public void run() {
try (InputStream inputStream = new BufferedInputStream(httpChannelOutputStream.getConnection().getInputStream(), 256 * 1024)) {
byte[] buffer = new byte[16 * 1024];
long read = 0;
long write = 0;
while (true) {
long t1 = System.nanoTime();
int count = inputStream.read(buffer);
long t2 = System.nanoTime();
read += t2 - t1;
if (count < 0)
break;
t1 = System.nanoTime();
doubleBufferInputStream.write(buffer, 0, count);
t2 = System.nanoTime();
write += t2 - t1;
}
System.out.println(read + " " + write);
} catch (IOException e) {
} finally {
doubleBufferInputStream.closeWriting();
}
}
}).start();
DataDeserializer deserializer = new DataDeserializer(doubleBufferInputStream);
deserializer.deserialize();
Now I would expect that the measured reading time is exactly the same as in the previous example. But instead, the read variable holds a value of 39.294s (How is that possible?? It's the exact same code being measured as in the previous example with 25.756s!)* while writing to my double buffer only takes 0.096s. Again, these numbers almost perfectly sum up to the measured time of code snippet #4.
Additionally, I profiled this very same code using Java VisualVM. That tells me that 40s were spent in this thread's run() method and 100% of these 40s are CPU time. On the other hand, it also spends 40s inside of the deserializer, but here only 26s are CPU time and 14s are spent waiting. This perfectly matches the time of reading from network into ByteBufferOutputStream. So I guess I have to improve my double buffer's "buffer-switching-algorithm".
*) Is there any explanation for this strange observation? I could only imagine that this way of measuring is very inaccurate. However, the read- and write-times of the latest measurements perfectly sum up to the original measurement, so it cannot be that inaccurate... Could someone please shed some light on this?
I was not able to find these read and write performances in the profiler... I will try to find some settings that allow me to observe the profiling results for these two methods.
Apparently, my "mistake" was to use a 32-bit JVM (jre1.8.0_172 being precise).
Running the very same code snippets on a 64-bit version JVM, and tadaaa... it is fast and makes all sense there.
In particular see these new numbers for the corresponding code snippets:
snippet #1: 4.667s (vs. 26.250s)
snippet #2: 11.568s (vs. 23.466s)
snippet #3: 17.185s (vs. 60.100s)
snippet #4: 12.336s (vs. 39.885s)
So apparently, the answers given to Does Java 64 bit perform better than the 32-bit version? are simply not true anymore. Or, there is a serious bug in this particular 32-bit JRE version. I didn't test any others yet.
As you can see, #4 is only slightly slower than #2 which perfectly matches my original assumption that
Based on 1. and 2. I'm assuming that it should be somehow possible to
do the entire job in a combined way (reading from the network +
deserializing) which should take not much more than 26.250s.
Also the very weird results of my profiling approach described in Update 2 of my question do not occur anymore. I didn't repeat every single test in 64 bit yet, but all profiling results that I did do are plausible now, i.e. the same code takes the same time no matter in which code snippet. So maybe it's really a bug, or does anybody have a reasonable explanation?
The most certain way to improve any of these is to change
connection.getInputStream()
to
new BufferedInputStream(connection.getInputStream())
If that doesn't help, the input stream isn't your problem.
To try MappedByteBuffer (memory mapped file in Java), I wrote a simple wc -l (text file line count) demo:
int wordCount(String fileName) throws IOException {
FileChannel fc = new RandomAccessFile(new File(fileName), "r").getChannel();
MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
int nlines = 0;
byte newline = '\n';
for(long i = 0; i < fc.size(); i++) {
if(mem.get() == newline)
nlines += 1;
}
return nlines;
}
I tried this on a file of about 15 MB (15008641 bytes), and 100k lines. On my laptop, it takes about 13.8 sec. Why is it so slow?
Complete class code is here: http://pastebin.com/t8PLRGMa
For the reference, I wrote the same idea in C: http://pastebin.com/hXnDvZm6
It runs in about 28 ms, or 490 times faster.
Out of curiosity, I also wrote a Scala version using essentially the same algorithm and APIs as in Java. It runs 10 times faster, which suggests there is definitely something odd going on.
Update: The file is cached by the OS, so there is no disk loading time involved.
I wanted to use memory mapping for random access to bigger files which may not fit into RAM. That is why I am not just using a BufferedReader.
The code is very slow, because fc.size() is called in the loop.
JVM obviously cannot eliminate fc.size(), since file size can be changed in run-time. Querying file size is relatively slow, because it requires a system call to the underlying file system.
Change this to
long size = fc.size();
for (long i = 0; i < size; i++) {
...
}
Ok, so I try to do this little experiment in java. I want to fill up a queue with integers and see how long it takes. Here goes:
import java.io.*;
import java.util.*;
class javaQueueTest {
public static void main(String args[]){
System.out.println("Hello World!");
long startTime = System.currentTimeMillis();
int i;
int N = 50000000;
ArrayDeque<Integer> Q = new ArrayDeque<Integer>(N);
for (i = 0;i < N; i = i+1){
Q.add(i);
}
long endTime = System.currentTimeMillis();
long totalTime = endTime - startTime;
System.out.println(totalTime);
}
}
OK, so I run this and get a
Hello World!
12396
About 12 secs, not bad for 50 million integers. But if I try to run it for 70 million integers I get:
Hello World!
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.Integer.valueOf(Integer.java:642)
at javaQueueTest.main(javaQueueTest.java:14)
I also notice that it takes about 10 mins to come up with this message. Hmm so what if I give almost all my memory (8gigs) for the heap? So I run it for heap size of 7gigs but I still get the same error:
javac javaQueueTest.java
java -cp . javaQueueTest -Xmx7g
Hello World!
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.Integer.valueOf(Integer.java:642)
at javaQueueTest.main(javaQueueTest.java:14)
I want to ask two things. First, why does it take so long to come up with the error? Second, Why is all this memory not enough? If I run the same experiment for 300 million integers in C (with the glib g_queue) it will run (and in 10 secs no less! although it will slow down the computer alot) so the number of integers must not be at fault here. For the record, here is the C code:
#include<stdlib.h>
#include<stdio.h>
#include<math.h>
#include<glib.h>
#include<time.h>
int main(){
clock_t begin,end;
double time_spent;
GQueue *Q;
begin = clock();
Q = g_queue_new();
g_queue_init(Q);
int N = 300000000;
int i;
for (i = 0; i < N; i = i+1){
g_queue_push_tail(Q,GINT_TO_POINTER(i));
}
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("elapsed time: %f \n",time_spent);
}
I compile and get the result:
gcc cQueueTest.c `pkg-config --cflags --libs glib-2.0 gsl ` -o cQueueTest
~/Desktop/Software Development/Tests $ ./cQueueTest
elapsed time: 13.340000
My rough thoughts about your questions:
First, why does it take so long to come up with the error?
As gimpycpu in his comment stated, java does not start with full memory acquisition of your RAM. If you want so (and you have a 64 bit VM for greater amount of RAM), you can add the options -Xmx8g and -Xms8g at VM startup time to ensure that the VM gots 8 gigabyte of RAM and the -Xms means that it will also prepare the RAM for usage instead of just saying that it can use it. This will reduce the runtime significantly. Also as already mentioned, Java integer boxing is quite overhead.
Why is all this memory not enough?
Java introduces for every object a little bit of memory overhead, because the JVM uses Integer references in the ArrayDeque datastructur in comparision to just 4 byte plain integers due to boxing. So you have to calulate about 20 byte for every integer.
You can try to use an int[] instead of the ArrayDeque:
import java.io.*;
import java.util.*;
class javaQueueTest {
public static void main(args){
System.out.println("Hello World!");
long startTime = System.currentTimeMillis();
int i;
int N = 50000000;
int[] a = new int[N];
for (i = 0;i < N; i = i+1){
a[i] = 0;
}
long endTime = System.currentTimeMillis();
long totalTime = endTime - startTime;
System.out.println(totalTime);
}
}
This will be ultra fast and due the usage of plain arrays.
On my system I am under one second for every run!
In your case, the GC struggles as it assumes that at least some objects will be short lived. In your case all objects are long lived, this adds a significant overhead to managing this data.
If you use -Xmx7g -Xms7g -verbose:gc and N = 150000000 you get an output like
Hello World!
[GC (Allocation Failure) 1835008K->1615280K(7034368K), 3.8370127 secs]
5327
int is a primitive in Java (4 -bytes), while Integer is the wrapper. This wrapper need a reference to it and a header and padding and the result is that an Integer and its reference uses 20 bytes per value.
The solution is to not queue up some many values at once. You can use a Supplier to provide new values on demand, avoiding the need to create the queue in the first place.
Even so, with 7 GB heap you should be able to create a ArrayQueue of 200 M or more.
First, why does it take so long to come up with the error?
This looks like a classic example of a GC "death spiral". Basically what happens is that the JVM does full GCs repeatedly, reclaiming less and less space each time. Towards the end, the JVM spends more time running the GC than doing "useful" work. Finally it gives up.
If you are experiencing this, the solution is to configure a GC Overhead Limit as described here:
GC overhead limit exceeded
(Java 8 configures a GC overhead limit by default. But you are apparently using an older version of Java ... judging from the exception message.)
Second, Why is all this memory not enough?
See #Peter Lawrey's explanation.
The workaround is to find or implement a queue class that doesn't use generics. Unfortunately, that class will not be compatible with the standard Deque API.
You can catch OutOfMemoryError with :
try{
ArrayDeque<Integer> Q = new ArrayDeque<Integer>(N);
for (i = 0;i < N; i = i+1){
Q.add(i);
}
}
catch(OutOfMemoryError e){
Q=null;
System.gc();
System.err.println("OutOfMemoryError: "+i);
}
in order to show when the OutOfMemoryError is thrown.
And launch your code with :
java -Xmx4G javaQueueTest
in order to increase heap size for JVM
As mentionned earlier, Java is much slower with Objects than C with primitive types ...
I would like to achieve 0.5-1 million remote function calls per second. Let's assume we have one Central computer where computation starts, and one Worker computer which does the computation. There will be many Worker computers in real configuration.
Let's assume for a moment that our task is to calculate a sum of [(random int from 0 to MAX_VAL)*2], PROBLEM_SIZE times
The very naive prototype is
Worker:
//The real function takes 0.070ms to compute.
int compute(int input) {
return input * 2;
}
void go() {
try {
ServerSocket ss = new ServerSocket(socketNum);
Socket s = ss.accept();
System.out.println("Listening for " + socketNum);
DataInput di = new DataInputStream(s.getInputStream());
OutputStream os = s.getOutputStream();
byte[] arr = new byte[4];
ByteBuffer wrap = ByteBuffer.wrap(arr);
for (; ; ) {
wrap.clear();
di.readFully(arr);
int value = wrap.getInt();
int output = compute(value);
wrap.clear();
byte[] bytes = wrap.putInt(output).array();
os.write(bytes);
}
} catch (IOException e) {
System.err.println("Exception at " + socketNum);
e.printStackTrace();
}
}
Central:
void go(){
try {
Socket s = new Socket(ip, socketNum);
s.setSoTimeout(2000);
OutputStream os = s.getOutputStream();
DataInput di = new DataInputStream(s.getInputStream());
System.out.println("Central socket starting for " + socketNum);
Random r = new Random();
byte[] buf = new byte[4];
ByteBuffer wrap = ByteBuffer.wrap(buf);
long start = System.currentTimeMillis();
long sum = 0;
for(int i = 0; i < n; i++) {
wrap.clear();
int value = r.nextInt(10000);
os.write(wrap.putInt(value).array());
di.readFully(buf);
wrap.clear();
int answer = wrap.getInt();
sum += answer;
}
System.out.println(n + " calls in " + (System.currentTimeMillis() - start) + " ms");
} catch(SocketTimeoutException ste) {
System.err.println("Socket timeout at " + socketNum);
}
catch (Exception e) {
e.printStackTrace();
}
If the ping is 0.150ms and we run 1-threaded Worker, and 1-threaded Central, each iteration will take ~0.150ms. To improve performance, I run N threads on both Worker and Central, n-th thread listens to port 2000+n. After each thread stops, we sum up the result.
Benchmarks
First, I ran the program above in my fellow's school network. Second, I ran it on two Amazon EC2 Cluster instances. Gap in results was very big.
CHUNK_SIZE = 100_000 in all runs.
Fellow's network:
I think 3 years ago it was top configuration available (Xeon E5645). I believe it is heavily optimized for parallel computations and has simple LAN topology since it has only 20 machines.
OS: Ubuntu
Average ping: ~0.165ms
N=1 total time=6 seconds
N=10 total time=9 seconds
N=20 total time=11 seconds
N=32 total time=14 seconds
N=100 total time=21 seconds
N=500 total time=54 seconds
Amazon network:
I ran the program on two Cluster Compute Eight Extra Large Instance (cc2.8xlarge) started in the same Placement Group.
OS is some amazonian linux
Average ping: ~0.170ms.
results were a bit disappointing:
N=1 total time=16 seconds
N=10 total time=36 seconds
N=20 total time=55 seconds
N=32 total time=82 seconds
N=100 total time=250 seconds
N=500 total time=1200 seconds
I ran each configuration 2-4 times, results were similar, mostly +-5%
Amazon N=1 result makes sense, since 0.170ms per function call = 6000 calls per second = 100_000 calls per 16 seconds. 6 seconds for Fellow's network are actually surprising.
I think that maximum TCP packets per second with modern networks is around 40-70k per second.
It corresponds with N=100, time=250 seconds: N*CHUNK_SIZE / time = 100 * 100_000packets / 250sec = 10_000_000packets / 250sec = 40_000packets/second.
The question is, how my Fellow's network/computer configuration managed to do so well, especially with high N values?
My guess: it is wasteful to put each 4byte request and 4byte response to individual packet since there is ~40 byte overhead. It would be wise to pool all these tiny requests for, say, 0.010ms and put them in one big packet, and then redistribute the requests to the corresponding sockets.
It is possible to implement pooling on application level, but seems that Fellow's network/OS is configured to do it.
Update: I've played with java.net.Socket.setTcpNoDelay(), it didn't change anything.
The ultimate goal:
I approximate equation with millions of variables using very large tree. Currently, tree with 200_000 nodes fits in RAM. However I am intrested to approximate equation which requires tree with millions of nodes. It would take few Terabytes of RAM. Basic idea of algorithm is taking random path from node to leaf, and imroving values along it. Currently program is 32-threaded, each thread does 15000 iterations per second. I would like to move it to cluster with same iterations per second number.
You may be looking to enable Nagle' algorithm: wikipedia entry.
Here's a link about disabling it that might be helpful: Disabling Nagle's Algorithm in linux
Is this something common to all programming languages? Doing multiple print followed by a println seems faster but moving everything to a string and just printing that seems fastest. Why?
EDIT: For example, Java can find all the prime numbers up to 1 million in less than a second - but printing then all out each on their own println can take minutes! Up to a 10 billion can hours to print!
EX:
package sieveoferatosthenes;
public class Main {
public static void main(String[] args) {
int upTo = 10000000;
boolean primes[] = new boolean[upTo];
for( int b = 0; b < upTo; b++ ){
primes[b] = true;
}
primes[0] = false;
primes[1] = false;
int testing = 1;
while( testing <= Math.sqrt(upTo)){
testing ++;
int testingWith = testing;
if( primes[testing] ){
while( testingWith < upTo ){
testingWith = testingWith + testing;
if ( testingWith >= upTo){
}
else{
primes[testingWith] = false;
}
}
}
}
for( int b = 2; b < upTo; b++){
if( primes[b] ){
System.out.println( b );
}
}
}
}
println is not slow, it's the underlying PrintStream that is connected with the console, provided by the hosting operating system.
You can check it yourself: compare dumping a large text file to the console with piping the same textfile into another file:
cat largeTextFile.txt
cat largeTextFile.txt > temp.txt
Reading and writing are similiar and proportional to the size of the file (O(n)), the only difference is, that the destination is different (console compared to file). And that's basically the same with System.out.
The underlying OS operation (displaying chars on a console window) is slow because
The bytes have to be sent to the console application (should be quite fast)
Each char has to be rendered using (usually) a true type font (that's pretty slow, switching off anti aliasing could improve performance, btw)
The displayed area may have to be scrolled in order to append a new line to the visible area (best case: bit block transfer operation, worst case: re-rendering of the complete text area)
System.out is a static PrintStream class. PrintStream has, among other things, those methods you're probably quite familiar with, like print() and println() and such.
It's not unique to Java that input and output operations take a long time. "long." printing or writing to a PrintStream takes a fraction of a second, but over 10 billion instances of this print can add up to quite a lot!
This is why your "moving everything to a String" is the fastest. Your huge String is built, but you only print it once. Sure, it's a huge print, but you spend time on actually printing, not on the overhead associated with the print() or println().
As Dvd Prd has mentioned, Strings are immutable. That means whenever you assign a new String to an old one but reusing references, you actually destroy the reference to the old String and create a reference to the new one. So you can make this whole operation go even faster by using the StringBuilder class, which is mutable. This will decrease the overhead associated with building that string you'll eventually print.
I believe this is because of buffering. A quote from the article:
Another aspect of buffering concerns
text output to a terminal window. By
default, System.out (a PrintStream) is
line buffered, meaning that the output
buffer is flushed when a newline
character is encountered. This is
important for interactivity, where
you'd like to have an input prompt
displayed before actually entering any
input.
A quote explaining buffers from wikipedia:
In computer science, a buffer is a
region of memory used to temporarily
hold data while it is being moved from
one place to another. Typically, the
data is stored in a buffer as it is
retrieved from an input device (such
as a Mouse) or just before it is sent
to an output device (such as Speakers)
public void println()
Terminate the current line by writing
the line separator string. The line
separator string is defined by the
system property line.separator, and is
not necessarily a single newline
character ('\n').
So the buffer get's flushed when you do println which means new memory has to be allocated etc which makes printing slower. The other methods you specified require lesser flushing of buffers thus are faster.
Take a look at my System.out.println replacement.
By default, System.out.print() is only line-buffered and does a lot work related to Unicode handling. Because of its small buffer size, System.out.println() is not well suited to handle many repetitive outputs in a batch mode. Each line is flushed right away. If your output is mainly ASCII-based then by removing the Unicode-related activities, the overall execution time will be better.
If you're printing to the console window, not to a file, that will be the killer.
Every character has to be painted, and on every line the whole window has to be scrolled.
If the window is partly overlaid with other windows, it also has to do clipping.
That's going to take far more cycles than what your program is doing.
Usually that's not a bad price to pay, since console output is supposed to be for your reading pleasure :)
The problem you have is that displaying to the screen is very espensive, especially if you have a graphical windows/X-windows environment (rather than a pure text terminal) Just to render one digit in a font is far more expensive than the calculations you are doing. When you send data to the screen faster than it can display it, it buffered the data and quickly blocks. Even writing to a file is significant compare to the calculations, but its 10x - 100x faster than displaying on the screen.
BTW: math.sqrt() is very expensive, and using a loop is much slower than using modulus i.e. % to determine if a number is a multiple. BitSet can be 8x more space efficient than boolean[], and faster for operations on multiple bits e.g. counting or searching bits.
If I dump the output to a file, it is quick, but writing to the console is slow, and if I write to the console the data which was written to a file it takes about the same amount of time.
Took 289 ms to examine 10,000,000 numbers.
Took 149 ms to toString primes up to 10,000,000.
Took 306 ms to write to a file primes up to 10,000,000.
Took 61,082 ms to write to a System.out primes up to 10,000,000.
time cat primes.txt
real 1m24.916s
user 0m3.619s
sys 0m12.058s
The code
int upTo = 10*1000*1000;
long start = System.nanoTime();
BitSet nonprimes = new BitSet(upTo);
for (int t = 2; t * t < upTo; t++) {
if (nonprimes.get(t)) continue;
for (int i = 2 * t; i <= upTo; i += t)
nonprimes.set(i);
}
PrintWriter report = new PrintWriter("report.txt");
long time = System.nanoTime() - start;
report.printf("Took %,d ms to examine %,d numbers.%n", time / 1000 / 1000, upTo);
long start2 = System.nanoTime();
for (int i = 2; i < upTo; i++) {
if (!nonprimes.get(i))
Integer.toString(i);
}
long time2 = System.nanoTime() - start2;
report.printf("Took %,d ms to toString primes up to %,d.%n", time2 / 1000 / 1000, upTo);
long start3 = System.nanoTime();
PrintWriter pw = new PrintWriter(new BufferedOutputStream(new FileOutputStream("primes.txt"), 64*1024));
for (int i = 2; i < upTo; i++) {
if (!nonprimes.get(i))
pw.println(i);
}
pw.close();
long time3 = System.nanoTime() - start3;
report.printf("Took %,d ms to write to a file primes up to %,d.%n", time3 / 1000 / 1000, upTo);
long start4 = System.nanoTime();
for (int i = 2; i < upTo; i++) {
if (!nonprimes.get(i))
System.out.println(i);
}
long time4 = System.nanoTime() - start4;
report.printf("Took %,d ms to write to a System.out primes up to %,d.%n", time4 / 1000 / 1000, upTo);
report.close();
Most of the answers here are right, but they don't cover the most important point: system calls. This is the operation that induces the more overhead.
When your software needs to access some hardware resource (your screen for example), it needs to ask the OS (or hypervisor) if it can access the hardware. This costs a lot:
Here are interesting blogs about syscalls, the last one being dedicated to syscall and Java
http://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead
http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html
https://blog.packagecloud.io/eng/2017/03/14/using-strace-to-understand-java-performance-improvement/