I would like to achieve 0.5-1 million remote function calls per second. Let's assume we have one Central computer where computation starts, and one Worker computer which does the computation. There will be many Worker computers in real configuration.
Let's assume for a moment that our task is to calculate a sum of [(random int from 0 to MAX_VAL)*2], PROBLEM_SIZE times
The very naive prototype is
Worker:
//The real function takes 0.070ms to compute.
int compute(int input) {
return input * 2;
}
void go() {
try {
ServerSocket ss = new ServerSocket(socketNum);
Socket s = ss.accept();
System.out.println("Listening for " + socketNum);
DataInput di = new DataInputStream(s.getInputStream());
OutputStream os = s.getOutputStream();
byte[] arr = new byte[4];
ByteBuffer wrap = ByteBuffer.wrap(arr);
for (; ; ) {
wrap.clear();
di.readFully(arr);
int value = wrap.getInt();
int output = compute(value);
wrap.clear();
byte[] bytes = wrap.putInt(output).array();
os.write(bytes);
}
} catch (IOException e) {
System.err.println("Exception at " + socketNum);
e.printStackTrace();
}
}
Central:
void go(){
try {
Socket s = new Socket(ip, socketNum);
s.setSoTimeout(2000);
OutputStream os = s.getOutputStream();
DataInput di = new DataInputStream(s.getInputStream());
System.out.println("Central socket starting for " + socketNum);
Random r = new Random();
byte[] buf = new byte[4];
ByteBuffer wrap = ByteBuffer.wrap(buf);
long start = System.currentTimeMillis();
long sum = 0;
for(int i = 0; i < n; i++) {
wrap.clear();
int value = r.nextInt(10000);
os.write(wrap.putInt(value).array());
di.readFully(buf);
wrap.clear();
int answer = wrap.getInt();
sum += answer;
}
System.out.println(n + " calls in " + (System.currentTimeMillis() - start) + " ms");
} catch(SocketTimeoutException ste) {
System.err.println("Socket timeout at " + socketNum);
}
catch (Exception e) {
e.printStackTrace();
}
If the ping is 0.150ms and we run 1-threaded Worker, and 1-threaded Central, each iteration will take ~0.150ms. To improve performance, I run N threads on both Worker and Central, n-th thread listens to port 2000+n. After each thread stops, we sum up the result.
Benchmarks
First, I ran the program above in my fellow's school network. Second, I ran it on two Amazon EC2 Cluster instances. Gap in results was very big.
CHUNK_SIZE = 100_000 in all runs.
Fellow's network:
I think 3 years ago it was top configuration available (Xeon E5645). I believe it is heavily optimized for parallel computations and has simple LAN topology since it has only 20 machines.
OS: Ubuntu
Average ping: ~0.165ms
N=1 total time=6 seconds
N=10 total time=9 seconds
N=20 total time=11 seconds
N=32 total time=14 seconds
N=100 total time=21 seconds
N=500 total time=54 seconds
Amazon network:
I ran the program on two Cluster Compute Eight Extra Large Instance (cc2.8xlarge) started in the same Placement Group.
OS is some amazonian linux
Average ping: ~0.170ms.
results were a bit disappointing:
N=1 total time=16 seconds
N=10 total time=36 seconds
N=20 total time=55 seconds
N=32 total time=82 seconds
N=100 total time=250 seconds
N=500 total time=1200 seconds
I ran each configuration 2-4 times, results were similar, mostly +-5%
Amazon N=1 result makes sense, since 0.170ms per function call = 6000 calls per second = 100_000 calls per 16 seconds. 6 seconds for Fellow's network are actually surprising.
I think that maximum TCP packets per second with modern networks is around 40-70k per second.
It corresponds with N=100, time=250 seconds: N*CHUNK_SIZE / time = 100 * 100_000packets / 250sec = 10_000_000packets / 250sec = 40_000packets/second.
The question is, how my Fellow's network/computer configuration managed to do so well, especially with high N values?
My guess: it is wasteful to put each 4byte request and 4byte response to individual packet since there is ~40 byte overhead. It would be wise to pool all these tiny requests for, say, 0.010ms and put them in one big packet, and then redistribute the requests to the corresponding sockets.
It is possible to implement pooling on application level, but seems that Fellow's network/OS is configured to do it.
Update: I've played with java.net.Socket.setTcpNoDelay(), it didn't change anything.
The ultimate goal:
I approximate equation with millions of variables using very large tree. Currently, tree with 200_000 nodes fits in RAM. However I am intrested to approximate equation which requires tree with millions of nodes. It would take few Terabytes of RAM. Basic idea of algorithm is taking random path from node to leaf, and imroving values along it. Currently program is 32-threaded, each thread does 15000 iterations per second. I would like to move it to cluster with same iterations per second number.
You may be looking to enable Nagle' algorithm: wikipedia entry.
Here's a link about disabling it that might be helpful: Disabling Nagle's Algorithm in linux
Related
I know the API "long getUidRxBytes (int uid)",but this interface could not get the network speed with each process. Is there someone who konws a simple way to get the speed of the prcocess.
my english is not very well.
Basically, to measure speed of anything, you need 2 parameters: time and amount.
Here, I assume you are calculating byte/s, you need to measure how many bytes transfered every second.
Almost time, you will need a algorithm such as
totalTimeSpent = 0
bytesSent = 0
do
beforeSendingTime = getCurrentMilisecond
send n bytes to desination via network
bytesSent = bytesSent + n
afterSendingTime = getCurrentMilisecond
timeSpent = afterSendingTime - beforeSendingTime
totalTimeSpent = totalTimeSpent + timeSpent
say: currentSpeed = n/timeSpent
say: averageSpeed = bytesSent/totalTimeSpent
loop until no data remaining to send
Hope it help, you need to implement that algorithm in your own development language
I wrote an Java code just for testing how my CPU will run when have to may operation to do so I wrote loop that will add 1 to var in 100000000000 iterations:
public class NoThread {
public static void main(String[] args) {
long s = System.currentTimeMillis();
int sum = 0;
for (int i=0;i<=1000000;i++){
for (int j=0;j<=10000;j++){
for (int k = 0;k<=10;k++){
sum++;
}
}
}
long k = System.currentTimeMillis();
System.out.println("Time" + (k-s)+ " " + sum );
}
}
Code finish working after 30 - 40 sec.
Next I decide to split this operation into 10 threads to make my cpu more cry and say my prog to write time when each thread end:
public class WithThread {
public static void main(String[] args) {
Runnable[] run = new Runnable[10];
Thread[]thread = new Thread[10];
for (int i = 0; i<=9;i++){
run[i] = new Counter(i);
thread[i] = new Thread(run[i]);
thread[i].start();
}
}
}
and
public class Counter implements Runnable {
private int inc;
private int incc;
private int sum = 0;
private int id;
public Counter(int a){
id = a;
inc = a * 100000;
incc = (a+1)*100000;
}
#Override
public void run(){
long s = System.currentTimeMillis();
for (int i = inc;i<=incc;i++){
for (int j=0;j<=10000;j++){
for (int k = 0;k<=10;k++){
sum++;
}
}
}
long k = System.currentTimeMillis();
System.out.println("Time" + (k-s)+ " " + sum + " in thread " + id);
}
}
In the result whole code end in 18 - 20 second - so two times faster but when I look at time in each Thread end it works i found something interesting. Each thread had same job to do but 4 threads end work in very short time ( 0,8 second ) and rest of threads ( 6 ) end in 18 to 20 second. I start it again and now i had 6 thread with fast time and 4 with slow. Run it again 7 fast and 3 slow. Amount of fast and slow thread looks randomly. So my question is why there is so big difference between fast and slow threads. Why amount of fast and slow threads is so random, and is this Language specific (Java) or maybe operating system, CPU or something else ?
Before moving into the working process of Threads and Processors, I'll explain it in more understandable way.
Scenario
Location A ------------------------------ Location B
| |_____________________________|
| |
| 200 Metres
|
| Have to be carried to
400 Bags of Sand -------------------------- Location B
(In Location A)
So, the worker will have to carry each Sand Bag from Location A to Location B until all the Sandbags are moved to location B.
Lets just pretend that the worker will be instantly Teleported back (for argument sake) to Location A (but not the other way around) once he arrives at Location B.
Case 1
Number of Workforce = 1 (No.of Mens)
Time taken = 2 mins (Time for Moving 1 SandBag from Location A to Location B)
Total time taken to carry 400 Sandbags from Location A to Location B will be
Totaltime Taken = 2 x 400 = 800 mins
Case 2
Number of Workforce = 4 (No.of Mens)
Time taken = 2 mins (Time for Moving 1 SandBag from Location A to Location B)
So now we're going to split the job equally among the available workforce.
Assigned Sandbag for Each worker = 400 / 4 = 100
Lets say everyone is starting their job at the same time.
Total time taken for carrying 100 Sandbags from Location A to Location B for an individual workforce
TimeTaken for Individual Workforce = 2 x 100 = 200 mins
Since everyone had started their job at the same time, all the 400 Sandbags will be carried from Location A to Location B in 200 mins
Case 3
Number of Workforce = 4 (No.of Mens)
Here, lets say that every men has to carry 4 sandbags from Location A to Location B in a single transfer.
Total Sandbags in Single transfer for every worker = 4 bags
Time taken = 12 mins (Time for Moving 4 SandBags from Location A to Location B in a single transfer)
Since everyone is forced to carry 4 sandbags with them instead of 1, this is greatly reduce their speed.
Even consider this,
1) I ordered you to carry 1 sandbag from A to B, you'll take 2 mins.
2) I ordered you to carry 2 sandbags from A to B at one transfer, you'll take 5 mins instead of theoritical 4 mins, because this is due to our body conditions and the weight we're carrying.
3) I ordered you to carry 4 sandbags from A to B at one transfer, you'll take 12 mins instead of (Theoritical 8 mins in Point 1, Theoritical 10 mins in Point 2), which is also because of human nature.
So now we're going to split the job equally among the available workforce.
Assigned Sandbag for Each worker = 400 / 4 = 100
Total transfers for Each worker = 100 / 4 = 25 Transfers
Calculating the time taken for single worker to complete his full job
Total time for Single worker = 12 mins x 25 tranfers = 300
So, they've taken an additional 100 min instead of theoritical 200 mins (Case 2)
Case 4
Total Sandbags in Single transfer for every worker = 100 bags
Since this is impossible to do by anyone, so he'll just quit.
xx--------------------------------------------------------------------------------------xx
This is the same kind of working principle in Threads and Processors
Here
Workforce = No. of Processors
Total Sandbags = No.of Threads
Sandbags in a Single transfer = No.of threads a (1) processor is going to handle simultaneously
Assume
Available Processors = 4
Runtime.getRuntime().availableProcessors() // -> Syntax to get the no of available processors
Note: Link every Case with the Realtime Case explained above
Case 1
for (int i=0;i<=1000000;i++){
for (int j=0;j<=10000;j++){
for (int k = 0;k<=10;k++){
sum++;
}
}
}
Whole operation is series process, so it'll take execution time what it's suppose to.
Case 2
for( int n = 1; n <= 4; n++ ){
Thread t = new Thread(new Runnable(){
void run(){
for (int i=0;i<=250000;i++){ // 1000000 / 4 = 250000
for (int j=0;j<=10000;j++){
for (int k = 0;k<=10;k++){
sum++;
}
}
}
}
});
t.start();
}
Here each processor will going to handle 1 thread. So it'll take 1/4th of the actual time.
Case 3
for( int n = 1; n <= 16; n++ ){
Thread t = new Thread(new Runnable(){
void run(){
for (int i=0;i<=62500;i++){ // 1000000 / 16 = 62500
for (int j=0;j<=10000;j++){
for (int k = 0;k<=10;k++){
sum++;
}
}
}
}
});
t.start();
}
Totally 16 threads will be created and each processor will have to handle 4 threads simultaneously. So practically, it'll increase the processor load to its max, thus it'll reduce the efficiency of the processor resulting in increase in the execution time of each processor.
Totally it'll take 1/4th of(1/4th of actual time) + performace degrade time(will definitely be higher than than the 1/4th of actual time)
Case 4
for( int n = 1; n <= 100000; n++ ){ // 100000 - Just for argument sake
Thread t = new Thread(new Runnable(){
void run(){
for (int i=0;i<=1000000;i++){
for (int j=0;j<=10000;j++){
for (int k = 0;k<=10;k++){
sum++;
}
}
}
}
});
t.start();
}
At this stage, creating and starting a thread is more expensive (if the processor already have more threads in it) than the time taken for creating and starting previous threads.As the number of simultaneous threads increases, it'll hugely increase the processor load until the processor reaches its capacity, thus lead to System Crash.
The reason why your threads created in the first were having less execution time is because there wont be any performance degrade in processor during the intital stage. But as the for loop continues, no of threads have to be handled by each processor increases beyond the fair ratio (1:1), so you'll start to experience lag when the threads counts were increased in processor.
I need help or some ideas on how to get the loop in this code to stop executing when the speedUp factor settles to a particular value. The idea of this method is continually run an ever increasing number of threads and derive a speedUp factor from the results. The rounded speedUp factor is how many cores are present on the machine. Running a 4 threaded task will have the same speedUp factor as a 16 threaded task on a 4 core machine. I want to be able to not have to manually set number of threads to run. When the speedUp factor settles to a value I want the program to terminate. There is no need to run a test for 8, 16, or 32 threads if the speed up factor has already settled at 2 for example.
Example output for a 4 core machine:
Number of threads tested: 1
Speed up factor: 1.0
Number of threads tested: 2
Speed up factor: 1.8473736372646188
Number of threads tested: 4
Speed up factor: 3.9416666666666669
Number of threads tested: 8
Speed up factor: 3.9750993377483446
Number of threads tested: 16
Speed up factor: 4.026086956521739
THIS MACHINE HAS: 4 CORES
THE APPLICATION HAS COMPLETED EXECUTION. THANK YOU
private static void multiCoreTest() {
// A runnable for the threads
Counter task = new Counter(1500000000L);
// A variable to store the number of threads to run
int threadMultiplier = 1;
// A variable to hold the time it takes for a single thread to execute
double singleThreadTime = ThreadTest.runTime(1, task);
// Calculating speedup factor for a single thread task
double speedUp = (singleThreadTime * threadMultiplier) / (singleThreadTime);
// Printing the speed up factor of a single thread
System.out.println("Number of threads tested: " + threadMultiplier);
System.out.println("Speed up factor: " + speedUp);
// Testing multiple threads
while (threadMultiplier < 16) {
// Increasing the number of threads by a factor of two
threadMultiplier *= 2;
// A variable to hold the time it takes for multiple threads to
// execute
double multiThreadTime = ThreadTest.runTime(threadMultiplier, task);
// Calculating speedup factor for multiple thread tests
speedUp = (singleThreadTime * threadMultiplier) / (multiThreadTime);
// Message to the user
System.out.println("\n" + "Number of threads tested: "
+ threadMultiplier);
System.out.println("Speed up factor: " + speedUp);
}
// Print number of cores
System.out.println("\n" + "THIS MACHINE HAS: " + Math.round(speedUp)
+ " CORES");
System.out.println("\n"
+ "THE APPLICATION HAS COMPLETED EXECUTION. THANK YOU");
// Exiting the system
System.exit(0);
}
}
Test if the new speedup is the same as the old one:
double oldSpeedUp = 0;
boolean found = false;
while(!found && threadMultiplier < 16) {
// ...
found = Math.round(speedUp) == Math.round(oldSpeedUp);
oldSpeedUp = speedUp;
}
As a side note, if you want the number of cores, you can call :
int cores = Runtime.getRuntime().availableProcessors();
I am trying to measure the performance of our service by putting the data in a HashMap like-
X number of calls came back in Y ms. Below is my code which is very simple. It will set the timer before hitting the service and after the response came back, it will measure the time.
private static void serviceCall() {
histogram = new HashMap<Long, Long>();
keys = histogram.keySet();
long total = 10;
long runs = total;
while (runs > 0) {
long start_time = System.currentTimeMillis();
// hitting the service
result = restTemplate
.getForObject("Some URL",String.class);
long difference = (System.currentTimeMillis() - start_time);
Long count = histogram.get(difference);
if (count != null) {
count++;
histogram.put(Long.valueOf(difference), count);
} else {
histogram.put(Long.valueOf(difference), Long.valueOf(1L));
}
runs--;
}
for (Long key : keys) {
Long value = histogram.get(key);
System.out.println("SERVICE MEASUREMENT, HG data, " + key + ":" + value);
}
}
Currently the output I am getting is something like this-
SERVICE MEASUREMENT, HG data, 166:1
SERVICE MEASUREMENT, HG data, 40:2
SERVICE MEASUREMENT, HG data, 41:4
SERVICE MEASUREMENT, HG data, 42:1
SERVICE MEASUREMENT, HG data, 43:1
SERVICE MEASUREMENT, HG data, 44:1
which means is 1 call came back in 166 ms, 2 calls came back in 40 ms and same with other outputs.
Problem Statement:-
What I am looking for now is something like this. I should have range setup like this-
X Number of calls came back in between 1 and 10 ms
Y Number of calls came back in between 11 and 20 ms
Z Number of calls came back in between 21 and 30 ms
P Number of calls came back in between 31 and 40 ms
T number of calls came back in between 41 and 50 ms
....
....
I number of calls came back in more than 100 ms
And any way to configure the range also. Suppose in future I need to tweak in the range, I should be able to do it. How can I achieve this thing in my current program? Any suggestions will be of great help.
A histogram is a set of data arranged into "bins" of equal size. You should convert your time measurement to a bin and use that bin as the map key. This can be done simply by dividing your time value by the bin size. For example: time / 10L.
I have two arrays (int and long) which contains millions of entries. Until now, I am doing it using DataOutputStream and using a long buffer thus disk I/O costs gets low (nio is also more or less same as I have huge buffer, so I/O access cost low) specifically, using
DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream("abc.txt"),1024*1024*100));
for(int i = 0 ; i < 220000000 ; i++){
long l = longarray[i];
dos.writeLong(l);
}
But it takes several seconds (more than 5 minutes) to do that. Actually, what I want to bulk flush (some sort of main memory to disk memory map). For that, I found a nice approach in here and here. However, can't understand how to use that in my javac. Can anybody help me about that or any other way to do that nicely ?
On my machine, 3.8 GHz i7 with an SSD
DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream("abc.txt"), 32 * 1024));
long start = System.nanoTime();
final int count = 220000000;
for (int i = 0; i < count; i++) {
long l = i;
dos.writeLong(l);
}
dos.close();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to write %,d longs%n",
time / 1e9, count);
prints
Took 11.706 seconds to write 220,000,000 longs
Using memory mapped files
final int count = 220000000;
final FileChannel channel = new RandomAccessFile("abc.txt", "rw").getChannel();
MappedByteBuffer mbb = channel.map(FileChannel.MapMode.READ_WRITE, 0, count * 8);
mbb.order(ByteOrder.nativeOrder());
long start = System.nanoTime();
for (int i = 0; i < count; i++) {
long l = i;
mbb.putLong(l);
}
channel.close();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to write %,d longs%n",
time / 1e9, count);
// Only works on Sun/HotSpot/OpenJDK to deallocate buffer.
((DirectBuffer) mbb).cleaner().clean();
final FileChannel channel2 = new RandomAccessFile("abc.txt", "r").getChannel();
MappedByteBuffer mbb2 = channel2.map(FileChannel.MapMode.READ_ONLY, 0, channel2.size());
mbb2.order(ByteOrder.nativeOrder());
assert mbb2.remaining() == count * 8;
long start2 = System.nanoTime();
for (int i = 0; i < count; i++) {
long l = mbb2.getLong();
if (i != l)
throw new AssertionError("Expected "+i+" but got "+l);
}
channel.close();
long time2 = System.nanoTime() - start2;
System.out.printf("Took %.3f seconds to read %,d longs%n",
time2 / 1e9, count);
// Only works on Sun/HotSpot/OpenJDK to deallocate buffer.
((DirectBuffer) mbb2).cleaner().clean();
prints on my 3.8 GHz i7.
Took 0.568 seconds to write 220,000,000 longs
on a slower machine prints
Took 1.180 seconds to write 220,000,000 longs
Took 0.990 seconds to read 220,000,000 longs
Is here any other way not to create that ? Because I have that array already on my main memory and I can't allocate more than 500 MB to do that?
This doesn't uses less than 1 KB of heap. If you look at how much memory is used before and after this call you will normally see no increase at all.
Another thing, is this gives efficient loading also means MappedByteBuffer?
In my experience, using a memory mapped file is by far the fastest because you reduce the number of system calls and copies into memory.
Because, in some article I found read(buffer) this gives better loading performance. (I check that one, really faster 220 million int array -float array read 5 seconds)
I would like to read that article because I have never seen that.
Another issue: readLong gives error while reading from your code output file
Part of the performance in provement is storing the values in native byte order. writeLong/readLong always uses big endian format which is much slower on Intel/AMD systems which are little endian format natively.
You can make the byte order big-endian which will slow it down or you can use native ordering (DataInput/OutputStream only supports big endian)
I am running it server with 16GB memory with 2.13 GhZ [CPU]
I doubt the problem has anything to do with your Java code.
Your file system appears to be extraordinarily slow (at least ten times slower than what one would expect from a local disk).
I would do two things:
Double check that you are actually writing to a local disk, and not to a network share. Bear in mind that in some environments home directories are NFS mounts.
Ask your sysadmins to take a look at the machine to find out why the disk is so slow. If I were in their shoes, I'd start by checking the logs and running some benchmarks (e.g. using Bonnie++).